You are on page 1of 3512

SAFETY, RELIABILITY AND RISK ANALYSIS: THEORY, METHODS

AND APPLICATIONS
PROCEEDINGS OF THE EUROPEAN SAFETY AND RELIABILITY CONFERENCE, ESREL 2008,
AND 17TH SRA-EUROPE, VALENCIA, SPAIN, SEPTEMBER, 22–25, 2008

Safety, Reliability and Risk Analysis:


Theory, Methods and Applications

Editors
Sebastián Martorell
Department of Chemical and Nuclear Engineering,
Universidad Politécnica de Valencia, Spain

C. Guedes Soares
Instituto Superior Técnico, Technical University of Lisbon, Lisbon, Portugal

Julie Barnett
Department of Psychology, University of Surrey, UK

VOLUME 1
Cover picture designed by Centro de Formación Permanente - Universidad Politécnica de Valencia

CRC Press/Balkema is an imprint of the Taylor & Francis Group, an informa business

© 2009 Taylor & Francis Group, London, UK

Typeset by Vikatan Publishing Solutions (P) Ltd., Chennai, India


Printed and bound in Great Britain by Antony Rowe (A CPI-group Company), Chippenham, Wiltshire.

All rights reserved. No part of this publication or the information contained herein may be reproduced, stored
in a retrieval system, or transmitted in any form or by any means, electronic, mechanical, by photocopying,
recording or otherwise, without written prior permission from the publisher.

Although all care is taken to ensure integrity and the quality of this publication and the information herein, no
responsibility is assumed by the publishers nor the author for any damage to the property or persons as a result
of operation or use of this publication and/or the information contained herein.

Published by: CRC Press/Balkema


P.O. Box 447, 2300 AK Leiden, The Netherlands
e-mail: Pub.NL@taylorandfrancis.com
www.crcpress.com – www.taylorandfrancis.co.uk – www.balkema.nl

ISBN: 978-0-415-48513-5 (set of 4 volumes + CD-ROM)


ISBN: 978-0-415-48514-2 (vol 1)
ISBN: 978-0-415-48515-9 (vol 2)
ISBN: 978-0-415-48516-6 (vol 3)
ISBN: 978-0-415-48792-4 (vol 4)
ISBN: 978-0-203-88297-9 (e-book)
Safety, Reliability and Risk Analysis: Theory, Methods and Applications – Martorell et al. (eds)
© 2009 Taylor & Francis Group, London, ISBN 978-0-415-48513-5

Table of contents

Preface XXIV
Organization XXXI
Acknowledgment XXXV
Introduction XXXVII

VOLUME 1

Thematic areas
Accident and incident investigation
A code for the simulation of human failure events in nuclear power plants: SIMPROC 3
J. Gil, J. Esperón, L. Gamo, I. Fernández, P. González, J. Moreno, A. Expósito,
C. Queral, G. Rodríguez & J. Hortal
A preliminary analysis of the ‘Tlahuac’ incident by applying the MORT technique 11
J.R. Santos-Reyes, S. Olmos-Peña & L.M. Hernández-Simón
Comparing a multi-linear (STEP) and systemic (FRAM) method for accident analysis 19
I.A. Herrera & R. Woltjer
Development of a database for reporting and analysis of near misses in the Italian
chemical industry 27
R.V. Gagliardi & G. Astarita
Development of incident report analysis system based on m-SHEL ontology 33
Y. Asada, T. Kanno & K. Furuta
Forklifts overturn incidents and prevention in Taiwan 39
K.Y. Chen, S.-H. Wu & C.-M. Shu
Formal modelling of incidents and accidents as a means for enriching training material
for satellite control operations 45
S. Basnyat, P. Palanque, R. Bernhaupt & E. Poupart
Hazard factors analysis in regional traffic records 57
M. Mlynczak & J. Sipa
Organizational analysis of availability: What are the lessons for a high risk industrial company? 63
M. Voirin, S. Pierlot & Y. Dien
Thermal explosion analysis of methyl ethyl ketone peroxide by non-isothermal
and isothermal calorimetry application 71
S.H. Wu, J.M. Tseng & C.M. Shu

V
Crisis and emergency management
A mathematical model for risk analysis of disaster chains 79
A. Xuewei Ji, B. Wenguo Weng & Pan Li
Effective learning from emergency responses 83
K. Eriksson & J. Borell
On the constructive role of multi-criteria analysis in complex decision-making:
An application in radiological emergency management 89
C. Turcanu, B. Carlé, J. Paridaens & F. Hardeman

Decision support systems and software tools for safety and reliability
Complex, expert based multi-role assessment system for small and medium enterprises 99
S.G. Kovacs & M. Costescu
DETECT: A novel framework for the detection of attacks to critical infrastructures 105
F. Flammini, A. Gaglione, N. Mazzocca & C. Pragliola
Methodology and software platform for multi-layer causal modeling 113
K.M. Groth, C. Wang, D. Zhu & A. Mosleh
SCAIS (Simulation Code System for Integrated Safety Assessment): Current
status and applications 121
J.M. Izquierdo, J. Hortal, M. Sánchez, E. Meléndez, R. Herrero, J. Gil, L. Gamo,
I. Fernández, J. Esperón, P. González, C. Queral, A. Expósito & G. Rodríguez
Using GIS and multivariate analyses to visualize risk levels and spatial patterns
of severe accidents in the energy sector 129
P. Burgherr
Weak signals of potential accidents at ‘‘Seveso’’ establishments 137
P.A. Bragatto, P. Agnello, S. Ansaldi & P. Pittiglio

Dynamic reliability
A dynamic fault classification scheme 147
B. Fechner
Importance factors in dynamic reliability 155
R. Eymard, S. Mercier & M. Roussignol
TSD, a SCAIS suitable variant of the SDTPD 163
J.M. Izquierdo & I. Cañamón

Fault identification and diagnostics


Application of a vapour compression chiller lumped model for fault detection 175
J. Navarro-Esbrí, A. Real, D. Ginestar & S. Martorell
Automatic source code analysis of failure modes causing error propagation 183
S. Sarshar & R. Winther
Development of a prognostic tool to perform reliability analysis 191
M. El-Koujok, R. Gouriveau & N. Zerhouni
Fault detection and diagnosis in monitoring a hot dip galvanizing line using
multivariate statistical process control 201
J.C. García-Díaz
Fault identification, diagnosis and compensation of flatness errors in hard turning
of tool steels 205
F. Veiga, J. Fernández, E. Viles, M. Arizmendi, A. Gil & M.L. Penalva

VI
From diagnosis to prognosis: A maintenance experience for an electric locomotive 211
O. Borgia, F. De Carlo & M. Tucci

Human factors
A study on the validity of R-TACOM measure by comparing operator response
time data 221
J. Park & W. Jung
An evaluation of the Enhanced Bayesian THERP method using simulator data 227
K. Bladh, J.-E. Holmberg & P. Pyy
Comparing CESA-Q human reliability analysis with evidence from simulator:
A first attempt 233
L. Podofillini & B. Reer
Exploratory and confirmatory analysis of the relationship between social norms
and safety behavior 243
C. Fugas, S.A. da Silva & J.L. Melià
Functional safety and layer of protection analysis with regard to human factors 249
K.T. Kosmowski
How employees’ use of information technology systems shape reliable operations
of large scale technological systems 259
T.K. Andersen, P. Næsje, H. Torvatn & K. Skarholt
Incorporating simulator evidence into HRA: Insights from the data analysis of the
international HRA empirical study 267
S. Massaiu, P.Ø. Braarud & M. Hildebrandt
Insights from the ‘‘HRA international empirical study’’: How to link data
and HRA with MERMOS 275
H. Pesme, P. Le Bot & P. Meyer
Operators’ response time estimation for a critical task using the fuzzy logic theory 281
M. Konstandinidou, Z. Nivolianitou, G. Simos, C. Kiranoudis & N. Markatos
The concept of organizational supportiveness 291
J. Nicholls, J. Harvey & G. Erdos
The influence of personal variables on changes in driver behaviour 299
S. Heslop, J. Harvey, N. Thorpe & C. Mulley
The key role of expert judgment in CO2 underground storage projects 305
C. Vivalda & L. Jammes

Integrated risk management and risk—informed decision-making


All-hazards risk framework—An architecture model 315
S. Verga
Comparisons and discussion of different integrated risk approaches 323
R. Steen & T. Aven
Management of risk caused by domino effect resulting from design
system dysfunctions 331
S. Sperandio, V. Robin & Ph. Girard
On some aspects related to the use of integrated risk analyses for the decision
making process, including its use in the non-nuclear applications 341
D. Serbanescu, A.L. Vetere Arellano & A. Colli
On the usage of weather derivatives in Austria—An empirical study 351
M. Bank & R. Wiesner

VII
Precaution in practice? The case of nanomaterial industry 361
H. Kastenholz, A. Helland & M. Siegrist
Risk based maintenance prioritisation 365
G. Birkeland, S. Eisinger & T. Aven
Shifts in environmental health risk governance: An analytical framework 369
H.A.C. Runhaar, J.P. van der Sluijs & P.P.J. Driessen
What does ‘‘safety margin’’ really mean? 379
J. Hortal, R. Mendizábal & F. Pelayo

Legislative dimensions of risk management


Accidents, risk analysis and safety management—Different perspective at a
Swedish safety authority 391
O. Harrami, M. Strömgren, U. Postgård & R. All
Evaluation of risk and safety issues at the Swedish Rescue Services Agency 399
O. Harrami, U. Postgård & M. Strömgren
Regulation of information security and the impact on top management commitment—
A comparative study of the electric power supply sector and the finance sector 407
J.M. Hagen & E. Albrechtsen
The unintended consequences of risk regulation 415
B.H. MacGillivray, R.E. Alcock & J.S. Busby

Maintenance modelling and optimisation


A hybrid age-based maintenance policy for heterogeneous items 423
P.A. Scarf, C.A.V. Cavalcante, R.W. Dwight & P. Gordon
A stochastic process model for computing the cost of a condition-based maintenance plan 431
J.A.M. van der Weide, M.D. Pandey & J.M. van Noortwijk
A study about influence of uncertain distribution inputs in maintenance optimization 441
R. Mullor, S. Martorell, A. Sánchez & N. Martinez-Alzamora
Aging processes as a primary aspect of predicting reliability and life of aeronautical hardware 449
J. Żurek, M. Zieja, G. Kowalczyk & T. Niezgoda
An alternative imperfect preventive maintenance model 455
J. Clavareau & P.E. Labeau
An imperfect preventive maintenance model with dependent failure modes 463
I.T. Castro
Condition-based maintenance approaches for deteriorating system influenced
by environmental conditions 469
E. Deloux, B. Castanier & C. Bérenguer
Condition-based maintenance by particle filtering 477
F. Cadini, E. Zio & D. Avram
Corrective maintenance for aging air conditioning systems 483
I. Frenkel, L. Khvatskin & A. Lisnianski
Exact reliability quantification of highly reliable systems with maintenance 489
R. Briš
Genetic algorithm optimization of preventive maintenance scheduling for repairable
systems modeled by generalized renewal process 497
P.A.A. Garcia, M.C. Sant’Ana, V.C. Damaso & P.F. Frutuoso e Melo

VIII
Maintenance modelling integrating human and material resources 505
S. Martorell, M. Villamizar, A. Sánchez & G. Clemente
Modelling competing risks and opportunistic maintenance with expert judgement 515
T. Bedford & B.M. Alkali
Modelling different types of failure and residual life estimation for condition-based maintenance 523
M.J. Carr & W. Wang
Multi-component systems modeling for quantifying complex maintenance strategies 531
V. Zille, C. Bérenguer, A. Grall, A. Despujols & J. Lonchampt
Multiobjective optimization of redundancy allocation in systems with imperfect repairs via
ant colony and discrete event simulation 541
I.D. Lins & E. López Droguett
Non-homogeneous Markov reward model for aging multi-state system under corrective
maintenance 551
A. Lisnianski & I. Frenkel
On the modeling of ageing using Weibull models: Case studies 559
P. Praks, H. Fernandez Bacarizo & P.-E. Labeau
On-line condition-based maintenance for systems with several modes of degradation 567
A. Ponchet, M. Fouladirad & A. Grall
Opportunity-based age replacement for a system under two types of failures 575
F.G. Badía & M.D. Berrade
Optimal inspection intervals for maintainable equipment 581
O. Hryniewicz
Optimal periodic inspection of series systems with revealed and unrevealed failures 587
M. Carvalho, E. Nunes & J. Telhada
Optimal periodic inspection/replacement policy for deteriorating systems with explanatory
variables 593
X. Zhao, M. Fouladirad, C. Bérenguer & L. Bordes
Optimal replacement policy for components with general failure rates submitted to obsolescence 603
S. Mercier
Optimization of the maintenance function at a company 611
S. Adjabi, K. Adel-Aissanou & M. Azi
Planning and scheduling maintenance resources in a complex system 619
M. Newby & C. Barker
Preventive maintenance planning using prior expert knowledge and multicriteria method
PROMETHEE III 627
F.A. Figueiredo, C.A.V. Cavalcante & A.T. de Almeida
Profitability assessment of outsourcing maintenance from the producer (big rotary machine study) 635
P. Fuchs & J. Zajicek
Simulated annealing method for the selective maintenance optimization of multi-mission
series-parallel systems 641
A. Khatab, D. Ait-Kadi & A. Artiba
Study on the availability of a k-out-of-N System given limited spares under (m, NG )
maintenance policy 649
T. Zhang, H.T. Lei & B. Guo
System value trajectories, maintenance, and its present value 659
K.B. Marais & J.H. Saleh

IX
The maintenance management framework: A practical view to maintenance management 669
A. Crespo Márquez, P. Moreu de León, J.F. Gómez Fernández, C. Parra Márquez & V. González
Workplace occupation and equipment availability and utilization, in the context of maintenance
float systems 675
I.S. Lopes, A.F. Leitão & G.A.B. Pereira

Monte Carlo methods in system safety and reliability


Availability and reliability assessment of industrial complex systems: A practical view
applied on a bioethanol plant simulation 687
V. González, C. Parra, J.F. Gómez, A. Crespo & P. Moreu de León
Handling dependencies between variables with imprecise probabilistic models 697
S. Destercke & E. Chojnacki
Monte Carlo simulation for investigating the influence of maintenance strategies on the production
availability of offshore installations 703
K.P. Chang, D. Chang, T.J. Rhee & E. Zio
Reliability analysis of discrete multi-state systems by means of subset simulation 709
E. Zio & N. Pedroni
The application of Bayesian interpolation in Monte Carlo simulations 717
M. Rajabalinejad, P.H.A.J.M. van Gelder & N. van Erp

Occupational safety
Application of virtual reality technologies to improve occupational & industrial safety
in industrial processes 727
J. Rubio, B. Rubio, C. Vaquero, N. Galarza, A. Pelaz, J.L. Ipiña, D. Sagasti & L. Jordá
Applying the resilience concept in practice: A case study from the oil and gas industry 733
L. Hansson, I. Andrade Herrera, T. Kongsvik & G. Solberg
Development of an assessment tool to facilitate OHS management based upon the safe
place, safe person, safe systems framework 739
A.-M. Makin & C. Winder
Exploring knowledge translation in occupational health using the mental models approach:
A case study of machine shops 749
A.-M. Nicol & A.C. Hurrell
Mathematical modelling of risk factors concerning work-related traffic accidents 757
C. Santamaría, G. Rubio, B. García & E. Navarro
New performance indicators for the health and safety domain: A benchmarking use perspective 761
H.V. Neto, P.M. Arezes & S.D. Sousa
Occupational risk management for fall from height 767
O.N. Aneziris, M. Konstandinidou, I.A. Papazoglou, M. Mud, M. Damen, J. Kuiper, H. Baksteen,
L.J. Bellamy, J.G. Post & J. Oh
Occupational risk management for vapour/gas explosions 777
I.A. Papazoglou, O.N. Aneziris, M. Konstandinidou, M. Mud, M. Damen, J. Kuiper, A. Bloemhoff,
H. Baksteen, L.J. Bellamy, J.G. Post & J. Oh
Occupational risk of an aluminium industry 787
O.N. Aneziris, I.A. Papazoglou & O. Doudakmani
Risk regulation bureaucracies in EU accession states: Drinking water safety in Estonia 797
K. Kangur

X
Organization learning
Can organisational learning improve safety and resilience during changes? 805
S.O. Johnsen & S. Håbrekke
Consequence analysis as organizational development 813
B. Moltu, A. Jarl Ringstad & G. Guttormsen
Integrated operations and leadership—How virtual cooperation influences leadership practice 821
K. Skarholt, P. Næsje, V. Hepsø & A.S. Bye
Outsourcing maintenance in services providers 829
J.F. Gómez, C. Parra, V. González, A. Crespo & P. Moreu de León
Revising rules and reviving knowledge in the Norwegian railway system 839
H.C. Blakstad, R. Rosness & J. Hovden

Risk Management in systems: Learning to recognize and respond to weak signals 847
E. Guillaume
Author index 853

VOLUME 2

Reliability and safety data collection and analysis


A new step-stress Accelerated Life Testing approach: Step-Down-Stress 863
C. Zhang, Y. Wang, X. Chen & Y. Jiang
Application of a generalized lognormal distribution to engineering data fitting 869
J. Martín & C.J. Pérez
Collection and analysis of reliability data over the whole product lifetime of vehicles 875
T. Leopold & B. Bertsche
Comparison of phase-type distributions with mixed and additive Weibull models 881
M.C. Segovia & C. Guedes Soares
Evaluation methodology of industry equipment functional reliability 891
J. Kamenický
Evaluation of device reliability based on accelerated tests 899
E. Nogueira Díaz, M. Vázquez López & D. Rodríguez Cano
Evaluation, analysis and synthesis of multiple source information: An application to nuclear
computer codes 905
S. Destercke & E. Chojnacki
Improving reliability using new processes and methods 913
S.J. Park, S.D. Park & K.T. Jo
Life test applied to Brazilian friction-resistant low alloy-high strength steel rails 919
D.I. De Souza, A. Naked Haddad & D. Rocha Fonseca
Non-homogeneous Poisson Process (NHPP), stochastic model applied to evaluate the economic
impact of the failure in the Life Cycle Cost Analysis (LCCA) 929
C. Parra Márquez, A. Crespo Márquez, P. Moreu de León, J. Gómez Fernández & V. González Díaz
Risk trends, indicators and learning rates: A new case study of North sea oil and gas 941
R.B. Duffey & A.B. Skjerve
Robust estimation for an imperfect test and repair model using Gaussian mixtures 949
S.P. Wilson & S. Goyal

XI
Risk and evidence based policy making
Environmental reliability as a requirement for defining environmental impact limits
in critical areas 957
E. Calixto & E. Lèbre La Rovere
Hazardous aid? The crowding-out effect of international charity 965
P.A. Raschky & M. Schwindt
Individual risk-taking and external effects—An empirical examination 973
S. Borsky & P.A. Raschky
Licensing a Biofuel plan transforming animal fats 981
J.-F. David
Modelling incident escalation in explosives storage 987
G. Hardman, T. Bedford, J. Quigley & L. Walls
The measurement and management of Deca-BDE—Why the continued certainty of uncertainty? 993
R.E. Alcock, B.H. McGillivray & J.S. Busby

Risk and hazard analysis


A contribution to accelerated testing implementation 1001
S. Hossein Mohammadian M., D. Aït-Kadi, A. Coulibaly & B. Mutel
A decomposition method to analyze complex fault trees 1009
S. Contini
A quantitative methodology for risk assessment of explosive atmospheres according to the
ATEX directive 1019
R. Lisi, M.F. Milazzo & G. Maschio
A risk theory based on hyperbolic parallel curves and risk assessment in time 1027
N. Popoviciu & F. Baicu
Accident occurrence evaluation of phased-mission systems composed of components
with multiple failure modes 1035
T. Kohda
Added value in fault tree analyses 1041
T. Norberg, L. Rosén & A. Lindhe
Alarm prioritization at plant design stage—A simplified approach 1049
P. Barbarini, G. Franzoni & E. Kulot
Analysis of possibilities of timing dependencies modeling—Example of logistic support system 1055
J. Magott, T. Nowakowski, P. Skrobanek & S. Werbinska
Applications of supply process reliability model 1065
A. Jodejko & T. Nowakowski
Applying optimization criteria to risk analysis 1073
H. Medina, J. Arnaldos & J. Casal
Chemical risk assessment for inspection teams during CTBT on-site inspections of sites
potentially contaminated with industrial chemicals 1081
G. Malich & C. Winder
Comparison of different methodologies to estimate the evacuation radius in the case
of a toxic release 1089
M.I. Montoya & E. Planas
Conceptualizing and managing risk networks. New insights for risk management 1097
R.W. Schröder

XII
Developments in fault tree techniques and importance measures 1103
J.K. Vaurio
Dutch registration of risk situations 1113
J.P. van’t Sant, H.J. Manuel & A. van den Berg
Experimental study of jet fires 1119
M. Gómez-Mares, A. Palacios, A. Peiretti, M. Muñoz & J. Casal
Failure mode and effect analysis algorithm for tunneling projects 1125
K. Rezaie, V. Ebrahimipour & S. Shokravi
Fuzzy FMEA: A study case on a discontinuous distillation plant 1129
S.S. Rivera & J.E. Núñez Mc Leod
Risk analysis in extreme environmental conditions for Aconcagua Mountain station 1135
J.E. Núñez Mc Leod & S.S. Rivera
Geographic information system for evaluation of technical condition and residual life of pipelines 1141
P. Yukhymets, R. Spitsa & S. Kobelsky
Inherent safety indices for the design of layout plans 1147
A. Tugnoli, V. Cozzani, F.I. Khan & P.R. Amyotte
Minmax defense strategy for multi-state systems 1157
G. Levitin & K. Hausken
Multicriteria risk assessment for risk ranking of natural gas pipelines 1165
A.J. de M. Brito, C.A.V. Cavalcante, R.J.P. Ferreira & A.T. de Almeida
New insight into PFDavg and PFH 1173
F. Innal, Y. Dutuit, A. Rauzy & J.-P. Signoret
On causes and dependencies of errors in human and organizational barriers against major
accidents 1181
J.E. Vinnem
Quantitative risk analysis method for warehouses with packaged hazardous materials 1191
D. Riedstra, G.M.H. Laheij & A.A.C. van Vliet
Ranking the attractiveness of industrial plants to external acts of interference 1199
M. Sabatini, S. Zanelli, S. Ganapini, S. Bonvicini & V. Cozzani
Review and discussion of uncertainty taxonomies used in risk analysis 1207
T.E. Nøkland & T. Aven
Risk analysis in the frame of the ATEX Directive and the preparation of an Explosion Protection
Document 1217
A. Pey, G. Suter, M. Glor, P. Lerena & J. Campos
Risk reduction by use of a buffer zone 1223
S.I. Wijnant-Timmerman & T. Wiersma
Safety in engineering practice 1231
Z. Smalko & J. Szpytko
Why ISO 13702 and NFPA 15 standards may lead to unsafe design 1239
S. Medonos & R. Raman

Risk control in complex environments


Is there an optimal type for high reliability organization? A study of the UK offshore industry 1251
J.S. Busby, A. Collins & R. Miles
The optimization of system safety: Rationality, insurance, and optimal protection 1259
R.B. Jongejan & J.K. Vrijling

XIII
Thermal characteristic analysis of Y type zeolite by differential scanning calorimetry 1267
S.H. Wu, W.P. Weng, C.C. Hsieh & C.M. Shu
Using network methodology to define emergency response team location: The Brazilian
refinery case study 1273
E. Calixto, E. Lèbre La Rovere & J. Eustáquio Beraldo

Risk perception and communication


(Mis-)conceptions of safety principles 1283
J.-T. Gayen & H. Schäbe
Climate change in the British press: The role of the visual 1293
N.W. Smith & H. Joffe
Do the people exposed to a technological risk always want more information about it?
Some observations on cases of rejection 1301
J. Espluga, J. Farré, J. Gonzalo, T. Horlick-Jones, A. Prades, C. Oltra & J. Navajas
Media coverage, imaginary of risks and technological organizations 1309
F. Fodor & G. Deleuze
Media disaster coverage over time: Methodological issues and results 1317
M. Kuttschreuter & J.M. Gutteling
Risk amplification and zoonosis 1325
D.G. Duckett & J.S. Busby
Risk communication and addressing uncertainties in risk assessments—Presentation of a framework 1335
J. Stian Østrem, H. Thevik, R. Flage & T. Aven
Risk communication for industrial plants and radioactive waste repositories 1341
F. Gugliermetti & G. Guidi
Risk management measurement methodology: Practical procedures and approaches for risk
assessment and prediction 1351
R.B. Duffey & J.W. Saull
Risk perception and cultural theory: Criticism and methodological orientation 1357
C. Kermisch & P.-E. Labeau
Standing in the shoes of hazard managers: An experiment on avalanche risk perception 1365
C.M. Rheinberger
The social perception of nuclear fusion: Investigating lay understanding and reasoning about
the technology 1371
A. Prades, C. Oltra, J. Navajas, T. Horlick-Jones & J. Espluga

Safety culture
‘‘Us’’ and ‘‘Them’’: The impact of group identity on safety critical behaviour 1377
R.J. Bye, S. Antonsen & K.M. Vikland
Does change challenge safety? Complexity in the civil aviation transport system 1385
S. Høyland & K. Aase
Electromagnetic fields in the industrial enviroment 1395
J. Fernández, A. Quijano, M.L. Soriano & V. Fuster
Electrostatic charges in industrial environments 1401
P. LLovera, A. Quijano, A. Soria & V. Fuster
Empowering operations and maintenance: Safe operations with the ‘‘one directed team’’
organizational model at the Kristin asset 1407
P. Næsje, K. Skarholt, V. Hepsø & A.S. Bye

XIV
Leadership and safety climate in the construction industry 1415
J.L. Meliá, M. Becerril, S.A. Silva & K. Mearns
Local management and its impact on safety culture and safety within Norwegian shipping 1423
H.A Oltedal & O.A. Engen
Quantitative analysis of the anatomy and effectiveness of occupational safety culture 1431
P. Trucco, M. De Ambroggi & O. Grande
Safety management and safety culture assessment in Germany 1439
H.P. Berg
The potential for error in communications between engineering designers 1447
J. Harvey, R. Jamieson & K. Pearce

Safety management systems


Designing the safety policy of IT system on the example of a chosen company 1455
I.J. Jóźwiak, T. Nowakowski & K. Jóźwiak
Determining and verifying the safety integrity level of the control and protection systems
under uncertainty 1463
T. Barnert, K.T. Kosmowski & M. Sliwinski
Drawing up and running a Security Plan in an SME type company—An easy task? 1473
M. Gerbec
Efficient safety management for subcontractor at construction sites 1481
K.S. Son & J.J. Park
Production assurance and reliability management—A new international standard 1489
H. Kortner, K.-E. Haugen & L. Sunde
Risk management model for industrial plants maintenance 1495
N. Napolitano, M. De Minicis, G. Di Gravio & M. Tronci
Some safety aspects on multi-agent and CBTC implementation for subway control systems 1503
F.M. Rachel & P.S. Cugnasca

Software reliability
Assessment of software reliability and the efficiency of corrective actions during the software
development process 1513
R. Savić
ERTMS, deals on wheels? An inquiry into a major railway project 1519
J.A. Stoop, J.H. Baggen, J.M. Vleugel & J.L.M. Vrancken
Guaranteed resource availability in a website 1525
V.P. Koutras & A.N. Platis
Reliability oriented electronic design automation tool 1533
J. Marcos, D. Bóveda, A. Fernández & E. Soto
Reliable software for partitionable networked environments—An experience report 1539
S. Beyer, J.C. García Ortiz, F.D. Muñoz-Escoí, P. Galdámez, L. Froihofer,
K.M. Goeschka & J. Osrael
SysML aided functional safety assessment 1547
M. Larisch, A. Hänle, U. Siebold & I. Häring
UML safety requirement specification and verification 1555
A. Hänle & I. Häring

XV
Stakeholder and public involvement in risk governance
Assessment and monitoring of reliability and robustness of offshore wind energy converters 1567
S. Thöns, M.H. Faber, W. Rücker & R. Rohrmann
Building resilience to natural hazards. Practices and policies on governance and mitigation
in the central region of Portugal 1577
J.M. Mendes & A.T. Tavares
Governance of flood risks in The Netherlands: Interdisciplinary research into the role and
meaning of risk perception 1585
M.S. de Wit, H. van der Most, J.M. Gutteling & M. Bočkarjova
Public intervention for better governance—Does it matter? A study of the ‘‘Leros Strength’’ case 1595
P.H. Lindøe & J.E. Karlsen
Reasoning about safety management policy in everyday terms 1601
T. Horlick-Jones
Using stakeholders’ expertise in EMF and soil contamination to improve the management
of public policies dealing with modern risk: When uncertainty is on the agenda 1609
C. Fallon, G. Joris & C. Zwetkoff

Structural reliability and design codes


Adaptive discretization of 1D homogeneous random fields 1621
D.L. Allaix, V.I. Carbone & G. Mancini
Comparison of methods for estimation of concrete strength 1629
M. Holicky, K. Jung & M. Sykora
Design of structures for accidental design situations 1635
J. Marková & K. Jung
Developing fragility function for a timber structure subjected to fire 1641
E.R. Vaidogas, Virm. Juocevičius & Virg. Juocevičius
Estimations in the random fatigue-limit model 1651
C.-H. Chiu & W.-T. Huang
Limitations of the Weibull distribution related to predicting the probability of failure
initiated by flaws 1655
M.T. Todinov
Simulation techniques of non-gaussian random loadings in structural reliability analysis 1663
Y. Jiang, C. Zhang, X. Chen & J. Tao
Special features of the collection and analysis of snow loads 1671
Z. Sadovský, P. Faško, K. Mikulová, J. Pecho & M. Vojtek
Structural safety under extreme construction loads 1677
V. Juocevičius & A. Kudzys
The modeling of time-dependent reliability of deteriorating structures 1685
A. Kudzys & O. Lukoševiciene
Author index 1695

VOLUME 3

System reliability analysis


A copula-based approach for dependability analyses of fault-tolerant systems with
interdependent basic events 1705
M. Walter, S. Esch & P. Limbourg

XVI
A depth first search algorithm for optimal arrangements in a circular
consecutive-k-out-of-n:F system 1715
K. Shingyochi & H. Yamamoto
A joint reliability-redundancy optimization approach for multi-state series-parallel systems 1723
Z. Tian, G. Levitin & M.J. Zuo
A new approach to assess the reliability of a multi-state system with dependent components 1731
M. Samrout & E. Chatelet
A reliability analysis and decision making process for autonomous systems 1739
R. Remenyte-Prescott, J.D. Andrews, P.W.H. Chung & C.G. Downes
Advanced discrete event simulation methods with application to importance measure
estimation 1747
A.B. Huseby, K.A. Eide, S.L. Isaksen, B. Natvig & J. Gåsemyr
Algorithmic and computational analysis of a multi-component complex system 1755
J.E. Ruiz-Castro, R. Pérez-Ocón & G. Fernández-Villodre
An efficient reliability computation of generalized multi-state k-out-of-n systems 1763
S.V. Amari
Application of the fault tree analysis for assessment of the power system reliability 1771
A. Volkanovski, M. Čepin & B. Mavko
BDMP (Boolean logic driven Markov processes) as an alternative to event trees 1779
M. Bouissou
Bivariate distribution based passive system performance assessment 1787
L. Burgazzi
Calculating steady state reliability indices of multi-state systems using dual number algebra 1795
E. Korczak
Concordance analysis of importance measure 1803
C.M. Rocco S.
Contribution to availability assessment of systems with one shot items 1807
D. Valis & M. Koucky
Contribution to modeling of complex weapon systems reliability 1813
D. Valis, Z. Vintr & M. Koucky
Delayed system reliability and uncertainty analysis 1819
R. Alzbutas, V. Janilionis & J. Rimas
Efficient generation and representation of failure lists out of an information flux model
for modeling safety critical systems 1829
M. Pock, H. Belhadaoui, O. Malassé & M. Walter
Evaluating algorithms for the system state distribution of multi-state k-out-of-n:F system 1839
T. Akiba, H. Yamamoto, T. Yamaguchi, K. Shingyochi & Y. Tsujimura
First-passage time analysis for Markovian deteriorating model 1847
G. Dohnal
Model of logistic support system with time dependency 1851
S. Werbinska
Modeling failure cascades in network systems due to distributed random disturbances 1861
E. Zio & G. Sansavini
Modeling of the changes of graphite bore in RBMK-1500 type nuclear reactor 1867
I. Žutautaite-Šeputiene, J. Augutis & E. Ušpuras

XVII
Modelling multi-platform phased mission system reliability 1873
D.R. Prescott, J.D. Andrews & C.G. Downes
Modelling test strategies effects on the probability of failure on demand for safety
instrumented systems 1881
A.C. Torres-Echeverria, S. Martorell & H.A. Thompson
New insight into measures of component importance in production systems 1891
S.L. Isaksen
New virtual age models for bathtub shaped failure intensities 1901
Y. Dijoux & E. Idée
On some approaches to defining virtual age of non-repairable objects 1909
M.S. Finkelstein
On the application and extension of system signatures in engineering reliability 1915
J. Navarro, F.J. Samaniego, N. Balakrishnan & D. Bhattacharya
PFD of higher-order configurations of SIS with partial stroke testing capability 1919
L.F.S. Oliveira
Power quality as accompanying factor in reliability research of electric engines 1929
I.J. Jóźwiak, K. Kujawski & T. Nowakowski
RAMS and performance analysis 1937
X. Quayzin, E. Arbaretier, Z. Brik & A. Rauzy
Reliability evaluation of complex system based on equivalent fault tree 1943
Z. Yufang, Y. Hong & L. Jun
Reliability evaluation of III-V Concentrator solar cells 1949
N. Núñez, J.R. González, M. Vázquez, C. Algora & I. Rey-Stolle
Reliability of a degrading system under inspections 1955
D. Montoro-Cazorla, R. Pérez-Ocón & M.C. Segovia
Reliability prediction using petri nets for on-demand safety systems with fault detection 1961
A.V. Kleyner & V. Volovoi
Reliability, availability and cost analysis of large multi-state systems with ageing components 1969
K. Kolowrocki
Reliability, availability and risk evaluation of technical systems in variable operation conditions 1985
K. Kolowrocki & J. Soszynska
Representation and estimation of multi-state system reliability by decision diagrams 1995
E. Zaitseva & S. Puuronen
Safety instrumented system reliability evaluation with influencing factors 2003
F. Brissaud, D. Charpentier, M. Fouladirad, A. Barros & C. Bérenguer
Smooth estimation of the availability function of a repairable system 2013
M.L. Gámiz & Y. Román
System design optimisation involving phased missions 2021
D. Astapenko & L.M. Bartlett
The Natvig measures of component importance in repairable systems applied to an offshore
oil and gas production system 2029
B. Natvig, K.A. Eide, J. Gåsemyr, A.B. Huseby & S.L. Isaksen
The operation quality assessment as an initial part of reliability improvement and low cost
automation of the system 2037
L. Muslewski, M. Woropay & G. Hoppe

XVIII
Three-state modelling of dependent component failures with domino effects 2045
U.K. Rakowsky
Variable ordering techniques for the application of Binary Decision Diagrams on PSA
linked Fault Tree models 2051
C. Ibáñez-Llano, A. Rauzy, E. Meléndez & F. Nieto
Weaknesses of classic availability calculations for interlinked production systems
and their overcoming 2061
D. Achermann

Uncertainty and sensitivity analysis


A critique of Info-Gap’s robustness model 2071
M. Sniedovich
Alternative representations of uncertainty in system reliability and risk analysis—Review
and discussion 2081
R. Flage, T. Aven & E. Zio
Dependence modelling with copula in probabilistic studies, a practical approach
based on numerical experiments 2093
A. Dutfoy & R. Lebrun
Event tree uncertainty analysis by Monte Carlo and possibility theory 2101
P. Baraldi & E. Zio
Global sensitivity analysis based on entropy 2107
B. Auder & B. Iooss
Impact of uncertainty affecting reliability models on warranty contracts 2117
G. Fleurquin, P. Dehombreux & P. Dersin
Influence of epistemic uncertainties on the probabilistic assessment of an emergency operating
procedure in a nuclear power plant 2125
M. Kloos & J. Peschke
Numerical study of algorithms for metamodel construction and validation 2135
B. Iooss, L. Boussouf, A. Marrel & V. Feuillard
On the variance upper bound theorem and its applications 2143
M.T. Todinov
Reliability assessment under Uncertainty Using Dempster-Shafer and Vague Set Theories 2151
S. Pashazadeh & N. Grachorloo
Types and sources of uncertainties in environmental accidental risk assessment: A case study
for a chemical factory in the Alpine region of Slovenia 2157
M. Gerbec & B. Kontic
Uncertainty estimation for monotone and binary systems 2167
A.P. Ulmeanu & N. Limnios

Industrial and service sectors


Aeronautics and aerospace
Condition based operational risk assessment for improved aircraft operability 2175
A. Arnaiz, M. Buderath & S. Ferreiro
Is optimized design of satellites possible? 2185
J. Faure, R. Laulheret & A. Cabarbaye

XIX
Model of air traffic in terminal area for ATFM safety analysis 2191
J. Skorupski & A.W. Stelmach
Predicting airport runway conditions based on weather data 2199
A.B. Huseby & M. Rabbe
Safety considerations in complex airborne systems 2207
M.J.R. Lemes & J.B. Camargo Jr
The Preliminary Risk Analysis approach: Merging space and aeronautics methods 2217
J. Faure, R. Laulheret & A. Cabarbaye
Using a Causal model for Air Transport Safety (CATS) for the evaluation of alternatives 2223
B.J.M. Ale, L.J. Bellamy, R.P. van der Boom, J. Cooper, R.M. Cooke, D. Kurowicka, P.H. Lin,
O. Morales, A.L.C. Roelen & J. Spouge

Automotive engineering
An approach to describe interactions in and between mechatronic systems 2233
J. Gäng & B. Bertsche
Influence of the mileage distribution on reliability prognosis models 2239
A. Braasch, D. Althaus & A. Meyna
Reliability prediction for automotive components using Real-Parameter Genetic Algorithm 2245
J. Hauschild, A. Kazeminia & A. Braasch
Stochastic modeling and prediction of catalytic converters degradation 2251
S. Barone, M. Giorgio, M. Guida & G. Pulcini
Towards a better interaction between design and dependability analysis: FMEA derived from
UML/SysML models 2259
P. David, V. Idasiak & F. Kratz

Biotechnology and food industry


Application of tertiary mathematical models for evaluating the presence of staphylococcal
enterotoxin in lactic acid cheese 2269
I. Steinka & A. Blokus-Roszkowska
Assessment of the risk to company revenue due to deviations in honey quality 2275
E. Doménech, I. Escriche & S. Martorell
Attitudes of Japanese and Hawaiian toward labeling genetically modified fruits 2285
S. Shehata
Ensuring honey quality by means of effective pasteurization 2289
E. Doménech, I. Escriche & S. Martorell
Exposure assessment model to combine thermal inactivation (log reduction) and thermal injury
(heat-treated spore lag time) effects on non-proteolytic Clostridium botulinum 2295
J.-M. Membrë, E. Wemmenhove & P. McClure
Public information requirements on health risk of mercury in fish (1): Perception
and knowledge of the public about food safety and the risk of mercury 2305
M. Kosugi & H. Kubota
Public information requirements on health risks of mercury in fish (2): A comparison of mental
models of experts and public in Japan 2311
H. Kubota & M. Kosugi
Review of diffusion models for the social amplification of risk of food-borne zoonoses 2317
J.P. Mehers, H.E. Clough & R.M. Christley

XX
Risk perception and communication of food safety and food technologies in Flanders,
The Netherlands, and the United Kingdom 2325
U. Maris
Synthesis of reliable digital microfluidic biochips using Monte Carlo simulation 2333
E. Maftei, P. Pop & F. Popenţiu Vlădicescu

Chemical process industry


Accidental scenarios in the loss of control of chemical processes: Screening the impact
profile of secondary substances 2345
M. Cordella, A. Tugnoli, P. Morra, V. Cozzani & F. Barontini
Adapting the EU Seveso II Directive for GHS: Initial UK study on acute toxicity to people 2353
M.T. Trainor, A.J. Wilday, M. Moonis, A.L. Rowbotham, S.J. Fraser, J.L. Saw & D.A. Bosworth
An advanced model for spreading and evaporation of accidentally released hazardous
liquids on land 2363
I.J.M. Trijssenaar-Buhre, R.P. Sterkenburg & S.I. Wijnant-Timmerman
Influence of safety systems on land use planning around seveso sites; example of measures
chosen for a fertiliser company located close to a village 2369
C. Fiévez, C. Delvosalle, N. Cornil, L. Servranckx, F. Tambour, B. Yannart & F. Benjelloun
Performance evaluation of manufacturing systems based on dependability management
indicators-case study: Chemical industry 2379
K. Rezaie, M. Dehghanbaghi & V. Ebrahimipour
Protection of chemical industrial installations from intentional adversary acts: Comparison
of the new security challenges with the existing safety practices in Europe 2389
M.D. Christou
Quantitative assessment of domino effect in an extended industrial area 2397
G. Antonioni, G. Spadoni, V. Cozzani, C. Dondi & D. Egidi
Reaction hazard of cumene hydroperoxide with sodium hydroxide by isothermal calorimetry 2405
Y.-P. Chou, S.-H. Wu, C.-M. Shu & H.-Y. Hou
Reliability study of shutdown process through the analysis of decision making in chemical plants.
Case of study: South America, Spain and Portugal 2409
L. Amendola, M.A. Artacho & T. Depool
Study of the application of risk management practices in shutdown chemical process 2415
L. Amendola, M.A. Artacho & T. Depool
Thirty years after the first HAZOP guideline publication. Considerations 2421
J. Dunjó, J.A. Vílchez & J. Arnaldos

Civil engineering
Decision tools for risk management support in construction industry 2431
S. Mehicic Eberhardt, S. Moeller, M. Missler-Behr & W. Kalusche
Definition of safety and the existence of ‘‘optimal safety’’ 2441
D. Proske
Failure risk analysis in Water Supply Networks 2447
A. Carrión, A. Debón, E. Cabrera, M.L. Gamiz & H. Solano
Hurricane vulnerability of multi-story residential buildings in Florida 2453
G.L. Pita, J.-P. Pinelli, C.S. Subramanian, K. Gurley & S. Hamid
Risk management system in water-pipe network functioning 2463
B. Tchórzewska-Cieślak

XXI
Use of extreme value theory in engineering design 2473
E. Castillo, C. Castillo & R. Mínguez

Critical infrastructures
A model for vulnerability analysis of interdependent infrastructure networks 2491
J. Johansson & H. Jönsson
Exploiting stochastic indicators of interdependent infrastructures: The service availability of
interconnected networks 2501
G. Bonanni, E. Ciancamerla, M. Minichino, R. Clemente, A. Iacomini, A. Scarlatti,
E. Zendri & R. Terruggia
Proactive risk assessment of critical infrastructures 2511
T. Uusitalo, R. Koivisto & W. Schmitz
Seismic assessment of utility systems: Application to water, electric power and transportation
networks 2519
C. Nuti, A. Rasulo & I. Vanzi
Author index 2531

VOLUME 4

Electrical and electronic engineering


Balancing safety and availability for an electronic protection system 2541
S. Wagner, I. Eusgeld, W. Kröger & G. Guaglio
Evaluation of important reliability parameters using VHDL-RTL modelling and information
flow approach 2549
M. Jallouli, C. Diou, F. Monteiro, A. Dandache, H. Belhadaoui, O. Malassé, G. Buchheit,
J.F. Aubry & H. Medromi

Energy production and distribution


Application of Bayesian networks for risk assessment in electricity distribution system
maintenance management 2561
D.E. Nordgård & K. Sand
Incorporation of ageing effects into reliability model for power transmission network 2569
V. Matuzas & J. Augutis
Mathematical simulation of energy supply disturbances 2575
J. Augutis, R. Krikštolaitis, V. Matuzienė & S. Pečiulytė
Risk analysis of the electric power transmission grid 2581
L.M. Pedersen & H.H. Thorstad
Security of gas supply to a gas plant from cave storage using discrete-event simulation 2587
J.D. Amaral Netto, L.F.S. Oliveira & D. Faertes
SES RISK a new framework to support decisions on energy supply 2593
D. Serbanescu & A.L. Vetere Arellano
Specification of reliability benchmarks for offshore wind farms 2601
D. McMillan & G.W. Ault

Health and medicine


Bayesian statistical meta-analysis of epidemiological data for QRA 2609
I. Albert, E. Espié, A. Gallay, H. De Valk, E. Grenier & J.-B. Denis

XXII
Cyanotoxins and health risk assessment 2613
J. Kellner, F. Božek, J. Navrátil & J. Dvořák
The estimation of health effect risks based on different sampling intervals of meteorological data 2619
J. Jeong & S. Hoon Han

Information technology and telecommunications


A bi-objective model for routing and wavelength assignment in resilient WDM networks 2627
T. Gomes, J. Craveirinha, C. Simões & J. Clímaco
Formal reasoning regarding error propagation in multi-process software architectures 2635
F. Sætre & R. Winther
Implementation of risk and reliability analysis techniques in ICT 2641
R. Mock, E. Kollmann & E. Bünzli
Information security measures influencing user performance 2649
E. Albrechtsen & J.M. Hagen
Reliable network server assignment using an ant colony approach 2657
S. Kulturel-Konak & A. Konak
Risk and safety as system-theoretic concepts—A formal view on system-theory
by means of petri-nets 2665
J. Rudolf Müller & E. Schnieder

Insurance and finance


Behaviouristic approaches to insurance decisions in the context of natural hazards 2675
M. Bank & M. Gruber
Gaming tool as a method of natural disaster risk education: Educating the relationship
between risk and insurance 2685
T. Unagami, T. Motoyoshi & J. Takai
Reliability-based risk-metric computation for energy trading 2689
R. Mínguez, A.J. Conejo, R. García-Bertrand & E. Castillo

Manufacturing
A decision model for preventing knock-on risk inside industrial plant 2701
M. Grazia Gnoni, G. Lettera & P. Angelo Bragatto
Condition based maintenance optimization under cost and profit criteria for manufacturing
equipment 2707
A. Sánchez, A. Goti & V. Rodríguez
PRA-type study adapted to the multi-crystalline silicon photovoltaic cells manufacture
process 2715
A. Colli, D. Serbanescu & B.J.M. Ale

Mechanical engineering
Developing a new methodology for OHS assessment in small and medium enterprises 2727
C. Pantanali, A. Meneghetti, C. Bianco & M. Lirussi
Optimal Pre-control as a tool to monitor the reliability of a manufacturing system 2735
S. San Matías & V. Giner-Bosch
The respirable crystalline silica in the ceramic industries—Sampling, exposure
and toxicology 2743
E. Monfort, M.J. Ibáñez & A. Escrig

XXIII
Natural hazards
A framework for the assessment of the industrial risk caused by floods 2749
M. Campedel, G. Antonioni, V. Cozzani & G. Di Baldassarre
A simple method of risk potential analysis for post-earthquake fires 2757
J.L. Su, C.C. Wu, K.S. Fan & J.R. Chen
Applying the SDMS model to manage natural disasters in Mexico 2765
J.R. Santos-Reyes & A.N. Beard
Decision making tools for natural hazard risk management—Examples from Switzerland 2773
M. Bründl, B. Krummenacher & H.M. Merz
How to motivate people to assume responsibility and act upon their own protection from flood
risk in The Netherlands if they think they are perfectly safe? 2781
M. Bočkarjova, A. van der Veen & P.A.T.M. Geurts
Integral risk management of natural hazards—A system analysis of operational application
to rapid mass movements 2789
N. Bischof, H. Romang & M. Bründl
Risk based approach for a long-term solution of coastal flood defences—A Vietnam case 2797
C. Mai Van, P.H.A.J.M. van Gelder & J.K. Vrijling
River system behaviour effects on flood risk 2807
T. Schweckendiek, A.C.W.M. Vrouwenvelder, M.C.L.M. van Mierlo, E.O.F. Calle & W.M.G. Courage
Valuation of flood risk in The Netherlands: Some preliminary results 2817
M. Bočkarjova, P. Rietveld & E.T. Verhoef

Nuclear engineering
An approach to integrate thermal-hydraulic and probabilistic analyses in addressing
safety margins estimation accounting for uncertainties 2827
S. Martorell, Y. Nebot, J.F. Villanueva, S. Carlos, V. Serradell, F. Pelayo & R. Mendizábal
Availability of alternative sources for heat removal in case of failure of the RHRS during
midloop conditions addressed in LPSA 2837
J.F. Villanueva, S. Carlos, S. Martorell, V. Serradell, F. Pelayo & R. Mendizábal
Complexity measures of emergency operating procedures: A comparison study with data
from a simulated computerized procedure experiment 2845
L.Q. Yu, Z.Z. Li, X.L. Dong & S. Xu
Distinction impossible!: Comparing risks between Radioactive Wastes Facilities and Nuclear
Power Stations 2851
S. Kim & S. Cho
Heat-up calculation to screen out the room cooling failure function from a PSA model 2861
M. Hwang, C. Yoon & J.-E. Yang
Investigating the material limits on social construction: Practical reasoning about nuclear
fusion and other technologies 2867
T. Horlick-Jones, A. Prades, C. Oltra, J. Navajas & J. Espluga
Neural networks and order statistics for quantifying nuclear power plants safety margins 2873
E. Zio, F. Di Maio, S. Martorell & Y. Nebot
Probabilistic safety assessment for other modes than power operation 2883
M. Čepin & R. Prosen
Probabilistic safety margins: Definition and calculation 2891
R. Mendizábal

XXIV
Reliability assessment of the thermal hydraulic phenomena related to a CAREM-like
passive RHR System 2899
G. Lorenzo, P. Zanocco, M. Giménez, M. Marquès, B. Iooss, R. Bolado Lavín, F. Pierro,
G. Galassi, F. D’Auria & L. Burgazzi
Some insights from the observation of nuclear power plant operators’ management of simulated
abnormal situations 2909
M.C. Kim & J. Park
Vital area identification using fire PRA and RI-ISI results in UCN 4 nuclear power plant 2913
K.Y. Kim, Y. Choi & W.S. Jung

Offshore oil and gas


A new approach for follow-up of safety instrumented systems in the oil and gas industry 2921
S. Hauge & M.A. Lundteigen
Consequence based methodology to determine acceptable leakage rate through closed safety
critical valves 2929
W. Røed, K. Haver, H.S. Wiencke & T.E. Nøkland
FAMUS: Applying a new tool for integrating flow assurance and RAM analysis 2937
Ø. Grande, S. Eisinger & S.L. Isaksen
Fuzzy reliability analysis of corroded oil and gas pipes 2945
M. Singh & T. Markeset
Life cycle cost analysis in design of oil and gas production facilities to be used in harsh,
remote and sensitive environments 2955
D. Kayrbekova & T. Markeset
Line pack management for improved regularity in pipeline gas transportation networks 2963
L. Frimannslund & D. Haugland
Optimization of proof test policies for safety instrumented systems using multi-objective
genetic algorithms 2971
A.C. Torres-Echeverria, S. Martorell & H.A. Thompson
Paperwork, management, and safety: Towards a bureaucratization of working life
and a lack of hands-on supervision 2981
G.M. Lamvik, P.C. Næsje, K. Skarholt & H. Torvatn
Preliminary probabilistic study for risk management associated to casing long-term integrity
in the context of CO2 geological sequestration—Recommendations for cement plug geometry 2987
Y. Le Guen, O. Poupard, J.-B. Giraud & M. Loizzo
Risk images in integrated operations 2997
C.K. Tveiten & P.M. Schiefloe

Policy decisions
Dealing with nanotechnology: Do the boundaries matter? 3007
S. Brunet, P. Delvenne, C. Fallon & P. Gillon
Factors influencing the public acceptability of the LILW repository 3015
N. Železnik, M. Polič & D. Kos
Risk futures in Europe: Perspectives for future research and governance. Insights from a EU
funded project 3023
S. Menoni
Risk management strategies under climatic uncertainties 3031
U.S. Brandt

XXV
Safety representative and managers: Partners in health and safety? 3039
T. Kvernberg Andersen, H. Torvatn & U. Forseth
Stop in the name of safety—The right of the safety representative to halt dangerous work 3047
U. Forseth, H. Torvatn & T. Kvernberg Andersen
The VDI guideline on requirements for the qualification of reliability engineers—Curriculum
and certification process 3055
U.K. Rakowsky

Public planning
Analysing analyses—An approach to combining several risk and vulnerability analyses 3061
J. Borell & K. Eriksson
Land use planning methodology used in Walloon region (Belgium) for tank farms of gasoline
and diesel oil 3067
F. Tambour, N. Cornil, C. Delvosalle, C. Fiévez, L. Servranckx, B. Yannart & F. Benjelloun

Security and protection


‘‘Protection from half-criminal windows breakers to mass murderers with nuclear weapons’’:
Changes in the Norwegian authorities’ discourses on the terrorism threat 3077
S.H. Jore & O. Njå
A preliminary analysis of volcanic Na-Tech risks in the Vesuvius area 3085
E. Salzano & A. Basco
Are safety and security in industrial systems antagonistic or complementary issues? 3093
G. Deleuze, E. Chatelet, P. Laclemence, J. Piwowar & B. Affeltranger
Assesment of energy supply security indicators for Lithuania 3101
J. Augutis, R. Krikštolaitis, V. Matuziene & S. Pečiulytė
Enforcing application security—Fixing vulnerabilities with aspect oriented programming 3109
J. Wang & J. Bigham
Governmental risk communication: Communication guidelines in the context of terrorism
as a new risk 3117
I. Stevens & G. Verleye
On combination of Safety Integrity Levels (SILs) according to IEC61508 merging rules 3125
Y. Langeron, A. Barros, A. Grall & C. Bérenguer
On the methods to model and analyze attack scenarios with Fault Trees 3135
G. Renda, S. Contini & G.G.M. Cojazzi
Risk management for terrorist actions using geoevents 3143
G. Maschio, M.F. Milazzo, G. Ancione & R. Lisi

Surface transportation (road and train)


A modelling approach to assess the effectiveness of BLEVE prevention measures on LPG tanks 3153
G. Landucci, M. Molag, J. Reinders & V. Cozzani
Availability assessment of ALSTOM’s safety-relevant trainborne odometry sub-system 3163
B.B. Stamenković & P. Dersin
Dynamic maintenance policies for civil infrastructure to minimize cost and manage safety risk 3171
T.G. Yeung & B. Castanier
FAI: Model of business intelligence for projects in metrorailway system 3177
A. Oliveira & J.R. Almeida Jr.

XXVI
Impact of preventive grinding on maintenance costs and determination of an optimal grinding cycle 3183
C. Meier-Hirmer & Ph. Pouligny
Logistics of dangerous goods: A GLOBAL risk assessment approach 3191
C. Mazri, C. Deust, B. Nedelec, C. Bouissou, J.C. Lecoze & B. Debray
Optimal design of control systems using a dependability criteria and temporal sequences
evaluation—Application to a railroad transportation system 3199
J. Clarhaut, S. Hayat, B. Conrard & V. Cocquempot
RAM assurance programme carried out by the Swiss Federal Railways SA-NBS project 3209
B.B. Stamenković
RAMS specification for an urban transit Maglev system 3217
A. Raffetti, B. Faragona, E. Carfagna & F. Vaccaro
Safety analysis methodology application into two industrial cases: A new mechatronical system
and during the life cycle of a CAF’s high speed train 3223
O. Revilla, A. Arnaiz, L. Susperregui & U. Zubeldia
The ageing of signalling equipment and the impact on maintenance strategies 3231
M. Antoni, N. Zilber, F. Lejette & C. Meier-Hirmer
The development of semi-Markov transportation model 3237
Z. Mateusz & B. Tymoteusz
Valuation of operational architecture dependability using Safe-SADT formalism: Application
to a railway braking system 3245
D. Renaux, L. Cauffriez, M. Bayart & V. Benard

Waterborne transportation
A simulation based risk analysis study of maritime traffic in the Strait of Istanbul 3257
B. Özbaş, I. Or, T. Altiok & O.S. Ulusçu
Analysis of maritime accident data with BBN models 3265
P. Antão, C. Guedes Soares, O. Grande & P. Trucco
Collision risk analyses of waterborne transportation 3275
E. Vanem, R. Skjong & U. Langbecker
Complex model of navigational accident probability assessment based on real time
simulation and manoeuvring cycle concept 3285
L. Gucma
Design of the ship power plant with regard to the operator safety 3289
A. Podsiadlo & W. Tarelko
Human fatigue model at maritime transport 3295
L. Smolarek & J. Soliwoda
Modeling of hazards, consequences and risk for safety assessment of ships in damaged
conditions in operation 3303
M. Gerigk
Numerical and experimental study of a reliability measure for dynamic control of floating vessels 3311
B.J. Leira, P.I.B. Berntsen & O.M. Aamo
Reliability of overtaking maneuvers between ships in restricted area 3319
P. Lizakowski
Risk analysis of ports and harbors—Application of reliability engineering techniques 3323
B.B. Dutta & A.R. Kar

XXVII
Subjective propulsion risk of a seagoing ship estimation 3331
A. Brandowski, W. Frackowiak, H. Nguyen & A. Podsiadlo
The analysis of SAR action effectiveness parameters with respect to drifting search area model 3337
Z. Smalko & Z. Burciu
The risk analysis of harbour operations 3343
T. Abramowicz-Gerigk
Author index 3351

XXVIII
Safety, Reliability and Risk Analysis: Theory, Methods and Applications – Martorell et al. (eds)
© 2009 Taylor & Francis Group, London, ISBN 978-0-415-48513-5

Preface

This Conference stems from a European initiative merging the ESRA (European Safety and Reliability
Association) and SRA-Europe (Society for Risk Analysis—Europe) annual conferences into the major safety,
reliability and risk analysis conference in Europe during 2008. This is the second joint ESREL (European Safety
and Reliability) and SRA-Europe Conference after the 2000 event held in Edinburg, Scotland.
ESREL is an annual conference series promoted by the European Safety and Reliability Association. The
conference dates back to 1989, but was not referred to as an ESREL conference before 1992. The Conference
has become well established in the international community, attracting a good mix of academics and industry
participants that present and discuss subjects of interest and application across various industries in the fields of
Safety and Reliability.
The Society for Risk Analysis—Europe (SRA-E) was founded in 1987, as a section of SRA international
founded in 1981, to develop a special focus on risk related issues in Europe. SRA-E aims to bring together
individuals and organisations with an academic interest in risk assessment, risk management and risk commu-
nication in Europe and emphasises the European dimension in the promotion of interdisciplinary approaches of
risk analysis in science. The annual conferences take place in various countries in Europe in order to enhance the
access to SRA-E for both members and other interested parties. Recent conferences have been held in Stockholm,
Paris, Rotterdam, Lisbon, Berlin, Como, Ljubljana and the Hague.
These conferences come for the first time to Spain and the venue is Valencia, situated in the East coast close
to the Mediterranean Sea, which represents a meeting point of many cultures. The host of the conference is the
Universidad Politécnica de Valencia.
This year the theme of the Conference is "Safety, Reliability and Risk Analysis. Theory, Methods and
Applications". The Conference covers a number of topics within safety, reliability and risk, and provides a
forum for presentation and discussion of scientific papers covering theory, methods and applications to a wide
range of sectors and problem areas. Special focus has been placed on strengthening the bonds between the safety,
reliability and risk analysis communities with an aim at learning from the past building the future.
The Conferences have been growing with time and this year the program of the Joint Conference includes 416
papers from prestigious authors coming from all over the world. Originally, about 890 abstracts were submitted.
After the review by the Technical Programme Committee of the full papers, 416 have been selected and included
in these Proceedings. The effort of authors and the peers guarantee the quality of the work. The initiative and
planning carried out by Technical Area Coordinators have resulted in a number of interesting sessions covering
a broad spectre of topics.
Sebastián Martorell
C. Guedes Soares
Julie Barnett
Editors

XXIX
Safety, Reliability and Risk Analysis: Theory, Methods and Applications – Martorell et al. (eds)
© 2009 Taylor & Francis Group, London, ISBN 978-0-415-48513-5

Organization

Conference Chairman
Dr. Sebastián Martorell Alsina Universidad Politécnica de Valencia, Spain

Conference Co-Chairman
Dr. Blás Galván González University of Las Palmas de Gran Canaria, Spain

Conference Technical Chairs


Prof. Carlos Guedes Soares Technical University of Lisbon—IST, Portugal
Dr. Julie Barnett University of Surrey, United Kingdom

Board of Institution Representatives


Prof. Gumersindo Verdú Vice-Rector for International Actions—
Universidad Politécnica de Valencia, Spain
Dr. Ioanis Papazoglou ESRA Chairman
Dr. Roberto Bubbico SRA-Europe Chairman

Technical Area Coordinators


Aven, Terje—Norway Leira, Bert—Norway
Bedford, Tim—United Kingdom Levitin, Gregory—Israel
Berenguer, Christophe—France Merad, Myriam—France
Bubbico, Roberto—Italy Palanque, Philippe—France
Cepin, Marco—Slovenia Papazoglou, Ioannis—Greece
Christou, Michalis—Italy Preyssl, Christian—The Netherlands
Colombo, Simone—Italy Rackwitz, Ruediger—Germany
Dien, Yves—France Rosqvist, Tony—Finland
Doménech, Eva—Spain Salvi, Olivier—Germany
Eisinger, Siegfried—Norway Skjong, Rolf—Norway
Enander, Ann—Sweden Spadoni, Gigliola—Italy
Felici, Massimo—United Kingdom Tarantola, Stefano—Italy
Finkelstein, Maxim—South Africa Thalmann, Andrea—Germany
Goossens, Louis—The Netherlands Thunem, Atoosa P-J—Norway
Hessami, Ali—United Kingdom Van Gelder, Pieter—The Netherlands
Johnson, Chris—United Kingdom Vrouwenvelder, Ton—The Netherlands
Kirchsteiger, Christian—Luxembourg Wolfgang, Kröger—Switzerland

Technical Programme Committee


Ale B, The Netherlands Badia G, Spain
Alemano A, Luxembourg Barros A, France
Amari S, United States Bartlett L, United Kingdom
Andersen H, Denmark Basnyat S, France
Aneziris O, Greece Birkeland G, Norway
Antao P, Portugal Bladh K, Sweden
Arnaiz A, Spain Boehm G, Norway

XXXI
Bris R, Czech Republic Le Bot P, France
Bründl M, Switzerland Limbourg P, Germany
Burgherr P, Switzerland Lisnianski A, Israel
Bye R, Norway Lucas D, United Kingdom
Carlos S, Spain Luxhoj J, United States
Castanier B, France Ma T, United Kingdom
Castillo E, Spain Makin A, Australia
Cojazzi G, Italy Massaiu S, Norway
Contini S, Italy Mercier S, France
Cozzani V, Italy Navarre D, France
Cha J, Korea Navarro J, Spain
Chozos N, United Kingdom Nelson W, United States
De Wit S, The Netherlands Newby M, United Kingdom
Droguett E, Brazil Nikulin M, France
Drottz-Sjoberg B, Norway Nivolianitou Z, Greece
Dutuit Y, France Pérez-Ocón R, Spain
Escriche I, Spain Pesme H, France
Faber M, Switzerland Piero B, Italy
Fouladirad M, France Pierson J, France
Garbatov Y, Portugal Podofillini L, Italy
Ginestar D, Spain Proske D, Austria
Grall A, France Re A, Italy
Gucma L, Poland Revie M, United Kingdom
Hardman G, United Kingdom Rocco C, Venezuela
Harvey J, United Kingdom Rouhiainen V, Finland
Hokstad P, Norway Roussignol M, France
Holicky M, Czech Republic Sadovsky Z, Slovakia
Holloway M, United States Salzano E, Italy
Iooss B, France Sanchez A, Spain
Iung B, France Sanchez-Arcilla A, Spain
Jonkman B, The Netherlands Scarf P, United Kingdom
Kafka P, Germany Siegrist M, Switzerland
Kahle W, Germany Sørensen J, Denmark
Kleyner A, United States Storer T, United Kingdom
Kolowrocki K, Poland Sudret B, France
Konak A, United States Teixeira A, Portugal
Korczak E, Poland Tian Z, Canada
Kortner H, Norway Tint P, Estonia
Kosmowski K, Poland Trbojevic V, United Kingdom
Kozine I, Denmark Valis D, Czech Republic
Kulturel-Konak S, United States Vaurio J, Finland
Kurowicka D, The Netherlands Yeh W, Taiwan
Labeau P, Belgium Zaitseva E, Slovakia
Zio E, Italy

Webpage Administration
Alexandre Janeiro Instituto Superior Técnico, Portugal

Local Organizing Committee


Sofía Carlos Alberola Universidad Politécnica de Valencia
Eva Ma Doménech Antich Universidad Politécnica de Valencia
Antonio José Fernandez Iberinco, Chairman Reliability Committee AEC
Blás Galván González Universidad de Las Palmas de Gran Canaria
Aitor Goti Elordi Universidad de Mondragón
Sebastián Martorell Alsina Universidad Politécnica de Valencia
Rubén Mullor Ibañez Universidad de Alicante

XXXII
Rafael Pérez Ocón Universidad de Granada
Ana Isabel Sánchez Galdón Universidad Politécnica de Valencia
Vicente Serradell García Universidad Politécnica de Valencia
Gabriel Winter Althaus Universidad de Las Palmas de Gran Canaria

Conference Secretariat and Technical Support at Universidad Politécnica de Valencia


Gemma Cabrelles López
Teresa Casquero García
Luisa Cerezuela Bravo
Fanny Collado López
María Lucía Ferreres Alba
Angeles Garzón Salas
María De Rus Fuentes Manzanero
Beatriz Gómez Martínez
José Luis Pitarch Catalá
Ester Srougi Ramón
Isabel Martón Lluch
Alfredo Moreno Manteca
Maryory Villamizar Leon
José Felipe Villanueva López

Sponsored by
Ajuntament de Valencia
Asociación Española para la Calidad (Comité de Fiabilidad)
CEANI
Generalitat Valenciana
Iberdrola
Ministerio de Educación y Ciencia
PMM Institute for Learning
Tekniker
Universidad de Las Palmas de Gran Canaria
Universidad Politécnica de Valencia

XXXIII
Safety, Reliability and Risk Analysis: Theory, Methods and Applications – Martorell et al. (eds)
© 2009 Taylor & Francis Group, London, ISBN 978-0-415-48513-5

Acknowledgements

The conference is organized jointly by Universidad Politécnica de Valencia, ESRA (European Safety and
Reliability Association) and SRA-Europe (Society for Risk Analysis—Europe), under the high patronage of
the Ministerio de Educación y Ciencia, Generalitat Valenciana and Ajuntament de Valencia.
Thanks also to the support of our sponsors Iberdrola, PMM Institute for Learning, Tekniker, Asociación
Española para la Calidad (Comité de Fiabilidad), CEANI and Universidad de Las Palmas de Gran Canaria. The
support of all is greatly appreciated.
The work and effort of the peers involved in the Technical Program Committee in helping the authors to
improve their papers are greatly appreciated. Special thanks go to the Technical Area Coordinators and organisers
of the Special Sessions of the Conference, for their initiative and planning which have resulted in a number of
interesting sessions. Thanks to authors as well as reviewers for their contributions in the review process. The
review process has been conducted electronically through the Conference web page. The support to the web
page was provided by the Instituto Superior Técnico.
We would like to acknowledge specially the local organising committee and the conference secretariat and tech-
nical support at the Universidad Politécnica de Valencia for their careful planning of the practical arrangements.
Their many hours of work are greatly appreciated.
These conference proceedings have been partially financed by the Ministerio de Educación y Ciencia
de España (DPI2007-29009-E), the Generalitat Valenciana (AORG/2007/091 and AORG/2008/135) and the
Universidad Politécnica de Valencia (PAID-03-07-2499).

XXXV
Safety, Reliability and Risk Analysis: Theory, Methods and Applications – Martorell et al. (eds)
© 2009 Taylor & Francis Group, London, ISBN 978-0-415-48513-5

Introduction

The Conference covers a number of topics within safety, reliability and risk, and provides a forum for presentation
and discussion of scientific papers covering theory, methods and applications to a wide range of sectors and
problem areas.

Thematic Areas
• Accident and Incident Investigation
• Crisis and Emergency Management
• Decision Support Systems and Software Tools for Safety and Reliability
• Dynamic Reliability
• Fault Identification and Diagnostics
• Human Factors
• Integrated Risk Management and Risk-Informed Decision-making
• Legislative dimensions of risk management
• Maintenance Modelling and Optimisation
• Monte Carlo Methods in System Safety and Reliability
• Occupational Safety
• Organizational Learning
• Reliability and Safety Data Collection and Analysis
• Risk and Evidence Based Policy Making
• Risk and Hazard Analysis
• Risk Control in Complex Environments
• Risk Perception and Communication
• Safety Culture
• Safety Management Systems
• Software Reliability
• Stakeholder and public involvement in risk governance
• Structural Reliability and Design Codes
• System Reliability Analysis
• Uncertainty and Sensitivity Analysis

Industrial and Service Sectors


• Aeronautics and Aerospace
• Automotive Engineering
• Biotechnology and Food Industry
• Chemical Process Industry
• Civil Engineering
• Critical Infrastructures
• Electrical and Electronic Engineering
• Energy Production and Distribution
• Health and Medicine
• Information Technology and Telecommunications
• Insurance and Finance
• Manufacturing
• Mechanical Engineering
• Natural Hazards

XXXVII
• Nuclear Engineering
• Offshore Oil and Gas
• Policy Decisions
• Public Planning
• Security and Protection
• Surface Transportation (road and train)
• Waterborne Transportation

XXXVIII
Thematic areas

Accident and incident investigation


Safety, Reliability and Risk Analysis: Theory, Methods and Applications – Martorell et al. (eds)
© 2009 Taylor & Francis Group, London, ISBN 978-0-415-48513-5

A code for the simulation of human failure events in nuclear


power plants: SIMPROC

J. Gil, J. Esperón, L. Gamo, I. Fernández, P. González & J. Moreno


Indizen Technologies S.L., Madrid, Spain

A. Expósito, C. Queral & G. Rodríguez


Universidad Politénica de Madrid (UPM), Madrid, Spain

J. Hortal
Spanish Nuclear Safety Council (CSN), Madrid, Spain

ABSTRACT: Over the past years, many Nuclear Power Plant (NPP) organizations have performed Probabilistic
Safety Assessments (PSAs) to identify and understand key plant vulnerabilities. As part of enhancing the PSA
quality, the Human Reliability Analysis (HRA) is key to a realistic evaluation of safety and of the potential
weaknesses of a facility. Moreover, it has to be noted that HRA continues to be a large source of uncertainly in
the PSAs. We developed SIMulator of PROCedures (SIMPROC) as a tool to simulate events related with human
actions and to help the analyst to quantify the importance of human actions in the final plant state. Among others,
the main goal of SIMPROC is to check if Emergency Operating Procedures (EOPs) lead to safe shutdown plant
state. First pilot cases simulated have been MBLOCA scenarios simulated by MAAP4 severe accident code
coupled with SIMPROC.

1 INTRODUCTION Also, this software tool is part of SCAIS software


package, (Izquierdo 2003), (Izquierdo et al. 2000) and
In addition to the traditional methods for verifying pro- (Izquierdo et al. 2008), which is a simulation system
cedures, integrated simulations of operator and plant able to generate dynamic event trees stemming from an
response may be useful to: initiating event, based on a technique able to efficiently
simulate all branches and taking into account different
1. Verify that the plant operating procedures can be factors which may affect the dynamic plant behavior
understood and performed by the operators. in each sequence. The development of SIMPROC is
2. Verify that the response based on these procedures described in detail in this paper.
leads to the intended results.
3. Identify potential situations where judgment of
operators concerning the appropriate response is
inconsistent with the procedures. 2 WHY SIMPROC?
4. Study the consequences of errors of commission
and the possibilities for recovering from such The ascribing of a large number of accidents to human
errors, (CSNI 1998) and (CSNI-PWG1 and CSNI- error meant that there was a need to consider it in
PWG5 1997). risk assessment. In the context of nuclear safety stud-
5. Study time availability factors related with proce- ies (Rasmussen 1975) concluded that human error
dures execution. contributed to 65% of the considered accidents. This
contribution is as great or even greater than the con-
Indizen Technologies, in cooperation with the tribution associated to system failures. The NRC esti-
Department of Energy Systems of Technical Univer- mates human error is directly or indirectly involved in
sity of Madrid and the Spanish Nuclear Safety Council approximately 65% of the abnormal incidents (Trager
(CSN), has developed during the last two years a tool 1985). According to the data of the Incident Reporting
known as SIMPROC. This tool, coupled with a plant System of the International Atomic Energy Agency
simulator, is able to incorporate the effect of opera- (IAEA), 48% of the registered events are related
tor actions in plant accident sequences simulations. to failures in human performances. Out of those,

3
63% were into power operation and the remaining 37% 3 BABIECA-SIMPROC ARCHITECTURE
into shutdown operation. Additionally, analyzing the
events reported using the International Nuclear Event The final objective of the BABIECA-SIMPROC sys-
Scale (INES) during the last decade, most of the major tem is to simulate accidental transients in NPPs con-
incidents of Level 2 or higher could be attributed to sidering human actions. For this purpose is necessary
causes related to human performance. Moreover, a to develop an integrated tool that simulates the dynam-
study based on a broad set of PSA data states that ics of the system. To achieve this we will use the
between 15 and 80% of Core Damage Frequency is BABIECA Simulation Engine to calculate the time
related with execution failure of some operator action evolution state of the plant. Finally we will model
(NEA 2004). Reason (Reason 1990), in a study of a the influence of human operator actions by means of
dozen meaningful accidents in the 15 years prior to SIMPROC. We have modeled the operators influence
their publication, including Three Mile Island (TMI), over the plant state as a separate module to empha-
Chernobyl, Bhopal and the fire of London under- size the significance of operator actions in the final
ground, concludes that at least 80% of the system state of the plant. It is possible to plug or unplug
failures were caused by humans, specially by inad- SIMPROC to consider the operator influence over the
equate management of maintenance supervision. In simulation state of the plant in order to compare end
addition, it determines that other aspects have rel- states in both cases. The final goal of the BABIECA-
evance, mainly technical inaccuracy or incomplete SIMPROC overall system, integrated in SCAIS, is to
training in operation procedures. This last aspect is of simulate Dynamic Event Trees (DET) to describe the
great importance in multiple sectors whose optimiza- time evolution scheme of accidental sequences gen-
tion of operation in abnormal or emergency situations erated from a trigger event. During this calculation it
is based on strict operation procedures. In this case, must be taken into account potential degradations of
it stands out that four of the investigations carried out the systems associating them with probabilistic calcu-
by the NRC and nearly 20 of the additional investiga- lations in each sequence. Additionally EOPs execution
tions from Three Miles Island (TMI) concluded that influence is defined for each plant and each sequence.
intolerable violations of procedures took place. The In order to achieve this objective the integrated scheme
TMI accident illustrated clearly how the interaction must fit the following features:
of technical aspects with human and organizational
factors can help the progression of events. After the 1. The calculation framework must be able to integrate
incident, big efforts on investigation and develop- other Simulation Codes (MAAP, TRACE, . . . ). In
ment have focused on the study of human factors in this case, BABIECA-SIMPROC acts as a wrapper
accidents management. The management of accidents to external codes. This will allow to work with dif-
includes the actions that the operation group must per- ferent codes in the same time line sequence. In case
form during beyond design basis accidents, with the the simulation reaches core damage conditions, it is
objective of maintaining the basic functions of reac- possible to unplug the best estimate code and plug
tor safety. Two phases in emergencies management a severe accident code to accurately describe the
are distinguished: The preventive phase, in which per- dynamic state of the plant.
formances of the operator are centered in avoiding 2. Be able to automatically generate the DET associ-
damage to the core and maintaining the integrity of the ated to an event initiator, simulating the dynamic
installation, and the phase of mitigation, in which once plant evolution.
core damage occurs, operator actions are oriented to 3. Obtain the probability associated to every possible
reduce the amount of radioactive material that is going evolution sequence of the plant.
to be released. Management of accidents is carried out All the system is being developing in C++ code
by following EOPs and Severe Accident Management in order to meet the requirements of speed and
Guides (SAMGs), and its improvement was one of performance needed in this kind of simulations. Paral-
the activities carried out after TMI accident. The acci- lelization was implemented by means of a PVM archi-
dent sequence provides the basis for determining the tecture. The communication with the PostGresSQL
frequencies and uncertainties of consequences. The database is carried out by the libpq++ library. All the
essential outcome of a PSA is a quantitative expres- input desk needed to initialize the system is done using
sion of the overall risks in probabilistic terms. The standard XML.
initial approach to import human factor concerns into The main components of the Global System Archi-
engineering practices was to use existing PSA meth- tecture can be summarized as follows:
ods and extend them to include human actions. We will
use SIMPROC to extend this functionality to include 1. DENDROS event scheduler. It is in charge of
the human factors into the plant evolution simulation. opening branches of the simulation tree depending
This is done in a dynamic way instead of the static on the plant simulation state. DENDROS allows
point of view carried out by PSA studies. the modularization and parallelization of the tree

4
generation. Calculation of the probability for each
branch is based on the true condition of certain
logical functions. The scheduler arranges for the
opening of the branch whenever certain conditions
are met, and stops the simulation of any particu-
lar branch that has reached an absorbing state. The
time when the opening of the new branch occurs can
be deterministically fixed by the dynamic condi-
tions (setpoint crossing) or randomly delayed with
respect to the time when the branching conditions
are reached. The latter option especially applies
to operator actions and may include the capabil-
ity to use several values of the delay time within
the same dynamic event tree. The scheduler must
know the probability of each branch, calculated
in a separate process called Probability Wrapper,
in order to decide which branch is suitable for Figure 1. BABIECA-SIMPROC architecture.
further development. The applications of a tree
structured computation extend beyond the scope
of the DETs. In fact, the branch opening and cutoff
can obey any set of criteria not necessarily given by modules allow us to represent relevant plant systems
a probability calculation as, for instance, sensitiv- in great detail. In this publication we will focus our
ity studies or automatic initialization for Accident attention on the MAAP4 Wrapper, which allows us to
Management Strategy analyses. More details on connect BABIECA with this severe accident code. The
how dynamic event trees are generated and handled SIMPROC interaction over the system is illustrated in
and their advantages for safety analysis applica- 2. The process can be summarized in the following
tions are given in another paper presented in this steps:
conference (Izquierdo et al. 2008).
2. BABIECA plant simulator. It is adapted to exe- 1. BABIECA starts the simulation and SIMPROC
cute the sequence simulation launched by Den- is created when a EOP execution is demanded
dros Event Scheduler. As mentioned previously, according to a set of conditions over plant variables
BABIECA can wrap other nuclear simulation codes previously defined.
(i.e., MAAP). This simulation code is able to 2. SIMPROC is initialized with the plant state vari-
extend the simulation capacities of BABIECA to ables at that time instant. As a previous step the
the context of severe accidents. computerized XML version of the set of EOPs must
3. SIMPROC procedures simulator. This simulation be introduced in the SIMPROC database.
module allows us to interact with the Simulation 3. The BABIECA calculation loop starts and the out-
Engine BABIECA to implement the EOPs. come of EOPs executions are modeled as bound-
4. Probability engine. It calculates the probabilities ary conditions over the system. Each topology
and delays associated with the set points of the block can modify its state according to the spe-
DET. The initial implementation will be based in cific actions of SIMPROC EOPs execution. Once
PSA calculations, but we have developed a proba- boundary conditions are defined for the current
bility wrapper able to use calculations from BDDs step, the solution for the next step is calculated
structures in the future. for each topology block. The calculation sequence
5. Global database. It will be used to save data from includes continuous variables, discrete variables
the different simulation modules, providing restart and events recollection for the current time step.
capability to the whole system and allowing an Finally, all variables information are saved in the
easier handling of the simulation results. database and, depending on the system configura-
tion, a simulation restart point is set.
If we focus our attention on the SIMPROC inte- 4. The procedures simulator SIMPROC does not have
gration of the system, the BABIECA-SIMPROC its own time step but adapts its execution to the
architecture can be illustrated accordingly (Fig. 1). BABIECA pace. Additionally, it is possible to set
BABIECA acts as a master code to encapsulate a default communication time between BABIECA
different simulation codes in order to build a robust and SIMPROC. This time represents the average
system with a broad range of application and great time the operator needs to recognize the state of
flexibility. BABIECA Driver has its own topology, the plant, which is higher than the time step of the
named BABIECA Internal Modules in Fig. 1. These simulation.

5
Figure 3. SIMPROC-BABIECA-MAAP4 connection.

Figure 2. BABIECA-SIMPROC calculation flow.


feature allows for a better modelling of the distri-
bution of the operation duties among the members
5. Once BABIECA reaches the defined simulation of the operation team.
end time, it sends a message to SIMPROC to quit As shown in Fig. 3, BABIECA implements a block
procedure execution (if it has not already ended) topology to represent control systems that model oper-
and saves the overall plant state in the database. ator actions over the plant. SIMPROC has access to
those blocks and can change its variables according
to the directives associated with the active EOPs. The
block outputs will be the time dependent boundary
4 SIMPROC-BABIECA-MAAP4 CONNECTION conditions for MAAP4 calculation during the cur-
SCHEME rent time step. It can also be seen in Fig. 3 that
BABIECA has another type of blocks: SndCode and
Providing system connectivity between BABIECA, RcvCode. These blocks were designed to communi-
SIMPROC and MAAP4 is one of the primary targets. cate BABIECA with external codes through a PVM
This allows to simulate severe accident sequences in interface. SndCode gathers all the information gen-
NPPs integrating operator actions simulated by SIM- erated by control systems simulation and sends it
PROC into the plant model. In order to simulate the to MAAP4 wrapper, which conforms all the data
operator actions over the plant with MAAP, the user to be MAAP4 readable. Then, MAAP4 will calcu-
has to define these actions in the XML input file late the next simulation time step. When this cal-
used as a plant model, considering logical conditions culation ends, MAAP4 sends the calculated outputs
(SCRAM, SI, . . .) to trigger EOPs execution. MAAP4 again to MAAP4 wrapper and then the outputs reach
can simulate operator functions, although in a very the RcvCode block of the topology. At this point,
rudimentary way. The key advantages of using the any BABIECA component, especially control sys-
integrated simulator SIMPROC-BABIECA-MAAP4 tems, has the MAAP4 calculated variables available.
as opposed of using MAAP4 alone are: A new BABIECA-SIMPROC synchronization point
1. The actions take a certain time while being exe- occurs and the BABIECA variable values will be the
cuted, whereas in MAAP4 they are instantaneous. boundary conditions for SIMPROC execution.
This delay can be modelled by taking into account
different parameters. For example, the number of
simultaneous actions the operator is executing must 5 APPLICATION EXAMPLE
influence his ability to start new required actions.
2. Actions must follow a sequence established by the The example used to validate the new simulation pack-
active EOPs. In MAAP, operator actions have no age simulates the operator actions related with the level
order. They are executed as logical signals are control of steam generators during MBLOCA tran-
triggered during plant evolution. sient in a PWR Westinghouse design. These operator
3. Although SIMPROC does not contain an operator actions are included in several EOPs, like ES-1.2 pro-
model, it allows for the use of pre-defined opera- cedure, Post LOCA cooldown and depressurization,
tor skills, such as ‘‘reactor operator’’ or ‘‘turbine which is associated with primary cooling depressur-
operator’’, each one with specific attributes. This ization. The first step in this validation is to run

6
the mentioned transient using MAAP4 alone. After
that, the same simulation was run using BABIECA-
SIMPROC system. Finally, the results from both
simulations are compared.

5.1 SG water level control with MAAP


Level control of steam generator is accomplished by
either of two modes of operation, automatic and man-
ual. In the automatic mode, the main feed water control
system senses the steam flow rate, Ws , and the down-
comer water level, z, and adjusts the feed water control
valves to bring the water level to a program (desired) Figure 4. Representation of the key parameters to imple-
water level, z0 . If the water level is very low, z < 0.9z0 , ment the narrow-band control over SGWL.
the control valve is assumed to be fully open, that is,

W = Wmax (1) The first step is to define the topology file for
BABIECA to work. In this file we need to set the block
where W is the feed water flow rate. structure of the systems we want to simulate and we
If the water level is above 0.9z0 , the model applied need to specify which blocks are going to be used by
a limited proportional control in two steps: SIMPROC to model Operator actions over the plant
1. Proportional control. The resulting feed water flow simulation. The EOP has a unique code to identify it
rate, W, returns the water level to z0 at a rate in the database, and some tags to include a description
proportional to the mismatch, of the actions we are going to execute. Moreover it has
a load-delay parameter designed to take into account
(W − Ws ) = α(z − z0 ). (2) the time the operator needs since the EOP demand
trigger starts until the operator is ready to execute the
The coefficient of proportionality, α in eq. 2 proper actions.
is chosen so that the steam generator inventory This simple EOP has only one step designed to
becomes correct after a time interval τ , which is control the Steam Generator Water Level. The main
set to 100 s at present. parameters of this step are:
2. A limited flow rate. The feed water flow rate is
limited to values between 0 (valve closed) and Wmax • Skill. Sets the operator profile we want to execute
(valve fully opened). the action. In this example there is a REACTOR
profile.
The other control mode is manual. In this mode, • Texec. Defines the time the operator is going to be
the control tries to hold the water level within an enve- busy each time he executes an action.
lope defined by a user-supplied deadband zDEAD . The • Taskload. Defines a percentage to take account of
feedwater flow rate is set as follows: the attention that the operator needs to execute an
⎧ action properly. The sum of all the taskload parame-

⎪ Wmax , if z < z0 − zDEAD ters of the actions the operator is executing must be
⎪0, if z > z0 + zDEAD 2


⎨ less than 100%. In future works, Texec and Taskload
Wmin , if z0 + zDEAD <z parameters will be obtained from the Probability
W = 2 (3)

⎪ < z0 + z DEAD Engine in executing time according probability dis-


⎪W , if z0 z− 2 < z
zDEAD
⎩ tributions of human actuation times for the different
< z0 + DEAD 2 actions required by EOPs. These time distributions
could be obtained from experimental studies.
where Wmin is the flow rate used on the decreasing part • Time window. Sets the temporal interval during
of the cycle. which the MONITOR is active.
Operation in manual mode results in a sawtooth- • Targets. Sentences that define the logical behavior
like level trajectory which oscillates about the desired of the MONITOR. We tell the MONITOR what it
level z0 . has to do, when and where.
The parameters used to implement the narrow-
band control over the steam generator water level are In more complex applications, a great effort must
illustrated in Fig. 4. be made in computerizing the specific EOPs of each
To simulate the BABIECA-SIMPROC version of nuclear plant under study, (Expósito and Queral
the same transient we must create the XML files 2003a) and (Expósito and Queral 2003b). This topic
needed to define the input desk of the overall system. is beyond the scope of this paper.

7
Finally, it is necessary to define the XML simu-
lation files for BABIECA and SIMPROC. The main
difference with the previous XML files is that they do
not need to be parsed and introduced in the database
prior to the simulation execution. They are parsed
and stored in memory during runtime execution of the
simulation.
The BABIECA simulation file parameters are:
• Simulation code. Must be unique in the database.
• Start input. Informs about the XML BABIECA
Topology file linked with the simulation.
• Simulation type. It is the type of simulation: restart,
transient or steady.
• Total time. Final time of the simulation.
• Delta. Time step of the master simulation.
• Save output frequency. Frequency to save the out- Figure 5. Steam Generator Water Level comparison (black
puts in the database. line: MAAP4 output; dashed red line: BABIECA-SIMPROC
• Initial time. Initial time for the simulation. output).
• Initial topology mode. Topology block can be in
multiple operation modes. During a simulation exe-
cution some triggers can lead to mode changes that
modify the calculation loop of a block.
• Save restart frequency. Frequency we want to save
restart points to back up simulation evolution.
• SIMPROC active. Flag that allow us to switch on
SIMPROC influence over the simulation.
The main parameters described in the XML SIM-
PROC simulation file are:
• Initial and end time. These parameters can be dif-
ferent to the ones used for the simulation file and
define a time interval for SIMPROC to work.
• Operator parameters. These are id, skill and slow-
ness. The first two identify the operator and his type
and the latter takes account of his speed to execute
the required actions. It is known that this parameter Figure 6. Mass Flow Rate to the Cold Leg (red line: MAAP4
depends on multiple factors like operator experience output; dashed black line: BABIECA-SIMPROC output).
and training.
• Initial variables. These are the variables that are
monitored continuously to identify the main param-
eters to evaluate the plant state. Each variable has a
the simulation. Then water level starts to raise going
procedure code to be used in the EOP description,
through the defined dead band until the level reaches
a BABIECA code to identify the variable inside the
12.5 m. At this point we set FWFR to its minimum
topology and a set of logical states.
• Variables. These variables are not monitored in a value (7.36 kgs ). When SGWL is higher than 15 m, the
continuous way but have the same structure as Initial flow rate is set to zero.
Variables. They are only updated under SIMPROC The concordance between MAAP4 and BABIECA-
request. SIMPROC results is good in general. The differences
can be explained due to different time steps of both
Once we have defined the XML input files, we have codes. MAAP4 chooses its time step according to con-
met all the required conditions to run the BABIECA- vergence criteria, while BABIECA-SIMPROC has a
SIMPROC simulation. fixed time step set in the XML BABIECA Simula-
Compared simulation results of MAAP4 simulation tion File. Additionally, BABIECA-SIMPROC takes
and BABIECA-SIMPROC simulation can be seen in into consideration the time needed by the operator to
Fig. 5 and Fig. 6. As shown in the figures, feed water execute each action whereas MAAP4 implementation
flow rate is set to 9,9 kgs when SGWL is lower than of the operator automatically execute the requested
7.5 m. This situation occurs in the very first part of actions.

8
6 CONCLUSIONS CSN Spanish Nuclear Safety Council
EOP Emergency Operation Procedure
The software tool BABIECA-SIMPROC is being SAMG Severe Accident Management Guide
developed. This software package incorporates oper- LOCA Loss of Coolant Accident
ator actions in accidental sequences simulations in DET Dynamic Event Tree
NPP. This simulation tool is not intended to evalu- PVM Parallel Virtual Machine
ate the probability of human errors, but to incorporate BDD Binary Decision Diagram
in the plant dynamics the effects of those actions per- XML Extensible Markup Language
formed by the operators while following the operating SGWL Steam Generator Water Level
procedures. Nonetheless, human errors probabilities FWFR Feed Water Flow Rate
calculated by external HRA models can be taken into
account in the generation of dynamic event trees under
the control of DENDROS. We have tested this applica- REFERENCES
tion with a pilot case related with the steam generator
water level control during a MBLOCA transient in a CSNI (Ed.) (1998). Proceedings from Specialists Meeting
PWR Westinghouse NPP. The results have been satis- Organized: Human performance in operational events,
factory although further testing is needed. At this stage CSNI.
we are in this process of validation to simulate a com- CSNI-PWG1, and CSNI-PWG5 (1997). Research strategies
for human performance. Technical Report 24, CSNI.
plete set of EOPs that are used in a PWR Westinghouse Expósito, A. and C. Queral (2003a). Generic questions about
NPP. Moreover, we are extending the capabilities of the computerization of the Almaraz NPP EOPs. Technical
the system to incorporate TRACE as an external code report, DSE-13/2003, UPM.
with its corresponding BABIECA wrapper. When this Expósito, A. and C. Queral (2003b). PWR EOPs computeri-
part of the work is completed, a wider simulation will zation. Technical report, DSE-14/2003, UPM.
be available. This will allow to analyze the impact of Izquierdo, J.M. (2003). An integrated PSA approach to inde-
EOPs execution by operators in the final state of the pendent regulatory evaluations of nuclear safety assess-
plant as well as the evaluation of the allowable response ment of Spanish nuclear power stations. In EUROSAFE
times for the manual actions. Forum 2003.
Izquierdo, J.M., J. Hortal, M. Sanchez-perea, E. Meléndez,
R. Herrero, J. Gil, L. Gamo, I. Fernández, J. Esperón,
P. González, C. Queral, A. Expósito, and G. Rodríguez
ACKNOWLEDGMENTS (2008). SCAIS (Simulation Code System for Integrated
Safety Assesment): Current status and applications. Pro-
SIMPROC project is partially funded by the Spanish ceedings of ESREL 08.
Ministry of Industry (PROFIT Program) and SCAIS Izquierdo, J.M., C. Queral, R. Herrero, J. Hortal, M. Sanchez-
project by the Spanish Ministry of Education and Sci- perea, E. Melandez, and R. Muñoz (2000). Role of fast
ence (CYCIT Program). Their support is gratefully Running TH Codes and Their Coupling with PSA Tools,
acknowledged. in Advanced Thermal-hydraulic and Neutronic Codes:
Current and Future Applications. In NEA/CSNI/R(2001)2,
We want to show our appreciation to the people Volume 2.
who in one way or another, have contributed to the NEA (2004). Nuclear regulatory challenges related to human
accomplishment project. performance. Isbn 92-64-02089-6, NEA.
Rasmussen, N.C. (1975). Reactor safety study, an assessment
of accident risks in u. s. nuclear power plants. In NUREG
NOMENCLATURE NUREG-75/014, WASH-1400.
Reason, J. (1990). Human Error. Cambridge University
ISA Integrated Safety Analysis Press.
SCAIS Simulation Codes System for Integrated Safety Trager, E.A. (1985). Case study report on loss of safety sys-
tem function events. Technical Report AEOD/C504, ffice
Assessment for Analysis and Evaluation of Operational Data. Nuclear
PSA Probabilistic Safety Analysis Regulatory Commission (NRC).
HRA Human Reliability Analysis

9
Safety, Reliability and Risk Analysis: Theory, Methods and Applications – Martorell et al. (eds)
© 2009 Taylor & Francis Group, London, ISBN 978-0-415-48513-5

A preliminary analysis of the ‘Tlahuac’ incident by applying


the MORT technique

J.R. Santos-Reyes, S. Olmos-Peña & L.M. Hernández-Simón


Safety, Risk & Reliability Group, SEPI-ESIME, IPN, Mexico

ABSTRACT: Crime may be regarded as a major source of social concern in the modern world. Very often
increases in crime rates will be treated as headline news, and many people see the ‘law and order’ issue as
one of the most pressing in modern society. An example of such issues has been highlighted by the ‘‘Tláhuac’’
incident which occurred in Mexico City on 23 November 2004. The fatal incident occurred when an angry
crowd burnt alive two police officers and seriously injured another after mistaking them for child kidnappers.
The third policeman who was finally rescued by colleagues (three and half hours after the attack began) suffered
serious injuries. The paper presents some preliminary results of the analysis of the above incident by applying the
MORT (Management Over-sight Risk Three) technique. The MORT technique may be regarded as a structured
checklist in the form of a complex ‘fault tree’ model that is intended to ensure that all aspects of an organization’s
management are looked into when assessing the possible causes of an incident. Some other accident analysis
approaches may be adopted in the future for further analysis. It is hoped that by conducting such analysis lessons
can be learnt so that incidents such as the case of ‘Tláhuac’ can be prevented in the future.

1 INTRODUCTION other hand, rape is generally defined, in the literature,


as a sexual act imposed to the victim by means of
Crime and disorder may comprise a ‘‘vast set of events violence or threats of violence (Griffiths 2004, Puglia
involving behaviour formally deemed against the law et al. 2005). It is recognised that the most common
and usually committed with ‘evil intent’’’ (Ekblom motives for rape include power, opportunity, pervasive
2005). These events range from murder to fraud, theft, anger, sexual gratification, and vindictiveness. Also,
vandalism, dealing in drugs, kidnappings and terrorist it is emphasised that is important to understand the
atrocities that threaten the public safety. Public safety behavioural characteristics of rapists, and the attitudes
may be defined as ‘‘a state of existence in which peo- towards rape victims.
ple, individually and collectively, are sufficiently free
from a range or real and perceived risks centering on 1.1.2 Murdering
crime and disorder, are sufficiently able to cope with Very often murdering is a product of conflict between
those they nevertheless experience, and where unable acquaintances or family members or a by-product
to cope unaided, are sufficiently-well protected from of other type of crime such as burglary or rob-
the consequences of these risks’’ (Ekblom 2005). bery (McCabe & Wauchope 2005, Elklit 2002). It
is generally recognised that the role of cultural atti-
tudes, reflected in and perpetuated by the mass media
1.1 The nature of crime accounts, has a very significant influence in the public
1.1.1 Sexual and rape perception of crime.
It is well documented that sexual assault and abuse
have a profound effect on the victim’s emotional func- 1.1.3 Organized crime
tions, such as grief, fear, self-blame, emotional liabil- The phenomenon of organised crime, such as smug-
ity, including cognitive reactions such as flashbacks, gling or trafficking, is a kind of enterprise driven
intrusive thoughts, blocking of significant details of as a business: i.e., the pursuit of profits (Tucker
the assault and difficulties with concentration (Alaggia 2004, Klein 2005). Smuggling or trafficking is usu-
et al. 2006). The offenders on the other hand display ally understood as the organised trade of weapons or
problems with empathy, perception about others, and drugs, women and children, including refugees, but
management of negative emotions, interpersonal rela- unlike the legal trade, smuggling or trafficking pro-
tionships, and world views (Alalehto 2002). On the vides goods or services, based on market demand

11
and the possibilities of realising it, that are illegal much more sophisticated. This role includes such as
or socially unacceptable. Lampe and Johansen (2004) developments as DNA testing and evidence, the use
attempt to develop a thorough understanding of organ- of less-than-lethal weapons, increasingly sophisticated
ised crime, by clarifying and specifying the concept forms of identification, and crimes such as identity
of thrust in organise crime networks. They propose theft and a variety of frauds and scams. The use of
four typologies that categorise trust as: (a) individ- scientific evidence and expert witnesses is also a sig-
ualised trust, (b) trust based on reputation, (c) trust nificant issue in the prosecution and adjudication of
based on generalisations, and (d) abstract trust. They offenders. Third, he puts emphasis on public security
show that the existence of these types of trust through and terrorism. This has shaped, and will continue to
the examination of illegal markets. shape, criminal justice in a variety of ways.
Ratcliffe (2005) emphasises on the practice of intel-
1.1.4 Battering ligence driven policing as a paradigm in modern law
Chiffriller et al. (2006) discuss the phenomenon enforcement in various countries (e.g., UK, USA,
called battering, as well as batterer personality and Australia & New Zealand). The author proposes a
behavioural characteristics. Based on cluster analysis, conceptual model of intelligence driven policing; the
they developed five distinct profiles of men who batter model essentially entails three stages: (1) law enforce-
women. And based on the behavioural and personality ment interpret the criminal environment, (2) influence
characteristics that define each cluster, they establish decision-makers, and (3) decision-makers impact on
five categories: (1) pathological batterers, (2) sexu- the criminal environment.
ally violent batterers, (3) generally violent batterers,
(4) psychologically violent batterers, and (5) family-
only batterers. The authors discuss the implications for 1.2 Crime in Mexico City
intervention of these categories. In 1989 the International Crime Victim Survey
(ICVS) was born and since then it has contributed to
1.1.5 Bullying the international knowledge of crime trends in several
Burgess et al. (2006) argue that bullying has become a countries. It is believed that since its conception stan-
major public health issue, because of its connection to dardized victimization surveys have been conducted in
violent and aggressive behaviours that result in seri- more than 70 countries worldwide. The ICVS has been
ous injury to self and to others. The authors define surrounded by a growing interest by both the crime
bullying as a relationship problem in which power research community and the policy makers. A part for
and aggression are inflicted on a vulnerable person providing an internationally standardized indicators
to cause distress. They further emphasise that teasing for the perception and fear of crime across different
and bullying can turn deadly. socio-economic contexts; it also has contributed to
an alternative source of data on crime. Similarly, in
1.1.6 Law enforcement Mexico, four National Crime Victim Surveys (known
Finckenauer (2004) emphasizes that a major push of as ENSI-1, 2, 3 & 4 respectively) have been conducted
the expansion of higher education in crime and justice since 2001. The surveys are intended to help to pro-
studies came particularly from the desire to profes- vide a better knowledge of the levels of crime which
sionalise the police—with the aim of improving police affect the safety of the Mexican citizens (ICESI 2008).
performance. He suggests new and expanded subject- Some key findings of the fourth (i.e., ENSI-4) are
matter coverage. First, criminal-justice educators must summarized in Tables 1 & 2.
recognise that the face of crime has changed—it has
become increasingly international in nature. Examples
include cyber-crime, drug trafficking, human traf- Table 1. Victimization (ICESI 2008).
ficking, other forms of trafficking and smuggling,
Types of crime Percentage (%)
and money laundering. Although global in nature,
these sorts of crimes have significant state and local Robbery 56.3
impact. The author argues that those impacts need to Other types of robbery 25.8
be recognised and understood by twenty-first-century Assault 7.2
criminal-justice professors and students. Increasingly, Theft of items from cars (e.g., accessories) 2.9
crime and criminals do not respect national borders. Burglary 2.4
As a result, law enforcement and criminal justice can- Theft of cars 1.5
not be bond and limited by national borders. Second, Other types of crime 0.4
Kidnappings 2.1
he emphasises the need to recognise that the role Sexual offences 0.8
of science in law enforcement and the administra- Other assaults/threat to citizens 0.6
tion of justice has become increasingly pervasive and

12
Table 2. Public’s perceptions of crime (ICESI 2008). 18:25 hrs. One of the PFP-police officers tried to
communicate with his superiors at the Headquarters
Activities that have been given but failed in the attempt. It is believe that the reason
up by the public in Mexico City Percentage (%) was because there was a meeting going on at the time
and his superior was unable to assist the call. At about
The Public have been given up:
• going out at night 49.1 the same time, the crowd cheered, chanted obscenities
• going to the football stadium 17.1 as they attacked the officers.
• going to dinner out 19.8 18:30 hrs. A first live TV coverage is being
• going to the cinema/Theater 21.3 broadcasted.
• carrying cash with them 45.4 19:30 hrs. Again, one of the PFP-police officers
• taking public transport 28.1 attempted for the second time to communicate with
• wearing jewellery 56.0 their superiors and it was broadcasted live, and he
• taking taxis 37.0 said ‘‘They are not allowing us to get out, come and
• carrying credit cards with them 38.0
rescue us’’.
• visiting friends or relatives 30.5
• other 1.6 21:20 hrs the regional police director from the
regional police headquarters (it will be called here
as RPHQ) was acknowledged that two police officers
have been burnt alive, he ordered the back-up units to
Overall, the public’s perception of crime shows that 9 intervene and rescue the officers.
out of 10 citizens feel unsafe in Mexico City. More than 22:30 hrs after three and a half hours the bodies of
half of the population believes that crime has affected the two PFP-police officers were recovered.
their quality of life; for example, Table 2 shows that
one in two citizens gave up wearing jewellery, going
out at night and taking cash with them. 3 THE MANAGEMENT OVERSIGHT RISK
TREE (MORT)

2 THE ‘TLAHUAC’ INCIDENT 3.1 Overview of the MORT method

On 23 November 2004 an angry crowd burnt a live The Management Oversight and Risk Tree (MORT)
two police officers and seriously injured another after is an analytical procedure for determining causes and
mistaken them for child kidnappers. The third police contributing factors (NRI-1 2002).
officer was finally rescued by colleagues three and In MORT, accidents are defined as ‘‘unplanned
half hours after the attack began. (The three police events that produce harm or damage, that is, losses’’
officers were members of the branch called ‘‘Fed- (NRI-1 2002). Losses occur when a harmful agent
eral Preventive Police’’, known as ‘‘PFP’’. They were comes into contact with a person or asset. This con-
under the management of the Federal Police Head- tact can occur either because of a failure of prevention
quarters; here it will be called FPHQ). The incident or, as an unfortunate but acceptable outcome of a
occurred at San Juan Ixtayoapan, a neighborhood of risk that has been properly assessed and acted-on (a
approximately 35,000 people on Mexico City’s south- so-called ‘‘assumed risk’’). MORT analysis always
ern outskirts; however, the incident is better known evaluates the ‘‘failure’’ route before considering the
as the ‘Tláhuac’ case. It is believed that the police ‘‘assumed risk’’ hypothesis. In MORT analysis, most
officers were taken photographs of pupils at a pri- of the effort is directed at identifying problems in the
mary school, where two children had recently gone control of a work/process and deficiencies in the pro-
missing. TV reporters reached the scene before police tective barriers associated with it. These problems are
reinforcements, and live cameras caught a mob beat- then analysed for their origins in planning, design,
ing the police officers. However, the head of the FPHQ policy, etc. In order to use MORT key episodes in
said that back-up units were unable to get through for the sequence of events should be identified first; each
more than three and a half hours because of heavy traf- episode can be characterised as: {a} a vulnerable target
fic. The main events are thought to be the following exposed to; {b} an agent of harm in the; {c} absence
(FIA 2004, 2005): of adequate barriers.
MORT analysis can be applied to any one or more
November 23rd 2004 of the episodes identified; it is a choice for you to
17:55 hrs. Residents caught the PFP-police officers make in the light of the circumstances particular to
in plain clothes taking photographs of pupils leaving your investigation. To identify these key episodes, you
the school in the San Juan Ixtlayopan neighbourhood. will need to undertake a barrier analysis (or ‘‘Energy
18:10 hrs. The crowd was told that the three PFP- Trace and Barrier Analysis’’ to give it its full title).
police officers were kidnappers. Barrier analysis allows MORT analysis to be focussed;

13
Losses
it is very difficult to use MORT, even in a superficial
way, without it.

Oversights Assumed
and omissions risks 3.2 Barrier analysis
The ‘‘Barrier analysis’’ is intended to produce a clear
set of episodes for MORT analysis. It is an essential
R1 R2 Rn preparation for MORT analysis. The barrier analysis
Specific Management
Control factors system factors embraces three key concepts, namely: {a} ‘‘energy’’;
LTA LTA {b} ‘‘target’’; and {c} ‘‘barrier’’.
S M
‘‘Energy’’ refers to the harmful agent that threatens
or actually damages a ‘‘Target’’ that is exposed to it.
‘‘Targets’’ can be people, things or processes—any-
Incident Mitigation Policy Implemen- Risk
LTA LTA tation of Assessment thing, in fact, that should be protected or would be
SA1
SA2
MA1
policy LTA System LTA better undisturbed by the ‘‘Energy’’. In MORT, an inci-
MA2 MA3 dent can result either from exposure to an energy flow
Potentially Controls & Energy flows
without injuries or damage, or the damage of a tar-
Vulnerable get with no intrinsic value. ‘‘Barrier’’ part of the title
Harmful Barriers leading
people/objects
condition LTA accident/incident refers to the means by which ‘‘Targets’’ are kept safe
SB1 SB2 SB3 SB4
from ‘‘Energies’’.

Inform- Inspec- Operational Main- Super Supervi-


ation tion readiness tenance vision sion sup-
systems LTA LTA LTA LTA port LTA Table 3. Barrier analysis for the ‘Tlahuac’ incident.
SD1 SD2 SD3 SD4 SD5 SD6
Energy flow Target Barrier
Figure 1. Basic MORT structure.
3-PFP ‘Tlahuac’ PFP-police officers in plain-
(Police Suburb clothes taking pictures of pupils
Officers) leaving school.
Opera- PFP-police officers did not have
tions an official ID or equivalent to
demonstrate that they were
effectively members of the PFP.
No communication between the
PFP-police officers and the local
& regional authorities of the
purpose of their operations in
the neighborhood.
Angry 3-PFP PFP-police officers in plain-
crowd clothes taking pictures of pupils
leaving school.
PFP-police officers did not have
an official ID or equivalent to
demonstrate that they were
effectively PFP-police officers.
Failure to investigate previous
kidnappings by the Police or
them?
Crowd 3-PFP PFP-police officers in plain-
attack clothes taking pictures of pupils
leaving school.
PFP-police officers did not have
an official ID or equivalent to
demonstrate that they were
Figure 2. SB1 branch—‘‘Potentially harmful energy flow or effectively PFP-police officers.
condition’’. (Red: problems that contributed to the outcome; No back-up units.
Green: is judged to have been satisfactory)

14
3.3 MORT structure On the other hand, the ‘‘Specific and Management’’
branches are regarded as the two main branches in
Figure 1 shows the basic MORT structure. The top
MORT (see Fig. 2). Specific control factors are broken
event in MORT is labelled ‘‘Losses’’, beneath which
down in to two main classes: {a} those related to the
are its two alternative causes; i.e., {1} ‘‘Oversights &
Omissions’’, {2} ‘‘Assumed risks’’. In MORT all
the contributing factors in the accident sequence are
treated as ‘‘oversights and omissions’’ unless they are
transferred to the ‘‘Assumed risk’’ branch. Input to
the ‘‘Oversights and Omissions’’ event is through and
AND logic gate. This means that problems manifest
in the specific control of work activities, necessarily
involve issues in the management process that govern
them.

Figure 5. SC2 branch—‘‘Barriers LTA’’. (Red: problems


that contributed to the outcome; Blue: need more informa-
tion. Green: is judged to have been satisfactory).

Figure 3. SB2 branch—‘‘Vulnerable people’’. (Red: prob-


lems that contributed to the outcome; Blue: need more
information. Green: is judged to have been satisfactory).

Figure 6. SD1 branch—‘‘technical information LTA’’.


Figure 4. SB3 branch—‘‘Barriers & Controls’’. (Red: prob- (Red: problems that contributed to the outcome; Green: is
lems that contributed to the outcome). judged to have been satisfactory).

15
Figure 7. SD1 branch—‘‘Communication LTA’’. (Red:
problems that contributed to the outcome; Green: is judged
to have been satisfactory).

Figure 9. SD6 branch—‘‘Support of supervision LTA’’.


(Red: problems that contributed to the outcome; Green: is
judged to have been satisfactory).

Figure 8. SD1 branch—‘‘Data collection LTA’’. (Red: prob-


lems that contributed to the outcome; Blue: need more
information. Green: is judged to have been satisfactory).

Figure 10. MB1 branch—‘‘Hazard analysis process LTA’’.


incident or accident (SA1), and {b} those related to
(Red: problems that contributed to the outcome; Blue:
restoring control following an accident (SA2). Both need more information. Green: is judged to have been
of them are under an OR logic gate because either can satisfactory).
be a cause of losses.

energy flow (‘‘crowd attack’’) was considered for the


4 ANALYSIS & RESULTS analysis.
The results of the MORT analysis are presented in
A barrier analysis for the ‘‘Tlahuac’’ incident has been two ways, i.e., the MORT chart and brief notes on
conducted and Table 3 shows the three possible phases the MORT chart. Both need to be read in conjunc-
or episodes. At this stage of the investigation the third tion with the NRI MORT User’s manual (NRI-12002).

16
blue, to indicate where there is a need to find more
information to properly assess it. Figures 2–11 show
several branches of the MORT chart for the ‘Tlahuac’
incident.
Table 4 summarizes some of the findings in brief
of the third phase identified by the Barrier analysis.

5 DISCUSSION

A preliminary analysis of the ‘Tlahuac’ incident has


been conducted by applying the MORT technique.
Some preliminary conclusions have been highlighted.
However, it should be pointed out that some of the
branches of the Tree were not relevant or difficult
to interpret in the present case; for example: ‘‘SD2-
Operational readiness LTA’’, ‘‘SD3-Inspection LTA’’,
‘‘SD4-Maintenance LTA’’, etc. It may be argued that
these branches are designed to identify weaknesses in
a socio-technical system; however it may be the case
that they are not suitable for the case of a human activ-
ity system; such as crime. A constraint of the present
Figure 11. MB1 branch— ‘‘Management system factors’’.
study was the difficulty of gathering reliable informa-
(Red: problems that contributed to the outcome; Blue:
need more information. Green: is judged to have been tion of the chain of command within the police force
satisfactory). at the time of the incident.
More work is needed to draw some final conclu-
sions of this fatal incident. The future work includes:
Table 4. Summary of main failures that were highlighted. {1} the circles shown in blue in the chart MORT need
further investigation; {2} the other two ‘‘energy flows’’
• Poor attitude from management (i.e., federal Police & identified by the Barrier analysis should be investi-
Mexico City’s Police organizations) gated; i.e., ‘‘Angry crowd’’ & ‘‘PFP-police officer
• Management laid responsibility to the traffic jams pre- operations’’. Other ‘energy flows’ may be identi-
venting them sending reinforcements to rescue the POs.
fied and investigated. Finally, other accident analysis
• A hazard assessment was not conducted. Even if it was
conducted it did not highlight this scenario. approaches should be adopted for further analysis;
• No one in management responsible for the POs safety these include for example, PRISMA (Van der Schaaf
• Management presumed/assumed that the three POs were 1996), STAMP (Levenson et al. 2003), SSMS model
enough for this operation and did not provide reinforce- (Santos-Reyes & Beard 2008), and others that may be
ments relevant. It is hoped that by conducting such analysis
• Both the Pos and the management did not learn from lessons can be learnt so that incidents such as the case
previous incidents of ‘Tláhuac’ can be prevented in the future.
• Lack of training. The Pos did not know what to do in such
circumstances
• Lack of support from the management
• No internal checks/monitoring ACKNOWLEDGEMENTS
• No independent checks/monitoring
• Lack of coordination between the Federal & Local police This project was funded by CONACYT & SIP-IPN
organizations under the following grants: CONACYT: No-52914 &
• Lack of an emergency plan of such scenarios. It took four SIP-IPN: No-20082804.
hours to rescue the bodies of two POs and rescue the third
alive

REFERENCES

Alaggia, R., & Regehr, C. 2006. Perspectives of justice


In practice, to make the process of making brief notes for victims of sexual violence. Victims and Offenders 1:
on the MORT chart easier to review, it is customary 33–46.
to colour-code the chart as follows: {a} red, where Alalehto, T. 2002. Eastern prostitution from Russia to
a problem is found; {b} green, where a relevant Sweden and Finland. Journal of Scandinavian Studies in
issue is judged to have been satisfactory, and; {c} Criminology and Crime Prevention 3: 96–111.

17
Burgess, A.W., Garbarino, C., & Carlson, M.I. 2006. Patho- Levenson, N.G., Daouk, M., Dulac, N., & Marais, K. 2003.
logical teasing and bulling turned deadly: shooters and Applying STAMP in accident analysis. Workshop on the
suicide. Victims and Offenders 1: 1–14. investigation and reporting of accidents.
Chiffriller, S.H., Hennessy, J.J., & Zappone, M. 2006. Under- McCabe, M.P., & Wauchope, M. 2005. Behavioural charac-
standing a new typology of batterers: implications for teristics of rapists. Journal of Sexual Aggression 11 (3):
treatment. Victims and Offenders 1: 79–97. 235–247.
Davis, P.K., & Jenkins, B.M. 2004. A systems approach to NRI-1. 2002. MORT User’s Manual. For use with the Man-
deterring and influencing terrorists. Conflict Management agement Oversight and Risk Tree analytical logic diagram.
and Peace Science 21: 3–15. Generic edition. Noordwijk Risk Initiative Foundation.
Ekblom, P. 2005. How to police the future: scanning for ISBN 90-77284-01-X.
scientific and technological innovations which generate Puglia, M., Stough, C., Carter, J.D., & Joseph, M. 2005.
potential threats & opportunities in crime, policing & The emotional intelligence of adult sex offenders: ability
crime reduction (Chapter 2). M.J. Smith & N. Tilley (eds), based EI assessment. Journal of Sexual Aggression 11 (3):
Crime Science—new approaches to prevent & detecting 249–258.
crime: 27–55. Willan Publishing. Ratcliffe, J. 2005. The effectiveness of police intelli-
Elklit, A. 2002. Attitudes toward rape victims—an empirical gence management: A New Zealand case study. Police
study of the attitudes of Danish website visitors. Jour- Practice & Research 6 (5): 435–451.
nal of Scandinavian Studies in Criminology and Crime Santos-Reyes, J., & Beard, A.N. 2008. A systemic approach
Prevention 3: 73–83. to managing safety. Journal of Loss Prevention in the
FIA, 2004. Linchan a agentes de la PFP en Tláhuac. Fuerza Process Industries 21 (1): 15–28.
Informativa Azteca (FIA), http://www.tvazteca.com/ Schmid, A.P. 2005. Root Causes of Terrorism: Some concep-
noticias (24/11/2004). tual Notes, a Set of Indicators, and a Model, Democracy
FIA, 2005. Linchamiento en Tláhuac parecía celebración. and Security 1: 127–136.
Fuerza Informativa Azteca (FIA), http://www.tvazteca. The international Crime Victim Survey (ICVS). Online
com/noticias (10/01/2005). http://www.unicri.it/wwd/analysis/icvs/index.php.
Finckenauer, J.O. 2005. The quest for quality in criminal Tucker, J. 2004. How not to explain murder: a sociological
justice education. Justice Quarterly 22 (4): 413–426. critique of bowling for columbine. Global Crime 6 (2):
Griffiths, H. 2004. Smoking guns: European cigarette 241–249.
smuggling in the 1990’s. Global Crime 6 (2): 185–200. Van der Schaaf, T.W. 1996. A risk management tool based
Instituto Ciudadano de Estudios Sobre la Inseguridad on incident analysis. International Workshop on Process
(ICESI), online www.icesi.org.mx. Safety Management and Inherently Safer process, Proc.
Klein, J. 2005. Teaching her a lesson: media misses boys’ Inter. Conf. 8–11 October, Orlando, Florida, USA:
rage relating to girls in school shootings. Crime Media 242–251.
Culture 1 (1): 90–97.
Lampe, K.V., & Johansen, P.O. 2004. Organized crime and
trust: on the conceptualization and empirical relevance of
trust in the context of criminal networks. Global Crime
6 (2): 159–184.

18
Safety, Reliability and Risk Analysis: Theory, Methods and Applications – Martorell et al. (eds)
© 2009 Taylor & Francis Group, London, ISBN 978-0-415-48513-5

Comparing a multi-linear (STEP) and systemic (FRAM) method


for accident analysis

I.A. Herrera
Department of Production and Quality Engineering, Norwegian University of Science and Technology,
Trondheim, Norway

R. Woltjer
Department of Computer and Information Science, Cognitive Systems Engineering Lab,
Linköping University, Linköping, Sweden

ABSTRACT: Accident models and analysis methods affect what accident investigators look for, which con-
tributing factors are found, and which recommendations are issued. This paper contrasts the Sequentially Timed
Events Plotting (STEP) method and the Functional Resonance Analysis Method (FRAM) for accident analy-
sis and modelling. The main issues addressed in this paper are comparing the established multi-linear method
(STEP) with the systemic method (FRAM) and evaluating which new insights the latter systemic method pro-
vides for accident analysis in comparison to the former established multi-linear method. Since STEP and FRAM
are based on a different understandings of the nature of accidents, the comparison of the methods focuses on
what we can learn from both methods, how, when, and why to apply them. The main finding is that STEP helps
to illustrate what happened, whereas FRAM illustrates the dynamic interactions within socio-technical systems
and lets the analyst understand the how and why by describing non-linear dependencies, performance conditions,
variability, and their resonance across functions.

1 INTRODUCTION Researchers have argued that linear approaches fail


to represent the complex dynamics and interdependen-
Analysing and attempting to understand accidents is an cies commonly observed in socio-technical systems
essential part of the safety management and accident (Amalberti, 2001; Dekker, 2004; Hollnagel, 2004;
prevention process. Many methods may be used for Leveson, 2001; Rochlin, 1999; Woods & Cook, 2002).
this, however they often implicitly reflects a specific Recently, systemic models and methods have been
view on accidents. Analysis methods—and thus their proposed that consider the system as a whole and
underlying accident models—affect what investigators emphasize the interaction of the functional elements.
look for, which contributing factors are found, and FRAM (Hollnagel, 2004) is such a systemic
which recommendations are made. Two such methods model. FRAM is based on four principles (Hollnagel,
with underlying models are the Sequentially Timed Pruchnicki, Woltjer & Etcher, 2008). First, the prin-
Events Plotting method (STEP; Hendrick & Benner, ciple that both successes and failures result from the
1987) and the Functional Resonance Accident Model adaptations that organizations, groups and individu-
with the associated Functional Resonance Analysis als perform in order to cope with complexity. Success
Method (FRAM; Hollnagel, 2004, 2008a). depends on their ability to anticipate, recognise, and
Multi-linear event sequence models and methods manage risk. Failure is due to the absence of that abil-
(such as STEP) have been used in accident analysis ity (temporarily or permanently), rather than to the
to overcome the limitations of simple linear cause- inability of a system component (human or technical)
effect approaches to accident analysis. In STEP, an to function normally. Second, complex socio-technical
accident is a special class of process where a pertur- systems are by necessity underspecified and only
bation transforms a dynamically stable activity into partly predictable. Procedures and tools are adapted
unintended interacting changes of states with a harm- to the situation, to meet multiple, possibly conflicting
ful outcome. In this multi-linear approach, an accident goals, and hence, performance variability is both nor-
is viewed as several sequences of events and the system mal and necessary. The variability of one function is
is decomposed by its structure. seldom large enough to result in an accident. However,

19
the third principle states that the variability of multiple by the co-pilot as ‘‘Pilot-Flying’’ (PF) and the captain
functions may combine in unexpected ways, lead- as ‘‘Pilot Non-Flying’’ (PNF). Shortly after clearance
ing to disproportionately large consequences. Normal to 4000 ft, the crew was informed that runway 19R
performance and failure are therefore emergent phe- was closed because of sweeping and that the landing
nomena that cannot be explained by solely looking at should take place on runway 19L. The aircraft was
the performance of system components. Fourth, the guided by air traffic control to land on 19L. Changing
variability of a number of functions may resonate, of the runway from 19R to 19L resulted in change in
causing the variability of some functions to exceed the go-around-altitude from 4000 ft at 19R to 3000 ft
normal limits, the consequence of which may be an at 19L. The crew performed a quick briefing for a new
accident. FRAM as a model emphasizes the dynam- final approach.
ics and non-linearity of this functional resonance, but During the last part of the flight, while the air-
also its non-randomness. FRAM as a method there- craft was established on the localizer (LLZ) and glide
fore aims to support the analysis and prediction of slope (GS) for runway 19L, the glide slope signal
functional resonance in order to understand and avoid failed. It took some time to understand this for the
accidents. pilots, who had not yet switched to tower (TWR) fre-
quency from APP frequency after acknowledging the
new frequency. Immediately after the glide path sig-
nal disappeared the aircraft increased its descent rate to
2 RESEARCH QUESTIONS AND APPROACH 2200 ft/min while being flown manually towards LLZ-
minima. The aircraft followed a significantly lower
The main question addressed in this paper is which approach than intended and was at its lowest only 460 ft
new insights this latter systemic method provides for over ground level at DME 4,8. The altitude at this dis-
accident analysis in comparison to the former estab- tance from the runway should have been 1100 ft higher.
lished multi-linear method. Since the accident analysis The crew initiated go-around (GA) because the aircraft
methods compared in this paper are based on a dif- was still in dense clouds and it drifted a little from the
ferent understanding of the nature of accidents, the LLZ at OSL. However, the crew did not notice the
comparison of the methods focuses on what we can below-normal altitude during approach. Later a new
learn from both methods, how, when, and why to apply normal landing was carried out.
them, and which aspects of these methods may need The executive summary of the Norwegian Accident
improvement. Investigation Board (AIBN) explains that the investi-
The paper compares STEP and FRAM in relation gation was focused on the glide slope transmission,
to a specific incident to illustrate the lessons learned its technical status and information significance for
from each method. The starting point of the study the cockpit instrument systems combined with cockpit
is the incident investigation report. A short descrip- human factors. The AIBN understanding of the situ-
tion of STEP and FRAM is included. For a more ation attributes the main cause of the incident to the
comprehensive description, the reader is referred to pilots’ incorrect mental picture of aircraft movements
Hendrick and Benner (1987; STEP) and Hollnagel and position. The report concludes that the in-cockpit
(2004; FRAM). Since different methods invite for glide slope capture representation was inadequate. In
different questions to be asked, it was necessary to addition, the report points to a deficiency in the pro-
interview air traffic controllers, pilots, and accident cedure for transfer of responsibility between approach
investigators to acquire more information. The infor- and tower air traffic control. (AIBN, 2004)
mation in this paper was collected through interviews Five recommendations resulted from the AIBN
and workshops involving a total of 50 people. The anal- investigation. The first recommendation is that the
ysis with STEP and FRAM was an iterative process responsibility between controls centres should be
between researchers and operative personnel. transferred 8 NM before landing or at acceptance
of radar hand-over. The second recommendation is
related to the certification of avionics displays, advis-
ing the verification of the information provided to
3 SUMMARY OF THE INCIDENT pilots, with special attention to glide slope and auto-
pilot status information. Third, training should take
A Norwegian Air Shuttle Boeing 737-36N with call- into account glide slope failures after glide slope cap-
sign NAX541 was en-route from Stavanger Sola air- ture under ILS approach. Fourth, Oslo airport should
port to Oslo Gardermoen airport (OSL). The aircraft consider the possibility of providing radar information
was close to Gardermoen and was controlled by Oslo to the tower controller to be able to identify approach
Approach (APP). The runway in use at Gardermoen paths deviations. The last recommendation is for the
was 19R. The aircraft was cleared to reduce altitude to airline to consider situational awareness aspects in the
4000 ft. The approach and the landing were carried out crew resource management (CRM) training.

20
14:42:36 14:42:55 14:42:57 14:44:02 TIME LINE
1
ACTORS

APP REQUEST AC-1 TO CHANGE TO


OSLO APP CONTROL
TWR FRQ 14:42:36
AIR TRAFFIC

TWR INFORMS
GARDERMOEN TWR G/S FAIL AC-2
CONTROL 14:42:57

RUNWAY EQUIP. RWY-E ACTIVATES


RWY-E ALARM G/S FAIL
14:42:55

CAPTAIN, COPILOT PNF ACCEPTS ALTITUDE


AIRCRAFT AC-1

PNF CHANGES PNF GO


”PNF”, ”PF” TRANSFER AROUND 460ft
TO TWR FRQ MANUAL
14:42:38
AIRCRAFT AC-1 NOSE MOVES
AC-1 AC-1 CHANGES FRQ
DOWN, DISCONNECT
TO TWR 14:44:02
A/P 14:43:27

Figure 1. STEP applied to NAX541 incident (simplified example).

4 SEQUENTIAL TIMED EVENTS PLOTTING represented in STEP are related to normal work and
help to predict future risks. The safety problems are
STEP provides a comprehensive framework for acci- identified by analysing the worksheet to find events
dent investigation from the description of the accident sets that constitute the safety problem. The identi-
process, through the identification of safety problems, fied safety problems are marked as triangles in the
to the development of safety recommendations. The worksheet. These problems are evaluated in terms of
first key concept in STEP is the multi-linear event severity. Then, they are assessed as candidates for rec-
sequence, aimed at overcoming the limitations of the ommendations. A STEP change analysis procedure
single linear description of events. This is imple- is proposed to evaluate recommendations. Five activ-
mented in a worksheet with a procedure to construct a ities constitute this procedure. The identification of
flowchart to store and illustrate the accident process. countermeasures to safety problems, the ranking of the
The STEP worksheet is a simple matrix. The rows are safety effects, assessment of the trade-off involved the
labelled with the names of the actors on the left side. selection of the best recommendations and a quality
The columns are labelled with marks across a time line. check.
Secondly, the description of the accident is per-
formed by universal events building blocks. An event
is defined as one actor performing one action. To 5 APPLICATION OF STEP TO NAX541
ensure that there is a clear description the events are
broken down until it is possible to visualize the pro- The incident is illustrated by a STEP diagram. Due
cess and be able to understand its proper control. In to page and paper limitations, Figure 1 illustrates a
addition, it is necessary to compare the actual accident small part of the STEP diagram that was created based
events with what was expected to happen. on the incident report. In Figure 1, the time line is
A third concept is that the events flow logically in on along the X-axis and the actors are on the Y-axis.
a process. This concept is achieved by linking arrows An event is considered to mean an actor performing
to show proceed/follow and logical relations between one action. The events are described in event building
events. The result of the third concept is a cascading blocks, for example ‘‘APP request to A/C to change
flow of events representing the accident process from to TWR frequency’’. An arrow is used to link events.
the beginning of the first unplanned change event to the Safety problems are illustrated on the top line by tri-
last connected harmful event on the STEP worksheet. angles in the incident process. Three such problems
The organization of the events is developed and were identified: 1) no communication between air-
visualized as a ‘‘mental motion picture’’. The com- craft 1 (NAX541) and tower (triangle 1 in Figure 1);
pleteness of the sequence is validated with three tests. 2) changed roles between PF and PNF not coordinated;
The row test verifies that there is a complete picture of and 3) pilots not aware of low altitude (2 and 3 not
each actor’s actions through the accident. The column shown in simplified figure).
test verifies that the events in the individual actor rows
are placed correctly in relation to other actors’ actions.
The necessary and sufficient test verifies that the early 6 FUNCTIONAL RESONANCE ANALYSIS
action was indeed sufficient to produce the later event, METHOD
otherwise more actions are necessary.
The STEP worksheet is used to have a link between FRAM promotes a systemic view for accident anal-
the recommended actions and the accident. The events ysis. The purpose of the analysis is to understand

21
the characteristics of system functions. This method The description of the aspects defines the potential
takes into account the non-linear propagation of events links among the functions. For example, the output of
based on the concepts of normal performance variabil- one function may be an input to another function, or
ity and functional resonance. The analysis consists of produce a resource, fulfil a pre-condition, or enforce
four steps (that may be iterated): a control or time constraint. Depending on the con-
Step 1: Identifying essential system functions, and ditions at a given point in time, potential links may
characterizing each function by six basic parame- become actual links; hence produce an instantiation
ters. The functions are described through six aspects, of the model for those conditions. The potential links
in terms of their input (I, that which the func- among functions may be combined with the results of
tion uses or transforms), output (O, that which the step 2, the characterization of variability. That is, the
function produces), preconditions (P, conditions that links specify where the variability of one function may
must be fulfilled to perform a function), resources have an impact, or may propagate. This analysis thus
(R, that which the function needs or consumes), time determines how resonance can develop among func-
(T, that which affects time availability), and control tions in the system. For example, if the output of a
(C, that which supervises or adjusts the function), and function is unpredictably variable, another function
may be described in a table and subsequently visual- that requires this output as a resource may be per-
ized in a hexagonal representation (FRAM module, formed unpredictably as a consequence. Many such
Figure 2). The main result from this step is a FRAM occurrences and propagations of variability may have
‘‘model’’ with all basic functions identified. the effect of resonance; the added variability under the
Step 2: Characterizing the (context dependent) normal detection threshold becomes a ‘signal’, a high
potential variability through common performance risk or vulnerability.
conditions. Eleven common performance conditions Step 4: Identifying barriers for variability (damping
(CPCs) are identified in the FRAM method to be factors) and specifying required performance moni-
used to elicit the potential variability: 1) availability toring. Barriers are hindrances that may either prevent
of personnel and equipment, 2) training, preparation, an unwanted event to take place, or protect against
competence, 3) communication quality, 4) human- the consequences of an unwanted event. Barriers can
machine interaction, operational support, 5) availabil- be described in terms of barrier systems (the orga-
ity of procedures, 6) work conditions, 7) goals, number nizational and/or physical structure of the barrier)
and conflicts, 8) available time, 9) circadian rhythm, and barrier functions (the manner by which the bar-
stress, 10) team collaboration, and 11) organizational rier achieves its purpose). In FRAM, four categories
quality. These CPCs address the combined human, of barrier systems are identified: 1) physical bar-
technological, and organizational aspects of each func- rier systems block the movement or transportation
tion. After identifying the CPCs, the variability needs of mass, energy, or information, 2) functional bar-
to be determined in a qualitative way in terms of sta- rier systems set up pre-conditions that need to be
bility, predictability, sufficiency, and boundaries of met before an action (by human and/or machine)
performance. can be undertaken, 3) symbolic barrier systems are
Step 3: Defining the functional resonance based on indications of constraints on action that are physi-
possible dependencies/couplings among functions and cally present and 4) incorporeal barrier systems are
the potential for functional variability. The output of indications of constraints on action that are not phys-
the functional description of step 1 is a list of functions ically present. Besides recommendations for barriers,
each with their six aspects. Step 3 identifies instantia- FRAM is aimed at specifying recommendations for
tions, which are sets of couplings among functions for the monitoring of performance and variability, to be
specified time intervals. The instantiations illustrate able to detect undesired variability.
how different functions are active in a defined context.

Time T C Control 7 APPLICATION OF FRAM TO NAX541

Step 1 is related to the identification and characteri-


Activity/
zation of functions: A total of 19 essential functions
Input I O Output were identified and grouped in accordance to the area
Function
of operation. There are no specified rules for the ‘level
of granularity’, instead functions are included or split
up when the explanation of variability requires. In this
Precondition P R Resource particular analysis some higher level functions, e.g.
‘Oslo APP control’, and some lower level functions,
Figure 2. A FRAM module. e.g. ‘Change frq to TWR control’.

22
The operative areas and functions for this particular Table 2. Manual Flight Approach CPCs.
incident are:
Function: Performance
– Crew operations: Change Runway (RWY) to 19L, Manual approach conditions Rating
New final approach briefing, Auto-pilot approach
(APP), Change APP frequency (frq) to TWR Availability Adequate
frq, Manual approach, GO-AROUND, Landing, of resources
Approach, Receiving radio communication, Trans- (personnel,
equipment)
mitting radio communication
Training, PF little Temporarily
– Avionics Functions: Disconnect Autopilot (A/P), preparation, experience on type inadequate
Electronic Flight Instrument (EFIS), Ground Prox- competence
imity Warning System (GPWS) Communication Delay to contact Inefficient
– Air traffic control: Oslo APP control, RWY sweep- quality tower
ing, Glideslope transmission, Gardermoen TWR HMI operational Unclear alerts Inadequate
control support
– Aircraft in the vicinity: AC-2 communication, AC-3 Avail. procedures Adequate
communication Work conditions Interruptions? Temporarily
inadequate?
The NAX541 incident report contains information # Goals, conflicts Overloaded More than
that helps to define aspects of functional performance. capacity
Essential functions are described with these aspects. Available time Task synchronisation Temporarily
inadequate
Table 1 shows an example of the aspects of the function
Circadian rhythm Adjusted
‘‘Manual Approach’’. Similar tables were developed Team collaboration Switched roles Inefficient
for 18 other functions. Org. quality
In step 2 the potential for variability is described using
a list of common performance conditions (CPCs).
Table 2 presents an example of CPCs for the function
‘‘Manual Approach’’.
The description of variability is based on the infor- it is normal and correct to request a runway change
mation registered in the incident report combined with with such a short notice? The interviews identified
a set of questions based on the CPCs. Since little of that there are no formal operational limits for tower
this information regarding variability was available, air traffic controllers, but for pilots there are. Thus
it was necessary to interview operational personnel an understanding of performance and variability was
(air traffic controllers, pilots). An example is for CPC obtained.
‘HMI, operational support’, a question was how aware In step 3 links among functions are identified for
pilots are of these EFIS, GPWS discrepancies, a pilot certain time intervals. States are identified to be valid
stated ‘‘Boeing manuals explain which information is during specific time intervals, which define links
displayed, it is normal to have contradictory informa- among the aspects of functions, hence instantiate
tion. In this case an understanding of the system as a the model. An example instantiation is presented in
whole is required. Pilots needs to judge relevant infor- Figure 3, where some of the links during the time inter-
mation for each situation.’’ An additional example of val 14:42:37–14:43:27 of the incident are described as
questions for the function ‘‘Runway change’’, was if an instantiation of the FRAM that resulted from step
1. Many more such instantiations may be generated,
but here only one example can be shown.
Table 1. A FRAM module function description. To understand the events in relation to links and
functions in this instantiation, numbers 1–5 and letters
Function: a-d have been used to illustrate two parallel processes.
Manual approach Aspect description
Following the numbers first, the APP controller com-
Input GPWS alarms, municates to the pilot that they should contact TWR
pilot informed of G/S failure at the TWR frequency (1). This is an output of ‘Oslo
Output Altitude in accordance with APP control’, and an input to ‘Receiving radio com-
approach path, Altitude lower/ munication’. This latter function thus has as output the
higher than flight path state that transfer is requested to the TWR frequency
Preconditions A/P disconnected (2), which matches the preconditions of ‘Change APP
Resources Pilot Flying, Pilot Non Flying frq to TWR frq’, and ‘Transmitting radio communi-
Time Efficiency Thoroughness Trade- cation’. The fulfilment of this precondition triggers
Off, time available varies
Control SOPs
the pilots to acknowledge the transfer to TWR to the
APP controller (3), an output of transmitting function,

23
T C T C T C
A/C-1 pilot & A/C functions
A/C-1 avionics ept Change
Manual
I APP frq to O I
Auto-pilot
O I O
Oslo APP control approach approach
TWR frq
Gardermoen TWR control
Ground equipment P R P R P R
5,d)Pilotinformed
of G/S failure
2)Transfer requested
toTWRfrq

T C 4) Frequency still
set to APP T C
c) A/P d isconn ected
Transmit-
14:43:27
I ting radio O
Receiving
comm I O
rad io comm

P R T C
P R
1) APP -Pilot:
3) Pilot-APP :
con tact TWR I Auto-p ilot O
on TWR frq
to TWR frq 6) Pilot-TWR : b)TWR-pilot:
Fligh t on TWRfrq informa/cof
T C T C G/Sfailure P R

Oslo APP Gardermoen a) G/S lost


I O T C
con trol I TWR O 14:42:55
con trol

Glide slope
P R P R I O
transmission

P R
X) Proactive TWR -APP comm:
a) no G/S sign al
check frequency change
14:42:55

Figure 3. A FRAM instantiation during the time interval 14:42:37–14:43:27 with incident data.

input to ‘Oslo APP control’. The pilots however do (d). Concurrently, the loss of G/S no longer fulfils the
not switch immediately after the transfer is requested, precondition of the auto-pilot function, with the result-
hence the output is that the frequency still is set to ing output of A/P being disconnected (c) about half a
APP, for a much longer time than would be intended minute after G/S loss. This in turn no longer fulfils
(indicated by the red ‘O’), and the pilots do not contact the precondition of an auto-pilot approach and instead
TWR (6) until much later. This has consequences for matches the precondition for a manual approach. All
the precondition of receiving/transmitting (4), which of this in turn results in variability on the manual
is being on the same frequency with the control centre approach, e.g. with decreased availability of time,
that has responsibility for the flight. With the delay in inadequate control because of PF-PNF collaboration
frequency change, the link that the pilot is informed problems, and inadequate resources (e.g. displays
of the G/S failure (5) is also delayed. unclear indications of A/P and G/S) resulting in highly
At about the same time, following the letters in variable performance (output) of the manual approach.
Figure 3, ‘Glide slope transmission’ changes output to Step 4 addresses barriers to dampen unwanted vari-
that there is no G/S signal at 14:42:55 (a), because of a ability and performance variability monitoring where
failure of the G/S transmitting equipment (a resource, variability should not be dampened. AIBN recom-
R in red). This makes the TWR controller inform mendations could be modelled as barrier systems and
pilots on the TWR frequency of the G/S failure (b), barrier functions, e.g. ‘‘Responsibility between con-
excluding the incident aircraft crew because of the trol centres should be transferred 8 NM before landing,
unfulfilled precondition because of link (4), delay- or at acceptance by radar hand over.’’ (AIBN, p. 31,
ing the point that the pilot is informed of G/S failure our translation). In FRAM terminology this can be

24
described as an incorporeal prescribing barrier. This performance varied becomes apparent: For example,
barrier would have an effect on the variability of the the operational limits for runway change for different
APP and TWR control functions through the aspect operators were discussed; the question of why the fre-
of control and the links between input and output in quency change was delayed gets answered based on
various instantiations describing communication and the normal variability in pilot-first-officer-interaction
transfer of responsibility. New suggestions for barriers patterns in cases of experience difference; the pilots’
also result from the FRAM. For example, a proac- unawareness of the low altitude is understandable with
tive communication from TWR to APP when a flight regard to variability related to e.g. team collaboration
does not report on frequency would link their out- and human-machine interface issues.
put and input (see link (X) in Figure 3), triggering STEP provides a ‘‘mental motion picture’’
instantiations of links 1–6 so that control and contact (Hendrick & Benner, 1987, p. 75) illustrating
is re-established. This barrier may be implemented in sequences of events and interactions between pro-
various systems and functions, such as through reg- cesses, indicating what happened when. FRAM
ulation, training, procedures, checklists and display instead sketches a ‘functional slide show’ with its
design, etc. The FRAM also points to the intercon- illustrations of functions, aspects, and emerging links
nectivity of air traffic control and pilot functions, between them in instances, indicating the what and
suggesting joint training of these operators with a wide when, and common performance conditions, vari-
range of variability in the identified functions. As with ability, and functional resonance, indicating why.
any method, FRAM enables the suggestion of barriers FRAM’s qualitative descriptions of variability pro-
(recommendations), which need to be evaluated by vide more gradations in the description of functions
domain experts in terms of feasibility, acceptability, than the bimodal (success/failure) descriptions typical
and cost effectiveness, among other factors. for STEP.
The FRAM and the instantiations that were created In relation to the question of when each method
here also point to the future development of indicators should be used, the type of incident and system to
for matters such as overload and loss of control when be analysed needs to be taken into account. STEP
cockpit crew has significant experience differences. is suited to describe tractable systems, where it is
possible to completely describe the system, the prin-
ciples of functioning are known and there is sufficient
8 COMPARISON knowledge of key parameters. FRAM is better suited
for describing tightly coupled, less tractable systems
Accident models, implicitly underlying an analysis or (Hollnagel, 2008b), of which the system described in
explicitly modelling an adverse event, influence the this paper is an example. Because FRAM does not
elicitation, filtering, and aggregation of information. focus only on weaknesses but also on normal per-
Then, what can we learn from the applications of STEP formance variability, this provides a more thorough
and FRAM to this incident? understanding of the incident in relation to how work
STEP is relatively simple to understand and pro- is normally performed. Therefore the application of
vides a clear picture of the course of the events. FRAM may lead to a more accurate assessment of
However, STEP only asks the question of which events the impact of recommendations and the identifica-
happened in the specific sequence of events under tion of previously unexplored factors that may have
analysis. This means that events mapped in STEP are a safety impact in the future. While the chain of events
separated from descriptions of the normal function- is suited for component failures or when one or more
ing of socio-technical systems and their contexts. For components failed, they are less adequate to explain
example, the STEP diagram illustrates that the PNF’s systems accidents (Leveson, 2001). This can be seen
switch to TWR frequency was delayed, but not why. in the STEP-FRAM comparison here. The STEP dia-
Instead, STEP only looks for failures and safety prob- gram focuses on events and does not describe the
lems, and highlights sequence and interaction between systems aspects: the understanding of underlying sys-
events. FRAM refrains from looking for human errors temic factors affecting performance is left to experts’
and safety problems but tries to understand why the interpretation. FRAM enables analysts to model these
incident happened. Since FRAM addresses both nor- systemic factors explicitly.
mal performance variability and the specifics of an
adverse event, FRAM broadens data collection of the
analysis compared to a STEP-driven analysis: Thus 9 CONCLUSIONS AND PRACTICAL
the development of the incident is contextualized in a IMPLICATIONS
normal socio-technical environment. Through asking
questions based on the common performance condi- This paper presented two accident analysis methods:
tions and linking functions in instantiations, FRAM The multi-sequential STEP and systemic FRAM. The
identified additional factors and the context of why question of how to apply these methods was addressed

25
by discussing the steps of the methods, illustrated by ACKNOWLEDGEMENTS
applying these methods to a missed approach incident.
This paper concluded that FRAM provides a different This work has benefited greatly from the help and sup-
explanation about how events are a result of variabil- port of several aviation experts and the participants in
ity of normal performance and functional resonance, the 2nd FRAM workshop. We are particularly grateful
compared to STEP. The main finding is that STEP to the investigators and managers of the Norwegian
helps to illustrate what happened, whereas FRAM cov- Accident Investigation Board who commented on a
ers what happened and also illustrates the dynamic draft of the model. Thanks to Ranveig K. Tinmannsvik,
interactions within the socio-technical system and lets Erik Jersin, Erik Hollnagel, Jørn Vatn, Karl Rollen-
the analyst understand the how and why by describ- hagen, Kip Smith, Jan Hovden and the conference
ing non-linear dependencies, performance conditions reviewers for their comments on our work.
and variability, and their resonance across functions.
Another important finding is that it was possible to
identify additional factors with FRAM. STEP inter- REFERENCES
pretations and analysis depends on investigator experi-
ence, FRAM introduces questions for systemic factors AIBN. 2004. Rapport etter alvorlig luftfartshendelse ved
and enables the explicit identification of other relevant Oslo Lufthavn Gardermoen 9. Februar 2003 med Boe-
aspects of the accident. The example also illustrates ing 737-36N, NAX541, operert av Norwegian Air Shuttle.
how unwanted variability propagates such as the infor- Aircraft Investigation Board Norway, SL RAP.:20/2004.
mation about G/S failure and the undesired resonance Amalberti, R. 2001. The paradoxes of almost totally safe
with the differences in pilots’ experience. However, transportation systems. Safety Science, 37, 109–126.
Dekker, S.W.A. 2004. Ten questions about human error: A
several incidents in different contexts would need to new view of human factors and system safety. Mahwah,
be analysed to validate and generalize these findings. NJ: Lawrence Erlbaum.
Two practical implications are found. The first is Hendrick, K., Benner, L. 1987. Investigating accidents with
that FRAM provides new ways of understanding fail- STEP. Marcel Dekker Inc. New York.
ures and successes, which encourages investigators to Hollnagel, E. 2004. Barriers and accident prevention. Alder-
look beyond the specifics of the failure under analysis shot, UK: Ashgate.
into the conditions of normal work. The second is that Hollnagel, E. 2008a. From FRAM to FRAM. 2nd FRAM
it models and analyses an intractable socio-technical Workshop, Sophia-Antipolis, France.
system within a specific context. While FRAM as a Hollnagel, E. 2008b. The changing nature of risks. Ecole des
Mines de Paris, Sophia Antipolis, France.
model has been accepted in the majority of discussions Hollnagel, E., Pruchnicki, S., Woltjer, R., & Etcher, S. 2008.
with practitioners, and seems to fill a need for under- Analysis of Comair flight 5191 with the Functional Res-
standing intractable systems, FRAM as a method is onance Accident Model. Proc. of the 8th Int. Symp. of
still young and needs further development. This paper the Australian Aviation Psychology Association, Sydney,
has contributed to the development of the method by Australia.
outlining a way to illustrate instantiations for a limited Leveson, N. 2001. Evaluating accident models using recent
time interval. An additional need is the identification aerospace accidents. Technical Report, MIT Dept. of
of normal and abnormal variability which this paper Aeronautics and Astronautics.
has addressed briefly. Remaining challenges include Perrow, C. 1999. Normal accidents. Living with high risk
technologies. Princeton: Princeton University Press. (First
a more structured approach to generating recommen- issued in 1984).
dations in terms of barriers and indicators, as well Rochlin, G.I. 1999. Safe operation as a social construct.
as evaluating how well FRAM is suited as a method Ergonomics, 42, 1549–1560.
to collect and organize data during early stages of Woods, D.D., & Cook, R.I. 2002. Nine steps to move forward
accident investigation. from error. Cognition, Technology & Work, 4, 137–144.

26
Safety, Reliability and Risk Analysis: Theory, Methods and Applications – Martorell et al. (eds)
© 2009 Taylor & Francis Group, London, ISBN 978-0-415-48513-5

Development of a database for reporting and analysis of near misses


in the Italian chemical industry

R.V. Gagliardi
ISPESL Department of Industrial Plants and Human Settlements, Rome, Italy

G. Astarita
Federchimica Italian Federation of Chemical Industries, Milan, Italy

ABSTRACT: Near misses are considered to be an important warning that an accident may occur and therefore
their reporting and analysis may have a significant impact on industrial safety performances, above all for those
industrial sectors involving major accident hazards. From this perspective, the use of a specific information
system, including a database ad hoc designed for near misses, constitutes an appropriate software platform
that can support company management in collecting, storing and analyzing data on near misses, and also
implementing solutions to prevent future accidents. This paper describes the design and the implementation of
such a system, developed in the context of a cooperation agreement with the Italian Chemical Industry Federation.
This paper also illustrates the main characteristics and utilities of the system, together with future improvements
that will be made.

1 INTRODUCTION large number of minor accidents and even more near


misses, representing the lower portion of the pyramid.
The importance of post-accident investigation is As a consequence, reducing the near misses should
widely recognized in industrial sectors involving the mean reducing the incidence of major accidents. In
threat of major accident hazards, as defined in the addition, due to the fact that near misses frequently
European Directive 96/82/CE ‘‘Seveso II’’ (Seveso II, occur in large facilities, they represent a large pool of
1997): in fact, by examining the dynamics and data which is statistically more significant than major
potential causes of accidents, lessons can be learned accidents. This data could be classified and analyzed
which could be used to identify technical and man- in order to extract information which is significant in
agerial improvements which need to be implemented improving safety.
in an industrial context. This is in order to pre- The reporting of near misses, therefore, takes on
vent the occurrence of accidents and/or mitigate their an important role in industrial safety, because it is
consequences. the preliminary step of subsequent investigative activ-
Such investigations are even more significant in the ities aimed at identifying the causes of near misses and
case of near misses, that is, ‘‘those hazardous situ- implementing solutions to prevent future near misses,
ations (events or unsafe acts) that could lead to an as well as more serious accidents. However, a real
accident if the sequence of events is not interrupted’’ improvement in industrial safety level as a whole can
(Jones & al. 1999). Philley & al. 2003, also stated that only be reached if these solutions are disseminated
‘‘a near miss can be considered as an event in which as widely as possible. They must also be in a form
property loss, human loss, or operational difficul- that can be easily and quickly retrieved by all per-
ties could have plausibly resulted if circumstances had sonnel involved at any stage of the industrial process.
been slightly different’’. Their significance, according From this perspective, the use of a specific and widely
to a shared awareness in the scientific world, is due to accessible software system, which is designed for
the fact that near misses are an important warning that facilitating near miss data collection, storage and anal-
a more serious accident may occur. In fact, several ysis, constitutes an effective tool for updating process
studies (CCPS Guidelines, 2003), using the ‘‘pyramid safety requirements.
model’’, suggest a relationship between the number This paper describes the design and implementa-
of near misses and major accidents: for each major tion of such a system, carried out in the context of
accident located at the top of the pyramid there are a a cooperation agreement with the Italian Chemical

27
Industry Federation. The various steps undertaken for the assumption that a similar approach can benefit
the development of such a system, involved in record- the whole industrial sector involving major accident
ing near misses in the Italian chemical industry, are hazards, the development of an informative system
presented below. for the collection and analysis of near misses in the
chemical process industry, has been undertaken at a
national level. This is thanks to a strong cooperation
2 FRAMEWORK FOR A NATIONWIDE agreement between the Italian National Institute for
APPROACH Prevention and Safety at Work (ISPESL), and Feder-
chimica. ISPESL is a technical-scientific institution
2.1 Legislative background within the Ministry of Health, which supports Ital-
ian Competent Authorities for the implementation of
As a preliminary remark, from a legislative perspec- the Seveso legislation in Italy; in recent decades it
tive, the recommendation that Member States report has acquired wide expertise in post major accidents
near misses to the Commission’s Major Accident investigations and reporting activities, as well as in
Reporting System (MARS) on a voluntary basis has the inspection of establishments which are subject to
been introduced by the European Directive 96/82/EC the Seveso legislation. Federchimica is the Italian fed-
‘‘Seveso II’’. This is in addition to the manda- eration of chemical industries and is comprised of
tory requirements of major accident reporting. More more than 1300 companies, including several estab-
specifically, in annex VI of the Directive, in which lishments which are subject to the Seveso legislation.
the criteria for the notification of an accident to the In Italy, Federchimica leads the ‘‘Responsible Care’’
Commission are specified, is included a recommen- voluntary worldwide programme for the improvement
dation that near misses of particular technical interest of health, safety and environmental performances
for preventing major accidents and limiting their con- in the chemical industry. The authors of this paper
sequences should be notified to the Commission. This hope that a joint effort in the field of near misses
recommendation is included in the Legislative Decree reporting and investigation, incorporating different
n. 334/99 (Legislative Decree n. 334, 1999), the expertises and perspectives from ISPESL and Feder-
national law implementing the Seveso II Directive in chimica, could result in a real step forward in industrial
Italy. safety enhancement.
In addiction further clauses regarding near misses
are included in the above mentioned decree, refer-
ring to the provisions concerning the contents of the 3 SYSTEM DESIGN AND REQUIRED
Safety Management System in Seveso sites. In fact, ATTRIBUTES
the Decree states that one of the issues to be addressed
by operators in the Safety Management System is In the context of the above mentioned cooperation
the monitoring of safety performances; this must be agreement, an informative system has been devel-
reached by taking into consideration, among other oped consisting of a web based software and a
things, the analysis of near misses, functional anoma- database designed for near misses. The system is
lies, and corrective actions assumed as a consequence specifically directed towards companies belonging to
of near misses. Federchimica which have scientific knowledge and
operating experience in the prevention of accidents and
safety control; moreover ISPESL personnel, who are
2.2 State of the art
involved in inspection activities on Seveso sites, are
Although legislation clearly introduces the request for authorized to enter the system in order to acquire all
the reporting of near misses, the present situation, elements useful for enhancing its performances in pre-
in Italy, is that this activity is carried out not on a vention and safety issues. Two main goals have been
general basis but only by a number of companies, taken into account with regard to the project design:
which use their own databases to share learning inter- first, gather as much information as possible on near
nally. It should also be noted that similar initiatives misses events occurring in the national chemical pro-
in this field have been undertaken by public author- cess industry. This is in order to build a reliable and
ities, in some cases, at a regional level. However exhaustive database on this issue. Second, provide
a wider, more systematic and nationwide approach a tool which is effective in the examination of near
for the identification, reporting and analysis of near misses and extraction of lesson learned, and which
misses is lacking. To fill this gap a common software aims to improve industrial safety performances.
platform, not specifically for each individual situation, To meet these requirements a preliminary analysis
facilitating the dissemination of information between of specific attributes to be assigned to the system has
different establishments as well as different industrial been carried out in order to define its essential charac-
sectors, must be implemented. Therefore, based on teristics. The first attribute which has been considered

28
is accessibility: To allow a wider diffusion, the sys-
tem has been developed in such a way that it is easy
to access when needed. A web-version has therefore near miss
been built, which assures security, confidentiality and
integrity criteria of the data handled. Secondly, in order
to guarantee that the system can be used by both expert reporting phase
and non-expert users, the importance of the system
being ‘‘user-friendly’’ has been an important consid-
eration in its design. The user must have easy access
to the data in the system and be able to draw data from
it by means of pull-down menus, which allow several
options to be chosen; moreover, ‘‘help icons’’ clarify
the meaning of each field of the database to facili- consultation phase
tate data entry. Third, the system must be designed in
such a way as to allow the user to extract information lesson
from the database; to this end it has been provided learned
with a search engine. By database queries and the
subsequent elaboration of results, the main causes of
the near misses and the most important safety mea-
sures adopted to avoid a repetition of the anomaly can Figure 1. Concept of the software system.
be singled out. In this way the system receives data
on near misses from the user and, in return, provides
information to the user regarding the adoption of cor-
of the near misses, and the second is the consultation
rective actions aimed at preventing similar situations
phase, for the extraction of the lessons learned from the
and/or mitigating their consequences. Lastly, the sys-
database, both described in the following paragraphs.
tem must guarantee confidentiality: this is absolutely
necessary, otherwise companies would not be will-
ing to provide sensitive information regarding their 4.1 Near misses reporting phase
activities. In order to ensure confidentiality, the data
are archived in the database in an anonymous form. This part contains all data regarding the near miss and
More precisely, when the user consults the database, is divided into two sections: the first contains all the
the only accessible information on the geographical information about the specific event (time, place, sub-
location of the event regards three macro-areas, North- stances involved etc.); the second provides the causes,
ern, Central and Southern Italy respectively; this is in the post accident measures put into place, and the
order to avoid that the geographical data inserted in the lessons learned. It is worth noting that, in order to
database (municipality and province) in the reporting encourage use of the database, in the selection of fields
phase could lead to the identification of the establish- to be filled in, a compromise was reached between the
ment in which near misses have occurred. Another need to collect as much information as possible, so
factor which assures confidentiality is the use of user- as to enrich the database and the need to simplify the
names and passwords to enter the system. These are completion of the form by the user.
provided by Federchimica, after credentials have been These forms are filled in on a voluntary basis, in
vetted, to any company of the Federation, which has fact. The contents of the two sections are described
scientific knowledge and operating experience in pre- in paragraph 4.1.1 and 4.1.2 respectively. A print-
vention and safety control. In this way, any company, out illustrating the fields contained in section 1 is
fulfilling the above conditions, is permitted to enter the presented in Fig. 2, in which the fields and the
system, in order to load near misses data and consult explanations are shown in Italian.
the database, and have the guarantee of data security
and confidentiality. 4.1.1 Section 1
Geographical location: Municipality and province
where the near miss occurred.
4 DATABASE STRUCTURE Date: The date when the near miss took place.
Time: The time when the near miss took place.
On the basis of the above defined attributes, which Seveso classification: Upper or lower tier establish-
are assigned to the system, the software platform sup- ments.
porting the near misses database has been created and Location: The location where the near miss occurred,
performs two main functions, as illustrated in Fig. 1: either inside the production units, or in the logistic
The first is the reporting phase, for the description units, internal or external.

29
Figure 2. Section 1 of the near misses database.

Area: Area of establishment in which the near miss Causes: The analysis of causes, including equipment
took place, including production, storage, services, failure, procedural deficiencies or human factors.
utilities or transport units. Each near miss may be triggered by more than
Unit: Depending on the area selected, unit directly one cause. The operational status of the estab-
involved in the near miss. lishment (normal operation, shut down, restart,
Accident description: The description of the accident maintenance), when the near miss took place must
as it was given by the compiler. be specified.
Substances: The substance or the substances involved Damages: The cost of damages provoked by the near
in the near miss. miss.
Corrective actions: Corrective actions that have been
put in place after the near miss in order to avoid a
4.1.2 Section 2 repetition of the event; these actions can be immedi-
Type of event: The types of events which occurred after ate or delayed, and if delayed, can be further divided
a near miss; this is a multi-check field allowing the into technical, procedural or training.
user to select more options simultaneously. Lessons learned: Descriptions of what we have learned
Potential danger: Dangers that could have stemmed from the near miss and what improvements have
from a near miss and that could have resulted in been introduced as a consequence.
more serious consequences if the circumstances had Annex: It is possible to complete the record of any near
been slightly different. miss, adding files in pdf, gif o jpeg format.
Safety measures in place: The kind of safety measures
which are in place, if any, specifying if they have Once this descriptive part is completed, a classifi-
been activated and, if this is the case, if they are cation code will be attributed to any near miss, in order
adequate. that identification is univocal.

30
Figure 3. Near misses search criteria.

4.2 Near misses consultation phase thus supporting effective training activities. In this
initial phase of the project the first priority has been the
As specified at the beginning of the previous para-
collection of as much near misses data as possible. In
graph, the second part of the software platform
order to reach this goal, the use of the database has been
concerns the extraction of lessons learned by the con-
encouraged among the partners of Federchimica, who
sultation of the database contents. In fact, one of the
are, in fact the main contributors of the data entered.
main objectives of the database is the transfer of tech-
To publicize the use of the software system both the
nical and managerial corrective actions throughout the
institutions responsible for the project, ISPESL and
chemical process industry. The database has therefore
Federchimica, have organized a number of workshops
been supplied with a search engine which aids nav-
in different Italian cities to illustrate the performances
igation through near miss reports, allowing several
and potentialities of the database. The feedback from
options. First, the user is able to visualize all near
the industries with respect to the initiative promoted
misses collected in the database in a summary table,
by ISPESL and Federchimica can be considered, on
illustrating the most important fields, that are, respec-
the whole, positive, at least for those bigger com-
tively, event code, submitter data, event description,
panies which already dedicate economic and human
damages, immediate or delayed corrective actions,
resources to the industrial safety; further efforts are
lessons learned, annex; second, the details of a specific
required to motivate also the smaller enterprises to use
event can be selected by clicking on the correspond-
the database.
ing code. It is also possible to visualize any additional
The system is currently in a ‘‘running in’’ phase,
documents, within the record of the specific event, by
which will provide an in-depth evaluation of its per-
clicking on ‘‘Annex’’. Third, to extract a specific near
formance and identify potential improvements. This
miss from the database, on the basis of one or more
phase involves the analysis of the (approximately) 70
search criteria, a query system has been included in
near misses which are now stored in the database.
the software utilities. This is done by typing a key-
A preliminary analysis of the first data collected shows
word, relevant, for example, to the location, unit,
that, for any near misses reported, all fields contained
year of occurrence, potential danger, etc, as shown
in the database have been filled; this fact represents a
in Fig. 3.
good starting point for an assessment on the quality of
the technical and managerial information gathered, as
well as on the reported lessons learned, which will be
5 RESULTS AND DISCUSSION carried out in the next future. Further distribution to the
competent authorities and to other industrial sectors
Thanks to these functions, the analysis of the database will be considered in a second stage.
contents can provide company management with
the information required to identify weaknesses in the
safety of the industrial facilities, study corrective 6 CONCLUDING REMARKS
actions performed to prevent the occurrence of acci-
dents, prioritize preventive and/or mitigating measures A software system for the collection, storage and anal-
needed, and better understand the danger of specific ysis of near misses in the Italian chemical industry has
situations. Another important function is the transfer been developed for multiple purposes. It allows infor-
of knowledge and expertise to a younger workforce, mation regarding the different factors involved in a

31
near miss event to be shared and information regarding training activities implemented in the chemical
lessons learned among all interested stakeholders to industry; they should be for all personnel involved
be disseminated in a confidential manner. The sys- in the use of the software system. These training
tem which has been developed attempts to fill a gap activities, to which both ISPESL and Federchimica
which exists in the wide range of databases devoted to may vigorously contribute, are expected to lead to an
process industry accidents. The system also can fulfil enhancement in the quality of the retrieval of the under-
the legal requirements for the implementation of the lying causes of near misses. They should also lead to
Safety Management System; in fact, it offers indus- an increase in the reliability of data stored as well as
try management the possibility of verifying its safety of lessons learned extracted from the database.
performances by analyzing near miss results. Its user-
friendly character can stimulate the reporting of near
misses as well as the sharing of lessons learned, sup- REFERENCES
porting industry management in the enhancement of
safety performances. It is the intention of the authors to Center for Chemical Process Safety, 2003. Investigating
Chemical Process Incidents. New York, American Insti-
continue the work undertaken, creating a wider range tute of Chemical Engineers.
of functions and keeping the database up to date. From European, Council 1997. Council Directive 96/82/EC on
a preliminary analysis carried out on the data collected, the major accident hazard of certain industrial activ-
we realized that, in the near future, a number of aspects ities (‘‘Seveso II’’). Official Journal of the European
regarding the management of near misses will need Communities. Luxembourg.
to be investigated in more depth. For example, the Jones, S., Kirchsteiger, C. & Bjierke, W. 1999. The impor-
screening criteria for the selection of near misses must tance of near miss reporting to further improve safety
be defined. These criteria are necessary for identifying performance. Journal of Loss prevention in the Process
near misses of particular technical interest for prevent- Industries, 12, 59–67.
Legislative Decree 17 August 1999, n. 224 on the control of
ing major accidents and limiting their consequences, major accident hazards involving dangerous substances.
as required by the Seveso legislation. Another element Gazzetta Ufficiale n. 228, 28 September 1999 Italy.
which requires in-depth analysis is the root causes of Philley, J., Pearson, K. & Sepeda, A. 2003. Updated CCPS
near misses, and the corrective measures needed. All Investigation Guidelines book, Journal of Hazardous
the above mentioned issues can benefit from specific Materials, 104, 137–147.

32
Safety, Reliability and Risk Analysis: Theory, Methods and Applications – Martorell et al. (eds)
© 2009 Taylor & Francis Group, London, ISBN 978-0-415-48513-5

Development of incident report analysis system


based on m-SHEL ontology

Yoshikazu Asada, Taro Kanno & Kazuo Furuta


Department of Quantum Engineering and Systems Science, The University of Tokyo, Japan

ABSTRACT: As increase of complex systems, simple mistakes or failures may cause serious accidents. One
of measures against this situation is to understand the mechanism of accidents and to use the knowledge for
accident prevention. However, analyzing incident reports is not kept up with the pace of their accumulation
at present, and a database of incident reports is utilized insufficiently. In this research, an analysis system of
incident reports is to be developed based on the m-SHEL ontology. This system is able to process incident reports
and to obtain knowledge relevant for accident prevention efficiently.

1 INTRODUCTION

Recently, the number of complex systems, such as


nuclear systems, air traffic control systems or medical
treatment systems, is increasing with progress in sci-
ence and technology. This change makes our life easy,
but it makes the causes of system failures differently.
With advanced technology, simple mistakes or inci-
dents may cause serious accidents. In other words, it
is important to understand the mechanism of incidents
for avoiding major accidents.
Small accidents or mistakes that occur behind a
huge accident are called incidents. Heinrich found
the relationship between huge accidents and incidents
(Heinrich, 1980). His study showed that there are 29 Figure 1. Heinrich’s law.
small accidents with light injury and 300 near misses
behind one big accident with seriously-injured people
(Figure 1). Bird also found the importance of near-
misses or incidents for accident prevention (Bird & inspection results or incident reports in nuclear power
George, 1969). plants is available to the public.
Of course this ratio will change among different sit- This is an example of good practice for safety
uations, but the important point is the fact that there management. However, though incident reports are
are a lot of non-safety conditions behind a serious collected, they are not always analyzed rightly.
problem. There are several reasons, but main reasons are as
For this reason, to reduce the number of inci- follows:
dents also contributes to decrease the number of major
1. Insufficient expertise for incident analysis
accidents. Nowadays many companies are operating
2. Inconsistent quality of incident reports
an incident reporting system. NUClear Information
Archives (NUCIA), a good example of such a system, Even though a lot of incident reports are com-
is a web site operated by the Japan Nuclear Technology piled, in most cases they are only stored in the
Institute [3]. In this site, lots of information such as database. It is important to solve this problem by

33
analyzing incident reports using advanced information
processing technologies.
In this study, a conceptual design of an analysis
system of incident reports is proposed, and its verifica-
tion is conducted experimentally. We adopted nuclear
industry as a particular application domain because of
availability of NUCIA.

2 PROBLEMS OF INCIDENT REPORTS


Figure 2. Diagram of m-SHEL model.
A number of incident reporting systems are now oper-
ated in various areas, but there are some problems. The
problems are twofold.
The first problem is how to make the report. interested person, and the bottom one is other people in
There are two types of reporting schemes: multiple- relation with the central person. This model is shown
choice and freestyle description. Multiple-choice is in Figure 2.
easy to report, because the reporter only have to In Figure 2, there are some gaps between two ele-
choose one or a few items from alternatives already ments. This means that mismatches between different
given. It saves the time of reporting. For this rea- elements may cause an accident. For example, mis-
son, many of existing reporting systems use this understanding of the operation manual may happen
type. But, some important information may be lost in the gap between S and L. To fill the gaps, the
with this reporting scheme. The purpose of incident interface that connects different elements plays an
reporting is to analyze the cause of an incident and important role.
to share the obtained knowledge. If the reports are
ambiguous, it is hard to know the real causes of an
incident. 3.2 COCOM
Second problem is about the error of the contents.
An estimation of the cause or investigation of pre- There are many incidents caused by human behavior,
ventative measures is the very important part of the such as carelessness or rule violation. For this reason,
report. However, there are sometimes mistake of anal- the liveware in the m-SHEL model must be focused on.
ysis. Of course, these parts are complicated, but exact In this study, the COCOM model is applied to incident
description must be needed. analysis.
It is not easy to get rid of these problems, and in The COCOM model was proposed by E. Hollnagel
this study it is assumed that no incident reports are free (Hollnagel, 1993). Hollnagel categorized the state
from these problems. The influence of these problems of human consciousness into the following four
will be discussed later. classes:

• Strategic control (high attention)


• Tactical control
3 ANALYSIS OF HUMAN FACTORS • Opportunistic control
• Scrambled control (low attention)
3.1 m-SHEL model
For analysis of accidents, it is necessary to break According to this model, human consciousness
the whole view of an incident down into elementary changes by the attention level to the task. When work-
factors, such as troubles in devices, lack of mutual ers have high attention to the task, they know well
understanding among staffs, or misunderstanding of how to do the task, and there are no environmental
an operator. Some frameworks for accident analysis causes of an accident. They are doing the task in a
have been proposed such as m-SHEL or 4M-4E. They strategic way.
are originally proposed for risk management, and they Once some problem happens, however, the situa-
can support understanding incidents. tion will change. Workers may lose their countenance,
The m-SHEL model was proposed by Kawano, and sometimes they become scrambled. Under such
et al. (Kawano, 2002). This model classifies acci- situation, it is difficult for them to be able to do the
dent factors into five elements of a system. They task in a strategic or tactical manner.
are m (management), S (software), H (hardware), By focusing on human consciousness based on
E (environment), and L (liveware). In the m-SHEL COCOM, it is able to make a short list of incident
model, there are two livewares. The central one is the causes. It is useful both for causal analysis and for risk

34
prediction of a similar incident by checking specific 2. Hierarchical structure by is-a link between two
factors. concepts, like ‘‘BMW is a car’’ or ‘‘dog is a
mammal’’.
3. Some relationships between different concepts
3.3 Nuclear Information Archives (NUCIA) other than ‘‘is-a link’’, like ‘‘car has an engine’’.
Here, let us look into an example of an incident report 4. Axiom that defines or constrains concepts or
database. links.
In NUCIA, lots of information such as inspection In this study, ontology is defined for analysis of
results or incident reports in nuclear power plants are incidents occurred in nuclear systems as follows:
available to the public. Disclosure of the information
is conducted under the following ideas. 1. A set of concepts necessary for analysis of incidents
in nuclear power plants.
• Various opinions not only from electric power sup- 2. Hierarchical structure. In nuclear engineering, the
pliers but also from industry-government-academia structure of a nuclear power plant and the organi-
communities are useful to solve problems in nuclear zational framework are described.
safety. 3. Relationships. Causality information is defined in
• Increasing transparency to the society contributes this part.
to obtain trust from the public. 4. Definition of axioms. Synonyms are also defined.
Collected information, however, is not utilized An ontology-based system is different from knowl-
enough at present. There are several reasons as edge base systems, expert systems or artificial intel-
follows. ligence. These traditional ways have some problems.
They are:
• The amount of data is enough, but no effective
methods of analysis are available. • Little knowledge is usable in common between
• While some reports are detailed, others are not. different fields.
There is little uniformity in description method. • New information cannot be added to the existent
knowledge base easily.
Under such circumstances, even though there is a
lot of useful information in the database, analysis and Compared with knowledge based system, ontology
application are not kept up with accumulation of data. is meta-level knowledge. In this research, the inci-
dent ontology contains not only detailed knowledge
or case examples (which are traditional knowledge-
3.4 Ontology base), but also association of incidents or causality
information (which are meta-level information). For
Ontology here means relations between concepts like this reason, analysis by ontology based system is more
synonyms or sub concepts. Mizoguchi defines ontol- multidirectional.
ogy as a theory of vocabulary/concepts used for build-
ing artificial systems (Mizoguchi, Kozaki, Sano &
Kitamura). Figure 3 shows an example of ontology. 3.5 m-SHEL ontology
There are four elements in ontology. They are: There are some previous works for computer based
1. A conceptual set which consists of elementally analysis. Chris was proposed case based retrieval
concepts of the intended field. with Semantic Network (Johnson 2003). In his
research, network were made by basic verbal phrases,
such as is_a, makes or resolved_by. David did
document synthesis with techniques like XML or
Semantic Web systems (Cavalcanti & Robertson
2003).
In this study, an m-SHEL ontology is developed
using XML (eXtensive Markup Language) (Tim et al.
1998). The m-SHEL ontology is based on incident
reports of NUCIA. In this ontology, each element has
at least one m-SHEL attribution, such as Software or
Environment. All of construction processes, like tag-
ging of attribution or making a hierarchical structure
were done manually. There is an example of XML
description of a concept contained in the m-SHEL
Figure 3. An example of ontology. ontology.

35
<concept>
<label>
(Information about concept.
This label is for readability.)
</label>
<ont-id>
143
</ont-id>
<mSHEL>
H
</mSHEL>
<description>
(Information about what is this
concept.
This is used for searching)
</description>
<cause-id>
217
(id of the concept which cause No. 143)
</cause-id>
<effect-id>
314
(id of the concept which is caused by No. 143)
</effect-id>
</concept>

The m-SHEL ontology has not only information of Figure 4. A part of m-SHEL ontology.
a hierarchical structure of elements, but also of causal
association. Causality is defined by a link orthogonal There is a weighing factor in the causality. It is
to the concept hierarchy. defined by how many cases are there which have the
Causality is described by XML as follows: same cause in the reports. This weighing factor is used
to decide which causality leads to the most or the least
<causality> likely outcome.
<label> Figure 4 shows a part of the m-SHEL ontology.
(Information about causality.
This label is for readability.)
</label> 4 ANALYSIS SYSTEM
<causality-id>
C-13 After having made the m-SHEL ontology, an analy-
</causality-id> sis system was developed. The flow of this system is
<cause-id> shown below:
147 1. Input an incident data in a text form. This step can
(id of the element which cause C-13) be done both manually and by importing from XML
</cause-id> style files.
<effect-id> 2. In Japanese language, since no spaces are placed
258 between words, morphological analysis is done to
(id of the element which is caused by C-13) the input text. It is explained in Chapter 4.1.
</effect-id> 3. Keywords from the text are selected and mapped
<weight> onto the ontology. Keywords are the words included
0.3 (weighing factor) in the m-SHEL ontology.
</weight> 4. Required information of the incident is structur-
</causality> ized both in the concept hierarchy and causality
relations.
In this example, causality labeled 13 is caused by
the concept No.147, and it causes the concept No. 258. Figure 5 shows an example of mapping result. For
These concept numbers are the same as the id of each visualization, only specific elements related to the
element of ontology. incident are shown here.

36
Figure 5. An example of mapping result.

Visual output function is not implemented in the


current system. Only output in an XML format is avail-
able now. Figure 5 was generated by another diagram
drawing software.

4.1 Morphological analysis


Japanese language is different from English. In
Japanese, no spaces are inserted between two words. If
English sentence is written like Japanese, it is written Figure 6. Fraction of compound nouns detected.
as follows:
This is a sample text. (a) From these points, a following rule is added: con-
In the m-SHEL ontology, it is necessary to separate tinuous Kanji nouns are combined to form a single
each word before applying text mining methods. For technical word. This is very simple change, but around
this step, Mecab (Yet Another Part-of-Speech and 60% of keywords can be found by this rule. Figure 4
Morphological Analyzer) is used [6]. Mecab is able shows the percentage of compound nouns detected for
to convert Japanese style sentences like (a) into to ten arbitrary sample cases of incident reports.
English style sentences like (b).
This is a sample text. (b) 4.2 Deletion of unnecessary parts
Mecab has an original dictionary for morphological Another change to the algorithm is for deletion of
analysis. The default dictionary does not have many unnecessarypartsofreports. Incidentreportssometimes
technical terms. If no improvements have been made, contain information additional to causal analysis.
analysis by Mecab does not work well for incident Exampleofsuchadditionalinformationisonhorizontal
reports, which include a lot of technical words or development, which is to develop lessons learned from
uncommon words. an incident horizontally throughout the company or the
To solve this problem, some change has been made industry. Additional information may sometimes work
to the dictionary and the algorithm of morphological as noises for causal analysis; it is eliminated therefore
analysis. fromreportsbeforeprocessing, ifthepurposeofanalysis
Firstly, some terms used in nuclear engineering does not include horizontal development.
or human factor analysis were added. This way of
improvement is most simple and reliable, but there still 5 VERIFICATION EXPERIMENT
remain some problems. Even though the fields of text
which is dealt with is specialized (in this case, nuclear In this chapter, test analysis of incident reports has
engineering), there are so many necessary words. It is been done using the proposed system. Sample reports
hard to add all of these terms. were from the NUCIA database, which is open to
Moreover, this improvement is required every time the public. Verification experiment was done by the
as the field of application is to be changed, i.e., following steps:
important technical words are different in different
application fields. 1. Before making the m-SHEL ontology, some inci-
Secondly, the algorithm to extract technical words dent reports are set aside. These reports are used
has been adjusted. In Japanese, many technical terms for verification.
are compound nouns, which are formed with more 2. Some reports were analyzed with the system auto-
than two nouns. From a grammatical viewpoint, two matically.
nouns are rarely appeared continuously. Moreover, 3. An expert analyzed the same reports manually to
Japanese Kanji is often used in those words. check the adequacy of analysis by the system.

37
Table 1. Result of verification test. Another reason for the high ratio is a trend of report-
ing incidents in nuclear industry of Japan that incidents
Case Case Case Case Case Case Case are reported from a viewpoint of failure mechanism of
1 2 3 4 5 6 7 hardware rather than human factors.
This trend also resulted in the low presence of Envi-
m 0/0 1/2 0/1 4/7 2/2 0/1 2/4
S 2/6 3/7 5/6 2/4 1/3 1/2 4/7 ronment or management items. Not only the result of
H 7/18 5/12 6/13 5/12 9/10 0/8 5/7 the system, but also that of the expert marked low
E 0/1 0/0 0/0 2/2 0/1 3/3 0/1 counts for these items. It means that the low scores are
L 2/4 1/5 2/4 2/5 3/4 3/5 4/3 attributable not to the system but to some problems in
c∗ 1/4 0/3 2/4 3/6 4/5 0/2 1/2 the original reports.
On the other hand, the results for Liveware and
∗ Number of causal association. causal association are different from those for Envi-
ronment and management. The results of automatic
analysis for Liveware and causal association also
4. Two results were compared. At this step, we
marked low scores. However, the expert did not mark
focused how many elements and causality relations
so low as the system. It seems this outcome is caused
are extracted automatically.
by some defects in the m-SHEL ontology.
In this study, seven reports were used for verification.
The incidents analyzed are briefly described below:
7 CONCLUSION
• automatic trip of the turbine generator due to
breakdown of the excitation equipment, An analysis system of incident reports has been devel-
• automatic trip of the reactor due to high neutron flux oped for nuclear power plants. The method of analysis
of the Intermediate Range Monitor (IRM), adopted is based on the m-SHEL ontology. The result
• impaired power generation due to switching over of of automatic analysis failed to mark high scores in
the Reactor Feed Pump (RFP), and assessment, but it is partly because of the contents
• manual trip of the reactor because of degradation of of the original incident report data. Though there is
vacuum in a condenser. a room for improvement of the m-SHEL ontology,
Result of the test is shown in Table 1. In this table, the system is useful to process incident reports and
each number shows how many concepts contained in to obtain knowledge useful for accident prevention.
the ontology were identified in the input report. The
numbers on the left represent the result obtained by the
system and those on the right the result by the expert. REFERENCES
For example, 3/7 means three concepts were identified
H.W. Heinrich. 1980. Industrial accident prevention: A safety
by the system, and seven by the expert. management approach, McGraw-Hill.
It should be noted here that not the numbers them- Frank E. Bird Jr. & George L. Germain. 1969. Practical Loss
selves are important, but coincidence of the both Control Leadership, Intl Loss Control Inst.
numbers is significant for judging validity of analy- NUCIA: Nuclear Information Archives, http://www.nucia.jp
sis by the system. In this research, since the algorithm R. Kawano. 2002. Medical Human Factor Topics. http://
of the system is quite simply designed it is reasonable www.medicalsaga.ne.jp/tepsys/MHFT_topics0103.html
to think that analysis by the expert is the reference. If E. Hollnagel. 1993. Human reliability analysis: Context and
the system identified more concepts than the expert, control. Academic Press.
it is probable that the system picked up some noises Riichiro Mizoguchi, Kouji Kozaki, Toshinobu Sano & Yoshi-
nobu Kitamura. 2000. Construction and Deployment of a
rather than the system could analyze more accurately Plant Ontology. Proc. of the 12th International Confer-
than the expert. ence Knowledge Engineering and Knowledge Manage-
ment (EKAW2000). 113–128.
C.W. Johnson. 2003. Failure in Safety-Critical Systems:
6 DISCUSSION A Handbook of Accident and Incident Reporting,
University of Glasgow Press, http://www.dcs.gla.ac.uk/∼
In Table 1, Hardware elements are relatively well johnson/book/
extracted. Plenty of information related to some parts J. Cavalcanti & D. Robertson. 2003. Web Site Synthesis based
of a power plant, such as CR (control rod) or coolant on Computational Logic. Knowledge and Information
Systems Journal, 5(3):263–287.
water, are included in incident reports. These words MeCab: Yet Another Part-of-Speech and Morphological
are all categorized as Hardware items, so the number Analyzer, http://mecab.sourceforge.net/
of Hardware items is larger than others. This is the Tim Bray, Jean Paoli, C.M. Sperberg-McQueen (ed.), 1998.
reason why Hardware has a high ratio of appearance. Extensible Markup Language (XML) 1.0: W3C Recom-
A similar tendency is shown with Software elements. mendation 10-Feb-1998. W3C.

38
Safety, Reliability and Risk Analysis: Theory, Methods and Applications – Martorell et al. (eds)
© 2009 Taylor & Francis Group, London, ISBN 978-0-415-48513-5

Forklifts overturn incidents and prevention in Taiwan

K.Y. Chen & Sheng-Hung Wu


Graduate School of Engineering Science and Technology, National Yunlin University of Science and Technology,
Douliou, Yunlin, Taiwan, ROC

Chi-Min Shu
Process Safety and Disaster Prevention Laboratory, Department of Safety, Health, and Environmental
Engineering, National Yunlin University of Science and Technology, Douliou, Yunlin, Taiwan, ROC

ABSTRACT: Forklifts are so maneuverable that they can move almost everywhere. With stack board, fork-
lifts also have the capability of loading, unloading, lifting and transporting materials. Forklifts are not only
widely used in various fields and regions, but are common in industry for materials handling. Because they are
used frequently, for any incidents such as incorrect forklift structures, inadequate maintenance, poor working
conditions, the wrong operations by forklift operators, and so on, may result in property damages and casual-
ties. The forklifts, for example, may be operated (1) over speed, (2) in reverse or rotating, (3) overloaded, (4)
lifting a worker, (5) on an inclined road, or (6) with obscured vision and so on, which may result in overturn-
ing, crushing the operator, hitting pedestrian workers, causing loads to collapse, lifting a worker high to fall
and so on. The above–mentioned, therefore, will result in adjacent labor accidents and the loss of property.
According to the significant professional disaster statistical data of the Council of Labor Affairs, Executive
Yuan, Taiwan, approximately 10 laborers perish in Taiwan due to forklift accidents annually. This obviously
shows that forklift risk is extremely high. If the operational site, the operator, the forklifts and the work environ-
ment are not able to meet the safety criterion, it can possibly cause a labor accident. As far as loss prevention
is concerned, care should be taken to guard handling, especially for forklift operation. This study provides
some methods for field applications, in order to prevent forklift overturn accidents and any related casualties
as well.

1 INTRODUCTION 1.1 Being struck by forklifts


The driver was obscured by goods piled too high, driv-
According to the significant occupational disaster sta-
ing, reversing or turning the forklift too fast, forgetting
tistical data of the Council of Labor Affairs (CLA),
to use alarm device, turn signal, head lamp, rear lamp
Executive Yuan, Taiwan, forklifts cause approximately
or other signals while driving or reversing, not noticed
10 fatalities annually from 1997 to 2007, as listed
by pedestrian workers or cyclists, or disturbed by the
in Fig. 1 (http://www.iosh.gov.tw/, 2008). The man-
work environment such as corner, exit and entrance,
ufacturing industry, the transportation, warehousing
shortage of illumination, noise, rain, etc., causing
and communication industry, construction industry
laborers to be struck by forklifts.
are the three most common sectors sharing the high-
est occupational accidents of forklifts in Taiwan, as
listed in Fig. 2. In addition, the forklift accidents were 1.2 Falling from forklifts
ascribed to being struck by forklifts, falling off a fork- The forklift driver lifted the worker high to work by
lift, being crushed by a forklift and overturning, in truck forks, the stack board or other materials but not
order. Similar to the normal cases in the USA, forklift established a work table, used the safety belt, safe
overturns are the leading cause of fatalities involving guard equipment and so on, causing the worker fall
forklift accidents. They represent about 25% of all off the forklifts easily.
forklift–related deaths (NIOSH alert, 2001). Figure 3
reveals that the forklift occupational accidents could 1.3 Collapsing and crushed
be classified into five main occupational accident
types, explained as follows (Collins et al., 1999; Yang, The material handling was not suitable so that the
2006). loads were piled up too high to stabilize well. The mast

39
Year include collisions with workplace materials, resulting
2007 6 in the collapse of the materials and wounding operators
2006 7 nearby.
2005 10
2004 13
1.4 Collapsing and crushed
2003 10
2002
Because of the high speed of the forklift’s reverse or
5
rotation, or because of the ascending, the descend-
2001 4
ing, uneven ground, wet slippery ground, soft ground,
2000 16
overly high lifted truck forks, or overloads, the forklifts
1999 10 overturned to crush the operator.
1998 9
1997 12
0 2 4 6 8 10 12 14 16
1.5 Becoming stuck or pinned
Number of fatalities Getting stuck or pinned between the truck forks and
mast or the tires while repairing or maintaining the
Figure 1. Fatalities of forklift accidents in Taiwan, from
1997 to 2007.
forklifts. In other instances, the operator forgot to turn
the forklift off in advance and adjust the goods on
the truck fork, or got off and stood in front of the
Others
forklift and next to the forklift’s dashboard to adjust
5
the goods. Therefore, when the operator returned to
the driver seat, he (or she) touched the mast operating
Construction 10 lever carelessly, causing the mast backward, and got
his or her head or chest stuck between the mast and
Transportation, overhead guard.
warehousing and 27
communication This study focused on reporting forklifts overturn
accidents and preventing them from occurring. Two
Manufacturing
forklifts overturn accidents were described as below
60
(http://www.iosh.gov.tw/, 2008; http://www.cla.gov.
0 10 20 30 40 50 60 tw/, 2008).
Figure 2. Fatalities from forklift accidents are distributed
in different industries, Taiwan, from 1997 to 2007.
2 OVERTURN CASE REPORT

40 Struck 2.1 Case one


37
35 On December 4, 2005 ca. 1 p.m., a 34-year-old laborer
who drove a forklift to discharge mineral water was
30
crushed to death by the forklifts. The laborer reversed
25 the forklift into a trailer on a board. The forklift slipped
Number of deaths

Collapsing
Falling and because the wheel deviated from the board. It, then,
20
17 Crushed Overturn Stuck was turned over, crushing the driver to death, as shown
16 and
15
15 Pinned in Fig. 4.
11
10
Hitting 2.2 Case two
5 4 Others
2 Figure 5 shows that on September 27, 2005 about
0
11:00 am, an 18-year-old laborer had engaged in
forklift operation, but accidentally hit against the con-
Figure 3. The number of deaths in forklift—accidents is
necting rod of the shelf while driving, and then the
divided into different types, Taiwan, from 1997 to 2007.
forklift overturned, crushing him, and he passed away.
The laborer operated the forklift to lift the cargo up
inclined while the forklift operator was driving so a to a shelf. After laying aside the cargo 7 meters
worker stood on the forklift to assist material han- high, the forklift left the shelf. The forklift moved
dling by holding the goods with hands and so on, without the truck fork being lowered. The forklift’s
which easily made the goods collapse and crushed the mast accidentally hit against the connecting rod and
assistant worker or pedestrians nearby; other disasters inclined. The operator was shocked by the disaster

40
and escaped quickly from the driver’s seat to the ware-
house entrance. However, the forklift was unable to
stop, decelerate its movement, but was still moving
forward to the warehouse entrance. After the forklift
mast hit against the connecting rod, its center of grav-
ity was changed to the driver seat side. The forklift
reversed in the direction of the warehouse entrance.
The laborer was hit and killed. A similar accident of a
forklift overturning and tipping is pictured in Fig. 6.

3 RELATED REGULATIONS

The CLA has promulgated the rule of labor safety and


health facilities (CLA, 2007), and the machine tool
Figure 4. Forklift overturns from the ladder. protection standard (CLA, 2001), which became effec-
tive and last modified in 2007 and 2001 for forklift
equipment, measures and maintenance, respectively.
The CLA also has the Rules of Labor Safety Health
Organization Management and Automatic Inspection
(CLA, 2002) for forklifts inspection. The standard
regulates operator training and licensing as well as
periodic evaluations of operator performance. The
standard also addresses specific training require-
ments for forklift operations, loading, seat belts,
overhead protective structures, alarms, and mainte-
nance. Refresher training is required if the operator is
(a) observed operating the truck in an unsafe manner,
(b) involved in an accident or mistake, or (c) assigned
a different type of forklift.

3.1 Forklift equipment, measure and maintenance


Figure 5. Forklift overturns due to hitting against the
connecting rod of the cargo. a. The employer should have safe protective equip-
ment for the moving forklifts and its setup should
be carried out according to the provision of the
machine apparatus protection standard.
b. The employer should make sure that the forklifts
cannot carry the laborer by the pallet or skid of
goods on the forks of forklifts or other part of fork-
lifts (right outside the driver seat), and the driver or
relevant personnel should be responsible for it. But
forklifts those have been stopped, or have equip-
ment or measures to keep laborers from falling, are
not subject to the restriction.
c. The pallet or skid at fork of forklifts, used by the
employer, should be able to carry weight of the
goods.
d. The employer should assure that the forks etc. are
placed on the ground and the power of the forklift
is shut off when the forklift operator alights.
e. The employer should not use a forklift without plac-
ing the back racket. The mast inclines and goods
fall off the forklift but do not endanger a laborer,
which is not subjected to the restriction.
f. The employer should have necessary safety and
Figure 6. Forklift overturning and tipping due to overloads. health equipment and measures when the employee

41
uses a forklift in the place with dangerous which are supplied by all equipment manufacturers
goods. and described the safe operation and maintenance of
g. The employer can not exceed the biggest load that forklifts.
forklifts can bear for the operation, and it trans-
ports of the goods should keep a firmness status
and prevent to turn over. 4 CONCLUSION AND RECOMMENDATIONS

The statistics on fatalities indicate that the three most


3.2 Forklift operation and inspection
common forklift—related fatalities involve overturns,
Under national regulations, CLA requirements for workers being struck or crushed by forklifts, and work-
forklifts operation and inspection are as follows: ers falling from forklifts. The case studies of overturns
indicate that the forklifts, the operating environment,
a. The employer should assign someone who has gone
and the actions of the operator all contribute to fatal
through the special safety and health education and
incidents related to forklifts. Furthermore, these fatal-
personnel’s operation training to operate a fork-
ities reveal that many workers and employers are not
lift with over one metric ton load. The employer
using, or even may be unaware of, safety procedures,
ordering personnel to operate more than one met-
not to mention the proper use of forklifts to reduce
ric ton load forklift should assure that she or he
the risk of injury and death. Reducing the risk of
had accepted 18 hours special training courses for
forklift incidents requires a safe work environment, a
safety and health.
sound forklift, secure work practices, comprehensive
b. The employer should periodically check for the
worker training, and systematic traffic management.
whole of the machine once every year to the
This study recommends that employers and work-
forklifts. The employer should carefully and regu-
ers comply with national regulations and consensus
larly inspected and maintained the brakes, steering
standards, and take the following measures to pre-
mechanisms, control mechanisms, the oil pressure
vent injury when operating or working near forklifts,
equipment, warning devices, lights, governors,
especially avoiding overturn accidents.
overload devices, guard, back racket and safety
devices, lift and tilt mechanisms, articulating axle
4.1 Worker training—employer event
stops, and frame members should be in a safe con-
dition, at least periodically checked once monthly. a. Ensure that a worker does not operate a forklift
c. Based on the rule of labor safe health organiza- unless she or he has been trained and licensed.
tion management and automatic inspection (CLA, b. Develop, implement, and enforce a comprehen-
2002), CLA requires that industrial forklifts must sive written safety program that includes worker
be inspected before being placed in service. They training, operator licensing, and a timetable for
should not be placed in service if the examination reviewing and revising the program. A comprehen-
shows any condition adversely affecting the safety sive training program is important for preventing
of the vehicle. Such examination should be made injury and death. Operator training should address
at least daily. When defects are found, they should factors that affect the stability of a forklift—such
be immediately reported and corrected. as the weight and symmetry of the load, the speed
d. Under all travel conditions, the forklift should be at which the forklift is traveling, operating surface,
operated at a speed that will permit it to be brought tire pressure, driving behavior, and the like.
safely to a stop. c. Train operators to handle asymmetrical loads when
e. The operator should slow down and sound the horn their work includes such an activity.
at cross aisles and other locations where vision is d. Inform laborers that when the forklift overturns
obstructed. jumping out the cockpit could possibly lead to being
f. Unauthorized personnel should not be permitted to crushed by the mast, the roof board, the back racket,
ride on forklifts. A safe place to ride should be the guard rail and the fuselage equipment. In an
provided and authorized. overturn, the laborer should remain in the cockpit,
g. An operator should avoid turning, if possible, and and incline his or her body in the reverse direction
should be alert on grades, ramps, or inclines. Nor- to which the forklift will turn over.
mally, the operator should travel straight up and
down. 4.2 Forklift inspection, maintenance
h. The operator of a forklift should stay with the truck and lifting—employer event
if it tips over. The operator should hold on firmly
and lean away from the point of impact. a. Establish a vehicle inspection and maintenance
program.
In addition to the above regulations, employers b. Ensure that operators use only an approved lift-
and workers should follow operator’s manuals, ing cage and adhere to general safety practices for

42
elevating personnel with a forklift. Also, secure the c. Do not jump from an overturning forklift. Stay
platform to the lifting carriage or forks. there, hold on firmly and lean in the opposite
c. Provide means for personnel on the platform to shut direction of the overturn, if a lateral tip over occurs.
power off whenever the forklift is equipped with d. Use extreme caution on grades, ramps, or inclines.
vertical only or vertical and horizontal controls for In general, the operator should travel only straight
lifting personnel. up and down.
d. When work is being performed from an elevated e. Do not raise or lower the forks while the forklift is
platform, a restraining means such as rails, chains, moving.
and so on, should be in place, or a safety belt with f. Do not handle loads that are heavier than the weight
lanyard or deceleration device should be worn by capacity of the forklift.
the person on the platform. g. Operate the forklift at a speed that will permit it to
be stopped safely.
h. Look toward the path of travel and keep a clear view
4.3 Workers near forklifts—employer event
of it.
a. Separate forklift traffic from other workers where i. Do not allow passengers to ride on a forklift unless
possible. a seat is provided.
b. Limit some aisles to workers either on foot or by j. When dismounting from a forklift, always set the
forklifts. parking brake, lower the forks, and turn the power
c. Restrict the use of forklifts near time clocks, break off to neutralize the controls.
rooms, cafeterias, and main exits, particularly when k. Do not use a forklift to elevate a worker who is
the flow of workers on foot is at a peak (such as at standing on the forks.
the end of a shift or during breaks). l. Whenever a truck is used to elevate personnel,
d. Install physical barriers where practical to ensure secure the elevating platform to the lifting carriage
that workstations are isolated from aisles traveled or forks of the forklift.
by forklifts.
e. Evaluate intersections and other blind corners to
determine whether overhead dome mirrors could ACKNOWLEDGMENTS
improve the visibility of forklift operators or work-
ers on foot. The authors are deeply grateful to the Institute of Occu-
f. Make every effort to alert workers when a forklift pational Safety and Health, Council of Labor Affairs,
is nearby. Use horns, audible backup alarms, and Executive Yuan, Taiwan, for supplying related data.
flashing lights to warn workers and other forklift
operators in the area. Flashing lights are especially
important in areas where the ambient noise level REFERENCES
is high.
http://www.iosh.gov.tw/frame.htm, 2008; http://www.cla.gov.
tw, 2008
4.4 Work environment—employer event U.S. National Institute for Occupational Safety and Health,
a. Ensure that workplace safety inspections are rou- 2001 ‘‘NIOSH alert: preventing injuries and deaths of
workers who operate or work near forklift’’, NIOSH
tinely conducted by a person who can identify Publication Number 2001-109.
hazards and conditions that are dangerous to work- Collins, J.W., Landen, D.D., Kisner, S.M., Johnston, J.J.,
ers. Hazards include obstructions in the aisle, blind Chin, S.F., and Kennedy, R.D., 1999. ‘‘Fatal occupa-
corners and intersections, and forklifts that come tional injuries associated with forklifts, United States,
too close to workers on foot. The person who con- 1980–1994’’, American Journal of Industrial Medicine,
ducts the inspections should have the authority to Vol. 36, 504–512.
implement prompt corrective measures. Yang, Z.Z., 2006. Forklifts accidents analysis and prevention,
b. Enforce safe driving practices, such as obeying Industry Safety Technology, 26–30.
speed limits, stopping at stop signs, and slowing The Rule of Labor Safety and Health Facilities, 2007.
Chapter 5, Council of Labor Affairs, Executive Yuan,
down and blowing the horn at intersections. Taipei, Taiwan, ROC.
c. Repair and maintain cracks, crumbling edges, and The Machine Tool Protection Standard, 2001. Chapter 5,
other defects on loading docks, aisles, and other Council of Labor Affairs, Executive Yuan, Taipei, Taiwan,
operating surfaces. ROC.
The Rule of Labor Safety Health Organization Management
and Automatic Inspection, 2002. Chapters 4 and 5, Coun-
4.5 Workers—labor event cil of Labor Affairs, Executive Yuan, Taipei, Taiwan,
a. Do not operate a forklift unless trained and licensed. ROC.
b. Use seat belts if they are available.

43
Safety, Reliability and Risk Analysis: Theory, Methods and Applications – Martorell et al. (eds)
© 2009 Taylor & Francis Group, London, ISBN 978-0-415-48513-5

Formal modelling of incidents and accidents as a means for enriching


training material for satellite control operations

S. Basnyat, P. Palanque & R. Bernhaupt


IHCS-IRIT, University Paul Sabatier, Toulouse, France

E. Poupart
Products and Ground Systems Directorate/Generic Ground Systems Office, CNES, Toulouse, France

ABSTRACT: This paper presents a model-based approach for improving the training of satellite control room
operators. By identifying hazardous system states and potential scenarios leading to those states, suggestions
are made highlighting the required focus of training material. Our approach is grounded on current knowledge
in the field of interactive systems modeling and barrier analysis. Its application is shown on a satellite control
room incident.

1 INTRODUCTION 1990) argues that operator training must be supported


by the development of ‘appropriate’ displays. While
Preventing incidents and accidents from recurring is this is true, an appropriate interface of a safety-critical
a way of improving safety and reliability of safety- system may not deter an incident or accident from
critical systems. When iterations of the development occurring (as shown in this paper) if operator training
process can be rapid (as for instance for most web does not accentuate critical actions.
applications), the system can be easily modified and The most efficient way to effectively alter operator
redeployed integrating behavioural changes that would behaviour is through training. However, the design of
prevent the same incident or accident from recurring. the training material faces the same constraints as the
When the development process is more resource con- design of the system itself i.e. that usability, function-
suming by, for instance, the addition of certification ality coverage, task-based structuring, . . . are critical
phases and the need to abide by standards, the design to make training efficient. Indeed, training opera-
and implementation of barriers (Hollnagel 2004) is tors with the wrong information or designing training
considered. Previous research (Basnyat et al. 2007) material that only offers partial coverage of available
proposes the specification and integration of barri- system functions will certainly lead to further failures.
ers to existing systems in order to prevent undesired We have defined a formal description technique
consequences. Such barriers are designed so that they dedicated to the specification of safety-critical interac-
can be considered as patches over an already exist- tive systems. Initial attempts to deal with incidents and
ing and deployed system. These two aforementioned accidents was by means of reducing development time
approaches are potentially complementary (typically, to a minimum making it possible to iterate quickly and
one would be preferred to the other depending on the thus modify an exiting model of the system in a RAD
severity of the failures or incidents that occurred), (Rapid Application Development) way (Navarre et al.
putting the system at the centre of the preoccupations 2001). Less formal approaches to informing operator
of the developers. training have been presented in (Shang et al. 2006) in
Another less common possibility is to adopt an which the authors use 3D technology and scenarios as
operator-centred view and to act on their behaviour a means of improving process safety.
leaving the system as it is. This approach is contradic- As shown in Table 1, certain categories of sys-
tory to typical Human-Computer Interaction philoso- tems cannot accommodate iterative re-designs and
phies (that promote user-centred design approaches thus require the definition and the integration of barri-
which aim to produce systems adapted to the tasks ers on top of the existent degraded system. In (Schupp
and knowledge of the operators) but may be necessary et al. 2006), (Basnyat and Palanque 2006), we pre-
if the system is hard to modify (too expensive, too sented how the formal description technique could
time consuming, unreachable . . .). Norman (Norman be applied to the description of system and human

45
Table 1. Categorization of types of users and systems.
Identification Examples Barrier
of users Types of users of systems Training availability Response to failure implementation

Anonymous General public Walk-up-and-use None Software/hardware patch Technical barrier

Many General public Microsoft Office Online documentation Software/hardware + online Technical barrier
identifiable only—training offered documentation patch
by third parties
Few Specialists Software Optional—In house and Software/hardware + Technical barrier
identifiable development third parties documentation patch
tools
Pilots Aircraft cockpits Trained—in house only Software/hardware/documen- Socio-technical
tation patch & training barrier
Very few Satellite Satellite control Trained—in house only Software/hardware patch Human barrier
identifiable operators room (unlikely) training

barriers and how the model-level integration allows Modifications to an error-prone version of the training
to assess, a priori, the adequacy and efficiency of the material (both the actual document and its ICO model)
barrier. are performed, so that operator behaviour that led to
The current paper focuses on a new problem cor- the incident/accident is substituted by safer behaviour.
responding to the highlighted lower row of Table 1. As for the work on barriers, the formal descrip-
It corresponds to a research project in collaboration tion technique allows for verifying that the training
with CNES (the French Space Agency) and deals with material covers all the system states including the ones
command and control systems for spacecraft ground known to lead to the accident/incident. We will pro-
segments. In such systems, modifications to the space- vide examples of models based on an example of an
craft are extremely rare and limited in scope which operational procedure in a spacecraft control room.
significantly changes incident and accident manage- The approach provides a systematic way to deal with
ment with respect to systems we have formally worked the necessary increase of reliability of the operations
on such as Air Traffic Control workstations (Palanque of safety-critical interactive systems.
et al. 1997) or aircraft cockpit applications (Barboni The approach uses the following foundations.
et al. 2007), (Navarre et al. 2004). A system model is produced, using the ICO notation,
For the space systems we are currently considering, representing the behaviour and all possible states of the
only the ground system is partly modifiable while the currently unreliable system on which an incident was
embedded system remains mostly unreachable. identified. Using formal analysis techniques, such as
To address the issue of managing incidents and marking graphs for Petri nets, an extraction of all pos-
accidents in the space domain we promote a formal sible scenarios leading to the same hazardous state in
methods approach focussing on operators. This paper which the incident occurred is performed. This ensures
presents an approach that shows how incidents and that not only the actual scenario is identified, but any
accidents can be prevented from recurring by defining additional scenarios as well. A barrier analysis is then
adequate training material to alter operator behaviour. performed, in order to identify which technical, human
The paper illustrates how to exploit a graphical formal and/or socio-technical barriers must be implemented
description technique called Interactive Cooperative in order to avoid such a scenario from reoccurring.
Objects (ICOs) (Navarre et al. 2003), based on high- This however, depends on whether or not the system
level Petri nets, for describing training material and is reachable and modifiable. In the case of technical
operators’ tasks. Using such a formal method ensures barriers, the system model is modified and the same
knowledge of complete coverage of system states as a hazardous scenarios are run on the improved system
means of informing training material. When we talk model as a means of verification. The human barri-
about complete coverage of system states, or unam- ers, based on identification of hazardous states, must
biguous descriptions, we refer to the software part of be realised via modifications to training material and
the global system, which alone can be considered as an selective training.
improvement to what is currently achieved with for- The following section presents related work in the
mal methods for describing states and events which field of model-based training. We then present the
are at the core of any interactive system. Verification case study and an incident in section 3, followed by
techniques can then be used to confront such our training approach to safety in section 4. Section 5
models with the system model of the subsection of briefly presents the ICO formalism and the Petshop
the ground segment and the model of a subsection of support tool used for the formal description of the
the spacecraft in order to assess their compatibility. interactive system. We present in section 6 system

46
models relating to the reported incident and discuss
how these models can be used to inform training
material in section 7. We then conclude in the final
section.

2 RELATED WORK ON MODEL-BASED


TRAINING

Though there are few industrial methodological


model-based training systems, this section is dedi-
cated to theoretical approaches and two established
industrial frameworks. Related fields of research that
have dedicated annual international conferences are
Intelligent Tutoring Systems (ITS), Knowledge Based Figure 1. Example of a training scenario (from (Lin 2001)).
Systems and Artificial Intelligence etc.
Such model-based approaches focus on a wide
range of models, including cognitive processes, the While Kontogiannis targets safety and productivity
physical device, user tasks, knowledge, domain etc. in process control, Lin (Lin 2001) presents a Petri-
It appears that the system models, a key factor in net based approach, again focusing on task models,
determining hazardous scenarios and states, are often for dealing with online training systems, showing
omitted or under specified. This will be exemplified that modelled training tasks and training scenarios
in the following paragraphs. analysis can facilitate creating a knowledge base of
The Volpe Center (National Transportation Sys- online training systems. Training plans (TP) and Train-
tems Centre) has designed and implemented training ing scenarios (TS) are proposed. The TS-nets are
programs to satisfy many different types of needs, separated into four areas (see Figure 1).
including systems use training, workforce skills train-
ing, awareness training, change management training, – Simulation area: for simulating the online training
and mission critical training. environment
Although modeling operator knowledge does not – Interface area: for trainee-system interaction
guarantee the safety of an application (Johnson 1997), – Evaluation sub-net: for monitoring and recording
Chris Johnson argues that epistemic logics (a form trainee performance
of textual representation of operator knowledge) can – Instruction area (the focus of the research)
be recruited to represent knowledge requirements for Elizale (Elizalde et al. 2006) presents an Intelli-
safety-critical interfaces. Similarly to our approach, gent Assistant for Operator’s Training (IAOT). The
the formalism highlights hazardous situations but does approach, based on Petri nets takes a derived action
not suggest how to represent this on the interface plan, indicating the correct sequence of actions nec-
to inform and support the operator. Interface design, essary to reach an optimum operation and translates it
based on usability principles, design guidelines, Fitt’s into a Petri net. The Petri net is used to follow the opera-
law (Fitts 1954) etc. is a discipline in its own right and tor’s action and provide advice in the case of a deviation
should be applied after system analysis. from recommended actions. What is different in this
approach, is that the recommended actions are not
user-centred, as with task modeling approaches, but
2.1 Petri-net based methods system-centred as designed by a ‘‘decision-theoretic
In the introduction, Petri nets were advocated as a planner’’.
means of formal graphical representation of a sys-
tem behavior. Within the literature are several Petri
2.2 Frameworks
net-based methods for training purposes.
Kontogiannis (Kontogiannis 2005) uses coloured Within the Model Based Industrial Training (MOBIT)
Petri nets for integrating task and cognitive models project, Keith Brown (Brown 1999) developed Model-
in order to analyse how operators process informa- based adaptive training framework (MOBAT). The
tion, make decisions or cope with suspended tasks and framework consists of a set of specification and real-
errors. The aim is to enhance safety at work. While the isation methods, for model-based intelligent training
approach is based on formal methods, there is no pro- agents in changing industrial training situations. After
posed means of verifying that the identified tasks and a specification of training problem requirements, a
workload are supported by the intended system as no detailed analysis splits into three separate areas for:
system model is presented. (a) a specification of the tasks a trainee is expected

47
to learn; (b) the specification of expertise and (c) a
specification of the target trainees.
The MOBAT framework includes multiple mod-
els, such as the task model, cognitive model, physical
model and domain model. The closest representation
of a system model in the framework (physical model)
(modeled using specific notations), does not provide
explicit representation of all the system states or of the
human-computer interaction. It does however provide
the facility for a user to request information relating
to past, present and future states (Khan et al. 1998).
A second model-based framework dedicated to
safety-critical domains, is NASA’s Man Machine
Design and Analysis System (MIDAS). MIDAS is a
‘‘a 3D rapid prototyping human performance mod-
eling and simulation environment that facilitates the Figure 2. Typical application interface for sending telecom-
design, visualization, and computational evaluation mands.
of complex man-machine system concepts in simu-
lated operational environments’’ (NASA 2005). The
framework has several embedded models including,
an anthropometric model of human figure, visual
perception, Updatable World Representation (UWR),
decisions, task loading, mission activities. . .
While the related work discussed in this section
focus on model-based training, they do not specifically
target post-incident improvements. For this reason, we
argue that an explicit system model is required, as a
means of identifying hazardous states and scenarios
leading to those states for informing training material
with intention to improve operator awareness.

3 CASE STUDY: SENDING A HAZARDOUS


TELECOMMAND
Figure 3. CNES Octave application interface for sending
telecommands.
Satellites and spacecraft, are monitored and controlled
via ground segment applications in control centres
with which satellite operators implement operational
procedures. A procedure contains instructions such Software Engineering Services Company) and is to
as send telecommand, check telemeasure, wait etc. be used with future satellites such as JASON-2 and
Figure 2 and Figure 3 provide illustrations of typical MeghaTropique. It has the similar interface sections as
applications for managing and sending telecommands the previous example, 1) toolbar, 2) information and
(TC) to spacecraft. To give the reader an idea, the triggering area (detailed information about the TC and
figures have been separated into sections for explana- triggering button for TCs), 3) the current procedure
tory purposes. In Figure 2, section 1 indicates the and 4) the history.
current view type, i.e. the commander, the monitor, the Certain procedures contain ‘‘telecommand that may
support engineer etc as well as the toolbar. Section 2 have potential catastrophic consequences or critical
gives the configuration of the current telecommand consequences e.g. loss of mission, negative inci-
and the mode that the operator must be in, in order to dence on the mission (halt maneuver e.g. and/or on
send that telecommand. Section 3 shows the history the vehicle (switch off of an equipment e.g.). These
of the current procedure and the current mode (opera- telecommands need an individual confirmation by
tional or preparation). Section 4 details the active stack the telecommand operator. This classification is per-
(current procedure). formed at flight segment level and related commands
The second example (Figure 3) is of Octave, A are adequately flagged (‘‘catastrophic’’) in the MDB
Portable, Distributed, Opened Platform for Interopera- (mission database). These hazardous telecommands
ble Monitoring Services (Cortiade and Cros 2008) and have a specific presentation (for example a different
(Pipo and Cros 2006) developed by SSII-CS (French color) in the local stack display.

48
Figure 4. Hazardous telecommand icon.

The sending of a hazardous TC requires 3 mouse


clicks (as opposed to 2 for non-hazardous telecom-
mands). ‘‘With the exception of the Red Button CAM
command, all operator initiated commands involving
Figure 5. Example of an intercom system.
functions that could lead to catastrophic consequences
shall be three step operations with feedback from the
function initiator after each step prior to the acceptance from the Flight Director. The Commander was thus
of the command.’’ That requirement is extracted from clicking numerous times to send the hazardous TCs.
a specification document at CNES not referenced for The Commander was distracted by technical conversa-
confidentiality reasons. tion regarding a forthcoming operation; his vigilance
Before presenting the incident, we describe here the (see research on situation awareness (Lee 2005 and
correct behaviour for sending a hazardous telecom- Eiff 1999)) was therefore reduced. The Comman-
mand based on the example Monitoring and Control der was involved in this conversation because there
application in Figure 2. The Commanding operator were only 2 team operators at that time as opposed
highlights the TC to be sent and clicks on the send to 3 or 4 that would have been necessary for the
button. The operator should note that the icon inline workload. Two additional operators were holding
with the TC has a yellow ‘‘!’’ indicating that it is a TC technical discussions at less than a meter from the
is classified as hazardous (see lower line of Figure 4). Commander resulting in perturbation to the auditory
After clicking send, a Telecommand Inspection environment.
window appears detailing the TC’s parameters. The A parallel can be drawn between the reduced num-
Commanding operator verifies that the parameters ber of operators in the test phase of the case study,
appear to be correct and clicks on the send TC button. and the situation in ACC Zurich during the Ueberlin-
Following this 2nd click, a red window called Con- gen accident (Johnson 2006). Our approach does not
firmation appears, in the top left hand corner of the attempt to encompass such aspects as we mainly focus
application. At this point, the Commanding operator on the computer part of such a complex socio technical
must use a dedicated channel on the intercom system to system and even more restrictively, its software part.
request confirmation from the Flight Director before As opposed to current research in the field of computer
sending a hazardous TC from the Flight Director. The science we take into account user related aspects of the
communication is similar to ‘‘FLIGHT from COM- computer system including input devices, interaction
MANDER on OPS EXEC: The next TC is hazardous, and visualisation techniques.
do you confirm its sending?’’ and the reply would be
‘‘Commander from FLIGHT on OPS EXEX: GO for
the TC which performs x, y, z. . .’’. After receiving the 4 TRAINING APPROACH TO SAFETY
go-ahead, the Commander can send the TC.
The introduction and related work sections argued for
the use of formal methods and more particularly model-
3.1 The incident
based design for supporting training of operators as
The incident described in this section occurred dur- a means of encompassing all possible system states
ing the test phase of a satellite development. The and identifying hazardous states. Furthermore, it has
Commanding operator sent a TC without receiving been identified (Table 1) that in response to an inci-
confirmation from the Flight Director. There was no dent, such a satellite control room application can
impact on the satellite involved in the tests. often only be made safer via training (a human bar-
The circumstances surrounding the incident were rier), as making modifications to the ground system
the following. Several minutes prior to the incident, can be a lengthy procedure after satellite deploy-
45 hazardous TCs were sent with a global go-ahead ment. We propose, in the following two paragraphs,

49
two model-based approaches for improving safety
following the hazardous TC incident reported in
section 3.1.

4.1 Changing operator behaviour without system


modification
This approach assumes that modifications to the sys-
tem are not possible. Therefore, changes to operator
behavior are necessary. Using a model representing the
behavior of the system, it is possible to clearly iden-
tify the hazardous state within which operators must
be wary and intentionally avoid the known problem.
The system model will also indicate the pre-conditions
(system and/or external events) that must be met in
order to reach a particular hazardous state providing
an indicator of which actions the operator must not
perform.
The advantages of this human-barrier approach lie
in the fact that no modifications to the system are
necessary (no technical barriers required) and is effi-
cient if only several operators lie in the fact that
no modifications to the system are necessary (no
technical barriers required) and is efficient if only
several users are involved (as within a control cen-
tre). We would argue that this is not an ideal solution
to the problem, but that the company would have
to make do with it as a last resort if the system is
inaccessible.

4.2 Changing operator behaviour with system Figure 6. Approach to formal modelling of incidents and
modification accidents as a means for enriching training material.
The second proposed approach assumes that the
ground system is modifiable. In the same way as
before, the hazardous state is identified using the The point here is not to force the operator to avoid
system model and preconditions are identified after an action, but to accurately make them learn the new
running scenarios on the system model. A barrier anal- system behavior.
ysis and implementation is then performed (Basnyat,
Palanque, Schupp, and Wright 2007). This implies
localised changes (not spread throughout entire sys- 5 INTERACTIVE COOPERATIVE
tem) to the system behavior. The existing training OBJECTS & PETSHOP
material (if any) must then be updated to reflect the
change in behaviour due to the technical barrier imple- The Interactive Cooperative Objects (ICOs) formal-
mentation. Similarly, localized changes to the train- ism is a formal description technique dedicated
ing material are made from an identified hazardous to the specification of interactive systems (Bastide
system state. et al. 2000). It uses concepts borrowed from the
While modifications to the ground segment may object-oriented approach (dynamic instantiation, clas-
have an impact on the onboard system, it can be con- sification, encapsulation, inheritance, client/server
sidered more reliable as the system will block the relationship) to describe the structural or static aspects
identified hazardous situation. The operators would of systems, and uses high-level Petri nets (Gen-
receive additional differential training (selective train- rich 1991) to describe heir dynamic or behavioural
ing since the behaviour of the system before and aspects.
after modifications is known) for the new behav- An ICO specification fully describes the potential
ior but even if they do not remember, or slip into interactions that users may have with the applica-
old routines, the improved system will block the tion. The specification encompasses both the ‘‘input’’
actions. aspects of the interaction (i.e. how user actions impact

50
Figure 7. An ICO model for procedure management.

on the inner state of the application, and which models to enable designers to break-down complexity
actions are enabled at any given time) and its ‘‘output’’ into several communicating models. This feature of
aspects (i.e. when and how the application displays the ICO notation has not been exploited in this paper
information relevant to the user). as explanation through one single (even fairly illegi-
An ICO specification is fully executable, which ble model) is better choice for explanatory purposes,
gives the possibility to prototype and test an appli- especially as on paper a model cannot be manipulated.
cation before it is fully implemented (Navarre et al. The diagram has been segregated for explanatory pur-
2000). The specification can also be validated using poses. Part A represents the behavior of an available
analysis and proof tools developed within the Petri list of instructions contained within the procedure.
nets community. In subsequent sections, we use the In this case they are CheckTM (check telemeasure),
following symbols (see Figure 7). Switch (an operator choice), Wait, Warning and TC
(telecommand). These instructions can be seen glob-
– States of the system are represented by the distribu-
ally in part B read from left to right. Parts C and D
tion of tokens into places (the focus of the incident) describe the behaviour for
– Actions triggered in an autonomous way by the sending of a TC (normal and hazardous). For brevity,
system are called transitions these are the only sections of the model that will be
– Actions triggered by users are represented by half discussed in detail.
bordered transition The model is such that the operator may select any
instruction before selecting the instruction he wishes
to implement. However, the instructions must be car-
6 SYSTEM MODEL OF PROCEDURE ried out in a given order defined within the procedure
BEHAVIOUR and correctly restricted by the application. While the
operator is effectuating an instruction (usually on a
Using the ICO formalism and its CASE tool, Pet- separate window superimposed on the list of instruc-
shop (Navarre, Palanque, and Bastide 2003), a model tions), he cannot click on or select another instruction
describing the behavior of a ground segment applica- until his current instruction is terminated. This behav-
tion for the monitoring and control of a non-realistic ior of instruction order restriction is represented with
procedure, including the sending of a hazardous a place OrderX after the terminating click of each
telecommand has been designed. Figure 7 illustrates instruction, a transition Next/Ok.
this ICO model. It can be seen, that as the complexity The ICO tool, Petshop, provides a novel feature
of the system increases, so does the Petri net model. with respect to common Petri nets, called ‘‘virtual
This is the reason why the ICO notation that we use places’’, to increase the ‘‘explainability’’ of models
involves various communication mechanisms between (by reducing the number of crossing arcs in the net).

51
A virtual place is a partial copy of a normal place. It ‘‘x.equals(‘‘tc’’)’’ and ‘‘x.equals(‘‘tchaza’’)’’ in transi-
adopts the display properties (such as markings) but tions SendTc2_1, SendTc2_2 respectively. If a normal
not the arc connections. The arcs can be connected TC is being treated, only transitions SendTc2_1 and
to normal places or to virtual places. The display is Cancel2 will be fireable. A click on the SendTC button
therefore modified allowing easier reorganisation of (Transition SendTC_1) will terminate the instruc-
models (see (Barboni et al. 2006)). In the ICO pro- tion, sending a token in place Past_instructions. If
cedure model, there are 3 uses of the virtual place; the operator clicks CANCEL, transition Cancel2,
Instruction_List, a place containing tokens represent- the TC is returned to the instruction list ready for
ing the available instructions, Block_Selection, a place reimplementation.
restricting the operator from selecting an instruction
while manipulating another and Past_Instructions, a
place containing the tokens representing instructions 6.2 Sending a hazardous telecommand
that have terminated.
If the TC (token) is of type hazardous, only transitions
Referring back to the incident described, we are
SendTc2_2 and Cancel2 will be fireable. Respect-
particularly interested in the sending of a hazardous
ing the requirements, a 3rd confirmation click is
telecommand. Once the operator has terminated all of
required. After transition SendTc2_2 is fired, place
the instructions preceding the telecommand instruc-
Display_Confirmation receives a token. The key to
tion, transition becomes fireable TC (see upper tran-
the incident described in section 3.1 lies in this
sition of Figure 8) and parts C and D of the model
state.
become the focus of the simulation.
Before clicking for the 3rd time to send the haz-
ardous TC, the operator should verbally request the
Go-Ahead from the Flight Director. The application
6.1 Sending a telecommand modelled allows the operator to click SEND (or
CANCEL) without the Go-Ahead. It is therefore up
When transition Tc is fired, place ReadyToSendTc to the operator to recall the fact that dialogue must be
receives a token. From here, the application allows initiated.
the operator can click SEND (the red button next to Part D in Figure 7 represents the potential scenarios
number 2 in Figure 2). This is represented by transition available, if transition DialogWithFD is fired, a token
SendTc1. After this external event has been received, is received in place RequestPending (representing the
place Display_TC_inspection receives a token. In operator initiating dialogue with the Flight Director).
this state, the application displays a pop-up window This part of the complete model is shown in Figure 9.
and the operator can either click SEND or CAN- For improved legibility, the virtual places represent-
CEL. In the model, transitions SendTc2_1, SendTc2_2 ing Block_selection and Past_instructions_have been
and Cancel2 become fireable. The two Send transi- removed from the diagram. For simulation purposes,
tions identify the value of the token (x) currently in the transition DialogWithFD is represented using the
place Display_TC_inspection, to determine whether autonomous type of transition even though the event
it is a normal TC or a hazardous TC. The code is external to the system behaviour. The action does

Figure 8. Send TC section of complete model (part C of


Figure 7). Figure 9. Request go-ahead (part D of Figure 7).

52
not involve the operator clicking a button on the inter- improvements to the modelled application assuming
face. Rather, this action can be considered as a human system modifications cannot and subsequently can
cognitive task and subsequently physical interac- be made. Each suggestion includes a barrier classi-
tion with an independent communication system (see fication according to Hollnagel’s Barrier systems and
Figure 5). barrier functions (Hollnagel 1999).
Tokens in places RequestPending and FlightDirec-
tor allow transition RequestGoAheadFD to be fire-
able. This transition contains the statement y=fd. 7.1 Without system modifications
requestGoAhead(x).
If the application cannot be modified, training should
It is important to note, that data from the token
focus on operator behaviour between the 2nd and
in place FlightDirector would come from a sepa-
3rd click for sending a hazardous TC. Section 6.2
rate model representing the behaviour of the Flight
identifies that the process of establishing dialogue
Director (his interactions with control room mem-
with the Flight Director as being potentially com-
bers etc). Once place Result receives a token,
pletely excluded from the operational procedure. This
4 transitions to be fireable: Go, WaitTimeAndGo,
is why part D of Figure 7 is intentionally segre-
WaitDataAndGo and NoGo, representing the possi-
gated.
ble responses from the Flight Director. Again, these
Training of operators concerning this behaviour
transitions (apart from the one involving time) are
should explicitly highlight the fact that although they
represented using autonomous transitions. While they
can click the SEND button without actually receiv-
are external events; they are not events existing
ing the Go-Ahead from the Flight Director, that it is
in the modelled application (i.e. interface buttons).
imperative to request the Go-Ahead. A ‘‘go-around’’
These are verbal communications. In the case of each
and potential support technique would be to use a
result, the operator can still click on both SEND
paper checklist. This suggestion is an immaterial bar-
and FAIL on the interface, transitions SendTc3_x and
rier system with barrier functions Monitoring and
Cancel3_ x in Figure 9. We provide here the expla-
Prescribing.
nation of one scenario. If the token (x) from the
FlightDirector place contains the data ‘‘WaitTime’’,
then transition WaitTimeAndGo, (containing the state-
ment y==‘‘WaitTime’’) is fireable taking the value 7.2 With system modifications
of ‘‘y’’. Assuming the application is modifiable, an improve-
Within the scenario in which the operator must ment in system reliability could be achieved using
wait for data (from an external source) before clicking several technical (software) barriers.
SEND, the model contains references to an exter-
nal service (small arrows on places SIP_getData, – Support or replace the current verbal Go-Ahead
SOP_getData and SEP_getData, Service Input, Out- with an exchange of data between the two parties.
put and Exception Port respectively. When in state Functional barrier system with hindering barrier
WaitingForData a second model, representing the ser- function
vice getData would interact with the current procedure – Modify the application so that the Go-Ahead data is
model. requested and sent via interaction with the interface,
the packet would contain information necessary for
the Flight Director to make a decision. Functional
7 MODEL-BASED SUPPORT FOR TRAINING barrier system with soft preventing barrier function.
– Change the 3rd SEND button behaviour, so that it
The aim of the approach presented in this paper, is is disabled pending receipt of the Go-Ahead packet
to use a model-based representation of current system (including timer and additional data if necessary).
behaviour (in this case after an incident) to support Functional barrier system with soft preventing bar-
training of operators in order to improve reliability in rier function.
a satellite control room. – Although the application analysed has a very vis-
The analysis and modelling of the monitoring ible red window providing the 3rd SEND but-
and control application including operator interac- ton (as demanded in the system requirements),
tions, and reported incident enables us to identify a it does not include any form of reminder to
hazardous state: After transition SendTc2_2 is fired the operator that dialogue with Flight Direc-
(2nd confirmation of a hazardous TC) and place tor must established. The 3rd pop-up window
Display_Confirmation contains a token. should thus display text informing the opera-
Table 1 highlights that the only likely response to tor of the data exchange process. Symbolic bar-
a failure in a satellite control room is to modify oper- rier system with regulating and indicating barrier
ator training. The following two subsections provide functions.

53
8 CONCLUSIONS AND FURTHER WORK Cortiade, E. and PA Cros. 2008. OCTAVE: a data model-
driven Monitoring and Control system in accordance
We have presented a model-based approach for with emerging CCSDS standards such as XTCE and
identifying human-computer interaction hazards and SM&C architecture. SpaceOps 2008 12–16 May 2008,
their related mitigating human (training) and tech- Heidelberg, Germany.
Eiff, G.M. 1999. Organizational safety culture. In R.S. Jensen
nical (software) barriers, classified according to (Ed.). Tenth International Symposium on Aviation Psy-
Hollnagel’s barrier systems and barrier functions chology (pp. 778–783). Columbus. OH: The Ohio State
(Hollnagel 1999). University.
The approach, applied to a satellite ground segment Elizalde, Francisco, Enrique Sucar, and Pablo deBuen. 2006.
control room incident, aids in identifying barriers such ‘‘An Intelligent Assistant for Training of Power Plant
as symbolic and immaterial barriers, necessary in the Operators.’’ pp. 205–207 in Proceedings of the Sixth
case where the degraded system is not modifiable. IEEE International Conference on Advanced Learning
These barriers may not have been considered within Technologies. IEEE Computer Society.
the development process, but have been identified Fitts, P.M. 1954. ‘‘The Information Capacity of the
Human Motor System in Controlling the Amplitude of
using the system model. Movement. . .’’.
We have used the ICO formal description technique Genrich, H.J. 1991. ‘‘Predicate/Transitions Nets, High-
(a Petri net based formalism dedicated to the mod- Levels Petri-Nets: Theory and Application.’’ pp. 3–43 in.
elling of interactive systems) to model the monitoring Springer Verlag.
and control of an operational procedure using a satel- Hollnagel, E. 1999. ‘‘Accidents and barriers.’’ pp. 175–180
lite ground segment application. The model provides in Proceedings CSAPC’99, In J.M. Hoc, P. Millot,
explicit representation of all possible system states, E. Hollnagel & P.C. Cacciabue (Eds.). Villeneuve d’Asq,
taking into account human interactions, internal and France: Presses Universitaires de Valenciennes.
external events, in order to accurately identify haz- Hollnagel, E. 2004. ‘‘Barriers and Accident Prevention.’’
Ashgage.
ardous states and scenarios leading to those states via
Johnson, C.W. 1997. ‘‘Beyond Belief: Representing
human-computer interactions. Knowledge Requirements For The Operation of Safety-
Critical Interfaces.’’ pp. 315–322 in Proceedings of
the IFIP TC13 International Conference on Human-
Computer Interaction. Chapman & Hall, Ltd http://portal.
ACKNOWLEDGEMENTS acm.org/citation.cfm?id=647403.723503&coll=GUIDE
&dl=GUIDE&CFID=9330778&CFTOKEN=19787785
This research was financed by the CNES R&T Tortuga (Accessed February 26, 2008).
project, R-S08/BS-0003-029. Johnson, C.W. 2006. Understanding the Interaction Between
Safety Management and the ‘Can Do’ Attitude in
Air Traffic Management: Modelling the Causes of
the Ueberlingen Mid-Air Collision. Proceedings of
REFERENCES Human-Computer Interaction in Aerospace 2006, Seat-
tle, USA, 20–22 September 2006. EDITORS F. Reuzeau
Barboni, E, D Navarre, P Palanque, and S Basnyat. 2007. and K. Corker.Cepadues Editions Toulouse, France.
‘‘A Formal Description Technique for the Behavioural pp. 105–113. ISBN 285428-748-7.
Description of Interactive Applications Compliant with Khan, Paul, Brown, and Leitch. 1998. ‘‘Model-Based Expla-
ARINC Specification 661.’’ Hotel Costa da Caparica, nations in Simulation-Based Training.’’ Intelligent Tutor-
Lisbon, Portugal. ing Systems. http://dx.doi.org/10.1007/3-540-68716-5_7
Barboni, E, D Navarre, P Palanque, and S Basnyat. 2006. (Accessed February 20, 2008).
‘‘Addressing Issues Raised by the Exploitation of For- Kontogiannis, Tom. 2005. ‘‘Integration of task networks and
mal Specification Techniques for Interactive Cockpit cognitive user models using coloured Petri nets and its
Applications.’’ application to job design for safety and productivity.’’
Basnyat, S, and P Palanque. 2006. ‘‘A Barrier-based Cogn. Technol. Work 7:241–261.
Approach for the Design of Safety Critical Interactive Lee, Jang R, Fanjoy, Richard O, Dillman, Brian G. The
Application.’’ vol. Guedes Soares & Zio (eds). Estoril, Effects of Safety Information on Aeronautical Decision
Portugal: Taylor & Francis Group. Making. Journal of Air Transportation.
Basnyat, S, P Palanque, B Schupp, and P Wright. 2007. ‘‘For- Lin, Fuhua. 2001. ‘‘Modeling online instruction knowledge
mal socio-technical barrier modelling for safety-critical using Petri nets.’’ pp. 212–215 vol.1 in Communications,
interactive systems design.’’ Special Edition of Elsevier’s Computers and signal Processing, 2001. PACRIM. 2001
Safety Science. Special Issue safety in design 45:545–565. IEEE Pacific Rim Conference on, vol. 1.
Bastide, R, O Sy, P Palanque, and D Navarre. 2000. ‘‘Formal NASA. 2005. ‘‘Man-machine Integration Design and
specification of CORBA services: experience and lessons Analysis System (MIDAS).’’ http://human-factors.arc.
learned. . .’’ ACM Press. nasa.gov/dev/www-midas/index.html (Accessed Febru-
Brown, Keith. 1999. ‘‘MOBIT a model-based framework for ary 19, 2008).
intelligent training.’’ pp. 6/1–6/4 in Invited paper at the Navarre, D, P Palanque, and R Bastide. 2004. ‘‘A Formal
IEEE Colloqium on AI. Description Technique for the Behavioural Description

54
of Interactive Applications Compliant with ARINC 661 Palanque, P, R Bastide, F Paterno, R Bastide, and F Paterno.
Specification.’’ 1997. ‘‘Formal Specification as a Tool for Objective
Navarre, D, P Palanque, and R Bastide. 2003. ‘‘A Tool- Assessment of Safety-Critical Interactive Systems. . . ’’
Supported Design Framework for Safety Critical Interac- Sydney, Australia: Chapman et Hall.
tive Systems.’’ Interacting with computers 15/3:309–328. Pipo, C, and C.A Cros. 2006. ‘‘Octave: A Portable, Dis-
Navarre, D, P Palanque, R Bastide, and O Sy. 2001. ‘‘A tributed, Opened Platform for Interoperable Monitoring
Model-Based Tool for Interactive Prototyping of Highly Services.’’ Rome, Italy.
Interactive Applications.’’ Schupp, B, S Basnyat, P Palanque, and P Wright. 2006.
Navarre, D, P Palanque, R Bastide, and O Sy. 2000. ‘‘A Barrier-Approach to Inform Model-Based Design of
‘‘Structuring Interactive Systems Specifications for Exe- Safety-Critical Interactive Systems.’’
cutability and Prototypability.’’ pp. 97–109 in, vol. no, Shang, P.W.H., L Chung, K Vezzadini, K Loupos, and
1946. Lecture Notes in Computer Science, Springer. W Hoekstra. 2006. ‘‘Integrating VR and knowledge-
Norman, D.A. 1990. ‘‘The ‘Problem’ with Automation: based technologies to facilitate the development of oper-
Inappropriate Feedback and Interaction, not ‘Over- ator training systems and scenarios to improve process
Automation Philosophical Transactions of the Royal Soci- safety.’’ vol. Guedes Soares & Zio (eds). Estoril, Portugal:
ety of London. Series B, Biological Sciences 327:585–593. Taylor & Francis Group.

55
Safety, Reliability and Risk Analysis: Theory, Methods and Applications – Martorell et al. (eds)
© 2009 Taylor & Francis Group, London, ISBN 978-0-415-48513-5

Hazard factors analysis in regional traffic records

M. Mlynczak & J. Sipa


Wroclaw University of Technology, Poland

ABSTRACT: Road safety depends mainly on traffic intensity and conditions of road infrastructure. More
precisely, one may found many more factors influencing safety in road transportation but process of their
perception, processing and drawing conclusions by the driver while driving is usually difficult and in many cases
fall behind the reality. Diagnostic Aided Driving System (DADS) is an idea of supporting system which provides
driver with selected, most useful information which reduces risk due to environment, weather conditions as well as
statistics of past hazardous events corresponding to given road segment. Paper presents part of the project aiming
to describe relations and dependencies among different traffic conditions influencing number of road accidents.

1 FOREWORD Undesired road events are described and stored in


reports created by rescue services consisting mainly
Road traffic is defined as movement of road vehicles, by fire brigade and police divisions. Precise accident
bicycles and pedestrians. The most important features identification requires giving the following facts:
of the traffic is effective and safe transportation of
passengers and goods. Traffic takes place in complex • exact date and time of the event (day time, week day,
system consisting of road infrastructure, vehicle oper- month, season),
ators, vehicles and weather conditions. Each of these • place of the event (number of lanes and traffic dir-
factors is complex and undergoes systematic and ran- ections, compact/dispersed development, straight/
dom changes. Safety is provided in that system by the bend road segment, dangerous turn, dangerous
means of safety in soft (law, regulations, and local slope, top of the road elevation, road crossing
rules) and hard form (traffic signals, barriers, etc.). In region, of equal importance or subordinated, round-
this light an accident is a fact of violation of certain about, pedestrian crossing, public transport stop,
safety condition. In the idea of MORT (Wixon 1992) it tram crossing, railway sub grade, railway crossing
might be understood as releasing some energy stored unguarded/guarded, bridge, viaduct, trestle bridge,
in transportation system. pavement, pedestrian way, bike lane, shoulder,
Analysis of accident reports, databases and papers median strip, road switch-over, property exit),
creates an image of necessary and sufficient conditions • severity of consequences (collision, crash, catastro-
leading to undesired road event. phe),
• collision mode (collision in motion [front, side,
rear], run over [a pedestrian, stopped car, tree,
lamp-post, gate arm, pot-hole/hump/hole, animal],
2 ACCIDENT DATABASE REVIEW
vehicle turn-over, accident with passenger, other),
• cause/accident culprit (driver, pedestrian, passen-
2.1 Accident description
ger, other reasons, complicity of traffic actors),
Polish road rescue system is based on fire service. • accidents caused by pedestrian (standing, lying in
In emergency the suitable database is created. The the road, walking wrong lane, road crossing by
database described in this paper concerns the south- red light, careless road entrance [before driving
western region of Poland, Lower Silesia. Period con- car; from behind the car or obstacle], wrong road
tained in work-sheets covers years 1999–2005. From crossing [stopping, going back, running], crossing
thousands of records information dealing with road at forbidden place, passing along railway, jumping
accidents was selected. in to vehicle in motion, children before 7 year old
The most important element of an accident descrip- [playing in the street, running into the street], other),
tion is its identification in time and road space. An • cause/vehicle maneuver (not adjusting speed to traf-
accident description includes always the same type of fic conditions, not observing regulations not giving
information. free way, wrong [overtaking, passing by, passing,

57
driving over pedestrian crossing, turning, stopping,
parking, reverse a car], driving wrong lane, driv-
ing by red light, not observing other signs and
signals, not maintaining safe distance between vehi-
cles, rapid breaking, driving without required lights,
tiredness, falling asleep, decreasing of psychomotor
efficiency, other),
• driver gender and experience,
• age (age interval: 0–6, 7–14, 15–17, 18–24, 25–39,
40–59, 60 and over),
• culprit due to the influence of alcohol (because of
driver, pedestrian, passenger, other, complicity of
traffic actors),
• weather conditions (sunshine, clouds, wind, rain/ Figure 2. Ratio of number of people injured or victims in
snow, fog/smog), total number of accidents.
• type of vehicle (bicycle, moped/scooter, motorbike,
car, tuck, bus, farm tractor, horse-drawn vehicle,
tram, train, other), It is seen that over 87% of accidents happen between
• vehicle fault (self damage, damage of the other 6 a.m. and 10 p.m. with the highest frequency range
vehicle), between 8 a.m. and 6 p.m. Within observed period, the
• event consequences (environment [soil, water, air], frequency and distribution of accidents in day hours is
road, road infrastructure [road facilities, buildings], stable and does not change much despite from rising
cargo [own, foreign]), motorization index in Poland. The average rate of traf-
• necessary actions (first-aid service, road rescue, fic flow, according to the entire analyzed road segment
chemical neutralization, fire extinguishing, rebuild). at motorway A-4, is presented in Fig. 4. It is shown
that since year 2001, volume of traffic is growing year
by year mainly due to economic development.
2.2 Analysis of accidents conditions Similar analysis of accidents distribution in time
General analysis of the database is directed on main was done in relation to months over the year (Fig. 5)
factors attendant circumstances. Major accidents con- and week days (Fig. 6).
sequences are caused proportionally by cars (over 80%
of injured and victims) but almost 30% of losses result
from trucks and road machines (tractors, combine 3 MODELLING OF ROAD EVENT NUMBER
harvesters end the like) accidents (Fig. 1).
The worst consequences of accident dealing with 3.1 Regression models
human life are usually death or bodily harm. Over Statistical number of accidents along certain road
the entire observation period an average ratio of vic- segments is a random number depending on many fac-
tims per accident is 0,22 and lies between 0,35 and tors described precisely above. Randomness means
0,07. Average number of injured is about 1,8 per each in this case that the number of expected accidents
accident (Fig. 2). may vary according to random events and is corre-
Analysis of accidents taking place within day time lated to some other factors. Regression models allow
is shown in Fig. 3. taking into consideration several factors influencing
accident number. Variation of accident number is
described as pure random and systematic variation
depending on some accident factors (Baruya 1999,
Evans 2005).
Two main regression models exist which are gener-
ally based on Poisson distribution (Fricker & Whitford
2004). Poisson model is adequate to random events
(accident number in a given road segment) providing
that randomness is independent for all variables taken
into account (Evans 2005, Kutz 2004, Miaou & Lum
1993).
Poisson regression function of accent number is:
Figure 1. Accident consequences fraction caused by differ-
ent type of vehicle. λ = e(β0 +β1 x1 +β2 x2 +···+βk xk ) (1)

58
0,07
day

07:00 17:00

Figure 3. Frequency of accidents in day time.

accident
0,25
frequency

0,2

0,15

0,1

0,05
week day
0

Thursday
Tuesday

Saturday
Wednesday

Sunday
Monday

Friday
Figure 4. Traffic volume in analyzed motorway segment in Figure 6. Distribution of accidents in week days over the
years 1999–2005. period 1999–2005.
no

injured
120 accidents appropriate. Otherwise the Pascal model (2) should
100
death
be considered, because it assumes randomness of λ
80
parameter itself (Michener & Tighe 1999).

λ = e(β0 +β1 x1 +β2 x2 +···+βk xk +ξ )


60
(2)
40

20
where: ξ is the Gamma distributed error term.
0
1 2 3 4 5 6 7 8 9 10 11 12
month

3.2 Application of regression models


Figure 5. Distribution of accidents and victims in months to the database
over the period 1999–2005.
Regression analysis of road events uses database
of Provincial Command of the State Fire Service.
From that database the motorway segment of the
where: βi = coefficients of regression, xi = indepen- total length l = 131 km was selected and from
dent variables. the period of 1999–2005 about 1015 accidents were
Poisson distribution assumes that variance is equal extracted.
to mean number of events occurrence so that mean Motorway segment is divided in 11 subsegments
and variance may be taken as measure of statistical of given length li = x1 and daily traffic volume
confidence. If variance is calculated as in the follow- tvi = x2 . To the analysis are also taken into account
ing formula σ = λ(1 + θ · λ), and θ (overdispersion the following casual variables:
parameter) is not significantly different from 0, then
the regression analysis based on the Poisson model is – hvi = x3 —heavy vehicle rate in traffic flow [%],

59
Ta ble 1. Regression models for all road events.

Independent variable
rn rc
J rj Number Number of rl
ln l ln tv Hv Number Number of road road junc- Number of rp rm
Length Daily Heavy of all of road junctions tions road junc- Number of Number of
of road traffic vehicle junc- junc- (national (commu- tions (lo- exits to moderni-
Model segment volume rate tions tions roads) nal roads) cal roads) parking zation

1 –7,06 1,65 0,56 – – – – – – – –


2 –7,23 1,55 0,89 –0,09 – – – – – – –
3 –9,22 1,49 0,82 – 0,01* – – – – – –
4 –5,19 2,25 0,60 –0,11 –0,05 – – – – – –
5 –6,75 1,68 0,83 –0,09 – –0,02* – – – – –
6 –2,33 2,45 0,22* –0,09 – – 0,05* –0,16 –0,13 – –
7 –4,56 2,60 0,43* –0,10 – – –0,02* –0,08* –0,07 –0,09* –
8 –9,35 2,16 0,90 –0,05 –0,05 – – – – – –0,77
9 –9,94 2,15 0,99 –0,06 – –0,09 – – – – –0,86
10 –9,83 2,26 0,95 –0,06 – – –0,07 –0,10 –0,10 – –0,85
11 –11,49 2,36 1,12 –0,06 – – –0,12 –0,04 –0,06* –0,06* –0,80
12 –9,91 1,59 1,06 –0,05 – – – – – – –0,65

* Insignificant parameters are written in italics.

Table 2. Significance analysis of selected regression models.

Multiple correlation coefficient Model deviance θ parameter

Model R2 R2p R2w R2wp R2ft R2ftp MD p θ p

8 0,72 0,88 0,69 0,87 0,69 0,87 261,26 10−5 0,034 0,1936
9 0,74 0,90 0,70 0,88 0,70 0,90 267,63 10−5 0,027 0,2617
10 0,76 0,92 0,72 0,91 0,72 0,92 269,76 10−5 0,020 0,3763
11 0,78 0,95 0,75 0,94 0,75 0,96 269,83 10−5 0,011 0,6067
12 0,70 0,85 0,67 0,84 0,67 0,85 248,61 10−5 0,053 0,1054

Table 3. Comparison of regression models for number of road events, accidents, injured and fatalities.

Model β0 ln l ln tv hv j rj rn Rc rl rp rm

Number of road events 0,0001 2,26 0,953 −0,056 – – −0,071 −0,099 −0,1 – −0,854
Number of accidents 0,0004 2,769 0,681 −0,083 – – −0,128 −0,209 −0,152 0,053 −1,005
Number of injured 0,0028 3,759 0,18 −0,065 – – −0,104 −0,077 −0,093 −0,114 −1
Number of fatalities 2203,9357 5,885 −1,972 −0,055 −0,209 – – – – – −1,718

– ji = x4 —number of all junctions (exits and – rmi = x10 —time fraction of the year with mod-
approaches of motorway), ernization road works of subsection (value from
– rji = x5 —number of all types road junctions, 0 to 1).
– rni = x6 —number of road junctions with national
roads, Accident database and construction data obtained
– rci = x7 —number of road junctions with communal from the company managing motorway allowed for
roads, building general regression model in the following
– rli = x8 —number of road junctions with local form (3):
roads,
– rpi = x9 —number of exits and approaches of 
λi = li 1 · tvi 2 · e (β0 +βj xji ) ;
β β
parking places, j>2 (3)

60
14 number of road events

avg. number of road events, accidents,


number of accidents
12
number of injured
injured, fatalities
10 number of fatalities

0
0 3000 6000 9000 12000 15000 18000 21000 24000 27000
traffic volume [veh./24h]

Figure 7. Regression functions of number of road events, accidents, injured and fatalities in function of traffic volume (all
remaining independent variables are constant). Shaded area relates to the range of traffic volume analyzed in database.

Twelve regression models were created on the pr = 1, rm = 1. Fig. 7 shows number of road events,
basis of the formula (3) and 10 independent variables number of accident, number of injured and number of
described above. In the Table 1 the successive models fatalities in function of daily traffic flow.
are presented containing logical combination of vari- These theoretical functions are valid in the inter-
ables x3 to x10 . Regression functions were obtained and val of traffic flow 9000–24000 veh./24 hours (shaded
selected using the squared multiple correlation coeffi- area) what was observed in the database in the period
cient R, the weighted multiple correlation coefficient of years 1999–2005. Regression functions may be
Rw and the Freeman-Tukey correlation coefficient Rft . extended on wider range of traffic flow to predict
Five models no. 8–12 (marked in shaded records) number of interesting events in other conditions. The
are considered as models of good statistical signifi- phenomenon of decreasing number of fatalities in traf-
cance. In that models parameters β are estimated prop- fic flow may be explained that within seven years of
erly and parameter θ is insignificantly different from 0. observation average quality of cars has risen. Traffic
In the estimated models number of accidents follows flow is here strongly correlated with time. ‘‘Better’’
Poisson distribution. Models significance analysis is cars move faster and generate higher traffic flow and
shown in Table 2. Regression coefficients denoted on the other hand may rise traffic culture and provide
with index p deal with adjusted models that describe higher safety level through installed passive and active
portion of explained systematic randomness. safety means. Estimation of victim number is bur-
Probability values given in Table 2 for parameter θ den higher randomness and proper and precise models
greater then 0,05 prove that θ is insignificant and cer- have not been elaborated jet.
tify correct assumption of Poisson rather than Pascal
model for number of accidents.
Described calculation method and regression model 4 CONCLUSIONS
gave the basis of evaluating theoretical number of road
events. There are four important events discussed in Two main subjects are discussed in the paper: the-
road safety: road event, accident with human losses, oretical modeling of traffic events and elaborating
accident with injured and fatalities. Having known his- of regression models on the basis of real accident
torical data, these events were analyzed and created database. Proposed model consists in developing of
regression models. In Table 3 parameters of regres- prediction function for various road events. There
sion models are put together to describe numbers of were four types of road events concerning: all road
the mentioned events. It is seen that some independent events, accidents, injuries and fatalities. Verification
variables does not influence on calculated number, of the assessment that number of, so called, rare
especially on number of fatalities. events undergoes Poisson distribution was done com-
In order to check the behavior of the obtained model paring elaborated real data with Poisson model with
it is has been proposed that all independent variables parameter λ calculated as regression function of ten
are constant contrary to traffic volume. It will give independent variables. Conformity of five proposed
regression function of number of certain events in rela- models was checked calculating statistical signifi-
tion to traffic volume. It is assumed that: l = 10 km, cance of parameter θ. Regression models applied to
hv = 0, 25, j = 3, rj = 1, rn = 1, rc = 0, rl = 1, online collected data are seen to be a part of active

61
supporting system for drivers, warning them about Fricker J.D. & Whitford R.K. 2004. Fundamentals of Trans-
rising accident risk at given road segment in certain portation Engineering. A Multimodal System Approach.
environmental conditions. Further development of the Pearson. Prentice Hall.
models would be possible to use in road design process Kutz M. (ed.). 2004. Handbook of Transportation Engineer-
especially in risk analysis and assessment. ing. McGraw-Hill Co.
Miaou S.P. & Lum H. 1993. Modeling Vehicle Accidents
And Highway Geometric Design Relationships. Accident
Analysis & Prevention, no. 25.
REFERENCES Michener R. & Tighe C. 1999. A Poisson Regression Model
of Highway Fatalities. The American Economic Review.
Baruya A. 1998. Speed-accident Relationships on European vol. 82, no. 2, pp. 452–456.
Roads. Transport Research Laboratory, U.K. Wixson J.R. 1992. The Development Of An Es&H Compli-
Evans A.W. 2003. Estimating Transport Fatality Risk From ance Action Plan Using Management Oversight, Risk Tree
Past Accident Data. Accident Analysis & Prevention, Analysis And Function Analysis System Technique. SAVE
no. 35. Annual Proceedings.

62
Safety, Reliability and Risk Analysis: Theory, Methods and Applications – Martorell et al. (eds)
© 2009 Taylor & Francis Group, London, ISBN 978-0-415-48513-5

Organizational analysis of availability: What are the lessons


for a high risk industrial company?

M. Voirin, S. Pierlot & Y. Dien


EDF R&D—Industrial Risk Management Department, Clamart, France

ABSTRACT: EDF R&D has developed an organisational analysis method. This method which was designed
from in depth examination of numerous industrial accidents, incidents and crises and from main scholar find-
ings in the domain of safety, is to be applied for industrial safety purpose. After some thoughts regarding
(dis)connections between ‘‘Safety’’ and ‘‘Availability’’, this paper analyses to what extend this method could be
used for event availability oriented analysis.

1 INTRODUCTION Table 1. Examples of events analyzed by EDF R&D.

EDF Research & Development has been leading since Event


several years a ‘‘sentinel system’’ dedicated to the
Texas City Explosion, March 23, 2005
study of worldwide cross-sectorial industrial events.
Columbia Accident, February 1, 2003
One purpose of this ‘‘sentinel system’’ is to define an Airplane Crashes above the Constance Lake, July 1, 2002
investigation method allowing to go beyond direct and Crash of the Avianca Company’s Boeing near JFK Airport,
immediate causes of events, i.e. to find organizational New York, January 25, 1990
in-depth causes. Failures of NASA’s Mars Climate Orbiter and Mars Polar
Based upon (i) studies of more than twenty indus- Lander engines, 1999
trial accidents, incidents and crises carried out in Explosion of the Longford Gas Plant, Australia,
the frame of this sentinel system (see table 1), September 25, 1998
(ii) knowledge of more than fifty industrial disasters, Crisis at the Millstone NPP from 1995 to 1999
Corrosion of the Davis Besse’s Vessel Head, March 2002
(iii) analysis of investigation methods used for sev-
Tokaïmura reprocessing plant’s accident,
eral events (by ‘‘event’’ we mean accident or incident September 30, 1999
or crisis, e.g. Cullen, 2000; Cullen, 2001; Columbia Fire in the Mont Blanc tunnel, March 24, 1999
Accident Investigation Board, 2003; US Chemical Trains collision in the Paddington railways station (near
Safety and Hazard Investigation Board, 2007) and London), October 5, 1999
findings of scholars in domains of industrial safety and
accidents (e.g. Turner, 1978; Perrow, 1984; Roberts,
1993; Sagan, 1993; Sagan, 1994; Reason, 1997;
Vaughan, 1996; Vaughan, 1997; Vaughan, 1999; . . .),
EDF-R&D has drawn up a method for Organisational
Analysis. This method is safety oriented and can 2 MAIN FEATURES OF ORGANISATIONAL
be used for event investigation and for diagnosis of ANALYSIS METHOD
organisational safety of an industrial plant as well.
After a brief description of main features of the This section gives an overview of the method. More
method and some remarks about similarities and dif- detailed information is available (Dien, 2006; Pierlot
ferences between ‘‘safety’’ and ‘‘availability’’, we will et al., 2007).
examine to which extend the method could also be The method is grounded on an empirical basis (see
used in case of events dealing with availability (versus figure 1). But it is also based on scholar findings, and
safety). more specifically on Reason work.

63
2.1 Historical dimension
As Llory states (1998): ‘‘accident does not start
with triggering of final accidental sequence; there-
fore, analysis require to go back in time, [. . .]’’ in
order to put in prominent place deterioration phenom-
ena. Analysis has to ‘‘go upstream’’ in the History
of the organisations involved for highlighting signifi-
Figure 1. General description of the method building (from cant malfunctioning aspects: what was not appreciated
Dien & Llory, 2006). in real time has to ‘‘make sense’’ when risk was
confirmed (i.e. when event has happened). Vaughan
reminds (2005): ‘‘The O-ring erosion that caused the
loss of Challenger and the foam debris problem that
took Columbia out of the sky both had a long history.’’
Early warning signs has to be looked for and detected
long before time event occurrence.
Numerous industrial events show that weakness
of operational feedback could be incriminated for
their occurrence—i.e. that previous relevant event(s)
was/were not taken into account or poorly treated after
their occurrence—. Analysts have to pay a specific
attention at incidents, faults, malfunctioning occurred
prior to the event.
Analysis of the ‘‘historical dimension’’ is parallel
to detailed examination of parameters, of variables of
Figure 2. The organizational analysis three main axis.
context which allow understanding of events. It has
to avoid a ‘‘retrospective error’’. Fine knowledge of
According to Reason (1997), causes of an event is event scenario—i.e. sequences of actions and deci-
made of three levels: sions which led to it—allows to assess actual mid and
long term effects of each action and decision. Analysts
• The person (having carried out the unsafe acts, the have to keep in mind that this evaluation is easier to
errors); make after the event than in real time. In other words,
• The workplace (local error-provoking conditions); analysts have to avoid a blame approach.
• The organization (organizational factors inducing
the event).
Development of event is ‘‘bottom-up’’, i.e. direc-
tion causality is from organizational factors to person. 2.2 Organizational network
In the event analysis, direction is opposite. Starting Within an organisation, entities communicate (‘‘Entity’’
point of analysis is direct and immediate causes of means a part of organisation more or less important in
bad outcome (event). Then, step by step, analysis con- terms of size, staffing. It could be a small amount
siders, as far as possible, how and when defences of people or even an isolated person, for instance a
failed. whistle blower): entities exchange data, they make
In addition to results obtained by scholars in the common decisions—or at least they discuss for mak-
field of organisational studies, real event organisa- ing a decision—, they collaborate, . . . So it is of the
tional analyses carried out allow us to define the three first importance to ‘‘draw’’ organisational network’’
main dimensions of an organisational approach, help- between entities concerned in the event. This network
ing to go from direct causes to root organisational is not the formal organisation chart of entities. It is a
causes (see figure 2): tool for showing numerous and complex interactions
involved for occurrence of event. It is a guideline for
• Historical dimension;
carrying out the analysis; it is built all along analysis
• Organizational network;
itself.
• ‘‘Vertical relationships’’ in the organization (from
Organisational network is hardly defined once and
field operators to plant top management).
for all for a given organisation. It is draft according to
We have to note that, if these dimensions are intro- the analysis goals. Parts of organisation can be ignored
duced in a independent way, they are interacting and because they were not involved in the event.
an analysis has to deal with them in parallel (and in Organisational network allows visualising com-
interaction). plexity of functional relationships between entities,

64
and sometimes, it highlights absence of relationships Table 2. The seven Pathogenic Organizational Factors
which had to be present. (POF).

Production Pressures (Rupture of the equilibrium between


2.3 ‘‘Vertical relationships’’ in the organization availability and safety to the detriment of safety)
Difficulty in implementing feedback experience
This dimension is a part of organisational network Short comings of control bodies
on which a specific focus is needed: It has to be Short comings in the safety organizational culture
specifically studied because it helps to figure out Poor handling of the organizational complexity
interactions between different hierarchical layers and No re-examining of design hypothesis
between managers and experts. Failure in daily safety management
It covers top-down and bottom-up communications.
It is essential to isolate it since it makes easier under-
standing of interactions between various management
levels, experts and ‘‘field operators’’. We have to complexity of the event-causing phenomena, a fac-
remind an obviousness often forgotten during event tor in a certain situation may be deemed to be a
analysis: organisation is a hierarchical system. consequence in another situation.
The main interests of this dimension are: modes of POFs are useful during event analysis since they
relationships, of communication, of information flows allow to define goals/phenomena to look for.
and modes of co-operation between hierarchical lay- A least, purpose of an organisational analysis is to
ers. Real events show that deterioration of these modes understand how organisation is working: it leads to (try
are cause of their occurrence. to) understand weaknesses and vulnerabilities, but also
At least, thanks to this dimension, causes of an event resilience aspects. Its goal is not necessarily to explain
cannot be focussed only on field operators. event from an expert point of view resulting in a list
Some other concepts have to be kept in mind while of (more or less numerous) direct causes leading to
an organizational analysis is carried out as for instance: consequences (with, at the end the fatal consequence).
‘‘incubation period’’ of an event (Turner, 1978), exis- This approach leads, in addition of collecting and anal-
tence or not of ‘‘weak signals’’ prior to occurrence ysis of objective data, to take account of employee’s
of the event and connected to it (Vaughan, 1996) and speeches. In order to collect ‘‘true’’ information ana-
existence or not of whistle blower(s), active and latent lysts have to have an empathic attitude toward people
defaults (Reason, 1997). Regarding these defaults, met during analysis (collecting ‘‘true’’ information
analyses of events have shown that some organiza- does not mean to take what it is said as ‘‘holly words’’.
tional flaws (latent defaults) were present and that they Analysts have to understand speeches according to the
are part of the root causes of the event. context, the role of the speaker(s), . . . Information
collected has to be cross-examined as well).
2.4 The Pathogenic Organizational Factors
Analyses have also shown some repetition of orga- 2.5 The Organizational Analysis Methodology
nizational root causes of events. We have defined, As stated above, this methodology has been already
according to Reason’s terminology (1997) these causes fully described (Pierlot et al., 2007; Voirin et al., 2007).
as ‘‘Pathogenic Organizational Factors’’ (POFs). Just let us recall here the main points of the
POFs cover a certain number of processes and phe- methodology:
nomena in the organization with a negative impact
on safety. They may be identified by their various • Organizational Analysis relies on background
effects within the organisation, which can be assim- knowledge made of study cases, findings of schol-
ilated into symptoms. Their effects may be objective ars, definition of Pathogenic Organizational Fac-
or even quantifiable, or they could be subjective and so tors,
only felt and suggested by employees. POFs represent • Organizational Analysis explores the three main
the formalisation of repetitive, recurrent or exemplary axis described above,
phenomena detected in the occurrence of multiple • Analysts use both written evidences and interviews.
events. These factors can be described as the hidden The interviews are performed under a compre-
factors which lead to an accident. hensive approach, in which, not only the causes
A list of seven POFs was defined (Pierlot, 2007). are searched, but also the reasons, the intentions,
Table 2 gives a list of this seven POFs. This list should the motivations and the reasoning of the people
not be considered to be definitive since new case stud- involved,
ies findings could add some findings even though • Analysts have to bring a judgment on the studied
it is relatively stable. Classification of the factors is situation, so that the ‘‘puzzle’’ of the situation could
somewhat arbitrary since, given the multiplicity and be rebuilt,

65
• Results of the Organizational Analysis are built with also a safety mission because they are part of the sec-
the help of a ‘‘thick description’’ (Geertz, 1998) and ond barrier against radioactive material leakage, and
then summarized. also because their cooling capacities are necessary to
deal with some accidental events.
But all of these systems could have both an impact
3 SAFETY AND AVAILABILITY: HOW on safety and availability. It is clear for systems deal-
CLOSE OR DIFFERENT ARE THESE TWO ing both with Safety and Availability, but, in case of
CONCEPTS? failures, some of the systems dealing only with avail-
ability could have clearly an impact on safety, as some
Although the method described above was built for of the systems dealing only with safety could have
safety purposes, considering the potential impacts of clearly an impact on availability (Voirin et al., 2007).
the Pathogenic Organizational Factors on the safety For example, if there is a turbine trip while the plant is
of a given complex system, the first applications of operating at full power, the normal way to evacuate the
this method within EDF were dedicated to availability energy produced by the nuclear fuel is no longer avail-
events. able without the help of other systems, and that may
According to the CEI 60050-191 standard, the sys- endanger safety even if the turbine is only dedicated
tem availability is the capacity of a system to feed its to availability.
expected mission, within given operating conditions at Safety and availability have then complex rela-
a given time. Availability must not be confused with tionships: This is not because short term or medium
reliability which has the same definition, excepted term availability is ensured that neither long term
that the time dimension is not at a given time, but availability or safety are also insured.
during a time interval. And we shall never forget that, before the acci-
Therefore Availability addresses two different dent, the Chernobyl nuclear power plant as the Bhopal
issues: the system capacity to avoid an damageable chemical factory had records of very good availability
event, and, if such an event had to occur, the system levels, even if, in the Bhopal case, these records were
capacity to recover from the event so that its initial obtained with safety devices out of order since many
performances are fully met again. months (Voirin et al., 2007).
Safety can be described as the system capacity to
avoid catastrophic failures, potentially hazardous for
workers, for the environment or for the public (Llory 4 CAN AN ORGANIZATIONAL ANALYSIS
& Dien 2006–2007). BE CARRIED OUT FOR AVAILABILITY
Safety and Availability seem then, at the first view, ORIENTED EVENT?
to be two un-related characteristics. But this is only
an impression. In fact, Availability and Safety are not Due to this strong relationship between availability and
independent. safety, is it possible to perform an Organisational Anal-
This is due to the fact that the equipment being ysis of an Availability event in the same way Safety
parts of a complex high risk system can deal, either oriented Organisational Analysis are performed?
only with Availability, either only with Safety, either We already gave a positive answer to that ques-
with Safety and Availability (Voirin et al., 2007). tion (Voirin et al., 2007), even if they are noticeable
For instance, in a nuclear power plant, there sys- differences.
tems dealing only with the plant availability, such as The POFs were established on the study of many
the generator group which allows to produce elec- Safety Events (Pierlot et al., 2007), they rely on a
tricity thanks to the steam coming from the Steam strong empirical basis However, the transposition of
Generators. such Safety oriented POFs into Availability oriented
There are also systems dealing only with safety. Pathogenic Organisational Factors was performed only
These systems are not required to produce electricity from the theoretical point of view, with the help of
and are not operating: Their purpose is only to fulfil a our knowledge of some study cases (Voirin et al.,
safety action if needed. For example, the Safety Injec- 2007).
tion System in a nuclear power plant has for mission The main point is that, during an Availability Organ-
to insure the water quantity within the primary circuit, isational Analysis, particular attention must be paid
so that decay heat could be still evacuated, even in case to the equilibrium between Availability and Safety,
of a water pipe break. knowing that one of the Safety oriented POFs (produc-
At last, they are systems dealing both with safety tion pressures) leads to the rupture of such a balance
and availability. For example, the Steam Generators (Dien et al., 2006; Pierlot et al., 2007).
of a nuclear power plant have an availability mission, It is with such a background knowledge that we
because they produce the sufficient steam to make the performed in 2007 the Organisational Analysis of an
generator group working; but Steam Generators fulfil event that occurred in one power plant. This event

66
was originally chosen because of its impact on the most of them were reluctant to give materials to the
Availability records of the plant. analysts.
We have checked these assumptions by carrying out It can be pointed out that the few managers who
an Organizational Analysis of a real case occurred in a agree to be interviewed were mainly managing entities
power plant. This is discuss in the following chapters. which could be seen, at first, in that case, as ‘‘vic-
tims’’. In that case, managers appeared to be opened
because they thought that the organisational analysis
was the opportunity for them to let the entire organ-
5 MAIN FINDINGS REGARDING. . . isation know that they can not be blamed for what
happened. . .
5.1 . . . Organizational analysis method These reaction of defence of the managers could
5.1.1 The three main dimensions have been forecasted. It can be over passed, knowing
The event we studied was chosen for Availability rea- that the analysts have to adopt a empathic attitude dur-
sons, but it appeared very early in the analysis that ing the interviews but also, that they have to bring a
it had also a Safety dimension. This event involved judgment of the information given by the interviewees.
different technical teams of the power plant where it The second fact that could be bring out to explain the
occurred, but also several engineering divisions. difficulty in that case to address the vertical dimension
The Organization Analysis methodology requires is the analysts position. In this analysis, the analysts
to address three dimensions: a historical one, a cross- were closed to the analysis sponsor. But, as analysts
functional one, and the ‘‘vertical relationships’’ in the can be stopped in their investigation by implicit stop
organisation. rules (Hopkins, 2003), and as the organisation cul-
ture and/or analyst’s interests could have effects on the
• The organizational network analysis results, the closer of the organisation the ana-
We have interviewed people working in each of lysts are, the more difficult will be for them to address
the concerned entities and involved in the related the vertical dimension.
event. We also succeeded in recovering official and However these difficulties to address the vertical
un-official (e-mails, power point presentations) data dimension are not blocking points: in this case study,
written by members of these entities. even if this dimension is not treated with the extent
The combination of interviews and the study of the it should have been, it is nevertheless explored suf-
written documents allowed us to draw the organisa- ficiently so that, in the end, the analysts were able to
tional network (Pierlot et al., 2007), describing how give a ‘‘thick description’’ of the event, and then a syn-
all these entities really worked together on the issues thesis, which have been accepted by the people who
related to this event: We were able then to address the ordered the study.
cross functional dimension.
• The historical dimension 5.1.2 The thick description
It was easily possible to address the historical dimen- The thick description (Geertz, 1998) appears to be a
sion, and then, to obtain not only a chronology of added value in the analysis process as in the restitution
related facts on a time scale, but also to draw a three phase. It allows the analysts to clearly formulate their
dimension organisational network, made of the usual assumptions, to make sure that all the pieces of the
cross functional plan and a time dimension, allow- puzzle fit the overall picture. In this way, this thick
ing then to put in obvious place the evolution of these description is elaborated by a succession of ‘‘tries and
cross functional dimension over the years before the errors’’, until the general sense of the studied issue is
event occurrence. In this sense, we could make ours made clear to the analysts.
the title of one chapter of the CAIB report ‘‘History as The first attempts to build this thick description
A Cause’’ (CAIB, 2003). were not made at the end of the analysis but during
• The vertical dimension the analysis. As a result, we can say that the analysis is
The study of the third dimension, the vertical one, over when the collection of new data doesn’t change
was more difficult to address. We could give several anymore this thick description, when these new data
explanations of this fact. don’t change the picture given in the description.
This thick description is a necessary step in the anal-
The first one is due to the position of the managers. ysis process, but it isn’t the final result of the analysis.
As organisational analysis has for purpose to point out As Pierlot mentioned it (Pierlot, 2006), there is a need
the organisational dysfunctions of a given organisa- to establish a synthesis of this thick description so that
tion, as one of the manager’s missions is to make sure general learning could be deduced from the analysis.
that the organisation works smoothly, he/she could Going from the thick description to this synthesis
have had a ‘‘protection’’ reaction that may explain why requires the analysts to bring a judgment so that the

67
main conclusions of the analysis could be understood also an impact on Availability. In that way, we can
and accepted by the concerned organisation and its say then that we are dealing with POFs having a
managers. potential impact on the overall results of a complex
In this sense, the writing of the synthesis is an high-risk system because these factors could lead to
important part of the analysis that may use, in a cer- a decrease of the results on the safety scale or on the
tain way, the same narrative techniques that are used availability scale. This is more specifically the case
by the ‘‘storytelling’’ approaches although the goals to for the factors ‘‘Poor handling of the organisational
be fulfilled are dramatically different: to give a general complexity’’ and ‘‘failures of operational feedback
sense to the collected data in the case of the organi- system’’.
sational analysis; to meet propaganda purposes in the However, the performed analysis was focused on
case of the ‘‘pure’’ storytelling (Salmon, 2007). the event occurrence. The analysis did not have for
During the building of the thick description and the purpose to look at the way the organisation recovered
writing of the synthesis, a specific focus has to be from the situation. If it had, we can make the realistic
paid to the status that has to be given to the collected assumption that the Safety oriented POFs might not
data. As stated above, the data come from two differ- be useful to study this issue. It is then a possibility that
ent paths: written documents and specific information there are also some pathogenic organisational factors
given by interviewees. that could describe the different dysfunctions ways for
an organisation to recover from an Availability (and
perhaps Safety) event.
5.1.3 The collected data status
If the reality of written documents cannot be discussed,
there could be always doubts on the quality, the reality
6 CONCLUSIONS
of information collected during interviews. In order
to overcome this difficulty, a written report of each
The performed Organisational Analysis on the occur-
interview was established and submitted to the appro-
rence of an Availability event confirms that the method
bation of the interviewees. Each specific information
designed for a Safety purpose can be used for an event
was also cross-checked, either with the collected writ-
dealing with Availability.
ten documents, or with information collected during
The three main dimensions approach (historical,
other interviews.
cross-functional, and vertical), the data collection and
But, there is still the case where a specific infor-
checking (written data as information given by inter-
mation cannot be confirmed or denied by written
viewees), the use of the thick description allow to
documents or during other interviews. What has to be
perform the analysis of an Availability event and to
done in that case? There is no generic answer. A ‘‘pro-
give a general sense to all the collected data.
tective’’ choice could be to say that, in this case, the
The knowledge of the Organisational Analysis
related information is never considered by the analysts.
of many different safety events, of the different
This could protect the analysts against the criticism to
Safety oriented Pathogenic Organisational Factors is a
use un-verified (and perhaps even false) data. How-
required background for the analysts so that they could
ever, such a solution could deprive the analysts of
know what must be looked for and where it must be
interesting and ‘‘valuable’’ materials.
done, or which interpretation could be done from the
We believe then, that the use of ‘‘single’’ subjective
collected data.
data relies only on the analysts judgment. If the data
This case confirms also that the usual difficul-
fits the overall picture, if it fits the general idea that the
ties encountered during the Organisational Analysis
analysts have of the situation, it can be used. However,
of a Safety event are also present for the Organisa-
if this data doesn’t cope with the frame of the analysis
tional Analysis of an Availability event : the vertical
results, the best solution is to avoid using it.
dimension is more difficult to address, the question
of ‘‘single’’ data is a tough issue that could deserve
deepener thoughts.
5.2 . . . The FOPs
The performed analysis proves also that most of
We remind that the performed analysis was launched the Safety oriented Pathogenic Organisational Factors
due to Availability reasons, but that it had also a Safety could be also seen as Availability oriented Pathogenic
dimension. Organisation Factors. However, these factors are
Let us recall also that most of the Safety POFs focused only on the event occurrence; they do not
(Pierlot et al., 2007) can be seen also, from a theoret- intend to deal with the organisation capacity to recover
ical point of view, as Availability oriented Pathogenic from the availability event. We believe that this
Organisational Factors (Voirin et al., 2007). particular point must also be studied more carefully.
The performed analysis confirms this theoretical The last point is that an Availability event
approach: most of the Safety oriented POFs have Organisational Analysis must be performed with a

68
close look on safety issues, and more specifically, Llory, M., Dien. Y. 2006–2007. Les systèmes socio-
on the balance between Safety and Availability. Here techniques à risques : une nécessaire distinction entre
again, we believe that the complex relationships fiabilité et sécurité. Performances. Issue No. 30 (Sept–Oct
between Availability and Safety deserves a closer look 2006); 31 (Nov–Dec 2006); 32 (Janv–fev 2007)
and that some research based on field studies should Perrow, C. (ed.) 1984. Normal Accidents. Living with High-
Risk Technology. New York: Basic Books.
be carried out. Pierlot, S. 2006. Risques industriels et sécurité: les
Eventually this case reinforces what we foresaw, organisations en question. Proc. Premier Séminaire de
i.e. if an organization is malfunctioning, impacts can Saint—André. 26–27 Septembre 2006, 19–35.
be either on Safety or on Availability (or both) and Pierlot, S., Dien, Y., Llory M. 2007. From organizational
so, a method designed and used for a Safety purpose factors to an organizational diagnosis of the safety. Pro-
can also be used, to a large extent, for Availability ceedings, European Safety and Reliability conference, T.
purpose. Aven & J.E. Vinnem, Eds., Taylor and Francis Group,
London, UK, Vol. 2, 1329–1335.
Reason, J (Ed.). 1997. Managing the Risks of Organizational
Accidents. Aldershot: Ashgate Publishing Limited.
REFERENCES Roberts, K. (Ed.). 1993. New challenges to Understand-
ing Organizations. New York: Macmillan Publishing
Columbia Accident Investigation Board 2003. Columbia Company.
Accident Investigation Board. Report Volume 1. Sagan, S. (Ed.). 1993. The Limits of Safety: Organizations,
Cullen, W. D. [Lord] 2000. The Ladbroke Grove Rail Inquiry, Accidents and Nuclear Weapons. Princeton: Princeton
Part 1 Report. Norwich: HSE Books, Her Majesty’s University Press.
Stationery Office. Sagan, S. 1994. Toward a Political Theory of Organiza-
Cullen, W. D. [Lord] 2001. The Ladbroke Grove Rail Inquiry, tional Reliability. Journal of Contingencies and Crisis
Part 2 Report. Norwich: HSE Books, Her Majesty’s Management, Vol. 2, No. 4: 228–240.
Stationery Office. Salmon, C. (Ed.). 2007. Storytelling, la machine à fabriquer
Dien, Y. 2006. Les facteurs organisationnels des acci- des histoires et à formater les esprits. Paris: Éditions La
dents industriels, In: Magne, L. et Vasseur, D. (Ed.), Découverte.
Risques industriels—Complexité, incertitude et décision: Turner, B. (Ed.). 1978. Man-Made Disasters. London:
une approche interdisciplinaire, 133–174. Paris: Éditions Wykeham Publications.
TED & DOC, Lavoisier. U.S. Chemical Safety and Hazard Investigation Board. 2007.
Dien, Y., Llory, M., Pierlot, S. 2006. Sécurité et perfor- Investigation Report, Refinery Explosion and Fire, BP –
mance: antagonisme ou harmonie? Ce que nous appren- Texas City, Texas, March 23, 2005, Report No. 2005-04-
nent les accidents industriels. Proc. Congrès λμ15—Lille, I-TX.
October 2006. Vaughan, D. (Ed.). 1996. The Challenger Launch Deci-
Dien, Y., Llory M. 2006. Méthode d’analyse et de diagnostic sion. Risky Technology, Culture, and Deviance at NASA.
organisationnel de la sûreté. EDF R&D internal report. Chicago: The Chicago University Press.
Dien, Y., Llory, M. & Pierlot, S. 2007. L’accident à la raf- Vaughan, D. 1997. The Trickle-Down Effect: Policy Deci-
finerie BP de Texas City (23 Mars 2005)—Analyse et sions, Risky Work, and the Challenger Tragedy. Califor-
première synthèse. EDF R&D internal report. nia Management Review, Vol. 39, No. 2, 80–102.
Geertz C. 1998. La description épaisse. In Revue Enquête. Vaughan, D. 1999. The Dark Side of Organizations: Mistake,
La Description, Vol. 1, 73–105. Marseille: Editions Misconduct, and Disaster. Annual Review of Sociology,
Parenthèses. vol. 25, 271–305.
Hollnagel, E., Woods, D.D., et Leveson, N.G. (Ed.) 2006. Vaughan, D. 2005. System Effects: On Slippery Slopes,
Resilience Engineering: Concepts and Precepts. Alder- Repeating Negative Patterns, and Learning from Mis-
shot: Ashgate Publishing Limited. take, In: Starbuck W., Farjoun M. (Ed.), Organization at
Hopkins, A. 2003. Lessons from Longford. The Esso Gas the Limit. Lessons from the Columbia Disaster. Oxford:
Plant Explosion. CCH Australia Limited, Sydney, 7th Blackwell Publishing Ltd.
Edition (1st edition 2000). Voirin, M., Pierlot, S. & Llory, M. 2007. Availability
Llory, M. 1998. Ce que nous apprennent les accidents indus- organisational analysis: is it hazard for safety? Proc.
triels. Revue Générale Nucléaire. Vol. 1, janvier-février, 33rd ESREDA Seminar—Future challenges of accident
63–68. investigation, Ispra, 13–14 November 2007.

69
Safety, Reliability and Risk Analysis: Theory, Methods and Applications – Martorell et al. (eds)
© 2009 Taylor & Francis Group, London, ISBN 978-0-415-48513-5

Thermal explosion analysis of methyl ethyl ketone peroxide by


non-isothermal and isothermal calorimetry application

S.H. Wu
Graduate School of Engineering Science and Technology, National Yunlin University of Science and Technology,
Douliou, Yunlin, Taiwan, ROC

J.M. Tseng
Graduate School of Engineering Science and Technology, National Yunlin University of Science
and Technology, Douliou, Yunlin, Taiwan, ROC

C.M. Shu
Department of Safety, Health, and Environmental Engineering, National Yunlin University of Science
and Technology, Douliou, Yunlin, Taiwan, ROC

ABSTRACT: In the past, process accidents incurred by Organic Peroxides (OPs) that involved near miss,
over-pressure, runaway reaction, thermal explosion, and so on occurred because of poor training, human error,
incorrect kinetic assumptions, insufficient change management, inadequate chemical knowledge, and so on, in
the manufacturing process. Calorimetric applications were employed broadly to test small-scale organic peroxides
materials because of its thermal hazards, such as exothermic behavior and self-accelerating decomposition in the
laboratory. In essence, methyl ethyl ketone peroxide (MEKPO) has a highly reactive and unstable exothermal
feature. In recent years, it has many thermal explosions and runaway reaction accidents in the manufacturing
process. Differential Scanning Calorimetry (DSC), Vent Sizing Package 2 (VSP2), and thermal activity monitor
(TAM) were employed to analyze thermokinetic parameters and safety index and to facilitate various auto-alarm
equipment, such as over-pressure, over-temperature, hazardous materials leak, etc., during a whole spectrum of
operations. Results indicated that MEKPO decomposed at lower temperature (30–40◦ C) and was exposed on
exponential development. Time to Maximum Rate (TMR), self-accelerating decomposition temperature (SADT),
maximum of temperature (Tmax ), exothermic onset temperature (T0 ), and heat of decomposition (Hd ) etc.,
were necessary and mandatory to discover early-stage runaway reactions effectively for industries.

1 INTRODUCTION peroxide (BPO), hydrogen peroxide (H2 O2 ), lauroyl


peroxide (LPO), tert-butyl perbenzoate (TBPBZ), tert-
The behavior of thermal explosions or runaway reac- butyl peroxybenzoate (TBPB), etc. are usually applied
tions has been widely studied for many years. A reactor as initiators and cross-linking agents for polymeriza-
with an exothermic reaction is susceptible to accumu- tion reactions. One reason for accidents involves the
lating energy and temperature when the heat genera- peroxy group (-O-O-) of organic peroxides, due to
tion rate exceeds the heat removal rate by Semenov its thermal instability and high sensitivity for thermal
theory (Semenov, 1984). Unsafe actions and behav- sources. The calorimetric technique is applied to eval-
iors by operators, such as poor training, human error, uate the fundamental exothermic behavior of small-
incorrect kinetic assumptions, insufficient change scale reactive materials during the stage of research
management, inadequate chemical knowledge etc., and development (R&D). Many thermal explosions
lead to runaway reactions, thermal explosions, and and runaway reactions have been caused globally by
release of toxic chemicals, as have sporadically MEKPO resulting in a large number of injuries and
occurred in industrial processes (Smith, 1982). Methyl even death, as shown in Table 1 (Yeh et al., 2003;
ethyl ketone peroxide (MEKPO), cumene hydroper- Chang et al., 2006; Tseng et al., 2006; Tseng et al.,
oxide (CHP), di-tert-butyl peroxide (DTBP), tert-butyl 2007; MHIDAS, 2006). Major Hazard Incident Data

71
Table 1. Thermal explosion accidents caused by MEKPO
globally.

Year Nation Frequency I F Worst case

1953–1978 Japan 14 115 23 Tokyo


1980–2004 China 14 13 14 Honan
1984–2001 Taiwan 5 156 55 Taipei
2000 Korea 1 11 3 Yosu Figure 1 (a). The aftermath Figure 1 (b). Reactor
1973–1986 Australia∗ 2 0 0 NA of the Yung-Hsin explosion bursts incurred by
1962 UK∗ 1 0 0 NA which devastated the entire thermal runaway.
plant, including all buildings
I: Injuries; F: Fatalities. within 100 m.
∗ Data from MHIDAS (MHIDAS, 2006).
NA: Not applicable.
Methyl Dimethyl Hydrogen
ethyl phthalate peroxide
Service (MHIDAS) was employed to investigate three
ketone (DMP) (H2O2)
accidents for MEKPO in Australia and UK (MHIDAS,
2006).
A thermal explosion and runaway reaction of MEKPO
occurred at Taoyuan County (the so-called Yung-
Hsin explosion) that killed 10 people and injured 47 Reactor
in Taiwan in 1996. Figures 1 (a) and 1 (b) show
the accident damage situation from the Institute of
Occupational Safety and Health in Taiwan. Crystallization Cooling
Accident development was investigated by a report H3PO4 tank to lower
from the High Court in Taiwan. Unsafe actions (wrong
than
dosing, dosing too rapidly, errors in operation, cooling
10˚C
failure) caused an exothermic reaction and the first Dehydration tank
thermal explosion of MEKPO.
Simultaneously, a great deal of explosion pressure
led to the top of the tank bursting and the hot concrete Drying tank
being broken and shot to the 10-ton hydrogen perox-
ide (H2 O2 ) storage tank (d = 1, h = 3). Under this
circumstance, the 10-ton H2 O2 storage tank incurred MEKPO in storage
the second explosion and conflagration that caused tank
10 people to perish (including employees and fire
fighters) and 47 injuries. Many plants near Yung-Hsin
Co. were also affected by the conflagration caused Figure 2. Typical manufacturing process for MEKPO in
by the H2 O2 tank. H2 O2 , dimethyl phthalate (DMP), Taiwan.
and methyl ethyl ketone (MEK) were applied to man-
ufacture the MEKPO product. Figure 2 depicts the
MEKPO manufacture process flowchart. To prevent in terms of industrial applications. In view of loss
any further such casualties, the aim of this study was prevention, emergency response plan is mandatory
to simulate an emergency response process. Differen- and necessary for corporations to cope with reactive
tial scanning calorimetry (DSC), vent sizing package chemicals under upset scenarios.
2 (VSP2), and thermal activity monitor (TAM) were
employed to integrate thermal hazard development.
Due to MEKPO decomposing at low temperature
2 EXPERIMENTAL SETUP
(30–40◦ C) (Yuan et al., 2005; Fu et al., 2003) and
exploding with exponential development, developing
2.1 Fundamental thermal detection by DSC
or creating an adequate emergency response procedure
is very important. DSC has been employed widely for evaluating ther-
The safety parameters, such as temperature of no mal hazards (Ando et al., 1991; ASTM, 1976) in
return (TNR ), and time to maximum rate (TMR), various industries. It is easy to operate, gives quanti-
self-accelerating decomposition temperature (SADT), tative results, and provides information on sensitivity
maximum temperature (Tmax ), etc., are necessary and (exothermic onset temperature, T0 ) and severity (heat
useful for studying emergency response procedures of decomposition, Hd ) at the same time. DSC was

72
Table 2. Thermokinetics and safety parameters of 31 mass% MEKPO and 20 mass% H2 O2 by DSC under β at 4◦ C min−1 .

Mass (mg) T1 (◦ C) Hd (J g−1 ) T2 (◦ C) Tmax (◦ C) Hd (J g−1 ) T3 (◦ C) Tmax (◦ C) Hd (J g−1 ) Htotal (J g−1 )

1 4.00 42 41 83 135 304 160 200 768 1,113


2 2.47 67 100 395 395

1 MEKPO; 2 H O .
2 2

applied to detect the fundamental exothermic behav-


2.5
ior of 31 mass% MEKPO in DMP that was purchased
directly from the Fluka Co. Density was measured and 20 mass% H2O2
provided directly from the Fluka Co. ca. 1.025 g cm−3 .
2.0
31 mass% MEKPO
It was, in turn, stored in a refrigerator at 4◦ C (Liao

Heat flow (W g-1)


1.5
et al., 2006; Fessas et al., 2005; Miyakel et al., 2005;
Marti et al., 2004; Sivapirakasam et al., 2004; Hou 1.0
et al., 2001). DSC is regarded as a useful tool for the
evaluation of thermal hazards and for the investigation 0.5
of decomposition mechanisms of reactive chemicals if
the experiments are carried out carefully. 0.0

0 50 100 150 200 250 300


2.2 Adiabatic tests by VSP2 Temperature (˚C)

VSP2, a PC-controlled adiabatic calorimeter, manu-


factured by Fauske & Associates, Inc. (Wang et al., Figure 3. Heat flow vs. temperature of 31 mass% and 20
2001), was applied to obtain thermokinetic and ther- mass% H2 O2 under 4◦ C min−1 heating rate by DSC.
mal hazard data, such as temperature and pressure
traces versus time. The low heat capacity of the
cell ensured that almost all the reaction heat that
was released remained within the tested sample. 3 RESULTS AND DISCUSSION
Thermokinetic and pressure behavior in the same test
cell (112 mL) usually could be tested, without any 3.1 Thermal decomposition analysis of 31 mass%
difficult extrapolation to the process scale due to a MEKPO and 20 mass% H2 O2 for DSC
low thermal inertia factor () of about 1.05 and 1.32 Table 2 summarizes the thermodynamic data by the
(Chervin & Bodman, 2003). The low  allows for DSC STAR e program for the runaway assessment.
bench scale simulation of the worst credible case, such MEKPO could decompose slowly at 30–32◦ C, as
as incorrect dosing, cooling failure, or external fire disclosed by our previous study for monomer. We
conditions. In addition, to avoid bursting the test cell surveyed MEKPO decomposing at 30◦ C, shown in
and missing all the exothermic data, the VSP2 tests Figure 3. We used various scanning rates by DSC to
were run with low concentration or smaller amount survey the initial decomposition circumstances. As a
of reactants. VSP2 was used to evaluate the essential result, a quick temperature rise may cause violent ini-
thermokinetics for 20 mass% MEKPO and 20 mass% tial decomposition (the first peak) of MEKPO under
H2 O2 . The standard operating procedure was repeated external fire condition. Table 2 shows the emergency
by automatic heat-wait-search (H–W–S) mode. index, such as T0 , H, Tmax , of 31 mass% MEKPO
by DSC under heating rate (β) 4◦ C min−1 . The ini-
tial decomposition peak usually releases little thermal
2.3 Isothermal tests by TAM
energy, so it is often disregarded. The T2 of mainly
Isothermal microcalorimetry (TAM) represents a decomposition was about 83◦ C. The total heat of reac-
range of products for thermal measurements manufac- tion (Htotal ) was about 1,113 J g−1 . DSC was applied
tured by Thermometric in Sweden. Reactions can be to evaluate the Ea and frequency factor (A). The Ea
investigated within 12–90◦ C, the working temperature under DSC dynamic test was about 168 kJ mol−1 and
range of this thermostat. The spear-head products are A was about 3.5 · 1019 (s−1 ).
highly sensitive microcalorimeters for stability test- Figure 3 displays the exothermic reaction of 20
ing of various types of materials (19). Measurements mass% H2 O2 under 4◦ C min−1 of β by the DSC.
were conducted isothermally in the temperature range Due to H2 O2 being a highly reactive chemical, oper-
from 60 to 80◦ C (The Isothermal Calorimetric Manual, ators must carefully control flow and temperature.
1998). H2 O2 was controlled at 10◦ C when it precipitated the

73
Table 3. Thermokinetic parameters of MEKPO and H2 O2 from adiabatic tests of VSP2.

T0 (◦ C) Tmax (◦ C) Pmax (psi) (dT dt−1 )max (◦ C min−1 ) (dP dt−1 )max (psi min−1 ) Tad (◦ C) Ea(kJ mol−1 )

1 90 263 530 394.7 140.0 178 97.54


2 80 158 841 7.8 100.0 78 NA

1 MEKPO; 2 H2 O2 .

280 (dT dt-1)max = 395˚C min -1


Tmax = 263°C 400
260
240 300
220 200
200 100 20 mass% MEKPO
Temperature (°C)

180 0
160 Tmax = 158°C

dT dt-1 (˚C min -1)


-3.6 -3.4 -3.2 -3.0 -2.8 -2.6 -2.4 -2.2 -2.0 -1.8
140
120 14
12
100 (dT dt-1)max = 7˚C min -1
10
80 8
60 20 mass% MEKPO 6
40 20 mass% H2O2 4 20 mass H2O2
2
20
0
0 50 100 150 200 250 -3.6 -3.4 -3.2 -3.0 -2.8 -2.6 -2.4 -2.2 -2.0 -1.8
Time (min) -1000/T (K-1)

Figure 4. Temperature vs. time for thermal decomposition Figure 6. Dependence of rate of temperature rise on temper-
of 20 mass% MEKPO and H2 O2 by VSP2. ature from VSP2 experimental data for 20 mass% MEKPO
and H2 O2 .
900 Pmax = 841 psi 160
140
800 120
100
700
80
Pmax = 530 psi 60
Pressu re (psig)

600 20 mass% MEKPO


40
500 20
0
400 -20
20 mass% MEKPO
160 -3.6 -3.4 -3.2 -3.0 -2.8 -2.6 -2.4 -2.2 -2.0 -1.8
300 20 mass% H2O2 140
120
200 100 max

100 80
60
0 40 20 mass H2O2
20
-100 0
0 50 100 150 200 250 -20
-3.6 -3.4 -3.2 -3.0 -2.8 -2.6 -2.4 -2.2 -2.0 -1.8
Time (min)

Figure 5. Pressure vs. time for thermal decomposition of


Figure 7. Dependence of rate of pressure rise on temper-
20 mass% MEKPO and H2 O2 by VSP2.
ature from VSP2 experimental data for 20 mass% MEKPO
and H2 O2 .

reaction. H2 O2 exothermic decomposition hazards are


shown in Table 2.
mass% MEKPO by VSP2 was more dangerous in
Figure 4 because it began self–heating at 90◦ C and
3.2 Thermal decomposition analysis for VSP2
Tmax reached 263◦ C. H2 O2 decomposed at 80◦ C and
Table 3 indicates T0 , Tmax , Pmax , (dT dt−1 )max , increased to 158◦ C. In parallel, H2 O2 decomposed
(dP dt−1 )max for 20 mass% MEKPO and H2 O2 by suddenly and was transferred to vapor pressure. Pmax
VSP2. Figure 4 displays temperature versus time was detected from Figure 5. Figures 6 and 7 show
by VSP2 with 20 mass% MEKPO and H2 O2 . The the rate of temperature increase in temperature and
pressure versus time by VSP2 under 20 mass% pressure from VSP2 experimental data for 20 mass%
MEKPO and H2 O2 is displayed in Figure 5. Twenty MEKPO and H2 O2 . According to the maximum of

74
Table 4. Scanning data of the thermal runaway decom- were about 263◦ C and 34.7 bar, respectively. Under 20
position of 31 mass% MEKPO by TAM. mass% MEKPO by VSP2 test, the (dT dt−1 )max and
(dP dt−1 )max were about 394.7◦ C min−1 and 140.0 bar
Isothermal min−1 , respectively. During storage and transporta-
temperature Mass Reaction time TMR Hd
tion, a low concentration (<40 mass%) and a small
(◦ C) (mg) (hr) (hr) (Jg−1 )
amount of MEKPO should be controlled. H2 O2 was
60 0.05008 800 200 784 controlled 10◦ C when it joined a MEKPO manufactur-
70 0.05100 300 80 915 ing reaction. This is very dangerous for the MEKPO
80 0.05224 250 40 1,015 manufacturing process, so the reaction was a concern
and controlled at less than 20◦ C in the reactor.
0.00015

0.00010 ACKNOWLEDGMENT
0.00005
Isothermal test for 60˚ C

0.00000 The authors would like to thank Dr. Kuo-Ming Luo for
Heat flux (Wg-1)

0.00015 0 200 400 600 800 1000 1200 valuable suggestions on experiments and the measure-
0.00010 Isothermal test for 70˚ C
ments of a runaway reaction. In addition, the authors
0.00005 are grateful for professional operating techniques and
0.00000
information.
0 50 100 150 200 250 300 350
0.00015
0.00010
Isothermal test for 80˚ C REFERENCES
0.00005
0.00000
0 50 100 150 200 250 300 350
Ando, T., Fujimoto, Y., & Morisaki, S. 1991. J. of Hazard.
Time (hr) Mater., 28: 251–280.
ASTM E537-76, 1976. Thermal Stability of Chemical by
Figure 8. Thermal curves for 31 mass% MEKPO by TAM Methods of Differential Thermal Analysis.
at 60–80◦ C isothermal temperature. Chervin, S. & Bodman, G.T. 2003. Process Saf. Prog, 22:
241–243.
Chang, R.H., Tseng, J.M., Jehng, J.M., Shu, C.M. & Hou,
temperature rise rate ((dT dt−1 )max ) and temperature H.Y. 2006. J. Therm. Anal. Calorim., 83, 1: 57–62.
rise rate maximum of pressure rise rate ((dP dt−1 )max ) Fu, Z.M., Li, X.R., Koseki, H.K., & Mok, Y.S. 2003. J. Loss
from Figures 6 and 7, MEKPO is more dangerous Prev. Process Ind., 16: 389–393. Semenov, N.N. 1984. Z.
than H2 O2 . According to Figure 2, the MEKPO and Phys. Chem., 48: 571.
H2 O2 were the Ops that may cause a violent reactivity Fessas, D. Signorelli, M. & Schiraldi, A. 2005. J. Therm.
behavior in the batch reactor. Anal. Cal., 82: 691.
Hou, H.Y., Shu, C.M. & Duh, Y.S. 2001. AIChE J., 47(8):
1893–6.
3.3 Thermal decomposition analysis of 31 Liao, C.C., Wu, S.H., Su, T.S., Shyu, M.L. & Shu, C.M.
mass% MEKPO for TAM 2006. J. Therm. Anal. Cal., 85: 65.
Marti, E., Kaisersberger, E. & Emmerich, W.D. 2004. J.
TAM was employed to determine the reaction behav- Therm. Anal. Cal., 77: 905.
ior for isothermal circumstance. Table 4 displays Miyakel, A., Kimura, A., Ogawa, T., Satoh, Y. & Inano, M.
the thermal runaway decomposition of 31 mass% 2005. J. Therm. Anal. Cal., 80: 515.
MEKPO under various isothermal environment by MHIDAS, 2006. Mayor Hazard Incident Data Service,
TAM. Results showed the TMR for isothermal tem- OHS_ROM, Reference Manual.
Sivapirakasam, S.P. Surianarayanan, M. Chandrasekaran, F.
perature at 60, 70, and 80◦ C that were about 200,
& Swaminathan, G. 2004. J. Therm. Anal. Cal., 78: 799.
80, and 40 hrs, respectively. From data investigations, Smith, D.W. 1982. ‘‘Runaway Reaction, and Thermal Explo-
isothermal temperature was rise, and then emergency sion’’, Chemical Engineering, 13: 79–84.
response time was short. The Isothermal Calorimetric Manual for Thermometric AB,
1998, Jarfalla, Sweden.
Tseng, J.M., Chang, R.H., Horng, J.J., Chang, M.K., &
4 CONCLUSIONS Shu, C.M. 2006. ibid, 85 (1) : 189–194.
Tseng, J.M., Chang, Y.Y., Su, T.S., & Shu, C.M. 2007. J.
According to the DSC experimental data, MEKPO Hazard. Mater., 142: 765–770.
Wang, W.Y., Shu, C.M., Duh, Y.S. & Kao, C.S. 2001. Ind.
decomposes at 30–40◦ C. Under external fire Eng. Chem. Res., 40: 1125.
circumstances, MEKPO can decompose quickly and Yeh, P.Y., Duh, Y.S., & Shu, C.M. 2003. Ind. Eng. Chem.
cause a runaway reaction and thermal explosion. On Res., 43: 1– 5.
the other hand, for 20 mass% MEKPO by VSP2, Yuan, M.H., Shu, C.H., & Kossoy, A.A. 2005. Thermo-
the maximum temperature (Tmax ) and pressure (Pmax ) chimica Acta,430: 67–71.

75
Crisis and emergency management
Safety, Reliability and Risk Analysis: Theory, Methods and Applications – Martorell et al. (eds)
© 2009 Taylor & Francis Group, London, ISBN 978-0-415-48513-5

A mathematical model for risk analysis of disaster chains

A. Xuewei Ji
Center for Public Safety Research, Tsinghua University, Beijing, China
School of Aerospace, Tsinghua University, Beijing, China

B. Wenguo Weng & Pan Li


Center for Public Safety Research, Tsinghua University, Beijing, China

ABSTRACT: A disaster often causes a series of derivative disasters. The spreading of disasters can often form
disaster chains. A mathematical model for risk analysis of disaster chains is proposed in causality network,
in which each node represent one disaster and the arc represent interdependence between them. The nodes of
the network are classified into two categories, active nodes and passive nodes. The term inoperability risk,
expressing the possible consequences of each passive node due to the influences of all other nodes, is cited
to assess the risk of disaster chains. An example which may occur in real life is solved to show how to apply
the mathematical model. The results show that the model can describe the interaction and interdependence of
mutually affecting emergencies, and it also can estimate the risk of disaster chains.

1 INTRODUCTION of chain of losses and failures to the simple couple


hazardous event damages. Glickman et al. proposed
Nowadays, on account of increasing congestion in hazard networks to describe the interdependencies
industrial complexes and increasing density of human among the hazards and determine risk reduction strate-
population around such complexes, any disaster is not gies. Some studies about risk assessment of domino
isolated and static, and many disasters are interactional effect based physical models and quantitative area risk
and interdependent. A disaster often causes a series of analysis have been made great progress as for instance
derivative disasters. In many cases, derivative disasters Khan et al. and Spadoni et al.
can cause greater losses than the original disaster itself. The purpose of the paper is to present a mathemat-
The spreading of disasters can often be described by ical model in causality network allowing to study the
interconnected causality chain, which is called disas- interaction and interdependence of mutually affecting
ter chains here. Disaster chains reflect how one factor among disasters and to conduct risk analysis of disaster
or sector of a system affects others. These cascade- chains.
like chain-reactions, by which a localized event in time
and space causes a large-scale disaster, may affect the
whole system. The recent example is the snowstorm
2 MATHEMATIC MODEL
in several provinces of southern China this year. The
rare snowstorm made highways, railways, and ship-
2.1 Some concepts and assumptions
ping lines blocked leading to the poor transportation
of coal, which brought the strain of electricity in many Disaster chains mean that one critical situation triggers
regions. All of these formed typical disaster chains. another one and so on, so that the situation worsens
Therefore, a great effort is necessary to prevent the even more. It is a complex system which consists of
chain effect of disasters and to improve the ability interacting disasters. All the disasters forms a causality
of emergency management. Researchers have made network, in which each node represent one disaster
some contribution to understand the domino effect and and the arc represent interdependence between them.
chain effect behind disasters. For example, Gitis et al. In this section, a mathematical model is set up based
studied catastrophe chains and hazard assessment, on the causality network (see Figure 1) for risk analysis
and they suggested a mathematical model describ- of disaster chains.
ing mutually affecting catastrophes. Menoni et al. In the causality network above, we classify the
analyzed the direct and indirect effects of the Kobe nodes of the network into two categories, active nodes
earthquake and suggested to substitute the concept and passive nodes. The term active nodes will be used

79
For each active node j = 1, 2, . . ., m, a random
variable ξi is used to represent the state, which takes
values 1 or 0 provided the node occurs or not. For each
passive node k(k = m +1, m+ 2, . . . , n) inoperability
is cited to represent its state, which is assumed to be a
continuous variable (ζk ) evaluated between 0 and 1,
with 0 corresponding to a normal state and 1 cor-
responding to completely destroyed state. We shall
consider conditional distribution of ξi and postulate
the Markov property: this distribution depends only on
the influence of the direct connected nodes. We pos-
tulate the following expressions for these conditional
probabilities:
 
Figure 1. Causality network of disaster chains. P(ξj = 1|ξi = xi , ζk = yk ) = πj + xi aij + yk akj
i k

i, j = 1, . . . , m, k = m + 1, . . . , n (2)
to refer to the nodes whose consequences (loss or dam-
age) are embodied through passive nodes, for example, where πi denotes the probability of spontaneous
the loss of earthquake comes from the damage of some occurrence of the active node j. In the above equation,
anthropogenic structures caused by it. Likewise, pas- it is supposed that
sive nodes will be used to refer to the nodes whose
consequences are possible loss events and are embod- 
m 
n
ied by themselves. Taking power system as an exam- πj + xi aij + yk akj ≤ 1 (3)
ple, the disaster is the consequence of inability of part i=1 k=m+1
of its functions (called inoperability here). Risk is the
integration of possibility and consequence. Therefore, Equation (2) means that the conditional probability
the inoperability risk proposed by Haimes et al. which of the event ξi = 1 with given ξi = xi , ζk = yk is a lin-
expresses the possible consequences of each passive ear combination of xi plus constant. Only the nodes i
node due to the influences of other nodes (including directly connected with the nodes j are involved in this
passive and active), is cited to assess the risk of disaster linear combination. This hypothesis forms the basis of
chains. the mathematical theory of the voter model (Malyshev
The proposed model is based on an assumption that et al.). Let us denote by Pi the unconditional proba-
the state of active node is binary. That means that bility P(ξi = 1). It is interpreted as the probability
active node has only two states (occurrence of nonoc- of the occurrence of the active node j. If we take the
currence). The magnitude of active node (disaster) is expectation value of (2), we get the equations:
not taken into consideration in the model. In contrast
to that, the state of passive nodes can be in continu- 
m 
n

ous states from normal state to completely destroyed Pj = πj + Pi aij + Rk akj (4)
state. i=1 k=m+1
 

m 
n
Pj = min πj + Pi aij + Rk akj , 1 (5)
2.2 The model i=1 k=m+1

Let us define certain active node with certain occur- where Rk denotes the inoperability of the passive
rence probability as the start point of disaster chains. node k. Next, we use the quantity bjk to represent
Let us index all nodes i = 1, 2, . . . , m, m + 1, m + the inoperability of each passive node k caused by
2, . . . , n. The nodes represent active nodes if i = active node i. So we have the inoperability (Ck ) of pas-
1, 2, . . . , n, and passive nodes if i = m + 1, m + sive node k due to all the active nodes that are directly
2, . . . , n. We shall denote by aij the probability for any connected with the passive node:
node i to induce directly the active node j, and aij > 0
in case of the existence of an arrow from node i to node 
m
j, otherwise aij = 0. By definition, we put aii = 0. So Ck = Pi bik (6)
we get the matrix: i=1

To assess the risk of disaster chains, we pro-


A = (aij ), i = 1, 2, . . . , n, j = 1, 2, . . . , m (1) pose a quantity Mij called direct influence coefficient.

80

⎪ m n
It expresses the influence of inoperability between pas- ⎪
⎪ Pj = πj + Pi aij + Rk akj
sive i and j, which are directly connected. We get direct ⎪



i=1 k=m+1
influence coefficient matrix: ⎪
⎪ m n

⎪ Rk = Pi bki + Ri Mik



⎨ i=1 i=m+1
M = (Mij ), i, j = m + 1, m + 2, . . . , n (7) s.t.

⎪ P = min π +

m
+

n

⎪ j j Pi aij R k a kj , 1
The expression M k reflects all influences over k −1 ⎪

m i=1


k=m+1
nodes and k links, i.e. k = 1 corresponds to direct ⎪

n

⎪ Rk = min Pi bki + Ri Mik , 1
influences, k = 2 to feedback loops with one interme- ⎪

⎩ i=1 i=m+1
diate node, etc. So, total influence coefficient matrix for j = 1, 2, . . . , m, k = m + 1, m + 2 . . . , n
(Tij ) can be expressed:
(14)

 Finally, in the mathematical model, we need to
T = M + M2 + M3 + · · · + Mk + · · · = Mk determine three matrix, i.e. the A matrix, the B matrix
k=1 and the M matrix. For the matrix A, several physical
(8) models and probability models have been developed
to give the probability that a disaster induces another
As this converges only for the existence of the one (Salzano et al.). Extensive data collecting, data
matrix (I − M )−1 , we will obtain the formula: mining and expert knowledge or experience may be
required to help determine the M matrix. We can use
T = (I − M )−1 − I (9) historical data, empirical data and experience to give
the matrix B.
where I denotes the unity matrix. Besides, the total
inoperability risk of the passive node k due to the
3 EXAMPLE
influence of all the nodes (active and passive) can be
expressed as:
To show how to apply the mathematical model, we
solve the following example. We consider the disaster

n
chains consisting of eight nodes and earthquake is con-
Rk = Ci Tik (10) sidered as spontaneous with a known probability π1 .
i=m+1 The causality network of the disaster chains is illus-
trated by Figure 2. We think that D1 ∼D4 are active
Based on the above equations, we can get: nodes and D5 ∼D8 are passive nodes.
In principle, the matrix A, B and M are obtained

n from expert data and historical data. For simplicity,
Rk = Ck + Ri Mik (11) all the elements of the three matrixes are regarded as
i=m+1 constant. We have:
⎡ ⎤ ⎡ ⎤
0 0.4 0.6 0.2 0.6 0 0.5 0.3
That is also: ⎢
⎢0
⎢ 0 0.1 0⎥⎥ B = ⎢0.9 0 0 0⎥ ⎥
⎢0 0.4 0 0⎥ ⎣0 0 0 0⎦

m 
n ⎢ ⎥
⎢0 0 0 0⎥ 0.5 0 0.1 0
Rk = Pi bki + Ri Mik (12) A=⎢
⎢0
⎥ ⎡ ⎤
⎢ 0 0 0⎥⎥ 0 0 0 0
i=1 i=m+1 ⎢0 0 0 0⎥ ⎢ 0.3 0 0 0⎥
⎢ ⎥ M =⎢ ⎥
⎣0 0 0 0⎦ ⎣0 0.4 0 0.9⎦
It is important to notice that, due to the constraint 0 0.3 0 0 0 0.2 0.5 0
0 ≤ Rk ≤ 1, equation (12) sometimes does not have
a solution. If this is the case, we will have to solve an
alternative problem, i.e. Now, suppose π1 = 0.3. Due to the earthquake,
fires in commercial area, hazardous materials release
 m  and social crisis are induced, and commercial area,
 
n
public transportation system and power supply system
Rk = min Pi bki + Ri Mik , 1 (13)
are impacted. We can get the results:
i=1 i=m+1

(P1 , P2 , P3 , P4 ) = (0.30, 0.33, 0.21, 0.06)


To sum up, the mathematical model for risk analysis
of disaster chains can be summarized as: (R5 , R6 , R7 , R8 ) = (0.58, 0.23, 0.37, 0.42)

81
Thus, because of the possible influences of the 4 CONCLUSIONS
earthquake and a series of disasters (fires and haz-
ardous materials release) induced by it, the commer- The mathematical model proposed here can be used to
cial area and power plant loss 58% and 42% of each study the interaction and interdependence of mutually
functionality, the economic situation deteriorate 23% affecting among disasters and to conduct risk analysis
of its normal state, and the transportation system loses of disaster chains. It can also be used to analyze which
63% of its operability. We conclude that the interde- disaster is more important compared to other disasters,
pendences make the influence of the earthquake be and to provide scientific basis for disaster chains emer-
amplified according to the results above. From the gency management. However, disaster chains are so
viewpoint of disaster chains, the interdependences can complex that developing a meaningful mathematical
be taken into consideration. model capable of study disaster chains is an arduous
We can also evaluate the impact of the earthquake task. The work here is only a preliminary attempt to
of varying occurrence probability on the risk distri- contribute to the gigantic efforts.
bution. Solving the problem can yield the following In order to construct a mathematical model to study
results. When π1 is 0.52, the value of R5 reaches 1. disaster chains in causality network, we think that the
The value of R8 reaches 1, when π1 increases to 0.72. following steps are needed:
This means the operability of commercial area and
• Enumerate all possible disasters in the causality
economic situation is 0.55 and 0.87 at π1 = 0.72,
network.
while the other two systems have been in completely
• For each node of the network, define a list of
destroyed. When π1 increases to 0.97, all of the sys-
other nodes which can be directly induced by it
tems completely fail except economic situation. The
and directly induce it, and define the interdependent
results are shown in Fig. 3.
relationships between them.
• Define appropriate quantities to describe the risk of
disaster chains.
Then risk analysis process of disaster chains can be
described by the model presented above.

REFERENCES

Gitis, V.G., Petrova, E.N. & Pirogov, S.A. 1994. Catastrophe


chains: hazard assessment, Natural hazards 10: 117–121.
Glickmanand, T.S. & Khamooshi, H. 2005. Using hazard
networks to determine risk reduction strategies, Journal
of the operational research society 56: 1265–1272.
Haimes, Y.Y. & Pu Jiang. 2001. Leontief-based model of
risk in complex interconnected infrastructures, Journal of
Figure 2. Causality network of our example. infrastructure systems 7(1): 1–12.
Khan, F.I. & Abbasi, S.A. 2001. An assessment of the likeli-
hood of occurrence, and the damage potential of domino
effect (chain of accidents) in a typical cluster of indus-
1.0
power plant tries, Journal of loss prevention in the process industries
14: 283–306.
0.8 Malyshev, V.A., Ignatyuk, I.A. & Molchanov, S.A. 1989.
commercial area
inoperability risk

public transportation Momentum closed processes with local interaction,


0.6 Selecta mathematica Sovietica 8(4): 351–384.
Salzano, E., Lervolino, I. & Fabbrocino, G. 2003. Seismic
economic situation risk of atmospheric storage tanks in the framework of
0.4
quantitative risk analysis, Journal of loss prevention in
the process industries 16: 403–409.
0.2
Scira Menoni. 2001. Chains of damages and failures in
a metropolitan environment: some observations on the
0.0 Kobe earthquake in 1995, Journal of hazardous materials
0.0 0.2 0.4 0.6 0.8 1.0
86: 101–119.
Probability of earthquake Spadoni, G., Egidi, D. & Contini, S. 2000. Through ARIPAR-
GIS the quantified area risk analysis supports land-use
Figure 3. Inoperability risk as function of the probability of planning activities, Journal of hazardous materials 71:
earthquake. 423–437.

82
Safety, Reliability and Risk Analysis: Theory, Methods and Applications – Martorell et al. (eds)
© 2009 Taylor & Francis Group, London, ISBN 978-0-415-48513-5

Effective learning from emergency responses

K. Eriksson
LUCRAM (Lund University Centre for Risk Analysis and Management),
Department of Fire Safety Engineering and Systems Safety, Lund University, Lund, Sweden

J. Borell
LUCRAM (Lund University Centre for Risk Analysis and Management),
Department of Design Sciences, Lund University, Lund, Sweden

ABSTRACT: During recent years a couple of emergencies have affected the city of Malmo and its inhabitants
and forced the city management to initiate emergency responses. These emergencies, as well as other incidents,
are situations with a great potential for learning emergency response effectiveness. There are several methods
available for evaluating responses to emergencies. However, such methods do not always use the full potential
for drawing lessons from the occurred emergency situations. Constructive use of principles or rules gained
during one experience (in this case an emergency response) in another situation is sometimes referred to as
‘positive transfer’. The objective of this paper is to develop and demonstrate an approach for improving learning
from the evaluation of specific response experiences through strengthening transfer. The essential principle in
the suggested approach is to facilitate transfer through designing evaluation processes so that dimensions of
variation are revealed and given adequate attention.

1 INTRODUCTION During the last four years three larger emergencies


have affected the city of Malmo. In 2004 the Box-
During recent years there has been an increased ing Day Tsunami resulted in an evacuation of people
demand from the public that society should prevent to Sweden and thus also to Malmo, which received a
emergencies to occur or at least minimize their nega- substantial number of evacuees. In the middle of July
tive outcomes. The demand for a more and more robust 2006 an armed conflict between Lebanon and Israel
society also enhances the need for the society to learn broke out. Also this incident resulted in an evacuation
from past experience. Experience and evaluations of of Swedish citizens, of whom many were connected
instances of emergency response are two possible to Malmo. Both these incidents resulted in a need
inputs to an emergency management preparedness for the city of Malmo to use their central emergency
process. Other important inputs are risk analyses, response organisation to support the people evacuated
exercises as well as experience from everyday work. from another part of the world. In April 2007 a riot
Ideally the result from an evaluation improves the broke out in a district of the city. Also during this
organisations ability to handle future incidents. Fur- incident there was a need for the city to initiate their
thermore, experience of a real emergency situation emergency response organisation but this time only
commonly creates awareness and willingness to pre- in the affected district. After each of these incidents
pare for future emergencies (Tierney et al. 2001; Boin Malmo has performed evaluations. A central question
et al. 2005). It is of central interest to question whether is: How can organisations be trained to face future
the full potential of an evaluation of an emergency unknown incidents based on known cases?
situation is used. Sometimes there is a tendency to The objective of this paper is to develop and
plan for the current situation (Carley & Harrald 1997). demonstrate an approach for strengthening emer-
For example Lagadec (2006 p. 489) mention that it is gency response capability through improving learning
essential to ‘‘. . . not prepare to fight the last war’’ but from the evaluation of specific emergency response
for future emergencies. For an overview of possible experiences. An approach has been developed based
barriers against learning from experience concerning on learning theories. The approach has been applied
emergency management, see for example Smith & to evaluations of emergency responses in Malmo.
Elliott (2007). The preliminary findings indicate that the developed

83
approach can improve experience-based learning in 3.2 Transfer
organisations.
Constructive use of principles or rules that a per-
son gained during one experience (in this case an
emergency response operation) in another situation
2 METHOD is sometimes referred to as ‘positive transfer’ (Reber
1995). Transfer may be quite specific when two sit-
From theories of learning a first hypothetical approach uations are similar (positive or negative transfer), but
for improving learning from evaluations of emergency also more general, e.g. ‘learning how to learn’. The
responses was created. To further examine and refine concept of transfer is also discussed within organisa-
the approach it was tested through application on tional theory. At an organisational level the concept
an evaluation of emergency response in the city of transfer involves transfer at an individual level but
Malmo. The evaluation concerned the management also transfer between different individuals or organisa-
of emergency response activities during the Lebanon tions. Transfer at an organisational level can be defined
war in July 2006. In addition, evaluations of the as ‘‘ . . . the process through which one unit (e.g. group,
Boxing Day Tsunami in 2004 and the emergency man- department, or division) is affected by experience of
agement organisations handling of a riot that broke another’’ (Argote & Ingram 2000 p. 151).
out in a district of the Malmo during April 2007
are used for further demonstration of the approach.
The evaluations are based on analyses of interviews 3.3 Variation
and collected documents. The interviews were car- One essential principle for facilitating the transfer
ried out with mainly municipal actors involved during process, established in the literature on learning, is
the incidents. In total 19 interviews have been com- to design the learning process so that the dimen-
pleted. The documents analysed were for example sions of possible variation become visible to the
notes from the response organisations’ information learners (Pang 2003, Marton & Booth 1999). Suc-
systems, notes and minutes from managerial meet- cessful transfer for strengthening future capability
ings during the events, written preparedness plans demands that critical dimensions of possible varia-
and newspaper articles. The use of the three eval- tion specific for the domain of interest are considered
uations can be seen as a first test of the approach (Runesson 2006).
for improved learning from evolutions of emergency When studying an emergency scenario two differ-
situations. ent kinds of variation are possible; variation of the
parameter values and variation of the set of param-
eters that build up the scenario. The first kind of
variation is thus the variation of the values of the
3 THEORY
specific parameters that build up the scenario. In prac-
tice, it is not possible to vary all possible parameter
3.1 Organisational learning
values. A central challenge is how to know which
For maintaining a response capability in an organisa- parameters are critical in the particular scenario and
tion over time there is a need that not only separate thus worth closer examination by simulated varia-
individuals but the entire organisation has the neces- tion of their values. The variation of parameter values
sary knowledge. According to Senge (2006 p. 129) can be likened to the concept of single-loop learning
‘‘. . . Organizations learn only through individuals who (Argyris & Schön 1996). When the value of a given
learn. Individual learning does not guarantee orga- parameter in a scenario is altered, that is analogous
nizational learning. But without it no organizational to when a difference between expected and obtained
learning occurs’’. Also Argyris & Schön (1996) point outcome is detected and a change of behaviour is
out that organisational learning is when the individ- made.
ual members learn for the organisation. Argyris & The second kind of variation is variation of the
Schön (1996) also discuss two types of organisational set of parameters. This kind of variation may be dis-
learning: single-loop learning and double-loop learn- cerned through e.g. discussing similarities as well as
ing. Single-loop learning occurs when an organisation dissimilarities of parameter sets between different sce-
modifies its performance due to a difference between narios. The variation of the set of parameters can be
expected and obtained outcome, without questioning likened to the concept of double-loop learning (Argyris
and changing the underlying program (e.g. changes & Schön 1996), wherein the system itself is altered
in values, norms and objectives). If the underly- due to an observed difference between expected
ing program that led to the behaviour in the first and obtained outcome. A central question is what
place is questioned and the organisation modifies it, the possible sets of parameters in future emergency
double-loop learning has taken place. scenarios are.

84
3.4 Scenario 4.2 Variation of the values of the parameters
A specific emergency situation can be described as The second step is to vary the value of the parame-
a scenario, here seen as a description of a series of ters that build up the scenario. This may be carried out
occurred or future events arranged along a timeline. through imagining variation of the included param-
Scenarios describing past emergencies ‘‘. . . answers eters (that are seen as relevant) within the scenario
the question: ‘What happened?’ ’’ (Alexander 2000 description. Typical examples of parameters can be
pp. 89–90). An emergency scenario can further be the length of forewarning or the number of people
seen as consisting of various parameters. The con- involved in an emergency response.
cept parameter is here defined in very broad terms and Variation of parameter values makes the parameters
every aspect with a potential for variation in a scenario themselves as well as the possible variation of their
is seen as a parameter. For example the duration of a values visible. This can function as a foundation for
scenario or the quantity of recourses that is needed positive transfer to future emergency situations with
can be viewed as parameters. Alexander (2000 p. 90) similar sets of relevant parameters. This in turn may
further mentions that when imagining the future the strengthen the capability to handle future emergencies
question to ask is ‘‘What if . . . ?’’. of the same kind as the one evaluated, but with for
One kind of parameters suitable for imagined example greater impact.
variation that is discussed in the literature is the
different needs or problems that arise during an emer-
gency situation. Dynes (1994) discusses two differ- 4.3 Variation of the set of parameters
ent types of needs or problems that requires to be The third step is the variation of the set of parameters.
responded to during an emergency. These are the By comparing the current case with other cases both
response generated and the agent generated needs. occurred (e.g. earlier emergencies) and imagined (e.g.
The response generated needs are quite general and results from risk and vulnerability analyses) different
often occur during emergencies. They are results of sets of parameters can be discussed.
the particular organisational response to the emer- Obviously it is not possible to decide the ultimate
gency, e.g. the needs for communication and coor- set of parameters to prepare for with general validity.
dination. The agent generated needs (Dynes et al. There is always a need for adaption to the situation.
1981) are the needs and problems that the emer- Yet it is often possible for an organisation to observe a
gency in itself creates, for example search, rescue, pattern of similarities in the parameters after a couple
care of injured and dead as well as protection against of evaluations. Even if two emergency scenarios dif-
continuing threats. These needs tend to differ more fer when it comes to physical characteristics there are
between emergencies than the response generated often similarities in the managing of the scenarios.
needs do. A possible way to vary the set of parameter is to
be inspired by Dynes (1994) different types of needs
or problems that arise during an emergency situa-
4 THE PROPOSED APPROACH tion. Commonly similarities are found when studying
response generated needs and it is thus wise to prepare
The main goal of this paper is to develop and to handle them. Even if agent generated needs tend to
demonstrate an approach to strengthening emergency differ more between emergencies, paying them some
response capability through improving learning from attention too may be worthwhile.
the evaluation of response experiences. Greater experience means more opportunities for
positive transfer. Furthermore, with increasing expe-
rience of thinking in terms of varying the set of
4.1 Description of the emergency scenario parameters and their values, it is probable that the
The first step in the approach is to construct a descrip- organisation and its employees also develop the ability
tion of the emergency scenario, i.e. create and docu- to more general transfer, through the abstract ability
ment a description of an occurred emergency situation. to think of variation of parameters.
The description of the occurred scenario is needed for
further discussions on the organisation’s ability to han-
4.4 Transferring information and knowledge
dle future emergencies. By describing the series of
events that build up the scenario, the most relevant A step that is often given inadequate attention is the
parameters can be identified. From this description transferring of information and knowledge obtained
it is then possible to vary the possible parameters as during the managing and the evaluation of an emer-
well as the set of parameters that build up the scenario, gency to the entire organisation. Thus there is a need
and thus answer Alexander’s (2000) question: ‘‘what for creating organisational learning (Carley & Harrald
if . . . ?’’. 1997). This task is not an easy one, and requires serious

85
resources. Therefore it is essential for organisations Malmo also had an administrative diary system that
to create a planned structure or process for this task. was supposed to be used as a way to spread informa-
This step also includes activities such as education and tion within the organisation. During the managing of
exercises. In the end it is essential that the findings are the Lebanon war this system was not used as intended.
carried by the individuals as well as codified in suitable Among other things, this depended on that some peo-
artefacts of the organisation. ple had problems using it. The critical question to ask
In addition, to create a better transfer and organisa- is thus: If the situation had been worse, how would they
tional learning it is throughout all steps of the approach then disseminate information within the organisation?
recommendable to work in groups. One reason for this In a worse situation, with many more people involved,
is that more people can be potential messengers to the it is probably not manageable to only use direct contact.
rest of the organisation. Would Malmo have managed such a situation?
By asking what if-questions concerning these two
critical aspects or parameters (as well as others), and in
5 DEMONSTRATING THE APPROACH that way varying the parameters, the organisation and
its employees might get an improved understanding of
The main goal of this paper was to develop and their capability to manage future emergencies. When
demonstrate an approach for strengthening emergency Malmo used the proposed approach and asked those
response capability through improving learning from critical questions several problems became visible and
the evaluation of specific response experiences. From discussed. Due to this, Malmo has later developed
theories of learning a hypothetical approach has been new routines for staffing their emergency response
constructed. Below the constructed approach will be organisation.
demonstrated through application on the evaluation of
the city of Malmo’s response to the Lebanon war. In
5.3 Variation of the set of parameters
addition, the evaluations of Malmo’s managing of the
tsunami and the riot will be used in the demonstration. The second step was to vary the set of parameters
by e.g. comparing the evaluated scenario to other
scenarios.
5.1 Description of the emergency scenario
During the interviews many people compared
The test of the approach started from a construction Malmo’s managing of the Lebanon war with the
of the emergency scenario. During this work critical managing of the Tsunami. During both emergencies
parameters for the response of the managing of the the Swedish government evacuated Swedish people
Lebanon war were observed. to Sweden from other countries. Individuals arriv-
ing to Malmo caused Malmo to initiate emergency
responses. Malmo’s work during both the situations
5.2 Variation of the values of the parameters
aimed at helping the evacuated. During both situations
During the evaluation of the Lebanon war two param- it is possible to identify more or less the same response
eters were identified as especially critical. These were generated needs. Both situations required e.g. commu-
the staffing of the central staff group and the spreading nication within the organisation and the creation of a
of information within the operative organisation. staff group to coordinate the municipal response.
During the emergency situation the strained staffing Also the agent generated needs were similar in
situation was a problem for the people working in the the two situations. Some of the affected individuals
central staff group. There was no plan for long-term arriving to Sweden needed support and help from the
staffing. A problem was that the crisis happened dur- authorities. But the type of support needed varied. For
ing the period of summer when most of the people example, after the tsunami there were great needs for
that usually work in the organisation were on vaca- psychosocial support, while the people returning from
tion. In addition, there seems to have been a hesitation Lebanon instead needed housing and food.
to bring in more than a minimum of staff. This resulted After the riot in 2007 a comparison with the man-
in that some individuals on duty were overloaded with aging of the Malmo consequences of the Lebanon war
tasks. After a week these individuals were exhausted. was carried out. This comparison showed much more
This was an obvious threat to the organisation’s ability dissimilarities than the comparison mentioned above,
to continue operations. Critical questions to ask are: especially when discussing in terms of agent gener-
What if the situation was even worse, would Malmo ated needs. For example, during the riot efforts were
have managed it? What if it would have been even concentrated on informing the inhabitants on what the
more difficult to staff the response organisation? city was doing and on providing constructive activities
The spreading of information within the opera- for youths. The response generated needs, e.g. a need
tive central organisation was primarily done by direct to initiate some form of emergency response organi-
contact between people either by telephone or mail. sation and communication with media, were the same

86
as during Malmo’s managing of the Lebanon related unproductive. But if we can do those things in a rea-
events. sonably disciplined way, we can be smarter and more
Really interesting question to ask are: What will imaginative’’ (Clarke 2005 p. 84).
happen in the future? What if Malmo will be hit by a The application of the approach during the eval-
storm or a terrorist attack? How would that influence uation of the managing of the Lebanon war conse-
the city? Who will need help? Will there be any need quences appears to have strengthened the learning in
for social support or just a need for technical help? the organisation. This statement is partly based on
Will people die or get hurt? Obviously, this is a never opinions expressed by the individuals in the organi-
ending discussion. The main idea is not to find all sation involved in the discussions during and after the
possible future scenarios, but to create a capability for evaluation. During the reception of the written report
this way of thinking. There is a need to always expect as well as during seminars and presentations of the sub-
the unexpected. ject we found that the organisation understood and had
use of the way of thinking generated by using the pro-
posed approach. Subsequently the organisation used
5.4 Transferring information and knowledge the findings from the use of this way of thinking in
their revision of their emergency management plan.
To support transfer of the result throughout the organ-
The new way of thinking seems to have resulted
isation the evaluation of the Lebanon war resulted in
in providing the organisation a more effective way
seminars for different groups of people within the
of identifying critical aspects. Consequently, it comes
organisation, e.g. the preparedness planners and the
down to being sensitive to the critical dimensions of
persons responsible for information during an emer-
variation of these parameters. There is still a need to
gency. These seminars resulted in thorough discus-
further study how an organisation knows which the
sions in the organisation on emergency management
critical dimensions are. It is also needed to further
capability. In addition, an evaluation report was made
evaluate and refine the approach in other organisations
and distributed throughout the organisation. The dis-
and on other forms of emergencies.
cussions during and after the evaluation of the Lebanon
war also led to changes in Malmo’s emergency man-
agement planning and plans. Some of these changes 7 CONCLUSION
can be considered examples of double-loop learning,
with altering of parameters, expected to improve future Seeing scenarios as sets of parameters, and elaborat-
emergency responses. ing on the variation of parameter values as well as
the set of parameters, seems to offer possibilities for
strengthening transfer. It may thus support emergency
6 DISCUSSION response organisations in developing rich and many-
sided emergency management capabilities based on
This paper has focused on the construction of an evaluations of occurred emergency events.
approach for strengthening emergency response capa-
bility through improving learning from the evaluation
of specific response experiences. REFERENCES
We have described how imaginary variation of sets
of parameters and parameter values can be used in Alexander, D. 2000. Scenario methodology for teaching prin-
evaluation processes around scenarios. Discussing in ciples of emergency management. Disaster Prevention
terms of such variation has shown to be useful. This and Management: An International Journal 9(2): 89–97.
Argote, L. & Ingram, P. 2000. Knowledge Transfer: A
holds for written reports as well as presentations. Sim- Basis for Competitive Advantage in Firms. Organiza-
ilar views are discussed in the literature. For example tional Behavior and Human Decision Processes 82(1):
Weick & Sutcliffe (2001) discuss how to manage the 150–169.
unexpected. They describe certain attitudes within an Argyris, C. & Schön, D.A. 1996. Organizational Learning II:
organisation, e.g. that the organisation is preoccupied Theory, Method, and Practice. Reading, Massachusetts:
with failures and reluctant to simplify interpreta- Addison-Wesley Publishing Company.
tions, that help the organisation to create mindfulness. Boin, A., ’t Hart, P., Stern, E. & Sundelius, B. 2005. The
A mindful organisation continues to question and Politics of Crisis Management: Public Leadership Under
reconsider conceptualisations and models and thus Pressure. Cambridge: Cambridge University Press.
Carley, K.M. & Harrald, J.R. 1997. Organizational Learn-
increases the reliability of their operations. Like- ing Under Fire. The American Behavioral Scientist 40(3):
wise Clarke (2005) discusses the need for playing 310–332.
with scenarios and imagines different possible futures. Clarke, L. 2005. Worst cases: terror and catastrophe in
‘‘It is sometimes said that playing with hypotheti- the popular imagination. Chicago: University of Chicago
cal scenarios and concentrating on consequences is Press.

87
Dynes, R.R. 1994. Community Emergency Planning: False Reber, A.S. 1995. The Penguin dictionary of psychology.
Assumptions and Inappropriate Analogies. International London: Penguin Books.
Journal of Mass Emergencies and Disasters 12(2): Runesson, U. 2006. What is it Possible to Learn? On Varia-
141–158. tion as a Necessary Condition for Learning. Scandinavian
Dynes, R.R., Quarantelli, E.L. & Kreps, G.A. 1981. A per- Journal of Educational Research 50(4): 397–410.
spective on disaster planning, 3rd edition (DRC Research Senge, P.M. 2006. The Fifth Discipline: The Art and Practice
Notes/Report Series No. 11). Delaware: University of of the Learning Organization. London: Random House
Delaware, Disaster Research Center. Business.
Lagadec, P. 2006. Crisis Management in the Twenty-First Smith, D. & Elliott, D. 2007. Exploring the Barriers to
Century: ‘‘Unthinkable’’ Events in ‘‘Inconceivable’’ Con- Learning from Crisis: Organizational Learning and Crisis.
texts. In H. Rodriguez, E.L. Quarantelli & R. Dynes (eds), Management Learning 38(5): 519–538.
Handbook of Disaster Research: 489–507. New York: Tierney, K.J., Lindell, M.K. & Perry, R.W. 2001. Facing the
Springer. unexpected: Disaster preparedness and response in the
Marton, F. & Booth, S. 1999. Learning and Awareness. United States. Washington, D.C.: Joseph Henry Press.
Mahawa NJ: Erlbaum. Weick, K.E. & Sutcliffe, K.M. 2001. Managing the Unex-
Pang, M.F. 2003. Two Faces of Variation: on continuity in pected: Assuring High Performance in an Age of Com-
the phenomenographic movement. Scandinavian Journal plexity. San Francisco: Jossey-Bass.
of Educational Research 47(2): 145–156.

88
Safety, Reliability and Risk Analysis: Theory, Methods and Applications – Martorell et al. (eds)
© 2009 Taylor & Francis Group, London, ISBN 978-0-415-48513-5

On the constructive role of multi-criteria analysis in complex


decision-making: An application in radiological emergency management

C. Turcanu, B. Carlé, J. Paridaens & F. Hardeman


SCK·CEN Belgian Nuclear Research Centre, Inst. Environment, Health and Safety, Mol, Belgium

ABSTRACT: In this paper we discuss about the use of multi-criteria analysis in complex societal problems and
we illustrate our findings from a case study concerning the management of radioactively contaminated milk. We
show that application of multi-criteria analysis as an iterative process can benefit not only the decision-making
process in the crisis management phase, but also the activities associated with planning and preparedness. New
areas of investigation (e.g. zoning of affected areas or public acceptance of countermeasures) are tackled in
order to gain more insight in the factors contributing to a successful implementation of protective actions and
the stakeholders’ values coming into play. We follow the structured approach of multi-criteria analysis and we
point out some practical implications for the decision-making process.

1 INTRODUCTION structuring the decision process and achieving a com-


mon understanding of the decision problem and the
Management of nuclear or radiological events lead- values at stake.
ing to contamination in the environment is a complex Using MCDA in an iterative fashion (as recom-
societal problem, as proven by a number of events mended e.g. by Dodgson et al. 2005) brings about
ranging from nuclear power plant accidents to loss of also an important learning dimension, especially for
radioactive sources. In addition to the radiological con- an application field such as nuclear or radiological
sequences and the technical feasibility, an effective emergency management, where planning and pre-
response strategy must also take into account pub- paredness are two essential factors contributing to
lic acceptability, communication needs, ethical and an effective response. An elaborated exercise pol-
environmental aspects, the spatial variation of con- icy will serve as a cycle addressing all the steps of
tamination and the socio-demographic background of the emergency management process: accident sce-
the affected people. This has been highlighted in the nario definition, emergency planning, training and
international projects in the 1990’s (French et al. 1992, communication, exercising and response evaluation,
Allen et al. 1996) and reinforced in recent Euro- fine-tuning of plans, etc (see Fig. 1).
pean projects—e.g. STRATEGY, for site restoration
(Howard et al. 2005) and FARMING, for extending
the stakeholders’ involvement in the management of Risk management
Mitigation Exercise
contaminated food production systems (Nisbet et al. cycle
management
2005)—as well as in reports by the International cycle
Atomic Energy Agency (IAEA 2006, p. 86). Risk Assessment &
accident scenarios Emergency
In this context, multi-criteria decision aid (MCDA)
appears as a promising paradigm since it can deal planning and response
Post crisis interventions, scenarios
with multiple, often conflicting, factors and value recovery & rehabilitation
systems. It also helps overcoming the shortcomings Training &
of traditional Cost-Benefit Analysis, especially when Exercises communication
Evaluations &
dealing with values that cannot be easily quantified lessons learned Simulated
or even less, translated in monetary terms due to their response
intangible nature.
A number of nuclear emergency management deci- Response, interventions
and crisis management Accident
sion workshops (e.g. French 1996, Hämäläinen et al.
2000a, Geldermann et al. 2005) substantiate the role
of MCDA as a useful tool in stimulating discussions, Figure 1. The emergency management cycle.

89
It is therefore of interest to extend the study of of MAVT software in decision conferences can be
potential applications of MCDA in this particular field more successful provided there is a careful planning in
and to try to integrate it into the socio-political con- advance, particularly for use in emergency exercises.
text specific to each country. Drawing on these ideas, In what concerns the aggregation function used, the
the research reported in this paper aimed at adding usual choice is an additive aggregation. The use of
the knowledge and experience from key stakeholders a utility function of exponential form had been ear-
in Belgium to the multi-criteria analysis tools devel- lier proposed by Papamichail & French (2000), but
oped for decision-support, and at exploring various it proved non-operational due to missing information
MCDA methods. The management of contaminated on the variance of actions’ scores. It is interesting to
milk after a radioactive release to the environment has notice that although risk aversion could seem as a nat-
been chosen as case study. ural attitude, Hämäläinen et al. (2000b) report a case
In Section 2 we discuss about the use of multi- study where half of the participants were risk averse
criteria analysis in nuclear emergency management. and half were risk seeking, possibly due to a different
In Section 3 we outline the Belgian decision-making perception of the decision problem. Related to this,
context. This will set the framework for Section 4 in the authors of this case study refer to the hypothesis
which we elaborate on the results from a stakeholder of Kahneman & Tversky (1979) that people are often
process carried out in Belgium in order to set up a risk seeking when it comes to losses (e.g. lives lost) and
theoretical and operational framework for an MCDA risk averse when it comes to gains (e.g. lives saved).
model for the management of contaminated milk. The Another type of MCDA approach, addressing at the
constructive role of multi-criteria analysis, the conclu- same time the definition of countermeasure strategies
sions and lessons learnt are summarised in the final as potential actions, as well as the exploration of the
section. efficient ones is due to Perny & Vanderpooten (1998).
In their interactive MCDA model, the affected area is
assumed to be a priori divided in a number of zones,
2 MCDA IN NUCLEAR EMERGENCY whereas the treatment of food products in each zone
MANAGEMENT can be again divided among various individual coun-
termeasures. The practical implementation makes use
MCDA provides a better approach than CBA for of a multi-objective linear programming model having
nuclear emergency management and site restoration, as objective functions cost, averted collective dose and
first and foremost because the consequences of poten- public acceptability. This modelling, although ques-
tial decisions are of a heterogeneous nature, which tionable as far as public acceptability—and qualitative
impedes coding them on a unique, e.g. monetary, factors in general—is concerned, allows a great deal
scale. In an overwhelming majority (e.g. French 1996, of flexibility at the level of defining constraints and
Zeevaert et al. 2001, Geldermann et al. 2005, Panov exploration of efficient, feasible strategies.
et al. 2006, Gallego et al. 2000), the MCDA methods The various degrees of acceptance of such decision
used up to now in connection with the management aid tools in the different countries suggest however that
of radiological contaminations mainly draw from the the processes envisaged must be tuned to fully accom-
multi-attribute value/utility theory (MAU/VT). modate the stakeholders’ needs (Carter & French
Research in this application field for MCDA has 2005). At the same time, the role of such analyses
started in the early 90’s in the aftermath of the in a real crisis is still a subject of debate (Hämäläinen
Chernobyl accident (French et al. 1992) and sub- et al. 2000a, Mustajoki et al. 2007).
sequently linked to the European decision-support Early stakeholder involvement in the design of
system RODOS (Ehrhardt & Weiss. 2000), as reported models developed for decision support is needed in
e.g. in French (1996), Hämäläinen et al. (1998) and order to accommodate their use to the national context,
more recently in Geldermann et al. (2006). e.g. to ensure that the hypotheses assumed, the mod-
The reported use of MAU/VT for comparing and els and the type of reasoning used reflect the actual
ranking countermeasure strategies suggests that it needs of the decision-making process. From an oper-
facilitates the identification of the most important ational viewpoint, the MCDA process can be used to
attributes, contributing thus to a shared understand- check if the existing models fit e.g. the databases and
ing of the problem between the different stakeholders decision practices in place and, at the same time, if
concerned by the decision-making process. Among they are sufficiently flexible at all levels: from defini-
the drawbacks revealed was the informational bur- tion of potential actions and evaluation criteria, up to
den of setting criteria weights, especially for socio- modelling of comprehensive preferences.
psychological attributes (e.g. Hämäläinen et al. 1998), In what concerns the general use of MCDA meth-
or even evaluating the impact of potential actions for ods, some authors (Belton & Stewart 2002) suggest
such non-quantifiable factors. Some recent studies that value functions methods are well suited ‘‘within
(Mustajoki et al. 2007) suggest that the application workshop settings, facilitating the construction of

90
preferences by working groups who mainly represent agricultural countermeasures in case of a radioactive
stakeholder interests’’. Outranking methods and contamination in the food chain is to reduce the radio-
what/if experiments with value functions might instead logical risk for people consuming contaminated food-
be better ‘‘fitted for informing political decision- stuff. For such cases, maximum permitted radioactiv-
makers regarding the consequences of a particular ity levels for food products, also called the European
course of action’’. Council Food Intervention Levels—CFIL—(CEC,
The exploration of an outranking methodology is 1989), have been laid down and are adopted by the
motivated by some particularities of our decision prob- Belgian legislation as well.
lem (Turcanu 2007). Firstly, the units of the evaluation The experience from the European project FARM-
criteria (e.g. averted dose, cost, and public accep- ING (Vandecasteele et al. 2005) clearly showed how-
tance) are heterogeneous and coding them into one ever that the characteristics of the agriculture and the
common scale appears difficult and not entirely natu- political environment, as well as the past experiences
ral. Secondly, the compensation issues between gains of food chain contamination crises are also important
on some criteria and losses on other criteria are not factors to be taken into account. This explains why
readily quantifiable. In general, Kottemann & Davis some countermeasures aiming at reducing the con-
(1991) suggest that the degree to which the preference tamination in food products were considered hardly
elicitation technique employed requires explicit trade- acceptable by the stakeholders, if at all, even when
off judgments influences the ‘‘decisional conflict’’ that supported by scientific and technical arguments.
can negatively affect the overall perception of a multi- Decisions taken have potentially far-reaching con-
criteria decision support system. Thirdly, the process sequences, yet they often have to be made under time
of weighting and judging seems in general more quali- pressure and conditions of uncertainty. Also in case of
tative than quantitative. When it comes to factors such food countermeasures, time may become an important
as social or psychological impact, the process of giving issue. For example in case of milk, the large amounts
weights to these non-quantifiable factors may appear produced daily vs. the limited storage facilities of
questionable. Furthermore, outranking methods relax diaries make it necessary to rapidly take a decision
some the key assumptions of the MAUT approach, for for management of contaminated milk (processing,
example comparability between any two actions and disposal, etc).
transitivity of preferences and indifferences.
In the following sections, the focus is laid on
the management of contaminated milk. The con- 4 MCDA MODEL FRAMEWORK
tinuous production of milk requires indeed urgent
decisions due to the limited storage facilities. More- In this section we discuss the multi-criteria decision
over, dairy products are an important element of the aid framework developed for the management of con-
diet, especially for children, who constitute the most taminated milk. We describe the stakeholder process
radiosensitive population group. Finally, for certain and the key elements of the MCDA model. Additional
radionuclides with high radiotoxicity such as radio- details on the stakeholder process and its results can
caesium and radioiodine, maximum levels of activity be found in Turcanu et al. (2006); in the following we
concentration in milk are reached within few days point out the new areas of investigation opened by this
after the ground deposition of the radioactive material research and we underline the constructive dimension
(Nisbet, 2002). brought by the use of MCDA.

4.1 The stakeholder process


3 NUCLEAR AND RADIOLOGICAL
EMERGENCY MANAGEMENT IN BELGIUM In the stakeholders process (Turcanu et al. 2006),
we carried out 18 individual interviews with gov-
In the context of the Belgian radiological emergency ernmental and non-governmental key actors for the
plan (Royal Decree, 2003 and subsequent amend- management of contaminated milk: decision-makers,
ments), the decision maker, also called the ‘‘Emer- experts and practitioners. This included various areas
gency Director’’ is the Minister of Interior or his of activity: decision making (Ministry of Interior);
representative. He is chairman of the decision cell, radiation protection and emergency planning; radioe-
i.e. the Federal Co-ordination Committee. The deci- cology; safety of the food chain; public health; dairy
sion cell is advised by the radiological evaluation cell, industry; farmers’ association; communication; local
mainly in charge with the technical and feasibility decision making; social science; and management of
aspects, and by the socio-economic evaluation cell. radioactive waste.
Various other organisations and stakeholders have We have decided to make use of individual inter-
a direct or indirect impact on the efficiency of the views rather than focus group discussions or question-
chosen strategies. The primary goal of food and naires in order to allow sufficient time for discussions,

91
to cover in a consistent way a range of stakeholders as fast calculation of e.g. the amount of production in
complete as possible and to facilitate free expression the selected zone, the implementation costs for the
of opinions. The discussion protocol used followed the selected countermeasure, the collective doses and
main steps described below. maximal individual doses due to ingestion of contam-
inated foodstuff (the dose is an objective measure of
4.2 Potential actions the detriment to health), helping the decision makers
or advisers in choosing the set of potential actions to
Several individual or combined countermeasures can be further evaluated.
be employed for the management of contaminated
milk (Howard et al. 2005), for instance disposal of
contaminated milk; prevention or reduction of activity 4.3 Evaluation criteria
in milk (clean feeding, feed additives) and/or storage
and processing to dairy products with low radioactivity The evaluation criteria were built through a process
retention factors. The need to make available a more combining a top-down and a bottom-up approach
flexible way to generate potential actions, require- (Turcanu et al. 2006).
ment which came out from the stakeholder process, The stakeholders interviewed were first asked to
led to the development of a prototype tool allowing identify all the relevant effects, attributes and con-
the integration of various types of data: data from the sequences of potential actions. Subsequently, they
food agency (e.g. location of dairy farms and local commented on a list of evaluation criteria derived from
production), modelling and measurement data (e.g. the literature and amended it, if felt necessary. Based
ground depositions of radioactive material at the loca- on the resulting list, a number of evaluation criteria
tion of dairy farms), other data such as administrative was built taking account, as much as possible, of the
boundaries, etc (see Fig. 2). properties of exhaustiveness, cohesiveness and non-
Simulated deposition data, for example for iodine redundancy (see Roy 1996 for a description of these
and caesium, were generated in discrete points of a grid concepts), in order to arrive at a consistent set of eval-
using dispersion and deposition models, e.g. as exist- uation criteria. The example discussed at the end of
ing in the RODOS decision-support system (Ehrhardt this section illustrates the list of criteria proposed.
and Weiss 2000) and introduced into a geographi- Here we should mention that one of the evalua-
cal information system (GIS). Here, the data were tion criteria highlighted by all stakeholders as very
converted into continuous deposition data through a important is public acceptance. To have a better
simple natural neighbour algorithm. Next, the deposi- assessment of it, we included in a public survey in
tion data were evaluated at the locations of dairy farms, Belgium (Turcanu et al. 2007) a number of issues
and combined with production data, both provided relevant for the decision-making process: i) public
by the federal food agency. The GIS allows overlay- acceptance of various countermeasures; ii) consumer’s
ing these data with other spatial information such as behaviour. The results showed that clean feeding
administrative boundaries, topological maps, selected of dairy cattle and disposal of contaminated milk
zones, population data, etc. are the preferred options in case of contaminations
A data fusion tool was subsequently implemented above legal norms. For contaminations below legal
in spreadsheet form. This tool can be used for a norms, normal consumption of milk seemed better
accepted than disposal. Nonetheless, the expressed
consumer’s behaviour revealed a precautionary ten-
dency: the presence of radioactivity at some step
GIS data in the food chain could lead to avoiding purchas-
ing products from affected areas. Finally, public trust
building was revealed as a key element of a successful
countermeasure strategy.
Food Agency The resulting distributions of acceptance degrees
data (from ‘‘strong disagreement’’ to ‘‘strong agreement’’)
on the sample of respondents can be compared in sev-
eral ways (Turcanu et al. 2007). To make use of all
information available, an outranking relation S (with
the meaning ‘‘at least as good as’’) can be defined on
Model and the set of individual countermeasures, e.g. based on
measurement data stochastic dominance:
 
Figure 2. Integration of various types of data and selection
aSb ⇔ aj ≤ bj + θ, ∀i = 1, . . . , 5
of areas and countermeasures. j≤i j≤i

92
where aj , bj are the percentages of respondents using be better fitted for a given criterion gi ,the following
the j-th qualitative label to evaluate countermeasures discrimination thresholds were chosen:
a and b, respectively, and θ is a parameter linked
ref
to the uncertainty in the evaluation of aj and bj . A qi (gi (a)) = max {qi , q0i · gi (a)} and
simpler approach, but with loss of information, is to
pi (gi (a)) = p0 · qi (gi (a)),
derive a countermeasure’s public acceptance score as
e.g. the percentage of respondents agreeing with a ref
countermeasure. with p0 > 1, a fixed value and qi ≥ 0 and 0 ≤ q0i < 1
Despite the inherent uncertainty connected to parameters depending on the criterion gi .
assessing the public acceptance of countermeasures
in ‘‘peace time’’ we consider that such a study is use- 4.5 Comprehensive preferences
ful for emergency planning purposes, especially for
situations when there is a time constraint as it is the In order to derive comprehensive preferences, we dis-
case for the management of contaminated milk. cussed with the stakeholders interviewed about the
relative importance of evaluation criteria. This notion
can be interpreted differently (Roy & Mousseau 1996),
4.4 Formal modelling of evaluation criteria depending on the type of preference aggregation
method used.
The preferences with respect to each criterion were We investigated the adequacy for our application of
modelled with "double threshold" model (Vincke four types of inter-criteria information: i) substitution
1992). Each criterion was thus represented by a real- rates (tradeoffs) between criteria; ii) criteria weights as
valued positive function associated with two types of intrinsic importance coefficients; iii) criteria ranking
discrimination thresholds: an indifference threshold with possible ties; iv) a partial ranking of subfamilies
q(·) and a preference threshold p(·). For a criterion g of criteria.
to be maximised and two potential actions a and b, we For each of these we asked the stakeholders inter-
can define the following relations: viewed if such a way to express priorities is suit-
able and, most importantly, if they are willing and
accept to provide/receive such information. Our dis-
a I b (a indifferent to b) cussions revealed a higher acceptance of the qualitative
⇔ g(a) ≤ g(b) + q(g(b)) and approaches, which indicates that outranking methods
might be better suited. The concept of weights as
g(b) ≤ g(a) + q(g(a)); intrinsic importance coefficients proved hard to under-
a P b (a strictly preferred to b) stand, but encountered a smaller number of opponents
than weights associated with substitution rates. The
⇔ g(a) > g(b) + p(g(b)); main argument against the latter can be seen in ethi-
a Q b (a weakly preferred to b) cal motivations, e.g. the difficulty to argue for a value
trade-off between the doses received and the economic
⇔ g(b) + q(g(b)) < g(a) and cost. Methods of outranking type that can exploit a
g(a) ≤ g(b) + p(g(b)). qualitative expression of inter-criteria information are
for instance the MELCHIOR method (Leclerq 1984)
or ELECTRE IV (Roy 1996).
Under certain consistency conditions for p and q, Let us consider in the following the case when the
this criterion model corresponds to what is called a inter-criteria information is incomplete -because the
pseudo-criterion (see Roy 1996). decision-maker is not able or not willing to give this
The choice of the double threshold model is moti- information- is the following. Let’s suppose that the
vated by the fact that for certain criteria (e.g. economic information about the relative importance of criteria is
cost) it might not be possible to conclude a strict available in the form of a function assumed irreflexive
preference between two actions scoring similar val- and asymmetric:
ues, e.g. due to the uncertainties involved, while an
intermediary zone exists between indifference and ι : G × G → {0, 1}, such that
strict preference. The double threshold model is a
ι(gm , gp ) = 1 ⇔ criterion gm is ‘‘more important
general one, easy to particularise for other types
of criteria. For example, by setting both thresholds than’’ criterion gp ,
to zero, one obtains the traditional, no-threshold
model. where G is the complete set of evaluation criteria.
In order to account in an intuitive way for both the The function ι, comparing the relative importance
situations when a fixed or a variable threshold could of individual criteria, can be extended to subsets of

93
criteria in a manner inspired from the MELCHIOR Table 2. Evaluation criteria: Example.
method: we test if favourable criteria are more impor-
tant than the unfavourable criteria. Variable Minimal
Formally stated, we define recursively a mapping indifference indifference
ι∗ : ℘ (G) × ℘ (G) → {0, 1} as: threshold threshold Optimis.
Criterion (q0i ) (qi ref ) direction

ι (F, ø) = 1, ∀ø = F ⊂ G, C1 Residual collective 10% 10 person min
ι∗ (ø, F) = 0, ∀ F ⊂ G, effective dose Sv
(person . Sv)
ι∗ ({gm }, {gp }) = 1 ⇔ ι(gm , gp ) = 1, C2 Maximal individual 0.5 mSv min
∗ 5% (thyroid)
ι ({gm } ∪ F, H) = 1, with {gm } ∪ F ⊂ G and dose (mSv)
∗ C3 Implementation 10% 20 kC
= min
H ⊂ G ⇔ ι (F, H) = 1 or ∃ gp ∈ H: ι(gm , gp ) = 1
cost (kC
=)
and ι∗ (F, H\{gp }) = 1. C4 Waste (tonnes) 10% 1t min
C5 Public 0 0 max
acceptance
We further define a binary relation R represent-
C6 (Geographical) 0 0 max
ing comprehensive preferences on the set of potential feasibility
actions A as follows: C7 Dairy industry’s 0 0 max
acceptance
∀a, b ∈ A, R(a, b) = ι∗ (F, H), where C8 Uncertainty of 0 0 min
outcome
F = {gi ∈ G|aPi b}, H = {gi ∈ G|b(Pi ∪ Qi )a}, C9 Farmers’ 0 0 max
acceptance
and (Pi , Qi , Ii ) is the preference structure associated C10 Environmental 0 0 min
with criterion gi . impact
C11 Reversibility 0 0 max
4.6 An illustrative example ∗ p0 = 2 for all cases.
In this section we discuss a hypothetical (limited scale)
131
I milk contamination. The potential actions (suit-
able decision alternatives) are described in Table 1. Table 3. Impact of potential actions∗ .
Table 2 gives the complete list of evaluation criteria
and Table 3 summarises the impact of potential actions Criterion C1 C2 C3 C4 C5 C6 C7 C8
with respect to these criteria.
When inter-criteria information is not available, Person
Action Sv mSv kC
= t – – – –
the comprehensive preferences resulting from the
aggregation method given above are illustrated in A1 4 100 0 0 0 1 0 3
Fig. 3. For instance, the arrow A4 → A3 means that A2 0.1 3.6 240 16 3 1 2 1
action A4 is globally preferred to action A3 . We can A3 0.3 4.6 17 16 3 1 2 1
see that there are also actions which are incomparable, A4 0.3 4.6 27 16 3 2 2 1
A5 0.8 20 1.3 0 2 1 1 2
Table 1. Potential actions: Example. A6 0.8 20 2.5 0 2 2 1 2
∗ On criteria C –C
Action Description 9 11 all actions score the same, therefore
they are not mentioned in the table.
A1 Do Nothing
A2 Clean feed in area defined by sector
(100◦ , 119◦ , 25 km):
A3 Clean feed in area where deposit activity for instance actions A4 and A6 ; the situation changes
>4000 Bq/m2 : however, when some information is given concerning
A4 Clean feed in area where deposited activity the relative importance of criteria.
>4000 Bq/m2 , extended to full Let us assume that the decision-maker states that
administrative zones the maximal individual dose is more important than
A5 Storage for 32 days in area where deposited any other criterion and that public acceptance is more
activity >4000 Bq/m2
important than the cost of implementation and the
A6 Storage for 32 days in area where deposited
geographical feasibility. We then obtain the results pre-
activity >4000 Bq/m2 , extended to full
administrative zones sented in Fig. 4, highlighting both actions A4 and A2
as possibly interesting choices.

94
Table 4. Results from the multi-criteria analysis.
A2 A4 A6
Multi-criteria analysis Stakeholder process results

Decision context Learn about the different


A3 A5 viewpoints.
Potential actions New modelling tools connecting
decision-support tools with
existing databases.
A1 Evaluation criteria Set of evaluation criteria that can
serve as basis in future exercises.
Better assessment of public accep-
Figure 3. Comprehensive preferences without inter-criteria tance (public opinion survey).
information. Preference Revealed tendency towards a quali-
aggregation tative inter-criteria information.
Outranking methods highlighted
as potentially useful for future
A4
studies.

A3
makers and decision advisers, as well as practition-
ers in the field (e.g. from dairy industry or farmers’
union) contributed to a better understanding of many
aspects of the problem considered. For our case study,
A6 A2
this process triggered further research in two direc-
tions: flexible tools for generating potential actions
A5 and social research in the field of public acceptance of
food chain countermeasures.
MCDA can be thus viewed as bridging between var-
ious sciences—decision science, radiation protection,
A1 radioecological modelling and social science—and a
useful tool in all emergency management phases.
The research presented here represents one step in
Figure 4. Comprehensive preferences with inter-criteria an iterative cycle. Further feedback from exercises and
information. workshops will contribute to improving the proposed
methodology.
Naturally, results such as those presented above
must be subjected to a detailed robustness analysis
(Dias 2006) as they depend on the specific val-
ues chosen for the parameters used in the model, REFERENCES
i.e. discrimination thresholds. For instance, if
q2 ref = 1 mSv, instead of 0.5 mSv as set initially (see Allen, P., Archangelskaya, G., Belayev, S., Demin, V.,
Drotz-Sjöberg, B.-M., Hedemann-Jensen, P., Morrey, M.,
Table 2), while the rest of the parameters remain at Prilipko, V., Ramsaev, P., Rumyantseva, G., Savkin, M.,
their initial value, both actions A4 and A3 would be Sharp, C. & Skryabin, A. 1996. Optimisation of health
globally preferred to action A2 . protection of the public following a major nuclear
accident: interaction between radiation protection and
social and psychological factors. Health Physics 71(5):
5 CONCLUSIONS 763–765.
Belton, V. & Stewart, T.J. 2002. Multiple Criteria Decision
In this paper we have discussed the application of Analysis: An integrated approach. Kluwer: Dordrecht.
multi-criteria analysis for a case study dealing with the Carter, E. & French, S. 2005. Nuclear Emergency Manage-
management of contaminated milk. Our main findings ment in Europe: A Review of Approaches to Decision
are summarised in Table 4. Making, Proc. 2nd Int. Conf. on Information Systems for
Crisis Response and Management, 18–20 April, Brussels,
One important conclusion is that consultation with Belgium, ISBN 9076971099, pp. 247–259.
concerned stakeholders is a key factor that can lead Dias, L.C. 2006. A note on the role of robustness analy-
to a more pragmatic decision aid approach and pre- sis in decision aiding processes. Working paper, Institute
sumably to an increased acceptance of the resulting of Systems Engineering and Computers INESC-Coimbra,
models. The stakeholder process, involving decision Portugal. www.inesc.pt.

95
Dodgson, J., Spackman, M., Pearman, A. & Phillips, L. 2000. Leclercq, J.P. 1984. Propositions d’extension de la notion
Multi-Criteria Analysis: A Manual. London: Depart- de dominance en présence de relations d’ordre sur les
ment of the Environment, Transport and the Regions, pseudo-critères: MELCHIOR. Revue Belge de Recherche
www.communities.gov.uk. Opérationnelle, de Statistique et d’Informatique 24(1):
Ehrhardt, J. & Weiss, A. 2000. RODOS: Decision Support 32–46.
for Off-Site Nuclear Emergency Management in Europe. Mustajoki, J., Hämäläinen, R.P. & Sinkko, K. 2007. Interac-
EUR19144EN. European Community: Luxembourg. tive computer support in decision conferencing: Two cases
French, S. 1996. Multi-attribute decision support in the event on off-site nuclear emergency management. Decision
of a nuclear accident. J Multi-Crit Decis Anal 5: 39–57. Support Systems 42: 2247–2260.
French, S., Kelly, G.N. & Morrey, M. 1992. Decision confer- Nisbet, A.F., Mercer, J.A., Rantavaara, A., Hanninen, R.,
encing and the International Chernobyl Project. J Radiol Vandecasteele, C., Carlé, B., Hardeman, F., Ioannides,
Prot 12: 17–28. K.G., Papachristodoulou, C., Tzialla, C., Ollagnon, H.,
Gallego, E., Brittain, J., Håkanson, L., Heling, R., Jullien, T. & Pupin, V. 2005. Achievements, difficulties
Hofman, D. & Monte, L. 2000. MOIRA: A Computerised and future challenges for the FARMING network. J Env
Decision Support System for the Restoration of Radionu- Rad 83: 263–274.
clide Contaminated Freshwater Ecosystems. Proc. 10th Panov, A.V., Fesenko, S.V. & Alexakhin, R.M. 2006. Method-
Int. IRPA Congress, May 14–19, Hiroshima, Japan. ology for assessing the effectiveness of countermeasures
(http://www2000.irpa.net/irpa10/cdrom/00393.pdf). in rural settlements in the long term after the Cher-
Geldermann, J., Treitz, M., Bertsch, V. & Rentz, O. 2005. nobyl accident on the multi-attribute analysis basis. Proc.
Moderated Decision Support and Countermeasure Plan- 2nd Eur. IPRA Congress, Paris, 15–19 May, France.
ning for Off-site Emergency Management. In: Loulou, R., www.irpa2006europe.com.
Waaub, J.-P. & Zaccour, G. (eds) Energy and Envi- Papamichail, K.N. & French, S. 2000. Decision support in
ronment: Modelling and Analysis. Kluwer: Dordrecht, nuclear emergencies, J. Hazard. Mater. 71: 321–342.
pp. 63–81. Perny, P. & Vanderpooten, D. 1998. An interactive multiob-
Geldermann, J., Bertsch, V., Treitz, M., French, S., jective procedure for selecting medium-term countermea-
Papamichail, K.N. & Hämäläinen, R.P. 2006. Multi- sures after nuclear accidents, J. Multi-Crit. Decis. Anal.
criteria decision-support and evaluation of strategies for 7: 48–60.
nuclear remediation management. OMEGA—The Inter- Roy, B. 1996. Multicriteria Methodology for Decision Aid-
national Journal of Management Science (in press). Also ing. Kluwer: Dordrecht.
downloadable at www.sal.hut.fi/Publications/pdf-files/ Roy, B. & Mousseau, V. 1996. A theoretical framework for
MGEL05a.doc. analysing the notion of relative importance of criteria.
Hämäläinen, R.P., Sinkko, K., Lindstedt, M.R.K., Amman, J Multi-Crit Decis Anal 5: 145–159.
M. & Salo, A. 1998. RODOS and decision conferencing Royal Decree, 2003. Plan d’Urgence Nucléaire et Radi-
on early phase protective actions in Finland. STUK-A 159 ologique pour le Territoire Belge, Moniteur Belge,
Report, STUK—Radiation and Nuclear Safety Authority, 20.11.2003.
Helsinki, Finland, ISBN 951-712-238-7. Turcanu, C. 2007. Multi-criteria decision aiding model for
Hämäläinen, R.P., Lindstedt, M.R.K. & Sinkko, K. 2000a. the evaluation of agricultural countermeasures after an
Multi-attribute risk analysis in nuclear emergency man- accidental release of radionuclides to the environment.
agement. Risk Analysis 20(4): 455–468. PhD Thesis. Université Libre de Bruxelles: Belgium.
Hämäläinen, R.P., Sinkko, K., Lindstedt, M.R.K., Amman, Turcanu, C., Carlé, B. & Hardeman, F. 2006. Agri-
M. & Salo, A. 2000b. Decision analysis interviews on cultural countermeasures in nuclear emergency man-
protective actions in Finland supported by the RODOS agement: a stakeholders’ survey for multi-criteria
system. STUK-A 173 Report, STUK—Radiation and model development. J Oper Res Soc. DOI 10.1057/
Nuclear Safety Authority, Helsinki, Finland. ISBN 951- palgrave.jors.2602337
712-361-2. Turcanu, C., Carlé, B., Hardeman, F., Bombaerts, G. & Van
Howard, B.J., Beresford, N.A., Nisbet, A., Cox, G., Oughton, Aeken, K. 2007. Food safety and acceptance of manage-
D.H., Hunt, J., Alvarez, B., Andersson, K.G., Liland, A. & ment options after radiological contaminations of the food
Voigt, G. 2005. The STRATEGY project: decision tools chain. Food Qual Pref 18(8): 1085–1095.
to aid sustainable restoration and long-term management Vandecasteele, C., Hardeman, F., Pauwels, O., Bernaerts, M.,
of contaminated agricultural ecosystems. J Env Rad 83: Carlé, B. & Sombré, L. 2005. Attitude of a group of
275–295. Belgian stakeholders towards proposed agricultural coun-
IAEA. 2006. Environmental consequences of the Chernobyl termeasures after a radioactive contamination: synthesis
accident and their remediation: twenty years of experi- of the discussions within the Belgian EC-FARMING
ence/report of the Chernobyl Forum Expert Group ‘Envi- group. J. Env. Rad. 83: 319–332.
ronment’. STI/PUB/1239, International Atomic Energy Vincke, P. 1992. Multicriteria decision aid. John Wiley &
Agency. Vienna: Austria. Sons, Chichester.
Kahneman, D. & Tversky, A. 1979. A. Prospect Theory: An Zeevaert, T., Bousher, A., Brendler, V., Hedemann Jensen, P. &
Analysis of Decision under Risk. Econometrica 47(2): Nordlinder, S. 2001. Evaluation and ranking of restoration
263–291. strategies for radioactively contaminated sites. J. Env. Rad.
Kottemann, J.E. & Davis, F.D. 1991. Decisional conflict and 56: 33–50.
user acceptance of multi-criteria decision-making aids.
Dec Sci 22(4): 918–927.

96
Decision support systems and software tools for safety and reliability
Safety, Reliability and Risk Analysis: Theory, Methods and Applications – Martorell et al. (eds)
© 2009 Taylor & Francis Group, London, ISBN 978-0-415-48513-5

Complex, expert based multi-role assessment system for small


and medium enterprises

S.G. Kovacs & M. Costescu


INCDPM ‘‘Alexandru Darabont’’, Bucharest, Romania

ABSTRACT: The paper presents a complex assessment system developed for Small and Medium Enterprises.
It assesses quality, safety and environment on three layers starting from basic to complete assessment (including
management and organizational culture). It presents the most interesting attributes of this system together with
some of the results obtained by testing this system on a statistical lot of 250 Romanian SME.

1 GENERAL ASPECTS

In Romania, Small and Medium Enterprises (SME’s)


were somehow, from objective and subjective rea-
sons, the Cinderella of safety activities. Complex
safety assessment procedures, like Hazop are too
complicated for SME’s. On the other side, more sim-
ple assessment systems are considered as too biased.
SME’s need not just to assess safety but quality and
environment as well.
The paper presents a three-layered multi-role
assessment system, based on expert-decision assist-
ing structures (Kovacs 2003b)—system developed
especially for SME and known as MRAS.
The system is conceived on three levels of complex- Figure 2. MRAS structure.
ity, from basic to extra. In this way, all the requests
regarding safety assessment are satisfied. Moreover,
on the third level, a dual, safety-quality assessment is center of all the safety solutions for a SME. In this
performed, together with a safety culture analysis. respect it optimises the informational flows assuring
The system is integrated into the Integrated Safety an efficient usage of all these tools in order to prevent
Management Unit (ISMU) which is the management risks or mitigate them.
ISMU schema is presented in Figure 1.
The general structure of the system is shown in
Figure 2.

2 VULNERABILITY ANALYSIS

In our assessment system vulnerability analysis is the


first step to understand how exposed are SME’s to
operational (safety) threats that could lead to loss
and accidents (Kovacs 2004).Vulnerability analysis
could be defined as a process that defines, identi-
Figure 1. ISMU structure. fies and classifies the vulnerabilities (security holes)

99
in an enterprise infrastructure. In adition vulnerabil- Table 1. Vulnerability scores.
ity analysis can forecast the effectiveness of proposed
prevention measures and evaluate their actual effec- Mark Semnification
tiveness after they are put into use. Our vulnerability
0 Non-vulnerable
analysis is performed in the basic safety assessment
1 Minimal
layer and consist mainly on the next steps: 2 Medium
1. Definition and analysis of the existing (and avail- 3 Severe vulnerability-loss
able) human and material resources; 4 Severe vulnerability-accidents
5 Extreme vulnerability
2. Assignment of relative levels of importance to the
resources;
3. Identification of potential safety threats for each
resource;
4. Identification of the potential impact of the threat Past Incident Coefficient.
on the specific resource (will be later used in
scenario analysis). Ov = Av ∗ Pik (1)
5. Development of a strategy for solving the threats
hierarchically, from the most important to the last where
important.
6. Defining ways to minimise the consequences if a Pik = 0—if there were no previous incidents;
threat is acting(TechDir 2006). Pik = 1.5—if there were near misses, loss incidents
and/or minor accidents;
We have considered a mirrored vulnerability Pik = 2—if there were severe accidents before.
(Kovacs 2006b)—this analysis of the vulnerability
being oriented inwards and outwards of the assessed
workplace. The Figure 3 shows this aspect.
3 PRE-AUDIT
So, the main attributes for which vulnerability is
assessed are:
The pre-audit phase of the basic safety assessment
1. the human operator: plays many roles (Mahinda 2006). The most important
2. the operation of these are connected with the need to have regu-
3. the process; larly a preliminary safety assessment (Kovacs 2001a)
which:
The mirrored vulnerability analysis follows not just
how much are these three elements vulnerable but also – identifies the most serious hazards;
how much they affect the vulnerability of the whole – maps the weak spots of the SME, where these
SME. hazards could act more often and more severly;
Vulnerability is estimated in our system on a 0 to 5 – estimates the impact of the hazards action upon man
scale like in Table 1. and property;
We are also performing a cross-over analysis taking – verifies basic conformity with safety law and basic
into account past five year incident statistical data. safety provisions;
So we are computing an Operational Vulnerability • assures consistency to the following assessments,
as means of Assessed Vulnerability multiplied by an recordings and also the management of previous
assessments;
Pre-audit is centered around two attributes of the
workplace:
Pre-audit of the human operator(Kovacs 2001b)—its
findings leads to two main measures: the improvement
of training or the change of workplace for the operator
which performance could endanger the safety at the
workplace.
Pre-audit of machines and the technological pro-
cess—the findings of this part of safety pre-audit
could lead to machine/process improvement or to
machine/process revision or to the improvement
of machines maintenance or if necessary to the
elimination of the machine—if the machine is beyond
Figure 3. Vulnerability analysis schema. repair.

100
Table 2. Pre-audit scores. The pre-audit assessment system uses the rating
presented in the Table 2.
Level of Pre-audit main guideline is represented by ISO
Pre- Minimum Pre- transition to 9001.
audit corresponding level audit move to next Pre-audit main instruments are checklists. A very
score of safety ranges level
short example of such a checklist is presented in the
0 Nothing in place 0–1 Identification of Table 3.
basic safety
needs, some
activities 4 SAFETY SCENARIO ANALYSIS
performed
1 Demonstrate a basic 1–2 Local safety
knowledge and procedures
As SME’s are performing often routine activities it is
willingness to developed and possible to develop and use as a safety improvement
implement safety implemented tool safety scenarios in order to be aware of what could
policies/procedures/ happen if risks are acting (TUV 2008).
guidelines A schema of our scenario analysis module is given
2 Responsibility and 2–3 Responsibility in the Figure 4.
accountability documented and The previous safety assessment data is used in order
identified for most communicated for to build a scenario plan—firstly. For example, in a
safety related tasks the main safety specific area of the SME were recorded frequent near
tasks
3 OHS Policies/ 3–4 Significant level
misses in a very short period of time. The previ-
Procedures and of implementation ous assessment data is used in order to estimate the
Guidelines through audit site trend of components and their attributes. One of such
implemented components is represented by the Human Operator
4 Comprehensive level 4–5 Complete
of implementation implementation
across audit site across audit site
5 Pre-audit results Industry best
used to review practice
and improve
safety system;
demonstrates
willingness for
continual
improvement

Table 3. Pre-audit checklist example.

Please estimate the truth of the following affirmations taking


into account the actual situation from the workplace

Question
nr. Question Yes No

1 We have a comprehensive ‘‘Health


and Safety Policy’’ manual plan in
place
2 All our employees are familiar with
local and national safety regulations
and we abide by them
3 We have developed a comprehensive
safety inspection process
4 We are performing regularly this
safety inspection process taking
into account all the necessary
details—our inspection is real and
not formal
Figure 4. Safety scenario module.

101
and one of its attributes would be the preparedness • The Human Operator;
in emergency situations. In our example we could • The Machine(s)—including facilities;
consider the sudden apparition of a fire at one of • The Working Environment;
the control panels at the workplace. If the prepared-
ness in emergency situations is set on in scenario, the together with a derived component which includes the
operator is able to put down the fire and no other specific work task and also the interaction between
loss occurs. If the attribute is set off the operator is components;
frightened and runs—the fire propagates and there is All these components- and their interaction—are
an accidental spill—because some valves got out of analysed in the best case-worst case framework con-
control. sidering also an intermediary state, the medium case.
Our scenario analysis instrument assures the rule of The medium case is (if not mentioned otherwise) is
the 3 P’s (Kovacs 2006b): the actual situation in the SME. One point of attention
is the transition between the cases, in order to analyse
• Plan: what happens if, for example, safety resources allo-
◦ Establish goals and objectives to follow; cation is postponed in order to allocate resources in a
◦ Develop procedures to follow; more hotter places.
• Predict:
◦ Predict specific risk action; 5 OBTAINED RESULTS
◦ Predict component failure;
◦ Predict soft spots where risks could materialize We have tested our system on a statistically significant
more often; lot of 250 Romanian SME (Kovacs 2006a) from all the
• Prevent: economic activities, on a half year period, taking into
account the following assessments:
◦ Foresee the necessary prevention measures; some
of these measures could be taken immediately; • General management assessment;
other are possible to be postponed for a more • Safety management assessment;
favourable period (for example the acquisition of • Incident rate after implementation (comparatively
an expensive prevention mean), other are sim- with incidents on a 5 year period);
ply improvements of the existing situation—for • Assessment of the ground floor work teams;
example a training that should include all the
workers at the workplace not just the supervisors; The SME’s test shown that such a system—which
The actual development of the scenario is per- is not very complicated—in order to be able to be used
formed in a best case—medium case—worst case by SME’s alone—is a very efficient tool in order to:
framework. This framework could be seen in the
Figure 5. – assess more objectively safety together with quality
Mainly there are considered three essential compo- and environment;
nents: – perform an efficient assessment of the SME man-
agement regarding quality, safety and environment;
– offers a realistic image to the SME management
in order to be convinced about the necessity of
improvement, not just for safety but also for envi-
ronment, as quality improvement is a must for every
SME in order to be able to remain in the market.
– tells the SME manager that quality must not be con-
sidered alone but only in connexion with safety and
environment;
– offers a valuable assessment instrument for exter-
nal assessment firms, inspection authorities and risk
assurance companies (Kovacs 2003a);

The Figure 6 shows the vulnerability analysis results


for the statistical lot.
It is possible to see that taking into account the
vulnerability alone the majority of the statistical lot
has a vulnerability over 3, this showing a severe
Figure 5. Worst case-best case scenario framework. vulnerability at risk action.

102
Vulnerability analysis results REFERENCES

Kovacs S, Human operator assessment—basis for a safe


workplace in the Process Industry, in Proceedings of the
5 0
Hazards XVI Symposium—Analysing the past, planning
12% 14% the future, Manchester 6–8 November 2001, Symposium
4
Series no. 148, ISBN 0-85295-441-7, pp. 819–833.
1
16% 14% Kovacs S, Safety as a need-Environment protection as a
deed, in Proceedings of the Hazards XVI Symposium—
Analysing the past, planning the future, Manchester
6–8 November 2001, Symposium Series no. 148, ISBN
0-85295-441-7, pp. 867–881.
3
2 Kovacs S, Major risk avoidance-fulfilling our responsibilities-
20%
24%
costs and benefits of Romania’s safety integration into
the European Union, in Proceedings of the Hazards XVII
Figure 6. Vulnerability analysis results. Symposium—fulfilling our responsibilities, Manchester
25–27 March 2003 Symposium Series no. 149, ISBN
0-85295-459-X, pp. 619–633.
6 CONCLUSIONS Kovacs S, Developing best practice safety procedures through
IT systems, in Proceedings of the Hazards XVII—
MRAS (Multi-Role Assessment System) was devel- fulfilling our responsibilities symposium, Manchester
oped initially as a self-audit tool which could allow 25–27 March 2003 Symposium Series no. 149, ISBN
SME to have an objective and realistic image regard- 0-85295-459-X, pp. 793–807.
ing their efforts to assure the continuous improvement Kovacs S, To do no harm- measuring safety performance
in a transition economy, in Proceedings of the Hazards
of quality and maintain a decent standard of safety
XVIII Symposium-Process safety-sharing best practice,
and environment protection. However, on the devel- Manchester 23–25 November 2004, Symposium Series
opment period we have seen a serious interest from No. 150, ISBN 0-85295-460-3, pp. 151–171.
control organs (like the Work Inspection) so we have Kovacs S, Process safety trends in a transition economy-
developed the system so that it can be used for self- a case study, In Proceedings of the Hazards XIX
audit or it can be used by an external auditor. The Symposium-Process Safety and Environmental Protec-
control teams are interested to have a quick referential tion, Manchester 25–27 March 2006, Symposium Series
in order to be able to check quickly and optimally the No. 151, ISBN 0-85295-492-8, pp. 152–166.
safety state inside a SME. In this respect our system Kovacs S, Meme based cognitive models regarding risk and
loss in small and medium enterprises, In Proceedings of
was the most optimal.
the Seventh International Conference on Cognitive Mod-
Romanian SME are progressing towards the full elling, ICCM06, Trieste, Edizioni Goliardiche, ISBN
integration into the European market. In this respect 88-7873-031-9, pp. 377–379.
they need to abide to the EU provisions, espe- Mahinda Seneviratne and Wai on Phoon, Exposure assess-
cially regarding safety and environment. Considering ment in SME: a low cost approach to bring oHS Services
this, the system assures not just the full conformity to SME’s, in Industrial Health 2006, 44, pp. 27–30.
with European laws but also an interactive forecast- TechDir 2006, SMP 12 Safety Case and Safety Case report.
plan-atact-improve instrument. The compliance audits TUV Rheinland: SEM 28 Safety Assessment service.
included in the system together with the management
and culture audits are opening Romanian SME’s to the
European Union world.

103
Safety, Reliability and Risk Analysis: Theory, Methods and Applications – Martorell et al. (eds)
© 2009 Taylor & Francis Group, London, ISBN 978-0-415-48513-5

DETECT: A novel framework for the detection of attacks to critical


infrastructures

F. Flammini
ANSALDO STS, Ansaldo Segnalamento Ferroviario S.p.A., Naples, Italy
Università di Napoli ‘‘Federico II’’, Dipartimento di Informatica e Sistemistica, Naples, Italy

A. Gaglione & N. Mazzocca


Università di Napoli ‘‘Federico II’’, Dipartimento di Informatica e Sistemistica, Naples, Italy

C. Pragliola
ANSALDO STS, Ansaldo Segnalamento Ferroviario S.p.A., Naples, Italy

ABSTRACT: Critical Infrastructure Protection (CIP) against potential threats has become a major issue in
modern society. CIP involves a set of multidisciplinary activities and requires the adoption of proper protection
mechanisms, usually supervised by centralized monitoring systems. This paper presents the motivation, the
working principles and the software architecture of DETECT (DEcision Triggering Event Composer & Tracker),
a new framework aimed at the automatic and early detection of threats against critical infrastructures. The
framework is based on the fact that non trivial attack scenarios are made up by a set of basic steps which have to
be executed in a predictable sequence (with possible variants). Such scenarios are identified during Vulnerability
Assessment which is a fundamental phase of the Risk Analysis for critical infrastructures. DETECT operates
by performing a model-based logical, spatial and temporal correlation of basic events detected by the sensorial
subsystem (possibly including intelligent video-surveillance, wireless sensor networks, etc.). In order to achieve
this aim, DETECT is based on a detection engine which is able to reason about heterogeneous data, implementing
a centralized application of ‘‘data fusion’’. The framework can be interfaced with or integrated in existing
monitoring systems as a decision support tool or even to automatically trigger adequate countermeasures.

1 INTRODUCTION & BACKGROUND a new framework aimed at the automatic detection of


threats against critical infrastructures, possibly before
Critical Infrastructure Protection (CIP) against ter- they evolve to disastrous consequences. In fact, non
rorism and any form or criminality has become trivial attack scenarios are made up by a set of basic
a major issue in modern society. CIP involves a steps which have to be executed in a predictable
set of multidisciplinary activities, including Risk sequence (with possible variants). Such scenarios must
Assessment and Management, together with the be precisely identified during Vulnerability Assess-
adoption of proper protection mechanisms, usu- ment which is a fundamental aspect of Risk Analysis
ally supervised by specifically designed Security for critical infrastructures (Lewis 2006). DETECT
Management Systems (SMS)1 (see e.g. (LENEL operates by performing a model-based logical, spatial
2008)). and temporal correlation of basic events detected by
Among the best ways to prevent attacks and dis- intelligent video-surveillance and/or sensor networks,
ruptions is to stop any perpetrators before they strike. in order to ‘‘sniff’’ sequence of events which indicate
This paper presents the motivation, the working prin- (as early as possible) the likelihood of threats. In order
ciples and the software architecture of DETECT to achieve this aim, DETECT is based on a detection
(DEcision Triggering Event Composer & Tracker), engine which is able to reason about heterogeneous
data, implementing a centralized application of ‘‘data
fusion’’ (a well-known concept in the research field of
cognitive/intelligent autonomous systems (Tzafestas
1 In some cases, they are integrated in the 1999)). The framework can be interfaced with or inte-
traditional SCADA (Supervisory Control & Data grated in existing SMS/SCADA systems in order to
Acquisition) systems. automatically trigger adequate countermeasures.

105
With respect to traditional approaches of infrastruc- or radiological material in underground stations, com-
ture surveillance, DETECT allows for: bined attacks with simultaneous multiple train halting
and railway bridge bombing, etc. DETECT has proven
• A quick, focused and fully automatic response to be particularly suited for the detection of such artic-
to emergencies, possibly independent from human ulated scenarios using a modern SMS infrastructure
supervision and intervention (though manual con- based on an extended network of cameras and sensing
firmation of detected alarms remains an option). In devices. With regards to the underlying security infras-
fact, human management of critical situations, pos- tructure, a set of interesting technological and research
sibly involving many simultaneous events, is a very issues can also be addressed, ranging from object
delicate task, which can be error prone as well as tracking algorithms to wireless sensor network inte-
subject to forced inhibition. gration; however, these aspects (mainly application
• An early warning of complex attack scenarios since specific) are not in the scope of this work.
their first evolution steps using the knowledge base DETECT is a collaborative project carried out by
provided by experts during the qualitative risk anal- the Business Innovation Unit of Ansaldo STS Italy
ysis process. This allows for preventive reactions and the Department of Computer and System Science
which are very unlikely to be performed by human of the University of Naples ‘‘Federico II’’.
operators given the limitation both in their knowl- The paper is organized as follows: Section 2
edge base and vigilance level. Therefore, a greater presents a brief summary of related works; Section 3
situational awareness can be achieved. introduces the reference software architecture of the
• An increase in the Probability Of Detection (POD) framework; Section 4 presents the language used to
while minimizing the False Alarm Rate (FAR), due describe the composite events; Section 5 describes the
to the possibility of logic as well as temporal correla- implementation of the model-based detection engine;
tion of events. While some SMS/SCADA software Section 6 contains a simple case-study application;
offer basic forms of logic correlation of alarms, Section 7 draws conclusions and provides some hints
the temporal correlation is not implemented in any about future developments.
nowadays systems, to the best of our knowledge
(though some vendors provide basic options of on-
site configurable ‘‘sequence’’ correlation embedded 2 RELATED WORKS
in their multi-technology sensors).
Composite event detection plays an important role
The output of DETECT consists of:
in the active database research community, which
• The identifier(s) of the detected/suspected sce- has long been investigating the application of Event
nario(s). Condition Action (ECA) paradigm in the context of
• An alarm level, associated to scenario evolution using triggers, generally associated with update, insert
(only used in deterministic detection as a linear or delete operations. In HiPAC (Dayal et al. 1988)
progress indicator; otherwise, it can be set to 100%). active database project an event algebra was firstly
• A likelihood of attack, expressed in terms of proba- defined.
bility (only used as a threshold in heuristic detection; Our approach for composite event detection follows
otherwise, it can be set to 100%). the semantics of the Snoop (Chakravarthy & Mishra
1994) event algebra. Snoop has been developed at the
DETECT can be used as an on-line decision sup- University of Florida and its concepts have been imple-
port system, by alerting in advance SMS operators mented in a prototype called Sentinel (Chakravarthy
about the likelihood and nature of the threat, as well et al. 1994, Krishnaprasad 1994). Event trees are used
as an autonomous reasoning engine, by automati- for each composite event and these are merged to form
cally activating responsive actions, including audio an event graph for detecting a set of composite events.
and visual alarms, emergency calls to first respon- An important aspect of this work lies in the notion
ders, air conditioning flow inversion, activation of of parameter contexts, which augment the semantics
sprinkles, etc. of composite events for computing their parameters
The main application domain of DETECT is home- (parameters indicate ‘‘component events’’). CEDMOS
land security, but its architecture is suited to other (Cassandra et al. 1999) refers to the Snoop model
application fields like environmental monitoring and in order to encompass heterogeneity problems which
control, as well. The framework will be experi- often appear under the heading of sensor fusion. In
mented in railway transportation systems, which have (Alferes & Tagni 2006) the implementation of an event
been demonstrated by the recent terrorist strikes to detection engine that detects composite events speci-
be among the most attractive and vulnerable targets. fied by expressions of an illustrative sublanguage of
Example attack scenarios include intrusion and drop the Snoop event algebra is presented. The engine has
of explosive in subway tunnels, spread of chemical been implemented as a Web Service, so it can also be

106
used by other services and frameworks if the markup • Event History database, containing the list of basic
for the communication of results is respected. events detected by sensors or cameras, tagged with
Different approaches for composite event detec- a set of relevant attributes including detection time,
tion are taken in Ode (Gerani et al. 1992a, b) and event type, sensor id, sensor type, sensor group,
Samos (Gatziu et al. 1994, Gatziu et al. 2003). Ode object id, etc. (some of which can be optional, e.g.
uses an extended Finite Automata for composite event ‘‘object id’’ is only needed when video-surveillance
detection while Samos defines a mechanism based on supports inter-camera object tracking).
Petri Nets for modeling and detection of composite • Attack Scenario Repository, providing a database
events for an Object Oriented Data-Base Management of known attack scenarios as predicted in Risk
System (OODBMS). Analysis sessions and expressed by means of an
DETECT transfers to the physical security the Event Description Language (EDL) including log-
concept of Intrusion Detection System (IDS) which ical as well as temporal operators (derived from
is nowadays widespread in computer (or ‘‘logical’’) (Chakravarthy et al. 1994)).
security, also borrowing the principles of Anomaly • Detection Engine, supporting both determinis-
Detection, which is applied when an attack pattern is tic (e.g. Event Trees, Event Graphs) and heuris-
known a priori, and Misuse Detection, indicating the tic (e.g. Artificial Neural Networks, Bayesian
possibility of detecting unknown attacks by observing Networks) models, sharing the primary requirement
a significant statistical deviation from the normality of real-time solvability (which excludes e.g. Petri
(Jones & Sielken 2000). The latter aspect is strictly Nets from the list of candidate formalisms).
related to the field of Artificial Intelligence and related • Model Generator, which has the aim of building
classification methods. the detection model(s) (structure and parameters)
Intelligent video-surveillance exploits Artificial starting from the Attack Scenario Repository by
Vision algorithms in order to automatically track parsing all the EDL files.
object movements in the scene, detecting several type • Model Manager, constituted by four sub-modules
of events, including virtual line crossing, unattended (grey-shaded boxes in Figure 1):
objects, aggressions, etc. (Remagnino et al. 2007).
Sensing devices include microwave/infrared/ultra- ◦ Model Feeder (one for each model), which
sound volumetric detectors/barriers, magnetic detec- instantiates the inputs of the detection engine
tors, vibration detectors, explosive detectors, and according to the nature of the models by cyclically
advanced Nuclear Bacteriologic Chemical Radio- performing proper queries and data filtering on
logical (NBCR) sensors (Garcia 2001). They can be the Event History (e.g. selecting sensor typolo-
connected using both wired and wireless networks, gies and zones, excluding temporally distant
including ad-hoc Wireless Sensor Networks (WSN) events, etc.).
(Lewis 2004, Roman et al. 2007). ◦ Model Executor (one for each model), which
triggers the execution of the model, once it has
been instantiated, by activating the related (exter-
3 THE SOFTWARE ARCHITECTURE nal) solver. An execution is usually needed at each
new event detection.
The framework is made up by the following main ◦ Model Updater (one for each model), which
modules (see Figure 1): is used for on-line modification of the model

EDL MODEL(S) MODEL


REPOSITORY GENERATOR UPDATER k

EVENT MODEL k
QUERIES FEEDER
HISTORY
DETECTION SMS / SCADA
INPUTS MODEL k ALARMS ->
MODEL k MODEL k
<- CONFIG
SOLVER EXECUTOR
COUNTERMEASURES
DETECTION
OUTPUT
MANAGER
ENGINE

Figure 1. The software architecture of DETECT.

107
(e.g. update of a threshold parameter), with- • Standard communication protocols (OPC3 ,
out regenerating the whole model (whenever ODBC4 , Web-Services, etc.) needed to interop-
supported by the modeling formalism). erate with open databases, SMS/SCADA, or any
◦ Output Manager (single), which stores the out- other client/server security subsystems which are
put of the model(s) and/or passes it to the interface compliant to such standards.
modules.
The last two points are necessary to provide
• Model Solver, that is the existing or specifically DETECT with an open, customizable and easily
developed tool used to execute the model. upgradeable architecture. For instance, by adopt-
ing a standard communication protocol like OPC,
Model Generator and Model Manager are depen- an existing SMS supporting this protocol could inte-
dent on the formalisms used to express the models grate DETECT as it was just a further sensing
constituting the Detection Engine. In particular, the device.
Model Generator and Model Feeder are synergic in At the current development state of DETECT:
implementing the detection of the event specified in
EDL files: in fact, while the Detection Engine plays • A GUI has been developed to edit scenarios and
undoubtedly a central role in the framework, many generate EDL files starting from the Event Tree
important aspects are demanded to the way the query graphical formalism.
on the database is performed (i.e. selection of proper • A Detection Engine based on Event Graphs (Buss
events). As an example, in case the Detection Engine 1996) is already available and fully working, using
is based on Event Trees (a combinatorial formalism), a specifically developed Model Solver.
the Model Feeder should be able to pick the set of last • A Model Generator has been developed in order to
N consecutive events fulfilling some temporal proper- generate Event Graphs starting from the EDL files
ties (e.g. total time elapsed since the first event of the in the Scenario Repository.
sequence <T), as defined in the EDL file. In case of • A Web Services based interface has been developed
Event Graphs (a state-based formalism), instead, the to interoperate with external SMS.
model must be fed by a single event at a time. • The issues related to the use of ANN (Jain et al.
Besides these main modules, there are others which 1996) for heuristic detection have been addressed
are also needed to complete the framework with useful, and the related modules are under development and
though not always essential, features (some of which experimentation.
can also be implemented by external tools or in the
SMS):
4 THE EVENT DESCRIPTION LANGUAGE
• Scenario GUI (Graphical User Interface) used to
draw attack scenarios using an intuitive formalism The Detection Engine needs to recognize combination
and a user-friendly interface (e.g. specifically of events, bound each other with appropriate operators
tagged UML Sequence Diagrams stored in the in order to form composite events of any complexity.
standard XMI2 format (Object Management Group Generally speaking, an event is a happening that occurs
UML 2008)). in the system, at some location and at some point in
• EDL File Generator, translating GUI output into time. In our context, events are related to sensor data
EDL files. variables (i.e. variable x greater than a fixed threshold,
• Event Log, in which storing information about variable y in a fixed range, etc.). Events are classified
composite events, including detection time, sce- as primitive events and composite events.
nario type, alarm level and likelihood of attack A primitive event is a condition on a specific sen-
(whenever applicable). sor which is associated some parameters (i.e. event
• Countermeasure Repository, associating to each identifier, time of occurrence, etc.). Event parameters
detected event or event class a set of operations to can be used in the evaluation of conditions. Each entry
be automatically performed by the SMS. stored in the Event History is a quadruple:
• Specific drivers and adapters needed to interface < IDev, IDs, IDg, tp >, where:
external software modules, possibly including anti- • IDev is the event identifier;
intrusion and video-surveillance subsystems. • IDs is the sensor identifier;

3 OLE (Object Linking & Embedding) for Process


2 XML (eXtended Markup Language) Metadata Communication.
Interchange. 4 Open Data-Base Connectivity.

108
• IDg is the sensor group identifier (needed for In the following, we briefly describe the semantics of
geographical correlation); these operators. For a formal specification of these
• tp is the event occurrence time which should be a semantics, the reader can refer to (Chakravarthy et al.
sensor timestamp (when a global clock is available 1994).
for synchronization) or the Event History machine OR. Disjunction of two events E1 and E2 , denoted
clock. (E1 OR E2 ). It occurs when at least one of its
components occurs.
Since the message transportation time is not instan- AND. Conjunction of two events E1 and E2 ,
taneous, the event occurrence time can be different denoted (E1 AND E2 ). It occurs when both E1 and
from the registration time. Several research works E2 occur (the temporal sequence is ignored).
have addressed the issue of clock synchronization in ANY. A composite event, denoted ANY (m, E1 ,
distributed systems. Here we assume that a proper E1 , . . . , En ), where m ≤ n. It occurs when m out of n
solution (e.g. time shifting) has been adopted at a lower distinct events specified in the expression occur (the
level. temporal sequence is ignored).
A composite event is a combination of primitive SEQ. Sequence of two events E1 and E2 , denoted
events defined by means of proper operators. The (E1 SEQ E2 ). It occurs when E2 occurs provided that
EDL of DETECT is derived from Snoop event alge- E1 has already occurred. This means that the time
bra (Chakravarthy & Mishra 1994). Every composite of occurrence of E1 has to be less than the time of
event instance is a triple: occurrence of E2 .
<IDec, parcont, te>, where: The sequence operator is used to define composite
• IDec is the composite event identifier; events when the order of its component events is rel-
• parcont is the parameter context, stating which evant. Another way to perform a time correlation on
occurrences of primitive events need to be con- events is by exploiting temporal constraints.
sidered during the composite event detection (as The logic correlation could loose meaningfulness
described below); when the time interval between component events
• te is the temporal value related to the occurrence of exceeds a certain threshold. Temporal constraints can
the composite event (corresponding to the tp of the be defined on primitive events with the aim of defin-
last component event). ing a validity interval for the composite event. Such
constraints can be added to any operator in the formal
Formally an event E (either primitive or composite) expression used for event description.
is a function from the time domain onto the boolean For instance, let us assume that in the composite
values, True and False: event E = (E1 AND E2 ) the time interval between the
occurrence of primitive events E1 and E2 must be at
E: T → {True, False}, given by: most T. The formal expression is modified by adding
 the temporal constraint [T] as follows:
True, if E occurs at a time t
E (t) =
False, otherwise
(E1 AND E2 )[T] = True
The basic assumption of considering a boolean
function is quite general, since different events can ⇔
be associated to a continuous sensor output according ∃ t1 ≤ t|(E1 (t) ∧ E2 (t1 ) ∨ E1 (t1 ) ∧ E2 (t)) ∧ |t − t1 | ≤ T
to a set of specified thresholds. Furthermore, negate
conditions (!E) can be used when there is the need for
checking that an event is no longer occurring. This
allows considering both instantaneous (‘‘occurs’’ = 5 THE SOFTWARE IMPLEMENTATION
‘‘has occurred’’) and continuous (‘‘occurs’’ = ‘‘is
occurring’’) events. However, in order to simplify This section describes some implementation details of
EDL syntax, negate conditions on events can be sub- DETECT, referring to the current development state
stituted by complementary events. An event Ec is of the core modules of the framework, including the
complementary to E when: Detection Engine. The modules have been fully imple-
mented using the Java programming language. JGraph
Ec ⇒!E has been employed for the graphical construction of
the Event Trees used in the Scenario GUI. Algorithms
Each event is denoted by an event expression, whose have been developed for detecting composite events in
complexity grows with the number of involved events. all parameter contexts.
Given the expressions E1 , E2 , . . . , En , every applica- Attack scenarios are currently described by Event
tion on them through any operator is still an expression. Trees, where leaves represent primitive events while

109
• Chronicle: the (initiator, terminator) pair is unique.
The oldest initiator is paired with the oldest termi-
nator.
• Continuous: each initiator starts the detection of the
event.
• Cumulative: all occurrences of primitive events are
accumulated until the composite event is detected.
The effect of EDL operators is then conditioned
by the specific context, which is implemented in the
Event Dispatcher. Theoretically, in the construction of
the model a different node should be defined for each
context. Whilst a context could be associated to each
operator, currently a single context is associated to
each detection model. Furthermore, a different node
object for each context has been implemented.
Figure 2. Event tree for composite event ((E1 OR E2) AND In the current implementation, Event Graphs are
(E2 SEQ (E4 AND E6))). used to detect the scenarios defined by Event Trees,
which are only used as a descriptive formalism. In
fact, scenarios represented by more Event Trees can
internal nodes (including the root) represent EDL lan- be detected by a single Event Graph produced by the
guage operators. Figure 2 shows an example Event Model Generator. When an Event Detector receives
Tree representing a composite event. a message indicating that an instance of a primitive
After the user has sketched the Event Tree, the Sce- event Ei has occurred, it stores the information in the
nario GUI module parses the graph and provides the node associated with Ei . The detection of compos-
EDL expression to be added to the EDL Repository. ite events follows a bottom-up process that starts from
The parsing process starts from the leaf nodes rep- primitive event instances and flows up to the root node.
resenting the primitive events and ends at the root So the composite event is detected when the condition
node. Starting from the content of the EDL Repository, related to the root node operator is verified. The propa-
the Model Generator module builds and instantiates gation of the events is determined by the user specified
as many Event Detector objects as many composite context. After the detection of a composite event, an
events stored in the database. The detection algorithm object of a special class (Event Detected) is instanti-
implemented by such objects is based on Event Graphs ated with its relevant information (identifier, context,
and the objects include the functionalities of both the component event occurrences, initiator, terminator).
Model Solver and the Detection Engine.
In the current prototype, after the insertion of attack
scenarios, the user can start the detection process on 6 AN EXAMPLE SCENARIO
the Event History using a stub front-end (simulating
the Model Executor and the Output Manager mod- In this section we provide an application of DETECT
ules). A primitive event is accessed from the database to the case-study of a subway station. We consider
by a specific Model Feeder module, implemented by a a composite event corresponding to a terrorist threat.
single Event Dispatcher object which sends primitive The classification of attack scenarios is performed by
event instances to all Event Detectors responsible for security risk analysts in the vulnerability assessment
the detection process. process.
The Event Dispatcher requires considering only The attack scenario consists of an intrusion and drop
some event occurrences, depending on a specific of an explosive device in a subway tunnel. Let us sup-
policy defined by the parameter context. The policy pose that the dynamic of the scenario follows the steps
is used to define which events represent the begin- reported below:
ning (initiator) and the end (terminator) of the sce-
nario. The parameter context states which component 1. The attacker stays on the platform for the time
event occurrences play an active part in the detec- needed to prepare the attack, missing one or more
tion process. Four contexts for event detection can be trains.
defined: 2. The attacker goes down the tracks by crossing the
limit of the platform and moves inside the tunnel
portal.
• Recent: only the most recent occurrence of the 3. The attacker drops the bag containing the explosive
initiator is considered. device inside the tunnel and leaves the station.

110
Obviously, it is possible to think of several variants A partial alarm can be associated to the scenario
of this scenario. For instance, only one between step 1 evolution after step 1 (left AND in the EDL expres-
and step 2 could happen. Please note that the detection sion), in order to warn the operator of a suspect
of step 1 (person not taking the train) would be very abnormal behavior.
difficult to detect by a human operator in a crowded In order to activate the detection process, a sim-
station due to the people going on and off the train. ulated Event History has been created ad-hoc. An
Le us suppose that the station is equipped with on-line integration with a real working SMS will
a security system including intelligent cameras (S1 ), be performed in the near future for experimentation
active infrared barriers (S2 ) and explosive sniffers (S3 ) purposes.
for tunnel portal protection. The formal description of
the attack scenario consists of a sequence of events
which should be detected by the appropriate sensors
and combined in order to form the composite event. 7 CONCLUSIONS & FUTURE WORKS
The formal specification of primitive events consti-
tuting the scenario is provided in following: In this paper we have introduced the working principles
and the software architecture of DETECT, an expert
a. extended presence on the platform (E1 by S1 ); system allowing for early warnings in security critical
b. train passing (E2 by S1 ); domains.
c. platform line crossing (E3 by S1 ); DETECT can be used as a module of a more com-
d. tunnel intrusion (E4 by S2 ); plex hierarchical system, possibly involving several
e. explosive detection (E5 by S3 ). infrastructures. In fact, most critical infrastructures
For the sake of brevity, further steps are omitted. are organized in a multi-level fashion: local sites,
The composite event drop of explosive in tunnel grouped into regions and then monitored centrally by
can be specified in EDL as follows: a national control room, where all the (aggregated)
events coming from lower levels are routed. When
the entire system is available, each site at each level
(E1 AND E2 ) OR E3 SEQ (E4 AND E5 )
can benefit from the knowledge of significant events
happening in other sites. When some communication
Figure 3 provides a GUI screenshot showing the links are unavailable, it is still possible to activate
Event Tree for the composite event specified above. countermeasures basing on the local knowledge.
The user chooses the parameter context and builds the We are evaluating the possibility of using a single
tree (including primitive events, operators and inter- automatically trained multi-layered ANN to comple-
connection edges) by the user-friendly interface. If a ment deterministic detection by: 1) classification of
node represents a primitive event, the user has to spec- suspect scenarios, with a low FAR; 2) automatic detec-
ify event (Ex ) and sensor (Sx ) identifiers. If a node tion of abnormal behaviors, by observing deviations
is an operator, the user can optionally specify other from normality; 3) on-line update of knowledge trig-
parameters such as a temporal constraint, the partial gered by the user when a new anomaly has been
alarm level and the m parameter (ANY operator). Also, detected. The ANN model can be trained to under-
the user can activate/deactivate the composite events stand normality by observing the normal use of the
stored in the repository carrying out the detection infrastructure, possibly for long periods of time. The
process. Model Feeder for ANN operates in a way which is
similar to the Event Tree example provided above. A
ANN specific Model Updater allows for on-line learn-
ing facility. Future developments will be aimed at a
more cohesive integration between deterministic and
heuristic detection, by making the models interact one
with each other.

REFERENCES

Alferes, J.J. & Tagni, G.E. 2006. Implementation of a Com-


plex Event Engine for the Web. In Proceedings of IEEE
Services Computing Workshops (SCW 2006). September
18–22. Chicago, Illinois, USA.
Buss, A.H. 1996. Modeling with Event Graphs. In Proc.
Figure 3. Insertion of the composite event using the GUI. Winter Simulation Conference, pp. 153–160.

111
Cassandra, A.R., Baker, D. & Rashid, M. 1999. CEDMOS: Jain, A.K., Mao, J. & Mohiuddin, K.M. 1996. Artificial Neu-
Complex Event Detection and Monitoring System. MCC ral Networks: A tutorial. In IEEE Computer, Vol. 29, No.
Tecnical Report CEDMOS-002-99, MCC, Austin, TX. 3, pp. 56–63.
Chakravarthy, S. & Mishra, D. 1994. Snoop: An expressive Jones, A.K. & Sielken, R.S. 2000. Computer System Intru-
event specification language for active databases. Data sion Detection: A Survey. Technical Report, Computer
Knowl. Eng., Vol. 14, No. 1, pp. 1–26. Science Dept., University of Virginia.
Chakravarthy, S., Krishnaprasad, V., Anwar, E. & Kim, S. Krishnaprasad, V. 1994. Event Detection for Supporting
1994. Composite Events for Active Databases: Seman- Active Capability in an OODBMS: Semantics, Architec-
tics, Contexts and Detection. In Proceedings of the ture and Implementation. Master’s Thesis. University of
20th international Conference on Very Large Data Bases Florida.
(September 12–15, 1994). LENEL OnGuard 2008. http://www.lenel.com.
Bocca, J.B., Jarke, M. & Zaniolo, C. Eds. Very Large Data Lewis, F.L. 2004. Wireless Sensor Networks. In Smart Envi-
Bases. Morgan Kaufmann Publishers, San Francisco, CA, ronments: Technologies, Protocols, and Applications, ed.
pp. 606–617. D.J. Cook and S.K. Das. John Wiley, New York.
Dayal, U., Blaustein, B.T., Buchmann, A.P., Chakravarthy, S., Lewis, T.G. 2006. Critical Infrastructure Protection in
Hsu, M., Ledin, R., McCarthy, D.R., Rosenthal, A., Sarin, Homeland Security: Defending a Networked Nation. John
S.K., Carey, M.J., Livny, M. & Jauhari, R. 1988. The Wiley, New York.
HiPAC Project: Combining Active Databases and Timing Object Management Group UML, 2008. http://www.omg.
Constraints. SIGMOD Record, Vol. 17, No. 1, pp. 51–70. org/uml.
Garcia, M.L. 2001. The Design and Evaluation of Physical OLE for Process Communication. http://www.opc.org.
Protection Systems. Butterworth-Heinemann, USA. Remagnino, P., Velastinm, S.A., Foresti G.L. & Trivedi, M.
Gatziu, S. & Dittrich, K.R. 1994. Detecting Composite 2007. Novel concepts and challenges for the next genera-
Events in Active Databases Using Petri Nets. In Proceed- tion of video surveillance systems. In Machine Vision and
ings of the 4th International Workshop on Research Issues Applications (Springer), Vol. 18, Issue 3–4, pp. 135–137.
in data Engineering: Active Database Systems, pp. 2–9. Roman, R., Alcaraz, C. & Lopez, J. 2007. The role of Wire-
Gatziu, S. & Dittrich, K.R. 2003. Events in an Object- less Sensor Networks in the area of Critical Information
Oriented Database System. In Proceedings of the 1st Infrastructure Protection. In Information Security Tech.
International. Report, Vol. 12, Issue 1, pp. 24–31.
Gerani, N.H., Jagadish, H.V. & Shmueli, O. 1992a. Event Tzafestas, S.G. 1999. Advances in Intelligent Autonomous
Specification in an Object-Oriented Database. In Systems. Kluwer.
Gerani, N.H., Jagadish, H.V. & Shmueli, O. 1992b. COM-
POSE A System For Composite Event Specification and
Detection. Technical report, AT&T Bell Laboratories,
Murray Hill, NJ.

112
Safety, Reliability and Risk Analysis: Theory, Methods and Applications – Martorell et al. (eds)
© 2009 Taylor & Francis Group, London, ISBN 978-0-415-48513-5

Methodology and software platform for multi-layer causal modeling

K.M. Groth, C. Wang, D. Zhu & A. Mosleh


Center for Risk and Reliability, University of Maryland, College Park, Maryland, USA

ABSTRACT: This paper introduces an integrated framework and software platform that uses a three layer
approach to modeling complex systems. The multi-layer PRA approach implemented in IRIS (Integrated Risk
Information System) combines the power of Event Sequence Diagrams and Fault Trees for modeling risk
scenarios and system risks and hazards, with the flexibility of Bayesian Belief Networks for modeling non-
deterministic system components (e.g. human, organizational). The three types of models combined in the IRIS
integrated framework form a Hybrid Causal Logic (HCL) model that addresses deterministic and probabilistic
elements of systems and quantitatively integrates system dependencies. This paper will describe the HCL
algorithm and its implementation in IRIS by use of an example from aviation risk assessment (a risk scenario
model of aircraft taking off from the wrong runway.

1 INTRODUCTION the third layer to extend the causal chain of events to


potential human, organizational, and socio-technical
Conventional Probabilistic Risk Assessment (PRA) roots.
methods model deterministic relations between basic This approach can be used as the foundation for
events that combine to form a risk scenario. This addressing many of the issues that are commonly
is accomplished by using Boolean logic methods, encountered in risk and safety analysis and hazard
such as Fault Trees (FTs) and Event Trees (ETs) or identification. As a causal model, the methodology
Event Sequence Diagrams (ESDs). However since provides a vehicle for identification and analysis of
with human and organizational failures are among cause-effect relationships across many different mod-
the most important roots of many accidents and inci- eling domains, including human, software, hardware,
dents, there is an increased interest in expanding causal and environment.
models to incorporate non-deterministic causal links The IRIS framework can be used to identify all risk
encountered in human reliability and organizational scenarios and contributing events and calculate associ-
theory. Bayesian Belief Networks (BBNs) have the ated probabilities; to identify the risk value of specific
capability to model these soft relationships. changes and the risk importance of certain elements;
This paper describes a new risk methodology known and to monitor system risk indicators by considering
as Hybrid Causal Logic (HCL) that combines the the frequency of observation and the risk signifi-
Boolean logic-based PRA methods (ESDs, FTs) with cance over a period of time. Highlighting, trace, and
BBNs. The methodology is implemented in a software drill down functions are provided to facilitate hazard
package called the Integrated Risk Information Sys- identification and navigation through the models.
tem (IRIS). The HCL computational engine of IRIS All of the features in IRIS can be implemented with
can also be used as a standalone console application. respect to one risk scenario or multiple scenarios, e.g.
The functionality of this computational engine can be all of the scenarios leading to a particular category or
accessed by other applications through an API. IRIS type of end state. IRIS can be used to build a single
was designed for the United States Federal Aviation one-layer model, or a network of multi-layer models.
Administration. Additional information on IRIS can IRIS was developed as part of an interna-
be found in the references (Groth, Zhu & Mosleh 2008; tional research effort sponsored by the FAA System
Zhu et al 2008; Groth 2007). Approach for Safety Oversight (SASO) office. Other
In this modeling framework risk scenarios are mod- parts of this research created ESDs, FTs, and BBNs by
eled in the top layer using Event Sequence Diagrams. teams of aviation experts from the United States and
In the second layer, Fault Trees are used to model the Europe. IRIS integrates the different models into a
factors contributing to the properties and behaviors of standard framework and the HCL algorithm combines
the physical system (hardware, software, and environ- quantitative information from the models to calculate
mental factors). Bayesian Belief Networks comprise total risk.

113
The Dutch National Aerospace Laboratory (NLR) crew and fatigue and workload contributed to decision
used the NLR air safety database and aviation experts errors made by ATC. The details from the flight 5191
to created a hierarchical set of 31 generic ESDs rep- and the group of models for use of the incorrect run-
resenting the possible accident scenarios from takeoff way during takeoff will be used throughout this paper
to landing (Roelen et al. 2002). to show how the HCL methodology can be applied to
Another layer of the aviation safety model was cre- a real example.
ated by Hi-Tec Systems. Hi-Tec created a comprehen-
sive model for the quality of air carrier maintenance
(Eghbali 2006) and the flight operations (Mandelapu 2 OVERVIEW OF HCL METHODOLOGY
2006). NLR has also created FTs for specific accident
scenarios (Roelen & Wever 2004a, b). 2.1 Overview of the HCL modeling layers
The NLR and Hi-Tec models were built and ana-
lyzed in IRIS. One set of models pertains to the use The hybrid causal logic methodology extends con-
of the incorrect runway during takeoff. These models ventional deterministic risk analysis techniques to
became especially pertinent after the August 2006 fatal include ‘‘soft’’ factors including the organizational and
Comair Flight 5191 crash in Lexington, Kentucky. The regulatory environment of the physical system. The
pilot of flight 5191 taxied onto the wrong runway dur- HCL methodology employs a model-based approach
ing an early morning takeoff due to a combination of to system analysis; this approach can be used as the
human and airport factors. The incorrect runway was foundation for addressing many of the issues that are
shorter than the minimum distance required for the air- commonly encountered in system safety assessment,
craft to takeoff. The aircraft was less than 300ft from hazard identification analysis, and risk analysis. The
the end of the runway before pilots realized the error integrated framework is presented in Figure 1.
and attempted to takeoff at below-optimal speed. The ESDs form the top layer of the three layer model,
attempted takeoff resulted in a runway overrun and the FTs form the second layer, and BBNs form the bottom
death of 49 of the 50 people onboard. layer. An ESD is used to model temporal sequences of
The NTSB (2007) cited human actions by crew events. ESDs are similar to event trees and flowcharts;
and air traffic control (ATC) contributing to the acci- an ESD models the possible paths to outcomes, each
dent. The crew violated cockpit policy by engaging in of which could result from the same initiating event.
non-pertinent conversation during taxiing and by com- ESDs contain decision nodes where the paths diverge
pleting an abbreviated taxi briefing. Signs indicating based on the state of a system element. As part of the
the runway number and cockpit displays indicating the hybrid causal analysis, the ESDs define the context
direction of takeoff were not mentioned by either pilot or base scenarios for the hazards, sources of risk, and
during the takeoff. During takeoff the flight crew noted safety issues.
that there were no lights on the runway as expected, but The ESD shown in Figure 2 models the probability
did not double check their position as the copilot had of an aircraft taking off safely, stopping on the run-
observed numerous lights out on the correct runway way, or overrunning the runway. As can be seen in
the previous day. Pre-flight paperwork also indicated the model, the crew must reject the takeoff and the
that the centerline lights on the proper runway were speed of the aircraft must be lower than the critical
out. The flight crew did not use the available cues to speed beyond which the aircraft cannot stop before the
reconsider takeoff.
At the time of the accident only one of two required
air traffic controllers were on duty. According to post- IE PE-1 PE-2
accident statements, the controller on duty at the time
of the accident was also responsible for monitoring
PE-3
radar and was not aware that the aircraft had stopped
short of the desired runway before he issued takeoff
clearance. After issuing takeoff clearance the con-
troller turned around to perform administrative tasks
B D
during take-off and was not engaged in monitoring the C Top layer: ESD
progress of the flight. Fatigue likely contributed to the E
Middle layer: FT
performance of the controller as he had only slept for C Bottom layer: BBN
2 hours in the 24 hours before the accident. B
A B
Impaired decision making and inappropriate task
prioritization by both crew members and ATC were
major contributing factors to this accident. The reduc-
ing lighting on both the correct and incorrect runways
at the airport contributed to the decision errors made by Figure 1. Illustration of a three-layered IRIS model.

114
Figure 2. Case study top layer—ESD for an aircraft using the wrong runway (Roelen et al. 2002).

Figure 4. Partial NLR model for takeoff from wrong run-


way. The flight plan node is fed by Figure 5, and the crew
Figure 3. Case study middle layer—FT for air traffic control
decision/action error is fed by additional human factors.
events (Roelen and Wever 2004a).

end of the runway. By the time the Flight 5191 crew of BBNs to the traditional PRA modeling techniques
realized the mistake, the plane was above critical speed extends conventional risk analysis by capturing the
and the runway overrun was inevitable. diversity and complexity of hazards in modern sys-
The initiating event of the ESD, ATC event, is tems. BBNs can be used to model non-deterministic
directly linked to the top gate of the FT in Figure 3. casual factors such as human, environmental and
This FT provides three reasons an aircraft could be organizational factors.
placed in this situation: loss of separation with traffic, BBNs offer the capability to deal with sequential
takeoff from incorrect runway, or a bird strike. dependency and uncertain knowledge. BBNs can be
FTs uses logical relationships (AND, OR, NOT, connected to events in ESDs and FTs. The connections
etc.) to model the physical behaviors of the system. In between the BBNs and logic models are formed by
an HCL model, the top event of a FT can be connected binary variables in the BBN; the probability of the
to any event in the ESD. This essentially decomposes linked BBN node is then assigned to the ESD or FT
the ESD event into a set of physical elements affecting event.
the state of the event, with the node in the ESD taking The wrong runway event in the center of the FT is
its probability value from the FT. the root cause of the accident. Factors that contribute
BBNs have been added as the third layer of the to this root cause are modeled in the BBNs in Figure 4
model. A BBN is a directed acyclic graph, i.e. it can- and 5. Figure 4 is part of the wrong runway BBN
not contain feedback loops. Directed arcs form paths developed by NLR (Roelen and Wever 2007); the
of influence between variables (nodes). The addition wrong runway FT event is linked to the output node

115
Figure 5. Case study bottom layer—BBN of flight operations (Mandelapu 2006).

of this BBN. The flight plan node in Figure 5 feeds BBN nodes are quantified in conditional proba-
into the wrong runway node in Figure 4. Figure 5 bility tables. The size of the conditional probability
is fed by the Hi-Tec air carrier maintenance model table for each node depends on the number of parent
(Eghbali 2006; not pictured) with the end node of the nodes leading into it. The conditional probability table
maintenance model feeding information into the fleet requires the analyst to provide a probability value for
availability node at the top of the flight operations each state of the child node based on every possible
model. combination of the states of parent nodes. The default
Since many of the casual factors in BBNs may have number of states for a BBN node is 2, although addi-
widespread influence, BBN nodes may impact multi- tional states can be added as long as the probability of
ple events within ESDs and FTs. The details of the all states sums to 1. Assuming the child and its n par-
HCL quantification procedure can be found in the ent nodes all have 2 states, this requires 2n probability
references (Groen & Mosleh 2008, Wang 2007). values.
In order to quantify the hybrid model it is neces-
sary to convert the three types of diagrams into a set
2.2 Overview of HCL algorithm quantitative
of models that can communicate mathematically. This
capabilities
is accomplished by converting the ESDs and FTs into
An ESD event can be quantified directly by inputting a Reduced Ordered Binary Decision Diagrams (BDDs).
probability value for the event, or indirectly by linking The set of reduced ordered BDDs for a model are all
it to a FT or a node in a BBN. Linked ESD events take unique and the order of variables along each path from
on the probability value of the FT or node attached to it. root node to end node is identical. Details on the algo-
This allows the analyst to set a variable probability for rithms used to convert ESDs and FTs into BDDs have
ESD events based on contributing factors from lower been described extensively (Bryant 1992, Brace et al.
layers of the model. Likewise, FT basic events can be 1990, Rauzy 1993, Andrews & Dunnett 2000, Groen
quantified directly or linked to any node in the BBN. et al. 2005).

116
BBNs are not converted into BDDs; instead, a of using the wrong runway is a continued takeoff with
hybrid BDD/BBN is created. In this hybrid structure, no consequences. The bottom half of the figure dis-
the probability of one or more of the BDD variables plays the cut-sets only for the scenarios that end with
is provided by a linked node in the BBN. Additional a runway overrun. As can be seen in the figure, the
details about the BDD/BBN link can be found in most likely series of event reading to an overrun is the
Groen & Mosleh (2008). combination of using the incorrect runway, attempt-
ing a rejected takeoff, and having speed in excess of
the critical stopping speed (V1). This is the pattern
3 HCL-BASED RISK MANAGEMENT displayed by flight 5191.
METRICS

In addition to providing probability values for each


3.1 Importance measures
ESD scenario, each FT and each BBN node, the HCL
methodology provides functions for tracking risks over Importance measures are used to identify the most
time and for determining the elements that contribute significant contributors to a risk scenario. They pro-
most to scenario risk. HCL also provides the mini- vide a quantitative way to identify the most important
mal cut-sets for each ESD scenario, allowing the user system hazards and to understand which model ele-
to rank the risk scenarios quantitatively. Specific func- ments most affect system risk. Importance measures
tions are described in more detail below. For additional can be used to calculate the amount of additional safety
technical details see (Mosleh et al. 2007). resulting from a system modification, which allows
Figure 7 displays scenario results for the wrong run- analysts to examine the benefits of different modifi-
way scenario. It is clear that the most probable outcome cations before implementation. Analysts can also use

Figure 6. Probability values and cut sets for the base wrong runway scenario.

Figure 7. Importance measure results for the runway overrun model.

117
importance measures to identify the elements that most system risks and to track changes in risk over time.
contribute to a risk scenario and then target system Risk significance is calculated with respect to selected
changes to maximize the safety impact. ESD scenarios or end states. It can be calculated for
There are numerous ways to calculate importance any BBN node, FT gate or event, or ESD pivotal event.
measures for Boolean models. However, due to the The risk indicator is calculated by Equation 3,
dependencies in HCL models introduced by inclu- where R is the total risk, φ is the frequency of the
sion of BBNs, the methods cannot be applied in their event.
original form. Four conventional importance measures
have been modified and implemented in HCL: Risk R = Pr(S| f ) · φ (3)
Achievement Work (RAW), Risk Reduction Worth
(RRW), Birnbaum, and Vesely-Fussel (VF). Pr(S| f ) is the risk weight of a BBN node or FT event or
The standard Vesely-Fussel importance measure gate ( f ) and S is the selected ESD end state or group of
(Eq. 1) calculates the probability that event e has end states. If S consists of an end state category or mul-
occurred given that ESD end state S has occurred tiple end states in the same ESD Equation 3 is modified
(Fussel 1975). using the same logic explained for modifying Equa-
tion 1. For multiple end states in different ESDs the
p(e · S) risk indicator value can be calculated using the upper
p(e|S) = (1)
P(S) bound approximation. The procedure for performing
precursor analysis and hazard ranking follows directly
For hybrid models, event e is a given state of a model from the risk indicator procedure.
element, e.g. a FT event is failed or a BBN node is Figure 8 displays individual risk indicators and total
‘‘degraded’’ instead of ‘‘fully functional’’ or ‘‘failed.’’ risk for several BBN nodes from the example model.
By addressing a particular state, it is possible to extend Frequency values are to be provided by the analyst. In
importance measures to all layers of the hybrid model. the example case, the frequency values were selected
Importance measures must be calculated with to show how IRIS could be used to monitor the risks
respect to an ESD end state. To ensure independence before the accident; these are not values from data.
in ESDs with multiple paths, it is necessary to treat The top graph in Figure 8 shows the changing risk
the end state S as the sum of the Si mutually exclusive values for each of the three selected indicators. The
paths leading to it. The importance measure can then bottom graph shows the aggregated risk value over
be calculated by using Equation 2. time. Based on the risk values obtained from the mod-
els and the hypothetical frequency data, it becomes
 
i p(e · Si ) p(Si |e) apparent that the risk associated with airport ade-
p(S) =  = i (2) quacy increased sharply between May and July. The
i p(S i ) i p(Si ) hypothetical risks associated with the adequacy of the
airport could have been identified in July and steps
For a set of scenarios belonging to two or more
ESDs the probability can be calculated as a func-
tion of the results from each ESD or by use of the
mean upper bound approximation. Additional details
on HCL importance measure calculations can be found
in Zhu (2008).
Figure 7 provides importance measure results for
the runway overrun model. The importance measures
in the figure are arranged by the Vesely-Fussel impor-
tance measure. The items in the components column
are FT events and some selected BBN nodes. The BBN
nodes selected reflect the factors that contributed to the
runway overrun of Flight 5191. From the figure it is
obvious that take-off from incorrect runway is the most
important contributing factor to the runway overrun
end state.

3.2 Risk indicators


Risk indicators are used to identify potential system
risks by considering both the risk significance and the
frequency of the event. They are also used to monitor Figure 8. Sample risk indicators implemented in IRIS.

118
could have been taken to reduce these risks before the flight 5191 flight crew. Airport adequacy was set to
serious consequences occurred. the state inadequate because of the lack of proper light-
ing on both runways. The takeoff plan was deemed
3.3 Risk impact substandard.
By comparing the results of the base case, Figure 6,
Analysts can use IRIS to visualize the change in system to the case updated with scenario evidence, Figure 9,
risk based on observed or postulated conditions. This it is possible to quantify the change in risk accompany
can be achieved by using the set evidence function certain behaviors. The updated probability of a runway
to make assumptions about the state of one or more overrun based on human actions, airport conditions,
BBN nodes. Once assumptions are made the model is and the takeoff plan is an order of magnitude greater
updated to reflect the new information, providing new than the probability of the base scenario. Again, the
probability values for all nodes subsequently affected series of events leading to the flight 5191 crash is the
by the changes. most probable sequence leading to an overrun in the
When the BBN is linked to an ESD or FT, the new model.
ESD and FT models will also display new probabil- It is evident from Figure 10 that the three BBN
ity values. The set evidence function allows users to nodes strongly impact the probability of taking off
see the impact of soft factors on risk scenarios. The from the incorrect runway. This probability increases
result is a more tangible link between the actions of by almost a factor of 2 when the model is updated with
humans/organizations and specific system outcomes. the scenario evidence.
Setting evidence will provide users with a better
understanding of how low-level problems propagate
through the system and combine to form risk scenar-
ios. Figure 9 displays updated scenario results for the 4 CONCLUSION
flight 5191 overrun. In this scenario, evidence was set
for three nodes in the BBN (Fig. 5). Human actions This paper provides an overview of the hybrid
was set to the state unsafe because of errors made by causal logic (HCL) methodology for Probabilistic Risk

Figure 9. Updated scenario results for the runway overrun with information about flight 5191 specified.

Figure 10. Fault tree results showing the probability of taking off from the wrong runway for the base case (top) and the case
reflecting flight 5191 factors (bottom).

119
Assessment and the IRIS software package developed REFERENCES
to use the HCL methodology for comprehensive risk
analyses of complex systems. The HCL methodol- Andrews, J.D. & Dunnett, S.J. 2000. Event Tree Analysis
ogy and the associated computational engine were using Binary Decision Diagrams. IEEE Transactions on
designed to be portable and thus there is no specific Reliability 49(2): 230–239.
HCL GUI. The computational engine can read mod- Brace, K., Rudell, R. & Bryant, R. 1990. Efficient Implemen-
tation of a BDD Package. The 27th ACM/IEEE Design
els from files and can be accessed through use of Automation Conference, IEEE 0738.
an API. Bryant, R. 1992. Symbolic Boolean Manipulation with
The three-layer The flexible nature of the HCL Ordered Binary Decision Diagrams. ACM Computing
framework allows a wide range of GUIs to be devel- Surveys 24(3): 293–318.
oped for many industries. The IRIS package is Eghbali, G.H. 2006. Causal Model for Air Carrier Mainte-
designed to be used by PRA experts and systems ana- nance. Report Prepared for Federal Aviation Administra-
lysts. Additional GUIs can be added to allow users tion. Atlantic City, NJ: Hi-Tec Systems.
outside of the PRA community to use IRIS without in Fussell, J.B. 1975. How to Hand Calculate System Relia-
depth knowledge of the modeling concepts and all of bility and Safety Characteristics. IEEE Transactions on
Reliability R-24(3): 169–174.
the analysis tool. Groen, F. & Mosleh, A. 2008 (In Press). The Quantifica-
Two FAA specific GUIs were designed with two tion of Hybrid Causal Models. Submitted to Reliability
different target users in mind. Target users provided Engineering and System Safety.
information about what information they needed from Groen, F., Smidts, C. & Mosleh, A. 2006. QRAS—the quan-
IRIS and how they would like to see it presented. The titative risk assessment system. Reliability Engineering
GUIs were linked to specific IRIS analysis tools, but and System Safety 91(3): 292–304.
enabled the results to be presented in a more qualitative Groth, K. 2007. Integrated Risk Information System Vol-
(e.g. high/medium/low) way. ume 1: User Guide. College Park, MD: University of
The GUIs were designed to allow target users to Maryland.
Groth, K., Zhu, D. & Mosleh, A. 2008. Hybrid Methodology
operate the software immediately. Users are also able and Software Platform for Probabilistic Risk Assess-
to view underlying models and see full quantitative ment. The 54th Annual Reliability and Maintainability
results if desired. Symposium, Las Vegas, NV.
HCL framework was applied to the flight 5191 Mandelapu, S. 2006. Causal Model for Air Carrier Mainte-
runway overrun event from 2006, and the event was nance. Report Prepared for Federal Aviation Administra-
analyzed based on information obtained about the tion. Atlantic City, NJ: Hi-Tec Systems.
conditions contributing to the accident. Mosleh, A. et al. 2004. An Integrated Framework for
The three layer HCL framework allows different Identification, Classification and Assessment of Avia-
modeling techniques to be used for different aspects of tion Systems Hazards. The 9th International Probabilistic
Safety Assessment and Management Conference. Berlin,
a system. The hybrid framework goes beyond typical Germany.
PRA methods to permit the inclusion of soft causal fac- Mosleh, A., Wang, C. & Groen, F. 2007. Integrated Method-
tors introduced by human and organizational aspects ology For Identification, Classification and Assessment
of a system. The hybrid models and IRIS software of Aviation Systems Hazards and Risks Volume 1: Frame-
package provide a framework for unifying multiple work and Computational Algorithms. College Park, MD:
aspects of complex socio-technological systems to per- University of Maryland.
form system safety analysis, hazard analysis and risk National Transportation Safety Board (NTSB) 2007. Aircraft
analysis. Accident Report NTSB/AAR-07/05.
The methodology can be used to identify the most Rauzy, A. 1993. New Algorithms for Fault Trees Analysis.
Reliability Engineering and System Safety 40: 203–211.
important system elements that contribute to spe- Roelen, A.L.C. & Wever, R. 2004a. A Causal Model
cific outcomes and provides decision makers with a of Engine Failure, NLR-CR-2004-038. Amsterdam:
quantitative basis for allocating resources and making National Aerospace Laboratory NLR.
changes to any part of a system. Roelen, A.L.C. &. Wever, R. 2004b. A Causal Model of
A Rejected Take-Off. NLR-CR-2004-039. Amsterdam:
National Aerospace Laboratory NLR.
ACKNOWLEDGEMENT Roelen, A.L.C et al. 2002. Causal Modeling of Air Safety.
Amsterdam: National Aerospace Laboratory NLR.
The work described in this paper was supported by Wang, C. 2007. Hybrid Causal Methodology for Risk Assess-
ment. PhD dissertation. College Park, MD: University of
the US Federal Aviation Administration. The authors Maryland.
are indebted to John Lapointe and Jennelle Derrick- Zhu, D. et al. 2008. A PRA Software Platform for Hybrid
son (FAA—William J. Hughes Technical Center) for Causal Logic Risk Models. The 9th International Proba-
overall monitoring and coordination. The opinions bilistic Safety Assessment and Management Conference.
expressed in this paper are those of the authors and Hong Kong, China.
do not reflect any official position by the FAA.

120
Safety, Reliability and Risk Analysis: Theory, Methods and Applications – Martorell et al. (eds)
© 2009 Taylor & Francis Group, London, ISBN 978-0-415-48513-5

SCAIS (Simulation Code System for Integrated Safety Assessment):


Current status and applications

J.M. Izquierdo, J. Hortal, M. Sánchez, E. Meléndez & R. Herrero


Consejo de Seguridad Nuclear (CSN), Madrid, Spain

J. Gil, L. Gamo, I. Fernández, J. Esperón & P. González


Indizen Technologies S.L., Madrid, Spain

C. Queral, A. Expósito & G. Rodríguez


Universidad Politécnica Madrid UPM, Spain

ABSTRACT: This paper surveys the current status of Spanish Nuclear Safety Council (CSN) work made to
establish an Integrated Safety Analysis (ISA) methodology, supported by a simulation framework called SCAIS,
to independently check the validity and consistency of many assumptions used by the licensees in their safety
assessments. This diagnostic method is based on advanced dynamic reliability techniques on top of using classical
Probabilistic Safety Analysis (PSA) and deterministic tools, and allows for checking at once many aspects of
the safety assessments, making effective use of regulatory resources. Apart from a theoretical approach that is at
the basis of the method, application of ISA requires a set of computational tools. Steps done in the development
of ISA started by development of a suitable software package called SCAIS that comprehensively implies an
intensive use of code coupling techniques to join typical TH analysis, severe accident and probability calculation
codes. The final goal is to dynamically generate the event tree that stems from an initiating event, improving the
conventional PSA static approach.

1 NEED OF RISK-BASED DIAGNOSTIC Important examples are for instance the analy-
TOOLS sis justifying the PSA success criteria and operating
technical specifications, which are often based on
Most often, in defending their safety cases within potentially outdated base calculations made in older
the licensing process, industry safety analysis have times in a different context and with other spectrum of
to rely on computational tools including simulation applications in mind.
of transients and accidents and probabilistic safety This complex situation generates a parallel need
assessments. Such an assessment capability, even if in regulatory bodies that makes it mandatory to
reduced to its analytical aspects, is a huge effort increase their technical expertise and capabilities in
requiring considerable resources. this area. Technical Support Organization (TSO) have
The increasing trend towards Risk Informed Regu- become an essential element of the regulatory pro-
lation (RIR) and the recent interest in methods that cess1 , providing a substantial portion of its technical
are independent on the diversity of existing nuclear and scientific basis via computerized safety analy-
technologies motivate an even greater demand for sis supported on available knowledge and analytical
computerized safety case analysis. It has been further methods/tools.
fostered by: TSO tasks can not have the same scope as their
industry counterparts, nor is it reasonable to expect
• new nuclear power plant designs;
• the large time span and evolution of the old generic
safety analysis, that requires confirmation of its
present applicability; and
• the need to extend the life of the existing plants with 1 Examples are GRS, IRSN, PSI, Studvik, of TSOs
associated challenges to the potential reduction in supporting the regulatory bodies of Germany, France,
their safety margins. Switzerland, Sweden, respectively.

121
the same level of resources. Instead, in providing its • their relationships when addressing high level
technical expertise, they shall: requirements such as defense in depth and safety
margins.
• review and approve methods and results of
licensees, and Analysis of PSA success criteria and operating
• perform their own analysis/calculations to verify the technical specifications and its mutual consistency are
quality, consistency, and conclusions of day to day important chapters of the optimization of the pro-
industry assessments. tection system design encompassing, for instance,
problems like:
The last is a difficult, different and very special reg-
ulatory task, requiring specific TSO diagnostic tools • To ensure that the protection system is able to
to independently check the validity and consistency of cope with all accident scenarios and not only with
the many assumptions used and conclusions obtained a predetermined set. This umbrella character is
by the licensees in their safety assessments. hard to prove, particularly within an atmosphere of
The approach and the tools shall include a sound reduced safety margins, (SMAP Task Group 2007).
combination of deterministic and probabilistic sin- It requires careful regulatory attention to the historic
gle checks, pieces however of an ISA methodology, evolution of the deterministic assessments and it is
that together constitute a comprehensive sample ver- a source of potential conflicts when risk techniques
ifying all relevant decision making risk factors and are combined.
ensuring that the decision ingredients are properly and • To ensure the adequacy of success criteria,
consistently weighted. that become critical and sensitive, (Siu, N. &
Hilsmeier, T. 2006). Many studies demonstrating
the umbrella condition are old and perhaps unsuit-
able under these more restrictive circumstances.
2 ISSUES OF PARTICULAR RELEVANCE Extension to operator actions of the automatic
protection design is one such source of potential
In recent years efforts are being devoted to the clari- inconsistencies with complex aspects like available
fication of the relative roles of deterministic and times for operator action. Emergency procedures
probabilistic types of analysis with a view towards verification is also worth mentioning, because they
their harmonization, in order to take benefit of their imply longer than automatic design accident time
strengths and to get rid of identified shortcomings, scales to consider and important uncertainties in the
normally related with inter-phase aspects, like the timing of interventions, both potentially altering the
interaction between the evolution of process variables umbrella conditions of the deterministic design.
and its influence in probabilities. Different organi- • Another important extension of the scope of the
zations, (Hofer 2002) and (Izquierdo 2003a), have analysis is the need to consider degraded core sit-
undertaken some initiatives in different contexts with uations to ensure acceptable residual risks, (IAEA
claims such as the need for ‘‘an integration of proba- 2007). Again consistency issues appear requiring
bilistic safety analysis in the safety assessment, up to regulatory checks.
the approach of a risk-informed decision making pro-
cess’’ as well as for ‘‘proposals of verification methods These consistency checks call for an appropriate
for application that are in compliance with the state of extension of the probabilistic safety metrics currently
the art in science and technology’’. used. Different exceedance frequency limits for differ-
These initiatives should progressively evolve into ent barrier safety limit indicators have been extensively
a sound and efficient interpretation of the regulations discussed and may be used as a sound risk domain
that may be confirmed via computerized analysis. It interpretation of the existing regulations. For instance,
is not so much a question of new regulations from frequency limits for partial core damage or few core
the risk assessment viewpoint, but to ensure com- fuel failures may correctly interpret the present deter-
pliance with existing ones in the new context by ministic rules for safety margins in a way consistent
verifying the consistency of individual plant assess- with a probabilistic approach.
ment results through a comprehensive set of checks.
Its development can then be considered as a key and
novel topic for research within nuclear regulatory 3 DEVELOPEMENTS FOR AN INTEGRATED
agencies/TSO. PSA TO INDEPENDENT VERIFICATION OF
More precisely, issues that require an integrated SAFETY CASES
approach arise when considering:
The CSN branch of Modelling and Simulation (MOSI)
• the process by which the insights from these com- has developed its own ISA methodology for the above
plementary safety analysis are combined, and referred purposes. This diagnostic method has been

122
designed as a regulatory tool, able to compute the and to overcome difficulties derived from particular
frequency of PSA sequences and the exceedance fre- models and computational methods.
quency of specified levels of damage, in order to
check in an independent way the results and assump-
tions of the industry PSAs, including their RIR 3.1 Software package description
extensions/applications. This approach harmonizes A consolidation and modernization program,
the probabilistic and deterministic safety assessment (Izquierdo 2003b), is currently being executed to
aspects via a consistent and unified computer frame- enhance capabilities and functionality of the soft-
work. ware package that would also facilitate an easier
Apart from a theoretical approach that is at the basis maintenance and extensibility.
of the method, application of ISA requires a set of com- The current SCAIS development includes as main
putational tools. A suitable software package called elements the Event Scheduler, the Probability Calcu-
SCAIS (Figure 1) at each time step couples, (Izquierdo lator, the Simulation Driver (BABIECA) and the Plant
2003a): Models (Figure 2):
• simulation of nuclear accident sequences, resulting 1. Event scheduler (DENDROS), that drives the
from exploring potential equipment degradations dynamic simulation of the different incidental
following an initiating event (i.e. simulation of ther- sequences. Its design guarantees modularity of the
mal hydraulics, severe accident phenomenology and overall system and the parallelization of the event
fission product transport); tree generation, (Muñoz 1999).
• simulation of operating procedures and severe acci- It is designed as a separate process that controls
dent management guidelines; the branch opening and coordinates the differ-
• automatic delineation (with no a-priori assump- ent processes that play a role in the generation
tions) of event and phenomena trees; of the Dynamic Event Tree (DET). The idea is
• probabilistic quantification of fault trees and to use the full capabilities of a distributed com-
sequences; and putational environment, allowing the maximum
• integration and statistic treatment of risk metrics. number of processors to be active. To this end, it
Since the final goal is to generate dynamically the manages the communications among the different
event tree that stems from an initiating event, improv- processes that intervene in the event tree develop-
ing the conventional PSA static approach, this simula- ment, namely, the distributed plant simulator, the
tion technique is called tree simulation. The activation probability calculator, and an output processor.
of branching conditions is referred to as stimuli activa- The scheduler arranges for the opening of the
tion. Stimulus Driven Theory of Probabilistic Dynam- branch whenever certain conditions are met, and
ics (SDTPD), (Hofer 2002) and (Izquierdo 2003a), is stops the simulation of any particular branch that
the underlying mathematical risk theory (basic con- has reached an absorbing state. The scheduler must
cepts, principles and theoretical framework) on which know the probability of each branch in order to
it is inspired and supported. The massive use of cou- decide which branch is suitable for further devel-
pled codes has led to the definition and development opment. Each new branch is started in a separate
of a Standard Software Platform that allows a given process, spawning a new transient simulator pro-
code to be incorporated quickly into the overall system cess and initializing it to the transient conditions
that were in effect when the branching occurred.

Figure 1. SCAIS overview schema. Figure 2. Global system architecture.

123
This saves a substantial amount of computation connection scheme to be applied to any set of codes
time, since common parts of the sequences are not regardless of the particular codes. This feature
recomputed. The applications of a tree structured increases the power of calculation since large codes
computation extend beyond the scope of the DETs. can be split and executed in different machines.
In fact, the branch opening and cutoff can obey any This feature allows BABIECA to be coupled with
set of criteria not necessarily given by a probability itself, enhancing an easier modeling and simulation
calculation as, for instance, sensitivity studies or management of large model topologies.
automatic initialization for Accident Management 4. Plant Models. Sequences obtained in ISA involve
Strategy analysis. Tasks distribution among the dif- very often a wide range of phenomena, not cov-
ferent processors is managed by the Parallel Virtual ered by a single simulation code. On the other
Machine (PVM) interface. hand, Emergency Operation Procedures (EOPs) are
2. Probability calculator module that incrementally a very complex set of prescriptions, essential to the
performs the boolean product of the fault trees cor- sequence development and branching and hard to
responding to each system that intervene in the represent in detailed codes. A plant model suitable
sequence, additionally computing its probability. for ISA purposes does not comprise then a single
The fault trees that will be used for the probability input deck for just one code, but several inputs
calculations are those of PSA studies. This imposes for several codes, each one being responsible for
a strong computational demand that is optimized the simulation of certain phenomena. The coupling
by preprocessing the header fault trees as much as entails different although interrelated interfaces
possible. The current approach is trying to use fast (see Figure 2).
on-line probability computation based on the rep- Codes as MELCOR, MAAP, RELAP, TRACE
resentation of fault trees using the Binary Decision can adapted to perform tree simulations under
Diagram (BDD) formalism, fed from the industry control of the scheduler. At present Modular Acci-
models. dent Analysis Program (MAAP4) is coupled to
3. BABIECA, the consolidated simulation driver, BABIECA to build up a distributed plant simu-
solves step by step topologies of block diagrams. A lation. MAAP4 performs the calculation of the
standardized linkage method has also been defined plant model when the transient reaches the severe
and implemented to incorporate as block-modules accident conditions, being then initialized with the
other single-application oriented codes, using par- appropriate transient conditions. Some parts of the
allel techniques. BABIECA driver allows also to simulation (specially the operator actions, but also
change the simulation codes at any time to fit the control systems) may still be performed by the orig-
model to the instantaneous conditions, depend- inal code and the appropriate signals be transferred
ing on the need of the given simulation. Two as boundary conditions.
coupling approaches, namely by boundary and ini-
tial conditions, have been implemented, (Herrero Consolidated versions of both driver simulator
2003): BABIECA and DENDROS scheduler have been
designed with an object oriented architecture and
• Initial Conditions Coupling. This type of cou- implemented in C++ language. They have been devel-
pling is used when a certain code is reaching oped using OpenSource standards (Linux, XercesC,
validity and applicability limits, and new models libpq ++). The code system has been equipped
and codes are necessary to further analysis. The with a Structured Query Language (SQL) Relational
typical example of this type of coupling is the Database (PostgreSQL), used as repository for model
transition from conventional Thermal hydraulic input and output. The input file has been also modern-
(TH) codes to severe accident codes. BABIECA ized using XML standards since it is easy to read and
accomplish that allowing parts of the model understand, tags can be created as they are needed, and
(modules) remaining active or inactive depend- facilitates treating the document as an object, using
ing on a specific variable (called simulation object oriented paradigms.
mode).
• Boundary Conditions Coupling. Some of the
output variables obtained at the time advance- 3.2 Features of SCAIS coupled codes and future
ment in one of the codes are sent to the compu- extensions
tation of the model boundary conditions in the
In order to connect BABIECA with external codes
other code. This type of coupling is used to build
has been developed a wrapper communication code
a wide scope code starting from several codes
based on PVM. BABIECA solves a step by step cou-
with a more limited scope.
pled blocks topology. The solution follows an Standard
The synchronization points are external to the Computational Scheme in a well defined synchroniza-
solution advancement; this subject allows the tion points that allows an easy code coupling with other

124
codes following the Standard. Two blocks are in charge • New developments in sequence dynamics. A gen-
of BABIECA communication with coupled codes: eralization of the transfer function concepts for
sequences of events, with potential for PSA applica-
• SndCode supplies the boundary conditions; and tion as generalized dynamic release factors is under
• RcvCode. Receives messages of every spawned investigation.
code in each time step. • New developments about classical PSA aspects.
For any desired code to be coupled, an specific They include rigorous definitions for concepts like
wrapper, consistent with the Standard Computational available time for operations or plant damage states.
Scheme, must be developed and implemented. It needs
the message passing points defined by BABIECA in
each synchronization point, allowing the communica- 4 TEST CASE: MBLOCA SEQUENCES
tion between both codes. Three wrappers have been
developed and are being tested at present: A full scale test application of this integrated software
package to a Medium Break Loss of Coolant Accident
• BABIECA—Wrapper allows BABIECA to be con- (MBLOCA) initiating event of a Spanish PWR plant
nected, and used to split big topologies into smaller is currently under development.
ones in order to save computation time parallelizing The specific objective of this analysis is to demon-
the processes. strate the methodology and to check the tool, focusing
• Probability—Wrapper allows BABIECA and DEN- on an independent verification of the event tree delin-
DROS connections with the probability calculator eation and assessment of realistic EOPs for LOCA
module. plant recovery. SCAIS provides a huge amount of
• MAAP4—Wrapper allows BABIECA-MAAP4 results for the analyst, when unfolding the Dynamic
connections. Event Tree (DET), even though methodology reduces
This wrapper, implemented in FORTRAN77 as to manageable the number of sequences. The follow-
MAAP4, gives us the advantage to analyze MAAP4 ing sections outline some conclusions of the assess-
sequences with the data processing tools devel- ment, including sequence evolution and operator
oped to BABIECA. The interface module has the actions analysis.
following functions:
1. To initialize the MAAP4 simulation using input 4.1 Sequence analysis: MBLOCA sequence with
data values provided by the BABIECA model. accumulators failure
2. To initialize MAAP4 from a restart image data.
In a first step MBLOCA sequences are simulated with
3. To accomplish the exchange of dynamic vari-
MAAP code. Results show that for breaks up to 6
ables between MAAP4 and other codes.
(inches) no action to achieve the primary depressuriza-
4. To drive MAAP4 time-advancement calculation.
tion is necessary and LPSI starts automatically about
5. To save results and restart image data in
3000 seconds. However, for 2 break and lower, pri-
BABIECA Data Base.
mary system is stabilized at a higher pressure than low
Recent extensions under development are pressure safety injection pump head, being necessary
oriented to: the operator action to achieve low-pressure controlled
conditions, Figure 3.
• Develop the Paths and Risk assessment modules of Main actions of set of operator actions, correspond-
SCAIS system. ing to EOP E-1 (loss of reactor or secondary coolant)
They implement a dynamic reliability method and EOP ES-1.2 (cooling and decrease of pressure
based on the Theory of Stimulated Dynamics TSD after a LOCA), are the following:
which is a variant of the more general SDTPD in its
path and sequence version, (Izquierdo 2006) and 1. Check the need of reactor coolant pump trip (E-1,
(Queral 2006). step 1).
• Development and application of advanced proba- 2. Check the steam generators (SG) levels (E-1,
bilistic quantification techniques based on Binary step 3).
Decision Diagrams (BDDs) that would release some 3. Check pressure decreasing and cooling in the Reac-
traditional drawbacks (e.g. truncation and rare event tor Coolant System (RCS) (E-1, step 11). If RCS
approximation) of the currently used quantification pressure is higher than 15 kg/cm2 a transition to
approaches. EOP ES-1.2 is required.
• New modeling algorithms to simulate standard 4. Check SG levels (ES-1.2, step 5).
PSA, correcting for dynamic effects. 5. Start the RCS cooling until cool shutdown condi-
• New wrapper module allowing BABIECA-TRACE tions (ES-1.2, step 6).
connections. 6. Check if sub-cooling margin greater than 0◦ C.

125
Figure 3. MAAP4 simulation. Comparison with/without Figure 5. Simplified topology scheme of BABIECA-
operator actions. Primary Pressure System (MBLOCA 2 ). MAAP simulation.

Figure 6. Primary System Water Temperature. MAAP stand


Figure 4. MAAP4 simulation. Primary System Water alone and BABIECA-MAAP comparison (MBLOCA 2 ).
Temperature. Comparison with/without operator actions
(MBLOCA 2 ).

BABIECA and thermal hydraulic codes like MAAP


7. Check if Safety Injection System (SIS) is active.
or TRACE, (Esperón 2008).
8. Disconnect pressurizer (PZR) heaters.
9. Decrease RCS pressure in order to recover PZR
level.
4.2 Dynamic event tree simulation for MBLOCA
These operator actions has been implemented in As a third step of the study, DET of a 6 MBLOCA is
MAAP. Of particular relevance is primary cool-down unfolded by incorporating DENDROS to the previous
process at 55◦ C/h by SGRV control, Figure 4. model. Each time a safety system set-point is reached,
As a second phase of the study, manual actions have DENDROS unfolds the tree by simulating two dif-
been modelled by using the available BABIECA gen- ferent branches (namely system success and failure
eral modules, transferring the corresponding actions branch), Figure 7. In the study, the following headers
to MAAP4, by using SndCode and RcvCode blocks, have been considered:
Figure 5. Simulations with MAAP4 stand-alone and
BABIECA coupled with MAAP4 give the same • HPSI High Pressure Safety Injection.
results, Figure 6. • ACCUM Safety Injection Accumulator.
In a near future EOPs instructions will be com- • LPSI Low Presure Safety Injection.
puterized with SIMPROC that will be coupled with • RECIR Safety Injection Recirculation Mode.

126
consistency of probabilistic and deterministic aspects.
It is our opinion this need could be framed on an
international cooperative effort among national TSOs.
We have also shown that SCAIS is a powerful tool
that carries out the CSN Integrated Safety Analysis. Its
versatility and extensibility makes it an appropriate set
of tools that may be used either together or separately
to perform different regulatory checks.

ACKNOWLEDGMENTS

This work is partially founded by Spanish Ministry of


Education & Science in the STIM project (ENE2006
Figure 7. MAAP4 Event Tree of MBLOCA 6 .
Automati- 12931/CON).
cally drawn by DENDROS simulation (branch opening time
in seconds).

NOMENCLATURE

ISA Integrated Safety Analysis


SCAIS Simulation Codes System for Integrated Safety
Assessment
PSA Probabilistic Safety Analysis
CSN Spanish Nuclear Safety Council
RIR Risk Informed Regulation
EOP Emergency Operation Procedure
SAMG Severe Accident Management Guide
DET Dynamic Event Tree
PVM Parallel Virtual Machine
BDD Binary Decision Diagram
TSO Technical Support Organization
MAAP Modular Accident Analysis Program
SDTPD Stimulus Driven Theory of Probabilistic
Dynamics
Figure 8. Safety Injection Flows for different DET
sequences (MBLOCA 6 ).
RCS Reactor Coolant System
SIS Safety Injection System
PSRV Pressure Safety Relief Valve
TWPS Temperature of Water in the Primary System
The results for some sequences show the impact XML Extensible Markup Language
of different system failures and/or operator actions, MBLOCA Medium Break Loss of Coolant Accident
Figure 8. HRA Human Reliability Analysis
TH Thermal Hydraulic
TSD Theory of Stimulated Dynamics
5 CONCLUSIONS PWR Pressurized Water Reactor
SQL Structured Query Language
We have presented an overview of CSN methods and
simulation packages to perform independent safety
assessments to judge nuclear industry PSA related
safety cases. It has been shown that this approach is REFERENCES
adequate for an objective regulatory technical support
to CSN decision-making. Esperón, J. & Expósito, A. (2008). SIMPROC: Procedures
simulator for operator actions in NPPs. Proceedings of
This paper also argues on the need for development ESREL 08.
of diagnosis tools and methods for TSO, to perform Herrero, R. (2003). Standardization of code coupling for inte-
their own computerized analysis to verify quality, con- grated safety assessment purposes. Technical meeting on
sistency, and conclusions of day to day individual progress in development and use of coupled codes for
industry safety assessments, in such a way that asserts accident analysis. IAEA. Viena, 26–28 November 2003.

127
Hofer, E. (2002). Dynamic Event Trees for Probabilistic Muñoz, R. (1999). DENDROS: A second generation sched-
Safety Analysis. EUROSAFE Forum, Berlin. uler for dynamic event trees. M & C 99 Conference,
IAEA (2007, September). Proposal for a Technology-Neutral Madrid.
Safety Approach for New Reactor Designs. Technical Queral, C. (2006). Incorporation of stimulusdriven theory
Report TECDOC 1570, International Atomic Energy of probabilistics dynamics into ISA (STIM). Joint project
Agency, http://wwwpub.iaea.org/mtcd/publications/pdf/ of UPM, CSN and ULB founded by Spanish Ministry of
te_1570 web.pdf. Education & Science (ENE2006 12931/CON), Madrid.
Izquierdo, J.M. & Cañamón, I. (2006). Status report Siu, N. & Hilsmeier, T. (2006). Planning for future probabilis-
on dynamic reliability: SDTPD path and sequence tic risk assessment research and development. In G.S. Zio
TSD developments. Application to the WP5.3 bench- (Ed.), Safety and Reliability for Managing Risk, Number
mark Level 2 PSA exercise. DSR/SAGR/FT 2004.074, ISBN 0-415-41620-5. Taylor & Francis Group, London.
SARNET PSA2 D73 [rev1]. SMAP Task Group (2007). Safety Margins Action Plan. Final
Izquierdo, J. (2003a). An integrated PSA Approach Report. Technical Report NEA/CSNI/R(2007)9, Nuclear
to Independent Regulatory Evaluations of Nuclear Energy Agency. Committee on the Safety of Nuclear
Safety Assessments of Spanish Nuclear Power Stations. Installations, http://www.nea.fr/html/nsd/docs/2007/csnir
EUROSAFE Forum, Paris. 2007-9.pdf.
Izquierdo, J. (2003b). Consolidation plan for the Integrated
Sequence Assessment (ISA) CSN Code system SCAIS.
CSN internal report.

128
Safety, Reliability and Risk Analysis: Theory, Methods and Applications – Martorell et al. (eds)
© 2009 Taylor & Francis Group, London, ISBN 978-0-415-48513-5

Using GIS and multivariate analyses to visualize risk levels and spatial
patterns of severe accidents in the energy sector

P. Burgherr
Paul Scherrer Institut (PSI), Laboratory for Energy Systems Analysis, Villigen PSI, Switzerland

ABSTRACT: Accident risks of different energy chains are analyzed by comparative risk assessment, based on
the comprehensive database ENSAD established by the Paul Scherrer Institut. Geographic Information Systems
(GIS) and multivariate statistical analyses are then used to investigate the spatial variability of selected risk
indicators, to visualize the impacts of severe accidents, and to assign them to specific geographical areas.
This paper demonstrates by selected case studies how geo-referenced accident data can be coupled with other
socio-economic, ecological and geophysical contextual parameters, leading to interesting new insights. Such an
approach can facilitate the interpretation of results and complex interrelationships, enabling policy makers to
gain a quick overview of the essential scientific findings by means of summarized information.

1 INTRODUCTION to provide an economic valuation of severe acci-


dents (Burgherr & Hirschberg 2008; Burgherr &
Categorizing countries by their risk levels due to natu- Hirschberg, in press; Hirschberg et al. 2004a).
ral hazards has become a standard approach to assess, Recently, ENSAD was extended to enable
prioritize and mitigate the adverse effects of natural geo-referencing of individual accident records at
disasters. Recent examples of coordinated interna- different spatial scales including regional classifi-
tional efforts include studies like the ‘‘World Disasters cation schemes (e.g. IPCC world regions; sub-
Report’’ (IFRC 2007), ‘‘Natural Disaster Hotspots’’ regions of oceans and seas; international organization
(Dilley et al. 2005), and ‘‘Reducing Disaster Risk’’ participation), sub-national administrative divisions
(UNDP 2004) as well as studies by the world’s leading (ADMIN1, e.g. state or province; ADMIN2, e.g.
reinsurance companies (Berz 2005; Munich Re 2003; county), statistical regions of Europe (NUTS) by Euro-
Swiss Re 2003). Applications to the impacts of man- stat, location name (PPL, populated place), latitude
made (anthropogenic) activities range from ecological and longitude (in degrees, minutes and seconds), or
risk assessment of metals and organic pollutants (e.g. global grid systems (e.g. Marsden Squares, Maid-
Critto et al. 2005; Zhou et al. 2007) to risk assess- enhead Locator System). The coupling of ENSAD
ment of contaminated industrial sites (Carlon et al. with Geographic Information Systems (GIS) and geo-
2001), conflicts over international water resources statistical methods allows one to visualize accident
(Yoffe et al. 2003), climate risks (Anemüller et al. risks and to assign them to specific geographical areas,
2006), and environmental sustainability assessment as well as to calculate new ‘‘risk layers’’, i.e. to produce
(Bastianoni et al. 2008). illustrative maps and contour plots based on scattered
Accidents in the energy sector rank second (after observed accident data.
transportation) of all man-made accidents (Hirschberg In the course of decision and planning processes
et al. 1998). Since the 1990s, data on energy- policy makers and authorities often rely on sum-
related accidents have been systematically collected, marized information to gain a quick overview of a
harmonized and merged within the integrative Energy- thematic issue. However, complex scientific results
related Severe Accident Database (ENSAD) devel- are often very difficult to communicate without appro-
oped and maintained at the Paul Scherrer Institut priate visualization. Methods for global, regional or
(Burgherr et al. 2004; Hirschberg et al. 1998). local mapping using GIS methods are particularly use-
Results of comparative risk assessment for the dif- ful because they reveal spatial distribution patterns
ferent energy chains are commonly expressed in a and link them to administrative entities, which allow
quantitative manner such as aggregated indicators (e.g. planning and implementation of preventive measures,
damage rates), cumulative risk curves in a frequency- legal intervention or mitigation at the appropriate
consequence (F-N) diagram, or external cost estimates administrative level.

129
The aim of this paper is to present and discuss – 200 or more evacuations
results of comparative risk assessment by means of – a far-reaching ban on the consumption of food
so-called disaster maps. The selected case studies – a release of at least 10000 tonnes of hydrocarbons
address different spatial scales and resolution as well as – the cleanup of a land or water surface of 25 km2 or
different analytical techniques to calculate aggregated more
risk indicators, based on summary statistics, multivari- – economic damages of at least 5 million USD (year
ate statistical analyses to produce a single index, and 2000 exchange rates).
spatial interpolation techniques.

2 METHODS 2.2 Geo-referencing of accident data


The plotting of accident data on maps and the
ENSAD concentrates on comprehensively covering analysis of spatial distribution patterns requires spe-
severe, energy-related accidents, although other man- cific information about the position of an event.
made accidents and natural catastrophes are also The geographic coordinates of accident locations
reviewed, but somewhat less explicitly. The database were determined using detailed accident information
allows the user to make conclusive analyses and cal- stored in ENSAD as well as other resources such
culate technical risks, including frequencies and the as the World Gazetteer (http://world-gazetteer.com/)
expected extent damages. or Google Earth (http://earth.google.com/). Addition-
The compilation of a severe accident database is ally, accidents were assigned to hierarchical orga-
a complex, tedious and extremely time consuming nization of political geographic locations, referring
task. Therefore, it was decided in the beginning to country (ADMIN0) and the first level below
that ENSAD should build upon existing information (ADMIN1; e.g. state or province). All maps were
sources. Its unique feature is that it integrates data from produced using ArcGIS 9.2 software (ESRI 2006b).
a large variety of primary data sources, i.e. information The basic GIS layer data for country boundaries were
is merged, harmonized and verified. based on ESRI world data and maps (ESRI 2006a),
The actual process of database building and im- data from the Geographic Information System of
plementation has been described in detail elsewhere the European Commission (GISCO) (Eurostat 2005),
(Burgherr et al. 2004; Hirschberg et al. 1998), thus and data from the U.S. Geological Survey (USGS
only a brief summary of the essential steps is given 2004).
here:
– Selection and survey of a large number of informa- 2.3 Evaluation period
tion sources.
– Collection of raw information that is subsequently The time period 1970–2005 was selected for analysis
merged, harmonized and verified. of accident risks to ensure an adequate representation
– Information for individual accidents is assigned to of the historical experience. In the case of the Chinese
standardized data fields and fed into ENSAD. coal chain, only the years 1994–1999 were considered
– The final ENSAD data is subjected to several cross- because annual editions of the China Coal Indus-
checks to keep data errors at the lowest feasible try Yearbook were only available for these years. It
level. could be shown that without access to this information
– Comparative evaluations are then carried out, based source, data are subject to substantial underreporting
on customized ENSAD-queries. (Burgherr & Hirschberg 2007).

2.1 Severe accident definition 2.4 Data analysis

In the literature there is no unique definition of a This section describes the methodological details of
severe accident. Differences concern the actual dam- the four selected examples that are presented in this
age types considered (e.g. fatalities, injured persons, publication. In the remainder of this paper they are
evacuees or economic costs), use of loose categories referred to as case studies 1–4.
such as ‘‘people affected’’, and differences in damage Case Study 1: The total fatalities per country of
thresholds to distinguish severe from smaller acci- severe accidents (≥5 fatalities) in fossil energy chains
dents. Within the framework of PSI’s database ENSAD (coal, oil, natural gas and Liquefied Petroleum Gas
an accident is considered to be severe if it is charac- (LPG)) are plotted on a world map. Additionally, the
terized by one or more of the following consequences top ten countries in terms of number of accidents, and
(Burgherr et al. 2004; Hirschberg et al. 1998): the ten most deadly accidents are also indicated.
Case Study 2: Geo-referenced data for all fatal
– 5 or more fatalities oil chain accidents stored in ENSAD are mapped for
– 10 or more injuries European Union (EU 27), candidate countries, and

130
the European Free Trade Association (EFTA). A mul- of major state-owned mines in different provinces was
tivariate risk score was then calculated for individual investigated.
countries to analyze their differences in susceptibility
to accident risks. The proposed risk score consists of
four indicators: 3 RESULTS AND DISCUSSION

– total number of accidents (accident proneness) The ENSAD database currently contains 21549
– total number of fatalities (accident gravity) accident records, of which 90.2% occurred in the
– maximum consequences (most deadly accident) years 1970–2005. Within this period, 7621 accidents
– fatality rate (expectation value expressed in GWe yr resulted in at least five fatalities, of which 31.1% were
(Gigawatt-electric-year) using a generic load factor man-made, energy-related accidents. Table 1 sum-
of 0.35) marizes the number of accidents and fatalities that
occurred in fossil energy chains of different coun-
try groups. Results are separated for countries of the
Each variable was scaled between 0 and 1 using the
Organisation for Economic Co-operation and Devel-
equation xij = (zij − min)/(max − min), where zij is
opment (OECD), European Union (EU 27), and states
the value of the jth variable for the ith country, and min
that are not OECD members (non-OECD), due to
and max are the minimum and maximum values of the
the large differences in technological development,
jth variable. The resulting matrix was then analyzed
institutional and regulatory frameworks, and general
by means of Principal Components Analysis (PCA)
safety culture of OECD and EU 27 versus non-OECD.
(Ferguson 1998; Jolliffe 2002). Non-centred PCA was
Calculations were complemented by separate values
used because the first component (PC1) of such an
for the Chinese coal chain that has a significantly
analysis is always unipolar (Noy-Meir 1973), and can
worse performance than all other non-OECD countries
thus be used as a multivariate risk score (a lower value
(Burgherr & Hirschberg 2007).
indicates a lower accident risk).
Case Study 3: Tanker oil spills in the Euro-
pean Atlantic, the Mediterranean Sea (including the 3.1 Main findings and insights of case studies
entrance to the Suez Canal) and the Black Sea were
analyzed because these maritime areas pertain to Case Study 1: Figure 1 shows the worldwide distri-
the geographical region addressed in Case Study 2. bution of total fatalities per country of severe (≥5
Additionally, large parts of the Northeast Atlantic fatalities) accidents in the period 1970–2005. The top
(including the North Sea and the Baltic Sea), the ten countries in terms of total numbers of accidents
Canary current and the Mediterranean Sea belong to are also indicated in Figure 1, as well as the ten most
the so-called Large Marine Ecosystems (LME) that deadly accidents.
are considered ecologically sensitive areas (Hempel & China was the most accident prone country with
Sherman 2003). Distribution patterns of oil spills 25930 fatalities, of which almost 95% occurred in
(≥700 tonnes) were analyzed using the kriging tech- 1363 accidents attributable to the coal chain. How-
nique, which is a geo-statistical method for spatial ever, only 15 of these resulted in 100 or more fatalities.
interpolation (Krige 1951; Matheron 1963). In a pre-
liminary step, locations were assigned to a Marsden Table 1. Numbers of accidents (Acc) and fatalities (Fat)
Square Chart, which divides the world into grids of of severe (≥5 fatalities) accidents in fossil energy chains
10 degrees latitude by 10 degrees longitude, because in are given for OECD, EU 27, and non-OECD countries
some cases only approximate coordinates were avail- (1970–2005). For the coal chain, non-OECD w/o China (first
able. Afterwards, the number of spills per Marsden line) and China alone (second line) are given separately.
Square was calculated, and coordinates set to the cen-
ter of each grid cell. Ordinary point kriging was then Natural
applied to compute a prediction surface for the number Coal Oil Gas LPG Total
of spills to be expected in a particular region, i.e. to
evaluate regional differences in susceptibility to acci- OECD Acc 81 174 103 59 417
dental tanker spills. For methodological details on the Fat 2123 3388 1204 1875 8590
applied kriging procedure see Burgherr (2007) and EU 27 Acc 41 64 33 20 158
Burgherr & Hirschberg (2008). Fat 942 1236 337 559 3074
Case Study 4: Severe (≥5 fatalities) accidents in Non-OECD Acc 144 308 61 61 574
the Chinese coal chain were analyzed at the province 1363
Fat 5360 17990 1366 2610 27326
level (ADMIN1). Fatality rates were calculated for 24456
large state-owned and local mines, and for small town- World total Acc 1588 482 164 120 2354
ship and village mines, respectively. The influence of Fat 31939 21378 2570 4485 60372
mechanization levels in mining as well as the number

131
Figure 1. Individual countries are shaded according to their total numbers of severe accident fatalities in fossil energy
chains for the period 1970–2005. Pie charts designate the ten countries with most accidents, whereas bars indicate the ten
most deadly accidents. Country boundaries: © ESRI Data & Maps (ESRI 2006a).

Figure 2. Locations of all fatal oil chain accidents and country-specific risk scores of EU 27, accession candidate and
EFTA countries for the period 1970–2005. Country boundaries: © EuroGeographics for the administrative boundaries
(Eurostat 2005).

In contrast, the cumulated fatalities of the non-OECD to OECD, of which the USA and Turkey have been
countries Philippines, Afghanistan, Nigeria, Russia, founder members of 1961, whereas Mexico gained
Egypt and India were strongly influenced by a few very membership more than three decades later in 1994.
large accidents that contributed a substantial share of The LPG chain contributed almost 50% to total fatal-
the total (see Figure 1 for examples). Among the ten ities in Mexico because of one very large accident
countries with most fatalities, only three belonged in 1984 that resulted in 498 fatalities. In Turkey the

132
coal chain had the highest share with 531 fatalities of accidents and 70% of fatalities taking place in the
or about 55%, of which 272 fatalities are due to a sin- oil and gas chains.
gle accident, i.e. a methane explosion in a coal mine in Case Study 2: The map of Figure 2 shows the
north-western Turkey in 1992. Finally, the USA exhib- locations of all fatal oil chain accidents contained in
ited a distinctly different pattern compared to the other ENSAD for EU 27, accession candidate and EFTA
countries with no extremely large accidents (only three countries in the period 1970–2005. The calculated
out of 148 with more than 50 fatalities), and over 75% risk score is a measure of the heterogeneous accident

Figure 3. Individual geo-referenced oil spills for the period 1970–2005 are represented by different-sized circles correspond-
ing to the number of tonnes released. Regional differences in susceptibility to accidents were analyzed by ordinary kriging,
resulting in a prediction map of filled contours. The boundaries of the Large Marine Ecosystems (LME) are also shown.
Country boundaries: © EuroGeographics for the administrative boundaries (Eurostat 2005).

133
Figure 4. For individual provinces in China, average fatalities per Mt produced coal are given for severe (≥5 fatalities)
accidents in large and small mines for the period 1994–1999. Provinces were assigned to three distinct levels of mechanization
as indicated by their shading. Locations of major state-owned coal mines are also indicated on the map. Administrative
boundaries and coal mine locations: © U.S. Geological Survey (USGS 2004).

risk patterns among countries. Four countries had risk Figure 3 also provides results of ordinary kriging
scores greater than 0.40, namely the United Kingdom, that are based on a spherical model for the fitted semi-
Italy, Norway and Turkey. This is largely due to variogram (SV) function. Cross-validation indicated
the substantially higher values for total fatalities and that the model and subsequently generated predic-
maximum consequences, and to a lesser extent also tion map provided a reasonably accurate prediction
for total accidents compared to the respective aver- (details on the methodology are given in Burgherr and
age values for all countries. In contrast, fatality rates Hirschberg (2008) and Matheron (1963)). The predic-
(i.e. fatalities per GWe yr) of these four countries tion map based on spatial interpolation by ordinary
were clearly below the overall average. These find- kriging provides useful information for identification
ings support the notion that besides fatality rates other and assessment of regional differences in susceptibil-
performance measures should be evaluated, particu- ity to oil spills from tankers. Such an interpolated
larly in the context of multi-criteria decision analy- surface layer is also superior to a simple point rep-
sis (MCDA) because preference profiles may differ resentation because it enables estimates for areas with
among stakeholders (e.g. Hirschberg et al. 2004b). few or no sample points. The map clearly identifies
Case Study 3: Geographic locations were avail- several maritime regions that are particularly prone
able for 128 out of a total of 133 tanker accidents in to accidental oil spills, namely the English Channel,
the years 1970–2005 that occurred in the European the North Sea, the coast of Galicia, and the Eastern
Atlantic, the Mediterranean Sea and the Black Sea, Mediterranean Sea. Finally, those spills were iden-
and each resulted in a spill of at least 700t. The severity tified that occurred in ecologically sensitive areas
of the spills is divided into different spill size classes, because they are located within the boundaries of the
based on the amount of oil spilled. Large Marine Ecosystems (LME) of the world. In

134
total, 82.8% of all mapped spills were located within ACKNOWLEDGEMENTS
LME boundaries. However, tankers can generally
not avoid passing LMEs because shipping routes are The author thanks Drs. Stefan Hirschberg and Warren
internationally regulated, and because LMEs also Schenler for their valuable comments on an earlier
include large fractions of coastal areas that ships version of this manuscript. This study was partially
have to pass when approaching their port of desti- performed within the Integrated Project NEEDS (New
nation. Therefore, recent efforts to reduce this share Energy Externalities Development for Sustainabil-
have focused on technical measures, for example ity, Contract No. 502687) of the 6th Framework
improved navigation and the replacement of single Programme of European Community.
hull tankers by modern double hull designs. The sub-
stantial decrease in total spill volume (also in these
ecologically sensitive areas) since the 1990s reflects
the achieved improvements of this ongoing process REFERENCES
(Burgherr 2007).
Case Study 4: The Chinese coal chain is a worri- Anemüller, S., Monreal, S. & Bals, C. 2006. Global Cli-
some case with more than 6000 fatalities every year, a mate Risk Index 2006. Weather-related loss events and
their impacts on countries in 2004 and in a long-term
fatality rate about ten times higher than in other non- comparison, Bonn/Berlin, Germany: Germanwatch e.V.
OECD countries, and even about 40 times higher than Bastianoni, S., Pulselli, F.M., Focardi, S., Tiezzi, E.B.P. &
in OECD countries (Burgherr & Hirschberg 2007). Gramatica, P. 2008. Correlations and complementarities
Average values of fatalities per million tonne (Mt) of in data and methods through Principal Components Anal-
coal output for the years 1994–1999 showed a dis- ysis (PCA) applied to the results of the SPIn-Eco Project.
tinct variation among provinces (Figure 4). The figure Journal of Environmental Management 86: 419–426.
also demonstrates how fatality rates are influenced by Berz, G. 2005. Windstorm and storm surges in Europe: loss
the level of mechanization in mining. When results trends and possible counter-actions from the viewpoint
of individual provinces were assigned to three groups of an international reinsurer. Philosophical Transactions
of the Royal Society A—Mathematical Physical and
according to differences in mechanized levels, average Engineering Sciences 363(1831): 1431–1440.
fatality rates were inversely related to mechanization Burgherr, P. 2007. In-depth analysis of accidental oil spills
indicating that higher mechanized levels of mining do from tankers in the context of global spill trends from all
contribute to overall mine safety. Furthermore, aver- sources. Journal of Hazardous Materials 140: 245–256.
age fatality rates for large state-owned and local mines Burgherr, P. & Hirschberg, S. 2007. Assessment of severe
were significantly lower than for small township and accident risks in the Chinese coal chain. International
village mines with very low safety standards. Finally, Journal of Risk Assessment and Management 7(8):
major state-owned mines exhibit predominantly high 1157–1175.
and medium levels of mechanization. Burgherr, P. & Hirschberg, S. 2008. Severe accident risks
in fossil energy chains: a comparative analysis. Energy
33(4): 538–553.
Burgherr, P. & Hirschberg, S. in press. A comparative anal-
ysis of accident risks in fossil, hydro and nuclear energy
4 CONCLUSIONS chains. Human and Ecological Risk Assessment.
Burgherr, P., Hirschberg, S., Hunt, A. & Ortiz, R.A. 2004.
The coupling of ENSAD with a GIS-based approach Severe accidents in the energy sector. Final Report to the
has proven to be useful in determining spatial distri- European Commission of the EU 5th Framework Pro-
bution patterns of accident risks for selected energy gramme ‘‘New Elements for the Assessment of External
chains and country groups at global and regional Costs from Energy Technologies’’ (NewExt), Brussels,
Belgium: DG Research, Technological Development and
scales. Multivariate statistical methods (PCA) and Demonstration (RTD).
spatial interpolation techniques (kriging) have been Carlon, C., Critto, A., Marcomini, A. & Nathanail, P. 2001.
successfully used to extract additional value from Risk based characterisation of contaminated industrial
accident data and to produce illustrative maps of site using multivariate and geostatistical tools. Environ-
risk scores and contour plots. The potential benefits mental Pollution 111: 417–427.
of such an approach are threefold: First, it allows Critto, A., Carlon, C. & Marcomini, A. 2005. Screening
the identification of spatial patterns including acci- ecological risk assessment for the benthic community in
dent hotspots and extreme events. Second, powerful, the Venice Lagoon (Italy). Environment International 31:
flexible and understandable visualizations facilitate 1094–1100.
Dilley, M., Chen, R.S., Deichmann, U., Lerner-Lam, A.L.,
the interpretation of complex risk assessment results. Arnold, M., Agwe, J., Buys, P., Kjekstad, O., Lyon, B. &
Third, the results summarized by means of GIS pro- Yetman, G. 2005. Natural disaster hotspots. A global
vide a simple and comprehensive set of instruments risk analysis. Disaster Risk Management Series No.
to support policy makers in the decision and planning 5, Washington D.C., USA: The World Bank, Hazard
process. Management Unit.

135
ESRI 2006a. ArcGIS 9: ESRI Data & Maps, Redlands Krige, D.G. 1951. A statistical approach to some basic mine
(CA), USA: Environmental Systems Research Institute valuation problems on the Witwatersrand. Journal of the
Inc. (ESRI). Chemical, Metallurgical and Mining Society of South
ESRI 2006b. ArcGIS 9: using ArcGIS Desktop, Redlands Africa 52: 119–139.
(CA), USA: Environmental Systems Research Institute Matheron, G. 1963. Traité de Géostatistique Apliquée. Tome
Inc. (ESRI). II: Le krigeage, Paris, France: Ed. Bureau du Recherches
Eurostat 2005. Guidelines for geographic data intended for Geologiques et Minieres.
the GISCO Reference Database, Witney, UK: Lovell Munich Re 2003. NatCatSERVICE—A guide to the Munich
Johns Ltd. Re database for natural catastrophes, Munich, Germany:
Ferguson, C.C. 1998. Techniques for handling uncer- Munich Re.
tainty and variability in risk assessment models, Berlin, Noy-Meir, I. 1973. Data transformations in ecological ordi-
Germany: Umweltbundesamt. nation. Some advantages of non-centering. Journal of
Hempel, G. & Sherman, K. (Eds.) 2003 Large Marine Ecology 61: 329–341.
Ecosystems of the world: trends in exploitation, protec- Swiss Re 2003. Natural catastrophes and reinsurance, Zurich,
tion, and research, Amsterdam, Elsevier B.V. Switzerland: Swiss Reinsurance Company.
Hirschberg, S., Burgherr, P., Spiekerman, G. & Dones, R. UNDP Bureau for Crisis Prevention and Recovery 2004
2004a. Severe accidents in the energy sector: comparative Reducing disaster risk. A challenge for development.
perspective. Journal of Hazardous Materials 111(1–3): New York, UNDP, Bureau for Crisis Prevention and
57–65. Recovery.
Hirschberg, S., Dones, R., Heck, T., Burgherr, P., USGS 2004. Open-File Report 01–318: Coal geology, land
Schenler, W. & Bauer, C. 2004b. Sustainability of elec- use, and human health in the People’s Republic of China
tricity supply technologies under German conditions: a (http://pubs.usgs.gov/of/2001/ofr-01-318/), Reston (VA),
comparative evaluation. PSI-Report No. 04–15, Villigen USA: U.S. Geological Survey.
PSI, Switzerland: Paul Scherrer Institut. Yoffe, S., Wolf, A.T. & Giordano, M. 2003. Conflict and
Hirschberg, S., Spiekerman, G. & Dones, R. 1998. Severe cooperation over international freshwater resources: Indi-
accidents in the energy sector—first edition. PSI Report cators of basins at risk. Journal of the American Water
No. 98-16, Villigen PSI, Switzerland: Paul Scherrer Resources Association 39(5): 1109–1126.
Institut. Zhou, F., Huaicheng, G. & Hao, Z. 2007. Spatial distribution
IFRC (International Federation of the Red Cross and Red of heavy metals in Hong Kong’s marine sediments and
Crescent Societies) 2007. World Disasters Report 2007, their human impacts: a GIS-based chemometric approach.
Bloomfield, USA: Kumarian Press Inc. Marine Pollution Bulletin 54: 1372–1384.
Jolliffe, I.T. 2002. Principal Component Analysis. Springer
Series in Statistics, 2nd ed., New York, USA: Springer.

136
Safety, Reliability and Risk Analysis: Theory, Methods and Applications – Martorell et al. (eds)
© 2009 Taylor & Francis Group, London, ISBN 978-0-415-48513-5

Weak signals of potential accidents at ‘‘Seveso’’ establishments

Paolo A. Bragatto, Patrizia Agnello, Silvia Ansaldi & Paolo Pittiglio


ISPESL, Research Centre via Fontana Candida, Monteporzio Catone (Rome), Italy

ABSTRACT: At industrial facilities where the legislation on major accident hazard is enforced, near misses,
failures and deviations, even though without consequences, should be recorded and analyzed, for an early
identification of factors that could precede accidents. In order to provide duty-holders with tools for capturing
and managing these weak signals coming from operations, a software prototype, named NOCE, has been
developed. ‘‘Client-server’’ architecture has been adopted, in order to have a palmtop computer connected to an
‘‘experience’’ data base at the central server. The operators shall record by the palmtop any non-conformance, in
order to have a ‘‘plant operational experience’’ database. Non-conformances are matched with the safety system
for finding breaches in the safety system and eventually remove them. A digital representation of the plant and
its safety systems shall be developed, step by step, by exploiting data and documents, which are required by the
legislation for the control of major accident hazard. No extra job is required to the duty holder. The plant safety
digital representation will be used for analysing non-conformances. The safety documents, including safety
management system, safety procedures, safety report and the inspection program, may be reviewed according
to the ‘‘plant operational experience’’.

1 INTRODUCTION these precursors effectively, major accidents may be


prevented and consequent fatalities, severe injuries,
At industrial facilities where major accident hazard is asset damages and production losses may be avoided.
posed, for one accident with fatalities or severe injuries An effective system for reporting any non confor-
there are tens of near-misses with consequences just mance, even without consequence, is essential for an
on equipment and hundreds of failures that cause just efficient safety management. Furthermore, an active
small loss of production, as well as procedures not well monitoring of equipment and of procedures should be
understood or applied, with minor consequences. Fur- planned to find out warning signals, before an incident
thermore in the life of an establishment, thousands non or a failure occurs. This monitoring strategy should
conformances are usually reported both for equipment include inspections of safety critical plants, equipment
and for procedures. The benefit of having a good pro- and instrumentation as well as assessment of com-
gram for analysing and managing near-misses, failures pliance with training, instructions and safe working
and non conformances have been widely demon- practices.
strated in many papers, including Hursta et al. (1996), The events potentially useful to detect early signals
Ashford (1997), Jones et al. (1999), Phimister et al. of something potentially going wrong in the safety
(2003), Basso et al. (2004), Uth et al. (2004) system include deviations, anomalies, failures, near-
Sonnemans et al. (2006). There are definitely much misses and incidents. These events should be recorded
more failures, trivial incidents and near-misses, than in a data base. Attention to cases of near-misses and
severe accidents. Furthermore failures and deviations minor incidents should include investigation, analy-
are very easy to identify, to understand, to control. sis, and follow-up to ensure that the lessons learnt
They are quite small in scale, relatively simple to ana- are applied to future operation. For failures, deviation
lyze and to resolve. In near-misses, actual accident and minor failures, corrective and preventive measures
causes are not hidden to avoid a potential prosecution, have to be ensured.
unlike the accidents with fatalities, as demonstrated A rough classification of non conformances should
by Bragatto et al. (2007) and Agnello et al. (2007). be useful to address the follow-up. In the case of human
At major accident hazard installations, near-misses, or organizational failure, audits on the safety man-
failures and deviations are able to give ‘‘weak sig- agement system should be intensified and operational
nals’’, which have the potential of warning the operator procedures be reviewed. In the case of a mechanical
before the accidents happen. Therefore, by addressing failure, inspection plan should be reviewed, to prevent

137
equipment deterioration and potential consequences analysis methods, such as the IEC 61882 HAZOP
of the failure should be considered. method (IEC 2001) or eventually the MOND Fire,
Explosion and Toxic index (Lewis 1979), which is
less probing, but also less difficult. In other words at a
2 WEAK SIGNALS FOR POTENTIAL Seveso facility, a structured and documented safety
ACCIDENTS system is always present. It is definitely ruled by
regulations and standards.
Many operators suppose that the non conformances At ‘‘Seveso’’ establishment there is already a quite
have to be recorded only in the cases of loss of complex system to manage risks, with its tools, its
hazardous materials, stop of production or damaged documents and its data, which are described in many
equipment; on the contrary every little anomaly, papers, including Fabbri et al. (2005), OECD (2006).
defect, deviation or minor failure should be taken into For that reason the first objective of the research has
account, as even a silly event could be the very first been to organize and to exploit in a better way the infor-
precursor or a potential concurrent to an accident. The mation already present, minimizing the efforts for the
personnel should be encouraged in not discriminating operator. No new models have to be implemented but
significant and silly events, as any event is potentially the items already present in the safety system have to
useful for finding latent conditions which might lead to be exploited. The results of non-conformances and
an accident after a long time. That could also motivate deviation analysis should be completely transferred
personnel in taking responsibility and stewardship. in the safety procedures, in the SMS manual and in
The inspiration for this research comes from the the safety report, in order to improve the overall risk
widespread use of personal wireless telecommunica- management in the establishment.
tion equipment, such as a palmtop, even in the indus-
trial premises. The palmtops should be exploited for
recording immediately any non conformance detected 3 THE PROPOSED APPROACH
in the plant, by writing notes, descriptions or com-
ments, but also by capturing the event through pictures. A new approach is proposed here for filling the gap
The analysis of non conformances recorded is very between safety documents and operational experi-
useful for the early detection of conditions, which ence. According to standards and regulations, which
might lead to an accident. For this purpose it is are enforced at major hazard facilities, the safety
essential to have sound methodologies, supported by system is well defined and may be digitally repre-
adequate tools, for understanding whether a single non sented. The basic idea is to exploit this representation
conformance could open a breaching in the safety bar- for analyzing non conformances and improving the
riers and for finding weak points of the safety system, safety system. That model is suitable for unscholarly
which could be improved. As the non conformance operators, including depot workers.
is perturbing the safety system, it has to be quickly The backbone of the proposed system is the digital
notified, looking for a solution. representation of the equipment, as used in opera-
For the follow-up of near misses and anomalies tion. In the present paper the equipment representation
many models have been proposed in the literature. has been integrated with a digital representation of
The systemic approach is quite new for this matter. the safety system, tailored expressly for small sized
An adequate model of industrial safety systems may ‘‘Seveso’’ plants.
be found just in a few recent papers, including Beard & The potential of the digital models coming from
Santos Reyes (2003), Santos-Reyes & Beard (2003), computer aided design system, for supporting hazard
Beard (2005), Santos Reyes & Beard (2008). The analysis has been demonstrated by Venkatasubrama-
development of a general systemic methodology for nian et al. (2000), Chung et al. (2001), Zhao et al.
finding latent accident precursors would be an over- (2005) Bragatto et al. (2007). In our approach, the
whelming task indeed. As the scope of the research scrutiny of the plant, required by the HAZID and
is restricted to the facilities where major accident leg- HAZAN methods, is exploited to build step by step a
islation is enforced and the goal is a demonstrative plausible representation of the plant. The hierarchical
prototype, the effort is instead affordable. The Euro- structure has basically five nested levels: Facility, Unit,
pean legislation on major accident hazard (‘‘Seveso’’ Assembly, Component, Accessory or Instrument.
Legislation) defines a framework, which structures the Furthermore, as a result of hazard and conse-
safety system along the lifecycle of the plant, including quences analysis, critical components and accessories
hazard identification and risk analysis, safety policy are discriminated. Components and accessories are
and management, operational procedures, emergency considered critical if their failures or anomalies are
management and periodical safety audit. Furthermore, in a chain of single events, which could lead to a
in most plants a scrutiny of equipment is performed, major accident. They are tagged and linked to the sin-
according to the best known hazard identification and gle event, present in the top events list, as handled in

138
the SR. In this way the top event list is embedded in
the digital representation of the plant.
At the end, the emergency plan may be included
in the net, too. For each major event found by the
hazard analysis, the plan should have an action or a
sequence of actions. In this way any emergency action
is linked to an event, which is in the top events list.
Furthermore the actions require usually operations to
be done on accessories (e.g. valves), which have to be
included in the plant digital representation. It is any-
way to be stressed again that this whole representation
of the plant does not require any new duty for the plant
holder.
Just SR and safety manual, which are mandatory
according to the Seveso legislation, have been used.
The pieces of information, usually lost after the job
of preparing mandatory safety documents, are stored
in a quite structured way, in order to have a com-
plete representation of the whole system, including the
equipment, the hazard ranked according to the Mond
index, the list of the top events, the sequences of fail-
ures that lead a top event, the sequence of actions
in the emergency plan, the SMS and the operating
manual.
In this structured digital environment, the non-
conformances shall be recorded. The backward path
from the single event (failure, near miss or accident)
to the safety system is supported by the complete dig-
ital representation of the plant. When a failure or near
miss is reported, it is connected to a piece of equip- Figure 1. The proposed digital model for the management
ment (component or accessory), present in the plant of non conformances.
digital representation.
There are many links between equipment and safety number of notes, which should be used for the periodic
documents, which may be used. Namely any sin- reviewing.
gle items of the Mond check list is linked to one or
more components or assembly, any event in a chain
4 IMPLEMENTATION OF THE PROPOSED
leading to a top event is linked to a component, any
MODEL
action of a safety procedure is linked to an acces-
sory or a component. A few components could be
In order to widespread the proposed methodology,
found without a direct link to safety documents; in
NOCE (NO-Conformance Events analysis), a soft-
this case, the parent assembly or unit will be con-
ware prototype, has been developed. ‘‘Client-server’’
sidered. In other words the path of the equipment
architecture has been adopted, in order to have a
tree will be taken backward, in order to retrieve the
palmtop computer on the field, connected to an
pieces of information about the assembly or the unit
‘‘experience’’ data base on the central server.
affected by the failure. If the failed item is linked to
The basic objectives of the software prototype are
a chain, which could lead to a top event, it shall be
noticed. • to keep track of the present event happened in the
If the non conforming item is corresponding to a plant by collecting all feasible information and tak-
credit factor in the Mond index method, the credit ing into account all the events occurred in the past
will be suspended, until the non conformance will be in the same context;
removed. • to access directly into the Safety Management
The basic result of this continuous trial of report- System for updating the documents, having the
ing any non conformance, which happens somewhere possibility to properly look at the risk analysis
in the plant, and walks inside the plant digital safety results;
representation, is to have, after a short time, the • To evaluate the new event in the appropriate envi-
safety documents (basically Mond check list, top ronment, for instance in safety management or in
event sequences and safety procedures) with a large risk analysis.

139
In order to define an adequate architecture for the 4.1 Software architecture
software, the workflow has been outlined. In the work-
The architecture of the proposed system is basically
flow two phases have been defined. The first phase is
composed by an application for workers, which have
basically the event recording and it is supposed to be
to record events, and application for safety managers,
performed by each single worker, which detects the
which have to analyze events and to update safety
failure or the non-conformance. The worker, or even-
documents. The application for the worker side has
tually the squad leader, is enabled to record directly
been supposed running on a palmtop computer, featur-
the event by means of a palmtop system. Immediate
ing Windows CE; while the application for the safety
corrective actions should be recorded too. The event
manager side has been running on a desktop computer
is associated to a component, which is supposed to be
featuring Windows XP. The palmtop application is
present in the plant data base. The worker may verify
aimed to record the non-conformances; to retrieve past
whether other failures had been recorded in the past.
events from the database; to upload the new events in
The very basic data are introduced by the worker, the
the database. The desktop application has all the func-
follow up of this information are instead managed on
tionalities needed to analyze new recorded events and
the desktop, by some safety supervisor. As the safety
to access to the databases, which contain equipment
manual addresses basic and general issues, this has
digital representation, structured safety documents in
always to be verified, as the very first follow up. This
digital format and recorded non-conformances. The
check is required for any type of non-conformances;
following paragraphs describe the modules into more
but it has to be deepened, in the case of human or orga-
details.
nizational failure; which could require intensifying
audits on the safety management system or review-
ing safety procedures. In the case of a mechanical
failure, asset active monitoring should be intensified 4.2 Palmtop application
to prevent equipment deterioration. As the last step
the safety report has to be considered, as well as in 4.2.1 Event information
the study about hazard identification and risk assess- As explained above, an event is considered something
ment, including check list and HAZOP when present. happened in the facility, all weak signals coming from
The workflow above described is summarized by the operational experience, such as failures, incidents and
‘‘flow-chart’’ shown in figure 2. near-misses, which may involve specific components,
or a logic or process unit, but also a procedure of the
safety management system. For this reason the event
has to be associated to an item of the establishment,
which may be an equipment component or a procedure.
The worker is required to input the basic infor-
mation about the event. The event is characterized
by some identifiers (e.g. worker’s name, progres-
sive id number, date and time). Required information
includes

• A very synthetic description


• The type of cause, which has to be selected from a
predefined list
• The type of consequence, which has to be selected
from a predefined list
• Immediate corrective action (Of course the planned
preventive actions have to be approved in the next
phase by the safety manager)

Before recording any event, the worker is required


to connect the experience database and to look for
the precedents in order to find analogy with the
present case. The form is shown in figure 3. As
an equipment data base is supposed already present
in the establishments, the browsing is driven by the
non-conformance identifiers, which are linked to the
equipment component identifier or, eventually, to a
Figure 2. The basic architecture of the NOCE prototype. procedure identifier.

140
Figure 3. Consulting precedents related to a non conform-
Figure 4. Recording a non conformance.
ing component.

4.2.2 Event recording form The proposed extension is basically a discussion of


The recording phase corresponds to filling out the the event in comparison with the relevant points in the
event form, introducing the information which charac- safety documentation (SR, SMS, Active monitoring),
terizes the event, as shown in figure 4. The prototype just where equipment or components involved in the
requires all the fields to be filled in, in order to have a event are cited.
complete description. After a new event has been sent
to the database, it will be analyzed. 4.3.1 Safety management system
In analysis phase, the first step is to verify the safety
management system manual. The access to this mod-
ule is mandatory for the next steps. It reflects the
4.3 Desktop application
action required by the Seveso legislation for updat-
An analysis of the event (accident or near miss) is ing the documents. The first action required is to
already contained in the recording format, which is identify the paragraph in the SMS manual for intro-
provided within the safety management system. ducing the link with the event occurred, and eventually
The key of these methods is basically the recon- selecting the procedure or the operational instruction
struction of the event: the analysis goes along the steps involved. In this phase both general safety procedures
that allowed the event to continue, did not prevent it and operational procedures have to be checked. Every
from evolving or even contributed to it. operational procedure is linked to a few pieces of
In order to learn better the lessons from the events, equipment (e.g. valves, pumps or engines). If in the
it is essential to transfer into the safety documenta- plant digital model, the piece of equipment is linked to
tion the new information or knowledge coming from a procedure, it is possible to go back from the failure
unexpected events. to the procedure through the equipment component.

141
In this way a detected non conformance or failure may 4.3.4 Session closing
generate a procedure revision. Discussion session may be suspended and resumed
many times. At the end, the discussion may be defi-
nitely closed. After discussion closing, lessons learnt
4.3.2 Safety report
are put in the DB and may be retrieved both from the
The following steps are the reviewing of the safety
palmtop and from the desktop.
report and its related documents. These two mod-
ules are aimed to discuss the follow up of the event.
They exploit a hierarchical representation of the estab- 4.4 A case study
lishment, where equipment components, with their
The NOCE prototype has been tested at a small-
accessories are nested in installations, which are
medium sized facility, which had good ‘‘historical’’
nested in units, which are nested in plants. In the safety
collections of anomalies and near misses. These
report, the point related to the failed components, or
data have been used to build an adequate experience
to the affected plant unit, will be highlighted in order
database. The following documents have been found
to be analyzed. In many cases it will be enough to
in the safety reports:
understand better the risk documents; in a few cases a
reviewing could be required. A tag will be added in the • Index method computation;
safety report. This tag will be considered when safety • hazard identification;
report will be reviewed or revised. • list of top events;
• Fault tree and event tree for potential major accident.
4.3.3 Active monitoring intensification They have been used to develop the plant safety
In the case of a failure, inspection plan may be digital model. Both technical and procedural failures
reviewed, in order to intensify active monitoring have been studied, using the prototype. The analysis
and prevent equipment deterioration. Affected acces- determined many annotations in the safety report and
sories, components and units are found in the com- in the related documents, as well in the procedures of
ponent database and proposed for maintenance. In the the safety management system. In figure 5 the list of
case of unexpected failures on mechanical part the data non conformances is shown, as presented in the NOCE
base may used to find in the plant all the parts, which user interface.
comply with a few criteria (e.g. same type or same In figure 6 a sample of a NOCE window is shown.
constructor). Extraordinary inspections may be con- In the window each single pane is marked by a letter.
sidered as well as reduced inspection period. Potential The order of the letters from A to F is according the
consequences of the failure may be considered too for logical flow of the analysis. In A the digital represen-
intensifying inspections. tation of the plant is shown; the affected component

Figure 5. The list of near misses, failures and deviation, as presented in the NOCE graphical user interface.

142
Figure 6. A. The tree plant representation. B. The event data, as recorded via palmtop. C. The computation of Mond FE&T
index. D. Basic representation of the top event. E. The paragraph of the Safety Report that has been tagged. F. The lesson
learnt from the non-conformance.

is highlighted. In B the event data, as recorded via The case study is quite simple. In a more complex
palmtop, are shown. The computation of Mond FE&T establishment, such as a process plant, the number
index is shown in pane C. the basic representation of of failures and non-conformances is huge and many
the event chain that leads to the top event is shown in D. workers have to be involved in the job of recording
From left to right there are three sub panes, with the as soon as possible the failures and the non-
initiating event, the top event and the event sequence, conformances. Their cooperation is essential for a
which has the potential to give a top event. The para- successful implementation of the proposed system.
graph of the Safety Report where the component is Furthermore advanced IT mobile technologies could
mentioned is shown in pane E. The paragraph has been be exploited for achieving good results even in com-
tagged for a future review. The lessons learnt from the plex industrial environment.
non-conformance are shown in F.

5 CONCLUSIONS ACKNOWLEDGEMENTS

The software presented in this paper exploits the ‘‘plant The authors wish to thank Mr Andrea Tosti for his pro-
safety’’ digital representation, for reporting and ana- fessional support in implementing palmtop software.
lyzing anomalies and non conformances, as recorded
in the operational experience, as well as for updat-
ing safety report, safety management system and REFERENCES
related documents. For building the digital represen-
tation, no extra duties are required; but exploiting in Lewis, D.J. (1979). The Mond fire, explosion and toxicity
a smarter way documents that are compelled by the index a development of the Dow Index AIChE on Loss
major accident hazard legislation. Prevention, New York.

143
Hursta, N.W., Young, S., Donald, I., Gibson, H. & Muyselaar, Uth, H.J. & Wiese, N. (2004). Central collecting and evaluat-
A. (1996). Measures of safety management performance ing of major accidents and near-miss-events in the Federal
and attitudes to safety at major hazard sites. J. of Loss Republic of Germany—results, experiences, perspectives
Prevention in the Process Industries 9(2), 161–172. J. of Hazardous Materials 111(1–3), 139–145.
Ashford, N.A. (1997). Industrial safety: The neglected issue Beard, A.N. (2005). Requirements for acceptable model use
in industrial ecology. Journal of Cleaner Production Fire Safety Journal 40, 477–484.
5(1–2), 115–121. Beard, A.N. (2005). Requirements for acceptable model use
Jones, S., Kirchsteiger, C. & Bjerke, W. (1999). The Fire Safety J. 40(5), 477–484.
importance of near miss reporting to further improve Zhao, C., Bhushan, M. & Venkatasubramanian, V. (2005).
safety performance. J. of Loss Prevention in the Process ‘‘Phasuite: An automated HAZOP analysis tool for
Industries 12(1), 59–67. chemical processes’’, Process Safety and Environmental
Venkatasubramanian, V., Zhao, J. & Viswanathan, V. (2000). Protection, 83(6B), 509–548.
Intelligent systems for HAZOP analysis of complex Fabbri, L., Struckl, M. & Wood, M. (Eds.) (2005). Guidance
process plants. Computers & Chemical Engineering, 24, on the Preparation of a Safety Report to meet the require-
2291–2302. ments of Dir 96/82/EC as amended by Dir 03/105/EC EUR
Chung, P.W.H. & McCoy, S.A. (2001). Trial of the 22113 Luxembourg EC.
‘‘HAZID’’ tool for computer-based HAZOP emulation Sonnemans, P.J.M. & Korvers P.M.W. (2006). Accidents in
on the medium-sized industrial plant, HAZARDS XVI: the chemical industry: are they foreseeable J. of Loss
Analysing the past, planning the future. Institution of Prevention in the Process Industries 19, 1–12.
Chemical Engineers, IChemE Symposium Series 148, OECD (2006). Working Group on Chemical Accident (2006)
391–404. Survey on the use of safety documents in the control of
IEC (2001). Hazop and Operability Studies (HAZOP Stud- major accident hazards ENV/JM/ACC 6 Paris.
ies) Application Guide IEC 61882 1st Geneva. Agnello, P., Ansaldi, S., Bragatto, P. & Pittiglio, P. (2007).
Phimister, J.R., Oktem, U., Kleindorfer, P.R. & The operational experience and the continuous updating
Kunreuther, H. (2003). Near-Miss Incident Management of the safety report at Seveso establishments Future Chal-
in the Chemical Process Industry. Risk Analysis 23(3), lenges of Accident Investigation Dechy, N Cojazzi, GM
445–459. 33rd ESReDA seminar EC-JRC-IPS.
Beard, A.N. & Santos-Reyes, J.A. (2003). Safety Man- Bragatto, P., Monti, M., Giannini,F. & Ansaldi, S. (2007).
agement System Model with Application to Fire Safety Exploiting process plant digital representation for risk
Offshore. The Geneva Papers on Risk and Insurance 28(3) analysis J. of Loss Prevention in the Process Industry 20,
413–425. 69–78.
Santos-Reyes, J.A. & Beard, A.N. (2003). A systemic Santos-Reyes, J.A. & Beard, A.N. (2008). A systemic
approach to safety management on the British Railway approach to managing safety J. of Loss Prevention in the
system. Civil Eng. And Env. Syst. 20, 1–21. Process Industries 21(1), 15–28.
Basso, B., Carpegna, C., Dibitonto, C., Gaido, G. &
Robotto, A. (2004). Reviewing the safety management
system by incident investigation and performance indica-
tors. J. of Loss Prevention in the Process Industry 17(3),
225–231.

144
Dynamic reliability
Safety, Reliability and Risk Analysis: Theory, Methods and Applications – Martorell et al. (eds)
© 2009 Taylor & Francis Group, London, ISBN 978-0-415-48513-5

A dynamic fault classification scheme

Bernhard Fechner
FernUniversität in Hagen, Hagen, Germany

ABSTRACT: In this paper, we introduce a novel and simple fault rate classification scheme in hardware. It is
based on the well-known threshold scheme, counting ticks between faults. The innovation is to introduce variable
threshold values for the classification of fault rates and a fixed threshold for permanent faults. In combination
with field data obtained from 9728 processors of a SGI Altix 4700 computing system, a proposal for the
frequency-over-time behavior of faults results, experimentally justifying the assumption of dynamic and fixed
threshold values. A pattern matching classifies the fault rate behavior over time. From the behavior a prediction
is made. Software simulations show that fault rates can be forecast with 98% accuracy. The scheme is able to
adapt to and diagnose sudden changes of the fault rate, e.g. a spacecraft passing a radiation emitting celestial
body. By using this scheme, fault-coverage and performance can be dynamically adjusted during runtime.
For validation, the scheme is implemented by using different design styles, namely Field Programmable Gate
Arrays (FPGAs) and standard-cells. Different design styles were chosen to cover different economic demands.
From the implementation, characteristics like the length of the critical path, capacity and area consumption
result.

1 INTRODUCTION 6500

The performance requirements of modern micropro-


cessors have increased proportionally to their growing
Number of neutron impacts

number of applications. This led to an increase of clock


frequencies up to 4.7 GHz (2007) and to an integration
density of less than 45 nm (2007). The Semiconductor
6300
Industry Association roadmap forecasts a minimum
feature size of 14 nm [6] until 2020. Below 90 nm a
serious issue occurs at sea level, before only known
from aerospace applications [1]: the increasing prob-
ability that neutrons cause Single-Event Upsets in
memory elements [1, 3]. The measurable radiation on
sea level consists of up to 92% neutrons from outer 6100
space [2]. The peak value is 14400 neutrons/cm2 /h 0 2000 4000 6000 8000

[4]. Figure 1 shows the number of neutron impacts per Time in hours
hour per square centimeter for Kiel, Germany (data
from [5]). Figure 1. Number of neutron impacts per hour for Kiel,
This work deals with the handling of faults after Germany (2007).
they have been detected (fault diagnosis). We do not
use the fault rate for the classification of fault types related work. Section 4 introduces History Voting.
such as transient, intermittent or permanent faults. We We seamlessly extend the scheme to support multi-
classify the current fault rate and forecast its develop- ple operating units used in multicore or multithreaded
ment. On this basis the performance and fault coverage systems. The scheme was modeled in software and
can be dynamically adjusted during runtime. We call its behavior simulated under the influence of faults.
this scheme History Voting. Section 5 presents experimental results and resource
The rest of this work is organized as follows: in demands from standard-cell and FPGA (Field Pro-
Section 2, we present and discuss observations on fault grammable Gate Array)-implementations. Section 6
rates in real-life systems. In Section 3, we discuss concludes the paper.

147
2 OBSERVATIONS and caches of all 19 partitions for the SGI Altix 4700
installation at the Leibniz computing center (Techni-
Figure 2 shows the number of transient single bit cal University of Munich) [8] is shown. The data was
errors (x-axis) for 193 systems and the number of gained from the system abstraction layer (salinfo). In
systems which shows faulty behavior (y-axis) over the observation interval (24.07.-31.08.2007, x-axis)
16 months [7]. For many systems the number of faults two permanent faults in Partition 3 and 13 can be rec-
is small. Few systems encounter an increased fault ognized, since we have a massive increase of errors
rate. From this, intermittent or permanent faults can (y-axis). If permanent faults would be concluded from
be concluded. a history, they would be recognized too late, leading to
Figure 3 shows the daily number of transient sin- a massive occurrence of side-effects. For the detection
gle bit memory errors for a single system [7]. The of permanent faults, the fault rate in the observation
faults within the first seven months were identified as interval alone is relevant. In partitions 2, 5, 9, 10 and
transient. The first burst of faults appears at the begin- 14, an increase of the fault rate before a massive fault
ning of month eleven, leading to the assessment of appearance can be depicted. As much as a sudden
intermittent faults. growth, a decrease of the fault rate can be observed.
Actual data is depicted in Figure 4. Here, the num- From [7] further properties of intermittent faults
ber of detected faults (y-axis) in the main memory can be derived which help to classify these faults: a
repeated manifestation at the same location and that
they occur in bursts. Naturally, intermittent and per-
100 manent faults and can be recovered by replacing the
faulty component.
80
Number of systems

60 3 RELATED WORK

40
A state-comparator detects faults in a duplex-system.
For three or more results or states a majority voter is
used for the selection of a correct state, being able
20
to mask a fault. Besides majority voting many other
selection criteria exist [17]. The classification of faults
0 from their frequency was done in early IBM main-
frames. One example is the automated fault diagnosis
in the IBM 3081 [14]. In [10] an analysis of faults
Number of transient faults
over time is used to forecast and separate permanent
from intermittent faults. The ES/9000 Series, model
Figure 2. Number of transient errors for 193 systems. 900 [15] implements a retry and threshold mechanism
to tolerate transient faults and classify fault types. In
30
[12] the fault rate is used to construct groups of faults.
Detailed logs from IBM 3081 and CYPER-systems
25 help to detect similarities and permanent faults. In [16]
the history of detected faults within a NMR-system is
used to identify the faultiest module. The offline dis-
Number of faults

20
persion frame mechanism by Lin and Siewiorek [11]
15 diagnoses faults in a Unix-type filesystem. The heuris-
tic is based on the observation of faults over time.
Different rules are derived, e.g. the two-in-one rule
10
which generates a warning if two faults occur within
an hour. Similar is the approach from Mongardi [13].
5
Here, two errors in two consecutive cycles within a
unit lead to the interpretation of a permanent fault. In
0 [9] the α-count mechanism is developed. It permits to
1–May–

1–May–
1–Sep–

1–Nov–

1–Mar–
1–Jul–99

1–Jul–00
1–Jan–00

model the above-named and additional variants. Faults


99

99

00
99

00

are weighted. A greater weight is assigned to recent


faults. Additionally, a unit is weighted with the value
Figure 3. Number of daily transient single bit errors of a α. Faults are classified, but intermittent and perma-
single system. nent faults are not differentiated. Latif-Shabgahi and

148
Figure 4. Field data for the SGI Altix 4700.

Bennett [16] develop and analyze a flexible major- technology and circuit will determine the frequency
ity voter for TMR-systems. From the history of faults of faults. History Voting for redundant (structural
the most reliable module is determined. In [18] the and temporal) systems classifies fault rates during
processor that will probably fail in the future is deter- runtime. It predicts if the fault rate will increase. Based
mined from the list of processors which took part in on this forecast, the performance of a system can be
a redundant execution scheme. Therefore, processors dynamically adjusted if the prediction has a certain
are assigned weights. Like [9] we separate between quality.
techniques which incur interference from outside and We hereby exclude the possibility that the system
mechanisms based on algorithms. The latter are used performance is decreased unreasonable due to a false
in this work. prediction. A trust γ is assigned to units. These can be
e.g. the components of a NMR-system, the cores in a
multicore system or threading units in a multithreaded
4 DIAGNOSIS AND PREDICTION system. If the trust is zero, a faulty unit can be identi-
fied. An additional innovation is to dynamically adjust
From the mentioned work and the observations, we the threshold values to the fault scenario. From this we
can conclude that a classification of faults is possi- can exclude a false detection of a permanent fault on
ble by measuring the time between them. The limits an unexpected high fault rate, e.g. the bypass flight of
for a classification are fluent, since the application, a space probe on a radiation emitting celestial body.

149
History Voting consists of two parts: transient or intermittent (e.g. through frequent usage
of a faulty component) fault occurred. Thus, it can-
1. Adjustment of threshold values to the operating not predict these faults and cannot derive the system
environment behavior over time. Therefore, we include a small
2. Prediction of the fault rate and calculation of trust memory (the history). The history holds the last three
and prediction quality. fault rates interpreted by i(a, b).
For a prediction, neuronal nets or methods like
4.1 Adjustment of threshold values branch prediction can be used. However, these meth-
ods are costly regarding time and area. We use a
All related works from above use fixed threshold limits simple pattern matching depicted in Table 2. Here,
for the classification of fault types. This assumption the predicted fault rate and a symbolic representa-
only matches an environment where the fault rate is tion are shown. For a time-efficient implementation,
known. To flexibly classify the fault rate we intro- the history is limited in size. Three recent fault rates
duce three threshold variables: ϕι , ϕτ and ϕπ . These are considered. If the prediction matches the current
represent the upper borders for the rate of permanent fault rate development, the prediction quality η is
(ϕπ ), intermittent (ϕι ) and transient faults (ϕτ ). After increased, else decremented. Here, also the field data
a reset the variables are set to known maximum val- from Figure 4 was taken into account. The symbols
ues of the expected fault scenario. For security and are shown in Table 3.
from the observations, ϕπ is set to a fixed value and
is not dynamically adjusted. A counter i is needed
to measure the time (in cycles) between faults. In the Table 2. History and forecasted fault rates.
fault-free case, i is increased every cycle, else the
H[1][2][3] Fault rate Prediction
trend of the fault rate is classified by the function
i(a, b), defined by: 0 0 0 0
0 0 1 0
i : N × N → {0, 1}
 0 1 0 0
0 if ϕι < (a, b) ≤ ϕτ and
i(a, b) := 0 1 1 1
1 if ϕπ < (a, b) ≤ φι
1 0 0 0
:N×N→N
1 0 1 1
with 1 1 0 0
(a, b) := |a − b| . (1) 1 1 1 1
Table 1 shows the coding of i(a, b) and the conse-
quence on the trust (γ). γ  represents the bitwise
shifting of value γ to the right, γ++ an increase of
value γ. If a fault cannot be tolerated, the trust is Table 3. Symbols (History voting).
decremented until the minimum (null) is reached.
If i is substantially greater than ϕτ , this could Symbol Description
mean that the diagnose unit does not respond. In this
γi Trust of unit i
case, tests should be carried out to prevent further ϕτ Upper threshold: normal fault rate
faulty behavior. ϕι Mid-threshold: increased fault rate
ϕπ Lower (fixed) threshold:
4.2 Forecast of fault rates, calculation permanent fault
of trust and prediction quality η Quality of prediction
υ If η > υ, trust can be adjusted
A further observation is that a fault-tolerant system (quality-threshold)
which does not know its past cannot express if a Δi Value of cycle-counter
when detecting fault i
H[i] Entry i in the history
Table 1. Coding of fault rates. Entries Maximal number of entries
in the history
Coding i(a, b) Fault rate Consequence Prediction Prediction of the fault rate
from the history (s. Figure 5)
0 Normal γ++ Predict Pattern matching to forecast
1 Increase γ the fault (s. Figure 5)

150
Figure 5. Calculation of trust and prediction.

Figure 5 shows the algorithm for the calculation of distance of faults in time will be compared with the
trust, the prediction and its quality. actual i . If the elevation or decrease is over 50%
First, we test if an irreparable internal (INTFAULT) ((i > (i−1  1)), (i ≤ (i−1  1))), we have a
or external fault (EXTFAULT) was signaled. If so, sudden change of the fault rate and threshold values
the fail-safe mode must be initiated. If no such will not be adjusted.
fault occurred, the cycle counter i is increased. Two possibilities to signal permanent internal faults
If a fault is detected (INT-/ and EXTFAULT are exist:
excluded here), the forecasted fault rate i(a, b) is
• The trust γi in unit i is less than the value pt (slow
entered into the history H. Over Predict, a pre-
degradation, not shown in Figure 5)
diction (Prediction) is made. The last fault rate is
• i is less than threshold ϕπ (sudden increase)
compared with the current one. If the prediction
is correct, η is increased. Else it will be decre- Hereby, we assume that no unit permanently locks
mented until the minimum is reached. Only if η > resources, since then the trust of other units will
υ the prediction can modify the trust γ and thus decrease disproportionately.
the system behavior. The more dense faults occur
in time, the less trust a unit gets. The greater the
trust, the higher the probability of a correct exe- 5 EXPERIMENTAL RESULTS
cution. A slow in- or decrease of the fault rate
signals a change within the operating environment To judge the mechanism, it was modeled in software.
and threshold values are modified. i−1 , the last Figure 6 shows the successful adjustment of threshold

151
values (fault rate λ = 10−5 ). The distance of faults in Table 5. Resource demands (standard-cell).
time, the threshold values and the accuracy are shown
(from top to bottom: ϕτ , ϕι , ϕπ and accuracy in %). We Place and route
purposely chose nearly equal (difference 100 cycles)
Critical path (ps) 3308
starting values for ϕι and ϕτ , since we wanted to show
Area (λ2 ) 1175 × 1200
the flexibility of the mechanism. We see how the fault Transistors 5784
rate is framed by threshold values. Threshold ϕπ is set Capacity (pF) 8.8
to a value where a permanent fault can be ascertained
(100 cycles).
In the beginning, the accuracy is low due to the
initial values of ϕτ and ϕι but evening out at about 98%.
If these values are correctly initialized, the accuracy Table 5 shows the resource demands for History
would have been 100%. Table 4 shows the resource Voting for a standard-cell design by using a 130 nm,
demands for History Voting (FPGA, Xilinx Virtex-e 6 metal layer CMOS technology.
XCV1000).

6 CONCLUSION

In this work we presented a novel fault classifica-


tion scheme in hardware. Apart from other schemes,
the developed History Voting classifies the fault rate
behavior over time. From this, a prediction is made.
Only if the prediction quality exceeds a known value,
the trust in units can be adjusted. From the implemen-
tations a proof of concept is made. From the results, we
see that the scheme is relatively slow. Since faults—
apart from permanent ones—occur seldom in time,
this will not appeal against the scheme. It will eas-
ily fit even on small FPGAs. For the scheme, many
application areas exist. Depending on the size of the
final implementation, it could be implemented as an
additional unit on a space probe, adjusting the sys-
tem behavior during the flight. Another application
area are large-scale systems, equipped with applica-
tion specific FPGAs e.g. to boost the performance of
cryptographic applications. Here, the scheme could
Figure 6. Successful adjustment of threshold values. be implemented to adjust the performance of a node,
e.g. identify faulty cores using their trust or auto-
matically generate warnings if a certain fault rate is
Table 4. Resource demands (FPGA). reached.

Place and route

Critical path (ns) 9.962 REFERENCES

[1] E. Normand. Single-Event Upset at Ground Level.


Energy consumption (at 200 MHz)
IEEE Trans. on Nuclear Science, vol. 43, no. 6, part 1,
pp. 2742–2750, 1996.
mA mW [2] T. Karnik et al. Characterization of Soft Errors caused
1,8 V voltage supply 6.88 12.39 by Single-Event Upsets in CMOS Processes. IEEE
Trans. on Dependable and Secure Computing, vol.
Area 1, no. 2, pp. 128–143, 2004.
[3] R. Baumann, E. Smith. Neutron-induced boron fission
Slices 188 as a major source of soft errors in deep submicron
Slice-FFs 200 SRAM devices. In Proc. of the 38th Int’l. Reliability
4-Input LUTs 233 Physics Symp., pp.152–157, 2000.
IOBs 4 [4] F.L. Kastensmidt, L. Carro, R. Reis. Fault-tolerance
Gate Count 3661 Techniques for SRAM-based FPGAs. Springer-
Verlag, ISBN 0-387-31068-1, 2006.

152
[5] National Geophysical Data Center (NGDC). Cosmic [12] R.K. Iyer et al. Automatic Recognition of Intermittent
Ray Neutron Monitor (Kiel). ftp://ftp.ngdc.noaa.gov/ Failures: An Experimental Study of Field Data. IEEE
STP/SOLAR_DATA/COSMIC_RAYS/kiel.07. Revi- Trans. on Computers, vol. 39, no. 4, pp. 525–537,
sion 12/2007/cited 21.01.2008. 1990.
[6] The International Technology Roadmap for [13] G. Mongardi. Dependable Computing for Railway
Semiconductors (ITRS). Front End Processes. Control Systems. In Proc. of the 3rd Depend-
http://www.itrs.net/Links/2005ITRS/FEP2005.pdf, able Computing for Critical Applications (DCCA-3),
2005 Edition/cited 18.01.2008. pp. 255–277, 1993.
[7] C. Constantinescu. Trends and Challenges in VLSI [14] N.N. Tendolkar, R.L. Swann. Automated Diagnostic
Circuit Reliability. IEEE Micro, vol. 23, no. 4, Methodology for the IBM 3081 Processor Complex.
pp. 14–19, 2003. IBM Journal of Research and Development, vol. 26,
[8] M. Brehm, R. Bader, R. Ebner, Leibniz- pp.78–88, 1982.
Rechenzentrum (LRZ) der Bayerischen Akademie der [15] L. Spainhower et al. Design for fault-Tolerance in
Wissenschaften, Hardware Description of HLRB II, System ES/9000 Model 900. In Proc. of the 22nd
http://www.lrz-muenchen.de/services/compute/hlrb/ Int’l Symp. Fault-Tolerant Computing (FTCS-22),
hardware. Revision 29.03.2007/ cited 30.11.2007. pp. 38–47, 1992.
[9] A. Bondavalli, et al. Threshold-Based Mechanisms to [16] G. Latif-Shabgahi, P. Bennett. Adaptive Majority
Discriminate Transient from Intermittent faults. IEEE Voter: A Novel Voting Algorithm for Real-Time fault-
Trans. on Computers, vol. 49, no. 3, pp. 230–245, Tolerant Control Systems. In Proc. of the 25th
2000. Euromicro Conference, vol. 2, p. 2113ff, 1999.
[10] M.M. Tsao, D.P. Siewiorek. Trend Analysis on Sys- [17] J.M. Bass, G. Latif-Shabgahi, P. Bennett. Experi-
tem Error Files. In Proc. of the 13th Int’l Symp. on mental Comparison of Voting Algorithms in Cases
fault-Tolerant Computing (FTCS-13), pp. 116–119, of Disagreement. In Proc. of the 23rd Euromicro
1983. Conference, pp. 516–523, 1997.
[11] T.-T.Y. Lin, D.P. Siewiorek. Error Log Analysis: Sta- [18] P. Agrawal. Fault Tolerance in Multiprocessor Sys-
tistical Modeling and Heuristic Trend Analysis. IEEE tems without Dedicated Redundancy. IEEE Trans. on
Trans. on Reliability, vol. 39, pp. 419–432, 1990. Computers, vol. 37, no. 3, pp. 358–362, 1988.

153
Safety, Reliability and Risk Analysis: Theory, Methods and Applications – Martorell et al. (eds)
© 2009 Taylor & Francis Group, London, ISBN 978-0-415-48513-5

Importance factors in dynamic reliability

Robert Eymard, Sophie Mercier & Michel Roussignol


Université Paris-Est, Laboratoire d’Analyse et de Mathématiques Appliquées (CNRS UMR 8050), France

ABSTRACT: In dynamic reliability, the evolution of a system is governed by a piecewise deterministic Markov
process, which is characterized by different input data. Assuming such data to depend on some parameter p ∈ P,
our aim is to compute the first-order derivative with respect to each p ∈ P of some functionals of the process,
which may help to rank input data according to their relative importance, in view of sensitivity analysis. The
functionals of interest are expected values of some function of the process, cumulated on some finite time interval
[0, t], and their asymptotic values per unit time. Typical quantities of interest hence are cumulated (production)
availability, or mean number of failures on some finite time interval and similar asymptotic quantities. The
computation of the first-order derivative with respect to p ∈ P is made through a probabilistic counterpart of the
adjoint point method, from the numerical analysis field. Examples are provided, showing the good efficiency of
this method, especially in case of large P.

1 INTRODUCTION values in a Borel set V ⊂ Rd and stands for the envi-


ronmental conditions (temperature, pressure, . . .). The
In reliability, one of the most common model used in process (It , Xt )t≥0 jumps at countably many random
an industrial context for the time evolution of a sys- times and both components interact one in each other,
tem is a pure jump Markov process with finite state as required for models from dynamic reliability: by a
space. This means that the transition rates between jump from (It − , Xt − ) = (i, x) to (It , Xt ) = ( j, y) (with
states (typically failure rates, repair rates) are assumed (i, x), ( j, y) ∈ E × V ), the transition rate between
to be constant and independent on the possible evolu- the discrete states i and j depends on the environmen-
tion of the environment (temperature, pressure, . . .). tal condition x just before the jump and is a function
However, the influence of the environment can clearly x −→ a(i, j, x). Similarly, the environmental condi-
not always be neglected: for instance, the failure rate tion just after the jump Xt is distributed according to
of some electronic component may be much higher in some distribution μ(i,j,x) (dy), which depends on both
case of high temperature. Similarly, the state of the components just before the jump (i, x) and on the after
system may influence the evolution of the environ- jump discrete state j. Between jumps, the discrete
mental condition: think for instance of a heater which component It is constant, whereas the evolution of
may be on or off, leading to an increasing or decreas- the environmental condition Xt is deterministic, solu-
ing temperature. Such observations have led to the tion of a set of differential equations which depends
development of new models taking into account such on the fixed discrete state: given that It (ω) = i for
interactions. In this way, Jacques Devooght introduced all t ∈ [a, b], we have dtd Xt (ω) = v(i, Xt (ω)) for all
in the 90’s what he called dynamic reliability, with t ∈ [a, b], where v is a mapping from E × V to V .
models issued at the beginning from the domain of Contrary to the general model from (Davis 1984),
nuclear safety, see (Devooght 1997) with references we do not take here into account jumps of (It , Xt )t≥0 ,
therein. In the probability vocabulary, such models eventually entailed by the reaching of the frontier of V .
correspond to piecewise deterministic Markov pro- Given such a PDMP (It , Xt )t≥0 , we are interested in
cesses (PDMP), introduced by (Davis 1984). Such different quantities linked to this process, which may
processes are denoted by (It , Xt )t≥0 in the following. be written as cumulated expectations on some time
They are hybrid processes, in the sense that both com- interval [0, t] of some bounded measurable function h
ponents are not of the same type: the first one It is of the process:
discrete, with values in a finite state space E. Typically,
it indicates the state (up/down) for each component of  t 
the system at time t, just as for a usual pure jump Rρ0 (t) = Eρ0 h(Is , Xs ) ds
Markov process. The second component Xt takes its 0

155
where ρ0 is the initial distribution of the process. 2 ASSUMPTIONS
Such quantities include e.g. cumulative availability
or production availability on some time interval [0, t], The jump rates a(i, j, x), the jump distribution μ(i,j,x) ,
mean number of failures on [0, t], mean time spent by the velocity field v(i, x) and the function h(i, x) are
(Xs )0≤s≤t on [0, t] between two given bounds. . . assumed to depend on some parameter p, where p
For such types of quantity, our aim is to study their belongs to an open set O ⊂ R or Rk . All the results are
sensitivity with respect of different parameters p ∈ P, written in the case where O ⊂ R but extension to the
from which may depend both the function h and the case O ⊂ Rk is straightforward. We add exponent(p) to
input data of the process (It , Xt )t≥0 . More specifically, (p)
each quantity depending on p, such as h(p) or Rρ0 (t).
the point is to study the influence of variations of p ∈ P (p)
We denote by ρt (i, dx) the distribution of the pro-
on Rρ0 (t), through the computation of the first-order (p) (p)
derivative of Rρ0 (t) with respect to each p ∈ P. In view cess (It , Xt )t≥0 at time t with initial distribution ρ0
of comparing the results for different p ∈ P, we pre- (independent on p). We then have:
fer to normalize such derivatives, and we are actually  t
interested in computing the dimensionless first-order R(p) ρs(p) h(p) ds
ρ0 (t) =
logarithmic derivative of Rρ0 (t) with respect to p: 0
   t 
= h(p) (i, x) ds ρs(p) (i, dx)
p ∂Rρ0 (t) i∈E V 0
IFp (t) =
Rρ0 (t) ∂p
In order to prove existence and to calculate deriva-
(p)
tives of the functional Rρ0 , we shall need the following
assumptions (H1 ): for each p in O, there is some
which we call importance factor of parameter p in neighborhood N (p) of p in O such that, for all i,
Rρ0 (t). In view of long time analysis, we also want j ∈ E × E:
to compute its limit IFp (∞), with
• the function (x, p) −→ a(p) (i, j, x) is bounded on
V ×N (p), belongs to C2 (V ×O) (twice continuously
differentiable on V × O), with all partial derivatives
p ∂(Rρ0 (t)/t)
IFp (∞) = lim uniformly bounded on V × N (p),
t→+∞ Rρ0 (t)/t ∂p • for all function f (p) (x) ∈ C2 (V ×O), with all partial
derivatives uniformly bounded on V × N (p), the
 (p)
function (x, p)  −→ f (p) (y)μ(i,j,x) (dy) belongs to
Noting that IFp (t) and IFp (∞) only make sense C2 (V × O), with all partial derivatives uniformly
when considering never vanishing parameter p, we bounded on V × N ( p),
consequently assume p to be positive. • the function (x, p)  −→ v(p) (i, x) is bounded on
This kind of sensitivity analysis was already studied V × N (p), belongs to C2 (V × O), with all partial
in (Gandini 1990) and in (Cao and Chen 1997) for pure derivatives uniformly bounded on V × N ( p),
jump Markov processes with countable state space, • the function (x, p)  −→ h(p) (i, x) is bounded on
and extended to PDMP in (Mercier and Roussignol V × N (p), almost surely (a.s.) twice continuously
2007), with more restrictive a model than in the pre- differentiable on V ×O with a.s. uniformly bounded
sent paper however. partial derivatives on V × N ( p), where a.s. means
Since the marginal distributions of the process with respect to Lebesgue measure in x.
(It , Xt )t≥0 are, in some sense, the weak solution
of linear first order hyperbolic equations (Cocozza- In all the paper, under assumptions H1 , for each p
Thivent, Eymard, Mercier, and Roussignol 2006), the in O, we shall refer to a N (p) fulfilling the four points
expressions for the derivatives of the mathematical of the assumption without any further notice. We
expectations can be obtained by solving the dual prob- recall that under assumptions H1 (and actually under
lem (adjoint point method), as suggested in (Lions much milder assumptions), the process (It , Xt )t≥0 is a
1968) for a wide class of partial differential equations. Markov process, see (Davis 1984) e.g. Its transition
(p)
We show here that the resolution of the dual prob- probability distribution is denoted by Pt (i, x, j, dy).
lem provides an efficient numerical method, when the
marginal distributions of the PDMP are approximated
using a finite volume method. 3 TRANSITORY RESULTS
Due to the reduced size of the present paper, all
proofs are omitted and will be provided in a forth- We first introduce the infinitesimal generators of
coming paper. both Markov processes (It , Xt )t≥0 and (It , Xt , t)t≥0 :

156
Definition 1.1 Let DH0 be the set of functions f (i, x) Theorem 1.1 Let t > 0 be fixed. Under assumptions
from E × V to R such that for all i ∈ E the function (p)
H1 , the function p  −→ Rρ0 (t) is differentiable with
x −→ f (i, x) is bounded, continuously differentiable respect of p on N (p) and we have:
on V and such that the function x  −→ v(p) (i, x) ·
∇f (i, x) is bounded on V . For f ∈ DH0 , we define (p) 
∂Rρ0 t
∂h(p)
 (t) = ρs(p) ds
(p)
 (p)
∂p 0 ∂p
H0 f (i, x) = a(p) (i, j, x) f (j, y)μ(i,j,x) (dy)  t
∂H (p) (p)
j∈E − ρs(p) ϕ (., ., s) ds (3)
(p) 0 ∂p t
+ v (i, x) · ∇f (i, x)
 where we set:
where we set a(p) (i, i, x) = − j
=i a(p) (i, j, x) and
(p) ∂H (p)
μ(i,i,x) = δx . ϕ(i, x, s)
Let DH be the set of functions f (i, x, s) from E × ∂p
V × R+ to R such that for all i ∈ E the function  ∂a(p) 
(p)
(x, s)  −→ f (i, x, s) is bounded, continuously dif- = (i, j, x) ϕ(j, y, s)μ(i,j,x) (dy)
∂p
ferentiable on V × R+ and such that the function j∈E
x −→ ∂f (p)  
∂s (i, x, s) + v (i, x) · ∇f (i, x, s) is bounded ∂ (p)
on V × R+ . For f ∈ DH , we define + a(p) (i, j, x) ( ϕ(j, y, s)μ(i,j,x) (dy)))
j∈E
∂p
 
(p)
H (p) f (i, x, s) = a(p) (i, j, x) f (j, y, s)μ(i,j,x) (dy) ∂v(p)
+ (i, x) · ∇ϕ(i, x, s)
j ∂p
∂f
+ (i, x, s) + v(p) (i, x) · ∇f (i, x, s) (1) for all ϕ ∈ DH and all (i, x, s) ∈ E × V × R+ .
∂s

We now introduce what we called importance Formula (3) is given for one single p ∈ R∗+ . In case
functions: Rρ0 (t) depends on a family of parameters P = (pl )l∈L ,
we then have:
Proposition 1 Let t > 0 and let us assume H1 to be 
(p) ∂R(P)
ρ0
t
∂h(P)
true. Let us define the function ϕt by, for all (i, x) ∈ (t) = ρs(P) ds
E × V: ∂pl 0 ∂pl
 t
∂H (P) (P)
(p)
ϕt (i, x, s) − ρs(P) ϕ (., ., s) ds (4)
∂pl t
 t−s 0

=− (Pu(p) h(p) )(i, x) du if 0 ≤ s ≤ t (2) (P)


∂R
0 for all l ∈ L. The numerical assessment of ∂pρl0 (t)
(p) (p) hence requires the computation of both ρs(P) (i, dx)
and ϕt (i, x, s) = 0 otherwise. The function ϕt then
and ϕt(P) (i, x) (independent on l ∈ L). This may be
is the single function element of DH solution of the
done through two different methods: first, one may
partial differential equation
use Monte-Carlo simulation to evaluate ρs(P) (i, dx) and
(p) the transition probability distributions Pt(P) (i, x, j, dy),
H (p) ϕt (i, x, s) = h(p) (i, x)
from where the importance function ϕt(P) (i, x) may
for all (i, x, s) ∈ E × V × [0, t], with initial condition be derived using (2). Secondly, one may use the
(p) finite volume scheme from (Eymard, Mercier, and
ϕt (i, x, t) = 0 for all (i, x) in E × V .
(p) Prignet 2008), which provides an approximation for
The function ϕt belongs to C2 (V × O) and ρs(P) (i, dx). The function ϕt(P) (i, x) may then be proved
is bounded with all partial derivatives uniformly to be solution of a dual finite volume scheme, see
bounded on V × N (p) for all p ∈ O. (Eymard, Mercier, Prignet, and Roussignol 2008).
(p)
The function ϕt is called the importance function This is the method used in the present paper for the
associated to the function h(p) and to t. numerical examples provided further. By this method,
the computation of ∂R(P) ρ0 /∂pl (t) for all l ∈ L requires
The following theorem provides an extension to the solving of two dual finite volume schemes, as
PDMP of the results from (Gandini 1990). well as some summation for each l ∈ L involving the

157
data ∂h(P) /∂pl and ∂H (P) /∂pl (see (4)), which is done differential equation:
simultaneously to the solving.
(p)
This has to be compared with the usual finite differ-
(P)
H0 Uh(p) (i, x) = π (p) h(p) − h(p) (i, x) (7)
∂R
ences method, for which the evaluation of ∂pρl0 (t) for
for all (i, x) ∈ E × V . Any other element of DH0 solu-
one single pl requires the computation of R(P)
ρ0 for two
different families of parameters (P and P with pl sub- tion of (7) is of the shape: Uh(p) + C where C is a
∂R
(P) constant.
stituted by some pl + ε). The computation of ∂pρl0 (t) The function Uh(p) is called the potential function
for all l ∈ L by finite differences hence requires associated to h(p) .
1+card(L) computations. When the number of param-
eters card(L) is big, the advantage clearly is to the The following theorem provides an extension to
present method. PDMP of the results from (Cao and Chen 1997).

Theorem 1.2 Let us assume μ(i,j,x) and vs(i, x) to be


4 ASYMPTOTIC RESULTS independent on p and H1 , H2 to be true. Then, the
following limit exists and we have:
We are now interested in asymptotic results and we (p) (p)
need to assume the process (It , Xt )t≥0 to be uniformly 1 ∂Rρ0 ∂h(p) ∂H
lim (t) = π (p) + π (p) 0 Uh(p) (8)
ergodic, according to the following assumptions H2 : t→+∞ t ∂p ∂p ∂p

• the process (It , Xt )t≥0 is positive Harris-recurrent where we set:


with π (p) as unique stationary distribution,
(p)
• for each ∈ O, there exists a function f (p) such
 +∞p (p) ∂H0
+∞
that 0 f (u) du < +∞, 0 uf (p) (u) du < ϕ0 (i, x)
∂p
(p)
+∞, limu→+∞ f (u) = 0 and  ∂a(p) 
:= (i, j, x) ϕ0 (j, y)μ(i,j,x) (dy)
j∈E
∂p
|(Pu(p) h(p) )(i, x) − π (p) h(p) | ≤ f (p) (u) (5)
for all ϕ0 ∈ DH0 and all (i, x) ∈ E × V .
for all (i, x) ∈ E × V and all u ≥ 0.
Just as for the transitory results, the asymptotic
In order not to give too technical details, we derivative requires, for its numerical assessment, the
constraint our asymptotic study to the special case computation of two different quantities: the asymp-
where only the jump rates a(p) (i, j, x) and the function totic distribution π (p) (i, dx) and the potential function
h(p) (i, x) depend on the parameter p. Assumptions H1 Uh(p) (i, x). Here again, such computations may be
are then substituted by assumptions H1 , where con- done either by finite volume schemes (using (7) for
ditions on μ(i,j,x) and on v(i, x) (now independent on Uh(p) ) or by Monte-Carlo simulation (using (6) for
p) are removed. Uh(p) ). Also, in case of a whole set of parameters
We may now introduce what we call potential P = (pl )l∈L , such computations have to be done only
functions. once for all l ∈ L, which here again gives the advan-
tage to the present method against finite differences,
in case of a large P.
Proposition 2 Let us assume μ(i,j,x) and v(i, x) to be
independent on p and assumptions H1 , H2 to be true.
Then, the function defined by: 5 A FIRST EXAMPLE

5.1 Presentation—theoretical results


 +∞
Uh(p) (i, x) := ((Pu(p) h(p) )(i, x) − π (p) h(p) )du A single component is considered, which is perfectly
0 and instantaneously repaired at each failure. The dis-
(6) tribution of the life length of the component (T1 ) is
absolutely continuous with respect of Lebesgue mea-
sure, with E(T1 ) > 0. The successive life lengths make
exists for all (i, x) ∈ E ×V . Besides, the function Uh(p) a renewal process. The time evolution of the com-
is element of DH0 and it is solution to the ordinary ponent is described by the process (Xt )t≥0 where Xt

158
stands for the time elapsed at time t since the last Under assumption H2 , we get the following closed
instantaneous repair (the backward recurrence time). form for ∂Q(∞)
∂p :
There is one single discrete state so that component It
is here unnecessary. The failure rate for the compo-  +∞
nent at time t is λ(Xt ) where λ(·) is some non negative ∂Q(∞) 1 ∂λ
= (x)
function. The process (Xt )t≥0 is ‘‘renewed’’ after each ∂p E0 (T1 ) 0 ∂p
repair so that μ(x) (dy) = δ0 (dy) and (Xt )t≥0 evolves  x 
v
between renewals with speed v(x) = 1. × (1 − Q(∞) e− 0 λ(u)du dv) dx
We are interested in the rate of renewals on [0, t], 0
namely in the quantity Q(t) such that:
 t  5.2 Numerical results
R(t) 1
Q(t) = = E0 λ(Xs ) ds
t t 0 We assume that T1 is distributed according to some
Weibull distribution, which is slightly modified to
where R(t) is the renewal function associated to the meet with our assumptions:
underlying renewal process.
The function λ(x) is assumed to depend on some ⎧
⎨ αβxβ−1 if x < x0
parameter p > 0. λ (α,β)
(x) = Pα,β,x0 (x) if x0 ≤ x < x0 + 2
Assuming λ(x) to meet with H1 requirement, the ⎩
αβ(x0 + 1)β−1 = constant if x0 + 2 ≤ x
results from Section 3 here writes:
 
∂Q(p) (t) 1 t s (p) ∂λ(p) where (α, β) ∈ O = [0, +∞] × [2 + ∞], x0 is cho-
= ρs (dx) (x) sen such that T1 > x0 is a rare event (P0 (T1 > x0 ) =
∂p t 0 0 ∂p β
(p) (p)
e−αx0 small) and Pα,β,x0 (x) is some smoothing func-
× (1 − ϕt (0, s) + ϕt (x, s)) ds tion which makes x −→ λ(α,β) (x) continuous on R+ .
For such a failure rate, it is then easy to check that
where ϕt is solution of assumptions H1 and H2 are true, using Proposition 6.
∂ Taking (α, β) = (10−5 , 4) and x0 = 100 (which
λ(x)(ϕt (0, s) − ϕt (x, s)) + ϕt (x, s) ensures P0 (T1 > x0 ) 5 × 10−435 ), we are now able
∂s to compute IFα (t) and IFβ (∞) for t ≤ ∞. In order to
∂ validate our results, we also compute such quantities
+ ϕt (x, s) = λ(x)
∂x by finite differences (FD) using:
for all s ∈ [0, t] and ϕt (x, t) = 0 for all x ∈ [0, t].
∂Q(t) 1
Assuming E(T1 ) < +∞, the process is known to (Q(p+ε) (t) − Q(p) (t))
have a single stationary distribution π (p) which has the ∂p ε
following probability density function (p.d.f.):
x for small ε and t ≤ ∞. For the transitory results, we
(p) λ(p) (u)du
P(T1 > x) e− 0 use the algorithm from (Mercier 2007) which provides
fπ(p) (x) = (p)
= (p)
(9) an estimate for the renewal function R(p) (t) and hence
E(T1 ) E(T1 ) (p)
for Q(p) (t) = R t (t) to compute Q(p) (t) and Q(p+ε) (t).
Using a result from (Konstantopoulos and Last For the asymptotic results, we use the exact formula
1999), one may then prove the following proposi- Q(p) (∞) = 1
(p) to compute such quantities, which
E0 (T1 )
tion, which ensure the process to be uniformly ergodic, is a direct consequence of the key renewal theorem.
meeting with H2 : The results are gathered in Table 1 for the asymp-
totic importance factors IFp (∞).
Proposition 3 Let us assume that E(eδT1 ) < +∞ The results are very stable for IFβ (∞) by FD choos-
for some 0 < δ < 1 and that T1 is new better than ing different values for ε and FD give very similar
used (NBU: for all x, t ≥ 0, we have F̄(x + t) ≤ results as EMR. The approximation for IFα (∞) by FD
F̄(x)F̄(t), where F̄ is the survival function F̄(t) = requires smaller ε to give similar results as EMR. Sim-
P(T1 > t)). Then, there are some C < +∞ and ilar remarks are valid for the transitory results, which
0 < ρ < 1 such that: are plotted in Figures 1 and 2 for t ∈ [0, 50] and differ-
ent values of ε. This clearly validates the method. As
(p)
|Pt h(p) (x) − π (p) h(p) | ≤ Cρ t for the results, we may note that, for a Weibull distri-
bution, the shape parameter β is much more influent
for all x ∈ R+ . on the rate of renewals than the scale parameter α.

159
Table 1. IFα (∞) and IFβ (∞) by finite differences (FD) A tank is considered, which may be filled in or
and by the present method (EMR). emptied out using a pump. This pump may be in two
different states: ‘‘in’’ (state 0) or ‘‘out’’ (state 1). The
ε IFα (∞) IFβ (∞) level of liquid in the tank goes from 0 up to R. The
state of the system ‘‘tank-pump’’ at time t is (It , Xt )
FD 10−2 4.625 × 10−3 2.824
10−4 8.212 × 10−2 2.821
where It is the discrete state of the pump (It ∈ {0, 1})
10−6 2.411 × 10−1 2.821 and Xt is the continuous level in the tank (Xt ∈ [0, R]).
10−8 2.499 × 10−1 2.821 The transition rate from state 0 (resp. 1) to state 1
10−10 2.500 × 10−1 2.821 (resp. 0) at time t is λ0 (Xt ) (resp. λ1 (Xt )). The speed
EMR – 2.500 × 10−1 2.821 of variation for the liquid level in state 0 is v0 (x) =
r0 (x) with r0 (x) > 0 for all x ∈ [0, R] and r0 (R) =
0: the level increases in state 0 and tends towards R.
Similarly, the speed in state 1 is v1 (x) = −r1 (x) with
r1 (x) > 0 for all x ∈ [0, R] and r1 (0) = 0: the level
1
FD 10
-2 of liquid decreases in state 1 and tends towards 0. For
0.9 FD 10
-3
i = 0, 1, the function λi (respectively ri ) is assumed to
-4
0.8
FD 10
-5
be continuous (respectively Lipschitz continuous) and
consequently bounded on [0, R]. The level in the tank
FD 10
-6
0.7 FD 10
FD 10
-7
is continuous so that μ(i, 1−i, x)(dy) = δx (dy) for i ∈
{0, 1}, all x ∈ [0, R]. In order to ensure the process to be
-8
0.6 FD 10
EMR
0.5 IF (∞)
α positive Harris recurrent, we also make the following
0.4 -7
additional assumptions: λ1 (0) > 0, λ0 (R) > 0 and
EMR 10
-6
10 -8
0.3 10
10
-5
 R  y
0.2 1 1
10
-4
du = +∞, du = +∞
0.1
10
-2
10
-3
x r0 (u) 0 r1 (u)
0
0 10 20 30 40 50
t
for all x, y, ∈ [0, R]. We get the following result:
10

8 Proposition 4 Under the previous assumptions, the


process (It , Xt )t≥0 is positive Harris recurrent with
6 single invariant distribution π given by:

4
π(i, dx) = fi (x) dx
2 -1
FD 10
FD 10 -2
FD 10
-3 for i = 0, 1 and
0 FD 10
-4
IF (∞)
β
EMR Kπ R/2 x λ (u) λ (u)
( r 1(u) − r 0(u) ) du
-2
0 10 20 30 40 50 f0 (x) = e 1 0 (10)
t r0 (x)
Figure 1&2. IFα (t) and IFβ (t) by finite differences and by Kπ R/2 x λ (u) λ (u)
( r 1(u) − r 0(u) ) du
f1 (x) = e 1 0 (11)
the present method (EMR). r1 (x)

where Kπ > 0 is a normalization constant. Besides,


6 A SECOND EXAMPLE assumptions H2 are true, namely, the process
(It , Xt )t≥0 is uniformly ergodic.
6.1 Presentation—theoretical results
The following example is very similar to that from
6.2 Quantities of interest
(Boxma, Kaspi, Kella, and Perry 2005). The main
difference is that we here assume Xt to remain bounded We are interested in two quantities: first, the propor-
(Xt ∈ [0, R]) whereas, in the quoted paper, Xt takes its tion of time spent by the level in the tank between two
values in R+ . fixed bounds R2 − a and R2 + b with 0 < a, b < R2 and

160
we set: 6.3 Numerical example
 t  We assume the system to be initially in state (I0 , X0 ) =
1
Q1 (t) = Eρ 1 R −a≤Xs ≤ R +b ds (0, R/2). Besides, we take:
t 0 0 2 2

1  t
 λ0 (x) = xα0 ; r0 (x) = (R − x)ρ0 ;
2 +b
R
1
= ρs (i, dx) ds
t λ1 (x) = (R − x)α1 ; r1 (x) = xρ1
2 −a
R
i=0 0
 t
=
1
ρs h1 ds (12) for x ∈ [0, R] with αi > 1 and ρi > 1. All conditions
t 0 for irreducibility are here achieved.
We take the following numerical values:
with h1 (i, x) = 1[ R −a, R +b] (x).
2 2
The second quantity of interest is the mean number α0 = 1.05; ρ0 = 1.2; α1 = 1.10;
of times the pump is turned off, namely turned from ρ1 = 1.1; R = 1; a = 0.2; b = 0.2.
state ‘‘in’’ (0) to state ‘‘out’’ (1) by unit time, namely:
 Similarly as for the first method, we test our results
1 
Q2 (t) = Eρ0 1{Is− =0 and Is =1} using finite differences (FD). The results are here
t 0<s≤t
rather stable choosing different values for ε and the
  results are provided for ε = 10−2 in case p ∈
1 t
{α0 , α1 , r0 , r1 } and for ε = 10−3 in case p ∈ {a, b}.
= Eρ0 λ0 (Xs )1{Is =0} ds
t 0
The asymptotic results are given in Tables 2 and 3,
 t R and the transitory ones are given in Table 4 and 5 for
=
1
λ0 (x)ρs (0, dx) ds t = 2.
t 0 0 The results are very similar by FD and MR both for
 the asymptotic and transitory quantities, which clearly
1 t
= ρs h2 ds (13) validate the method. Note that the asymptotic results
t 0 coincides by both methods, even in the case where the
velocity v(i, x) field depends on the parameter (here
with h2 (i, x) = 1{i=0} λ0 (x). ρi ), which however does not fit with our technical
For i = 0, 1, the function λi (x) is assumed to depend assumptions from Section 4. Due to that (and to other
on some parameter αi (but no other data depends on the examples where the same remark is valid), one may
same parameter). Similarly, the function ri (x) depends conjecture that the results from Section 4 are valid
on some ρi for i = 0, 1. By definition, the function h1
also depends on parameters a and b.
We want to compute the importance factors with Table 2. IFp(1) (∞) by the present method (EMR) and by
respect to p for p ∈ {α0 , α1 , r0 , r1 , a, b} both in Q1 and finite differences (FD).
Q2 , except for parameters a and b which intervenes
only in Q1 . p FD EMR Relative error
As told at the end of Section 13, we have to compute
the marginal distribution (ρs (i, dx))i=0,1 for 0 ≤ s ≤ t α0 −3.59 × 10−2 −3.57 × 10−2 5, 40 × 10−3
and the importance function associated to hi0 and t α1 −4.45 × 10−2 −4.43 × 10−2 3, 65 × 10−3
ρ0 3.19 × 10−1 3.17 × 10−1 6, 95 × 10−3
for i0 = 1, 2. This is done through solving two dual
ρ1 2.80 × 10−1 2.78 × 10−1 7, 19 × 10−3
implicit finite volume schemes. A simple summation
a 4.98 × 10−1 4.98 × 10−1 1, 06 × 10−7
associated to each p, which is done simultaneously to b 5.09 × 10−1 5.09 × 10−1 1, 53 × 10−7
the solving, then provides the result through (4).
As for the asymptotic results, the potential functions
Uhi0 are here solutions of
Table 3. IFp(2) (∞) by the present method (EMR) and by
d finite differences (FD).
vi (x) (Uhi0 (i, x)) + λi (x)(Uhi0 (1 − i, x)
dx
p FD EMR Relative error
− Uhi0 (i, x)) = Qi0 (∞) − hi0 (i, x)
α0 −1.81 × 10−1 −1.81 × 10−1 1, 67 × 10−4
for i0 = 0, 1, which may be solved analytically. α1 −1.71 × 10−1 −1.71 × 10−1 1, 30 × 10−4
∂Q 0 (∞) ρ0 −6.22 × 10−2 −6.19 × 10−2 5, 21 × 10−3
A closed form is hence available for i∂p using
ρ1 −6.05 × 10−2 −6.01 × 10−2 5, 58 × 10−3
(10–11) and (8).

161
Table 4. IFp(1) (t) for t = 2 by the present method (EMR) ACKNOWLEDGEMENT
and by finite differences (FD).
The authors would like to thank Anne Barros,
p FD EMR Relative error Christophe Bérenguer, Laurence Dieulle and Antoine
Grall from Troyes Technological University (Univer-
α0 −8.83 × 10−2 −8.82 × 10−2 1, 08 × 10−3 sité Technologique de Troyes) for having drawn their
α1 −9.10 × 10−3 −9.05 × 10−3 5, 29 × 10−3 attention to the present subject.
ρ0 4.89 × 10−1 4.85 × 10−1 7, 51 × 10−3
ρ1 1.97 × 10−1 1.97 × 10−1 4, 04 × 10−3
a 2.48 × 10−1 2.48 × 10−1 4, 89 × 10−4
REFERENCES
b 7.11 × 10−1 7.11 × 10−1 7, 77 × 10−6
Boxma, O., H. Kaspi, O. Kella, and D. Perry (2005). On/off
storage systems with state-dependent input, output, and
switching rates. Probab. Engrg. Inform. Sci. 19(1), 1–14.
Table 5. IFp(2) (t) for t = 2 by the present method (EMR) Cao, X.-R. and H.-F. Chen (1997). Perturbation realization,
and by finite differences (FD). potentials, and sensitivity analysis ofMarkov processes.
IEEE trans. automat. contr. 42(10), 1382–1393.
p FD EMR Relative error Cocozza-Thivent, C., R. Eymard, S. Mercier, and
Roussignol, M. (2006). Characterization of the marginal
α0 −2.06 × 10−1 −2.06 × 10−1 9, 12 × 10−4 distributions of Markov processes used in dynamic relia-
α1 −6.80 × 10−2 −6.79 × 10−2 2, 12 × 10−3 bility. J. Appl. Math. Stoch. Anal., Art. ID 92156, 18.
ρ0 −1.25 × 10−1 −1.24 × 10−1 4, 27 × 10−3 Davis, M.H.A. (1984). Piecewise-deterministic Markov
ρ1 −4.11 × 10−3 −4.03 × 10−3 2, 00 × 10−2 processes: a general class of nondiffusion stochastic
models. J. Roy. Statist. Soc. Ser. B 46(3), 353–388. With
discussion.
Devooght, J. (1997). Dynamic reliability. Advances in
Nuclear Science and Technology 25, 215–278.
under less restrictive assumptions than those given in Eymard, R., S. Mercier, and A. Prignet (2008). An implicit
that section. finite volume scheme for a scalar hyperbolic problem with
As for the results, one may note that the importance measure data related to piecewise deterministic markov
factors at t = 2 of α0 and ρ0 in Qi (i = 1, 2) are processes. J. Comput. Appl. Math. available online 1 Nov.
clearly higher than the importance factors of α1 and 2007.
ρ1 in Qi (i = 1, 2). This must be due to the fact that Eymard, R., S. Mercier, A. Prignet, and M. Roussignol
the system starts from state 0, so that on [0, 2], the (2008, jun). A finite volume scheme for sensitivity anal-
ysis in dynamic reliability. In Finite Volumes for Complex
system spends more time in state 0 than in state 1. The
Applications V, Aussois, France.
parameters linked to state 0 hence are more important Gandini, A. (1990). Importance & sensitivity analysis in
than the ones linked to state 1. Similarly, the level is assessing system reliability. IEEE trans. reliab. 39(1),
increasing in state 0 so that the upper bound b is more 61–70.
important than the lower one a. Konstantopoulos, T. and G. Last (1999). On the use of
In long-time run, the importance factors of α0 and Lyapunov function methods in renewal theory. Stochastic
α1 in Qi (i = 1, 2) are comparable. The same remark Process. Appl. 79(1), 165–178.
is valid for ρ0 and ρ1 , as well as for a and b. Lions, J.-L. (1968). Contrôle optimal de systèmes gouvernés
Finally, parameters ρ0 and ρ1 are more important par des équations aux dérivées partielles. Paris: Dunod.
Mercier, S. (2007). Discrete random bounds for general ran-
than parameters α0 and α1 in Q1 , conversely to what
dom variables and applications to reliability. European J.
happens in Q2 . This seems coherent with the fact that Oper. Res. 177(1), 378–405.
quantity Q1 is linked to the level in the tank, and con- Mercier, S. and M. Roussignol (2007). Sensitivity estimates
sequently to its evolution, controlled by ρ0 and ρ1 , in dynamic reliability. In Proceedings of MMR 2007 (Fifth
whereas quantity Q2 is linked to the transition rates, International Conference on Mathematical Methods on
and consequently to α0 and α1 . Reliability), Glasgow, Scotland.

162
Safety, Reliability and Risk Analysis: Theory, Methods and Applications – Martorell et al. (eds)
© 2009 Taylor & Francis Group, London, ISBN 978-0-415-48513-5

TSD, a SCAIS suitable variant of the SDTPD

J.M. Izquierdo
Consejo de Seguridad Nuclear, Madrid, Spain

I. Cañamón
Universidad Politécnica de Madrid, Madrid, Spain

ABSTRACT: The present status of the developments of both the theoretical basis and the computer imple-
mentation of the risk and path assessment modules in the SCAIS system is presented. These modules are
supplementary tools to the classical probabilistic (i.e. fault tree event tree and accident progression, PSA-
ET/FT/APET) and deterministic (dynamic accident analysis) tools that are able to compute the frequency of
exceedance in a harmonized approach, based on a path and sequences version (TSD, theory of stimulated
dynamics) of the general stimulus driven theory of probabilistic dynamics. The contribution examines the
relation of the approach with classical PSA and accident dynamic analysis, showing how the engineering
effort already made in the nuclear facilities may be used again, adding to it an assessment of the damage
associated to the transients and a computation of the exceedance frequency. Many aspects of the classi-
cal safety assessments going on from technical specifications to success criteria and event tree delineation/
evaluation may be verified at once, making effective use of regulatory and technical support organizations
resources.

1 INTRODUCTION 2 SDTPD PATH AND SEQUENCES


APPROACH
This paper deals with the development of new capa-
bilities related with risk informed licensing policies 2.1 Background: semi-Markov path
and its regulatory verification (RIR, Risk Informed and sequences
Regulation), which require a more extended use of
As it is well known, the differential semi-Markov equa-
risk assessment techniques with a significant need to
tions for the probability πj (t) of staying in state j at
further extend PSA scope and quality. The comput-
time t are
erized approach (see ESREL8 companion paper on
the SCAIS system) is inspired by a coherent protec- d 
tion theory that is implemented with the help of a πj (t) = −πj (t) · pj→k (t) + ϕj (t)
mathematical formalism, the stimulus driven theory dt
k =j
of probabilistic dynamics, SDTPD, (ref. 1) which is 
an improved version of TPD (ref. 3), whose restrictive ϕj ≡ pk→j (t) · πk (t) (1)
assumptions are released. k =j
Computer implementation of SDTPD makes neces-
sary to design specific variants that, starting from the where the transition rates p are allowed to be a function
SDTPD original formal specification as formulated of time. A closed solution is
in ref. 1, go down the line towards generating a set of ⎡ ⎤
hierarchical computing algorithms most of them being  t
the same as those in the classical safety assessments. πj (t) = ⎣exp ds A(s)⎦ πk (τ )
They aim to compute the exceedance frequency and k τ jk (2)
the frequency of activating a given stimuli as the main 
figures of merit in RIR defense in depth regulation pj→k (s)k  = j
requirements, in particular the adequate maintenance [A(s)]jk ≡ −λj (s) ≡ − pj→k (s)k = j
of safety margins (ref. 2). k =j

163
They may also be written as integral equations rep- that is
resenting the solution of the semi Markov system for
the frequency of entering state j, (ingoing density)  seq j
ϕj (t), and its probability πj (t) given by ϕj (t) = ϕj (t)
 t 
j

ϕj (t) = dτ [πk (τ ) δ(τ ) + ϕk (τ )]qjk (t, τ )  


seq j
0 k =j ≡ ϕj11 (o) d τn Qj (t/
τn ) (8)
j V (τ <t)
 t −

t
ds

pj→k (s) n,j n
τ
πj (t) = dτ[πj (τ ) δ(τ ) + ϕj (τ )]e k =j

0
where the first term is the initiating event frequency;
(3) seq j
Qj (t, τn ) will be called the path Q-kernel.
Vector j is called ‘‘a sequence’’ and the couple of
where δ stands for the Dirac delta, and qjk dt is the vectors (j, τ ) ‘‘a path’’. This type of solution will then
probability that, being in state k at time τ , event jk be named the ‘‘path and sequence’’ approach. In static
takes place at exactly time t; it is then given by PSA, each component of j is a header of an event tree

t
− τ pk→l (s)ds and there is no distinction in principle between paths
qjk (t, τ ) = pk→j (t)e l =k
(4) and sequences.
We notice that in a semi-Markov system like the
The solution (eq. 2) may also be found in terms of one above, the factors q of the Qseq product are all the
the iterative equations same functions, irrespective of n. An extension of this
approach may be made, however, when these functions

 depend on n and j, as will be the case with the TPD and
ϕj (t) = ϕjn (t) SDTPD theories. However, the normalization condi-
n=1 tions of equations (5a) and (5b) above should always
 t be respected for all n. It is obvious that the path and
ϕjn (t) = dτ ϕkn−1 (τ ) qjk (t, τ ) sequence approach looses interest when the number of
k =j 0 paths is unmanageable. Below we show cases where
 this is not the case.
ϕj1 (t) = πk (0) qjk (t, 0) (5)
k =j
2.2 Extension to TPD dynamic reliability
where n stands for the number of experienced events.
In TPD, the transition rates p are also allowed to be
Note that the q’s satisfy a strong closure relation for
functions of process variables (temperatures, pres-
all τ1 < τ2 < t
sures, etc.) that evolve with time along a dynamic
⎛ ⎞
  τ2 trajectory. The transitions come together with a change
qk,j (t, τ2 ) ⎝1 − dvql,j (ν, τ1 )⎠ = qk,j (t, τ1 ) in the trajectory (a change in dynamics or dynamic
l =j τ1 event).
Let x be the vector of process variables describing
(6a)
the dynamic behavior of the system. We denote by i the
group of system configurations in which the dynamic
that ensures that at all times evolution is given by the equivalent explicit form

πj (t) = 1 (6b)
j
x = x(t, i) = g i (t, x0 ), x0 = g i (0, x0 ) (9)

If we replace the iterations in eq. 5, we may express We will assume in the following that:
ϕj (t) in terms of an aggregate of so called ‘‘path
frequencies’’ built upon the products • the system starts at a steady point state u0 , j0 with
given probability. A random instantaneous initiat-
seq j ing event with a known frequency triggers the time
Qj (t/
τn ) = qj,jn (t, τn )qjn ,jn−1 (τn , τn−1 ).....qj2 ,j1
evolution.
× (τ2 , τ1 )τ1 < · · · < τn−1 < τn < t • process variable x can be reached from the ini-
tial point state only after one sequence of dynamic
τn ≡ (τ1 , . . . , τn ) (7) events through a compound path defined by

164
x = x(t, j ) = gjn (t − τn+ , ūˆ n ) τn > t > τn−1 sub states jn of (jn , I ), associated with the different
(10) dynamics are defining the dynamic paths as in the no-
ūˆ n = gn−1 (τn− − τn−1
+
, ūˆ n−1 ) stimuli case, but the stimuli state is kept in Markov
vector format.
where τn−,+ account for the ‘natural’ discontinuity in Let Qi,j I ,J (t, τ )dt be the probability of the transition
slopes when the dynamic events take place. (j, J ) → (i, I ), a time interval dt around t after entering
Under the given assumptions, once the possible dynamic substate j at time τ . In TSD terms, Q may be
dynamics and initial point state are defined, the paths written in the following way
possible may be determined including its timing. We
will call a sequence the set of paths {j, τjn } with the J ,J ,seq j seq j
Qjnn,jn−1
n−1
(τn , τn−1 ) ≡ qjn ,jn−1 (τn , τn−1 )
same j, but differing in the timing. Then the gen-

eral path and sequences approach may be applied such × π(Jn , τn− /Jn−1 , τn−1 ) (12)
that we associate the dynamic paths with the paths of
equation (6), (7) and (8) above. For every path, we The last factor means the stimuli probability vec-

now have a deterministic transient that may be sim- tor at time t = τn− conditioned to be π Jn−1 (τn−1 ) at
ulated with deterministic simulators, providing then −
τn−1 . To compute this stimuli probability is then a typ-
functions q vía ical problem of the classical PSA binary λ, μ Markov

t problem with the stimuli matrix, STn , associated with
− pjn→l (s,x(s,j ))ds
seq j τ the (+ activated, − deactivated) state of each stimuli
qjjn (t, τ ) = pjn →j (t, x(t, j ))e l =jn
(i.e. A ≡ STn in eq. 2) and compound-stimuli state
(11) vector I , once a dynamic path is selected, i.e. the
solution is
2.3 Extension to stimuli: modeling the plant states
  t 
SDTPD includes an additional extension of the π J (t) = exp STn (u)du π Kn (τn+ ) (13)
state space: stimulus activation states are consid- Kn τn J ,Kn
ered through a stimulus label indicating the activated
(IG = +) or deactivated (IG = −) state of stimulus G. with initial conditions at τn+ with probability vector
The notation I will be reserved to label these state
configurations when considering all stimuli together. 
A so-called stimulus is either an order for action (in π Kn (τn+ ) = [+ Jn −
n ]Kn ,Jn π (τn ) (14)
many cases given by an electronic device or corre- Jn
sponding to an operator diagnosis), or the fulfillment
of conditions triggering a stochastic phenomenon. Yet Matrix [+ n ] models the potential reshuffling of the
it can take many different forms, such as the crossing state of the stimuli that may take place as a result of
of a setpoint or the entry in a region of the process vari- the occurrence of dynamic event n, i.e. some activate,
ables space. In general, the term stimulus covers any some deactivate, some stay as they were. It also incor-
situation which potentially causes, after a given time porates the essential feature of SDTPD that stimuli
delay, an event to occur and subsequently a branch- G n , responsible for the nth dynamic transition, should
ing to take place. Thus, if Gn denotes the stimulus be activated for the transition to take place. Then the
that leads to transition jn → jn+1 , it should be in the equivalent to eq (8) may be applied with:
activated state for the transition to take place.  J ,J ,seq j J ,J ,seq j
j (t/
τ) = (t, τn )Qjnn,jn−1
SDTDP models the activation/deactivation of stim- J
Qj,seq Qj,jn n n−1

uli as events with activation/deactivation transition J1 ,J2 ,...Jn


rates λF (t, x)/μF (t, x) for stimulus F. The total sys-
J ,J ,seq j
tem state in the course of a transient is thus defined × (τn , τn−1 ) . . . . . . Qj22,j11 (τ2 , τ1 )π J1 (0+ )
by the couple (jn , I ) which accounts at the same time
for the current dynamics in which the system evolves, τ1 < τn−1 < τn < t (15)
and for the current status of the different activations.
with
J ,J ,seq j
3 THE TSD METHODOLOGY Qjnn,jn−1
n−1
(t, τn )
seq j
3.1 Modeling the probability of the next event ≡ qjn ,jn−1 (τn , τn−1 )
   t 
along a dynamic path with stimuli events
The TSD method uses a specific type of path and × exp STn (u)du [+
n ]Kn ,Jn−1 (16)
Kn τn
sequence approach to solving the SDTPD, where the Jn, Kn

165
becomes the extended next event frequency of the TSD Repeating the process
approach. This situation is very common and covers
the cases of setpoint related stimuli activation events
as well as those independent on the dynamics that may I
Qj,seq τn ) = δG+1 ∈J δG+ ∈J − . . . . . . δG+ ∈J − ...−
j (t/
be uncoupled, as extensively discussed1 in ref. 3. Once 2 G1 n G1 Gn−1

again, this approach is not useful if the number of paths seq j


× δI ,J − ...− × Qj (t/ τn ) (22)
is too large. G1 Gn

3.2 Example
seq j
where Qj (t/ τn ) is given in eq. (7). For n > N all
As an academic example, assume that all stimuli,
initially deactivated, only activate as a result of the stimuli would be deactivated and then the [+ n ]I ,J con-
initiating event, and only deactivate again when its cor- tribution would be zero, as well as in any sequence with
responding dynamic event do occur. For a sequence of repeated headers.
N headers, we label 1 to N the first N header-stimuli, In other words, the result reduces to a standard
i.e. stimuli whose activation triggers the dynamic tran- semi Markov case, but in the solution equation (8)
sition. Then all λF (t, x), μF (t, x) are zero and the only sequences of non-repeated, up to N events are
following results: allowed. This academic case illustrates the reduction
Define compound state JG± , such that differs from in the number of sequences generated by the stimuli
compound state J only in the state of stimulus G, being activation condition that limits the sequence explosion
it activated or deactivated. Also define problem of the TPD in a way consistent with explicit
and implicit PSA customary practice.
1 if Gn ∈ I is activated Results of this example in case of N = 2 are shown
δG+n ∈I = (17)
0 if Gn ∈ I is deactivated in sections 4–5 below, mainly as a way of checking the
method for integration over the dynamic times, that is
Then the subject of the next section.
  t 
STn = 0 ⇒ exp STn (u)du = Identity matrix
τn
(18) 3.3 Damage exceedance frequency
In TSD, additional stimuli are also defined, including
 for instance stimulus ‘‘Damage’’ that, when activated,

π I (τn+ < t < τn+1 ) = π I (τn+ ) = [+ J −
n ]I ,J π (τn ) corresponds to reaching unsafe conditions. Then, the
J damage exceedance frequency is given by

= δG+n ∈J δI ,J − π J (τn− ) (19)
J
Gn
 
damage J damage
ϕj (t) = ϕjJ11 (0) d τ N,
Qj,seq j (t/
τ)
Because after the initial event all stimuli become j ,JN Vn,j (
τ <t)
activated
(23)

J1 + 1 for J1 = (+, +, · · · · +N ) all stimuli activated
π (0 ) =
0 all other states
(20) To obtain the damage exceedance frequency, only
the aggregate of those paths that exceed a given amount
and of damage are of interest. Because they are expected
rare, it is of paramount importance to first find the
 damage domains, i.e. the time combinations for which
π I (τ1+ < t < τ2− ) = π I (τ1+ ) = δG+1 ∈J δI ,J − π J (τ1− )
G1 damage may occur. This is the main purpose of the
J
 searching approach that follows.
1 for I = JG−1 = (−, +, · · ·+N ) Next sections provide algorithms to perform this
= (21) aggregation and show results as applied to a simple
0 all other states
example. An air/steam gas mixture where hydrogen is
injected as initiator event and its rate suddenly changed
as a header event, (SIS), has been considered. See
1 This uncoupling allows to factor-out the traditional fault Table 1 and section 5. In addition to those in table 1, a
trees when the basic events depend on the pre-accident combustion header stimulus is activated if the mixture
period time scales. becomes flammable, and (section 4 only) a combustion

166
Table 1. Dynamic/stimuli events used in the verification
tests.

Event Description

1 Initiating event: H2 injection into containment


air/steam mixture—result of cold leg break with
core degradation.
2 H2 injection rate increase,—result of core oxidation
increase after random initiation of Safety Injection a) b)
System (SIS) in the primary system.
3 Spray cold water injection—result of random Figure 1. Sampling domains (shadowed) corresponding to:
initiation of the Containment Heating Removal a) The 2D sequence [2 1], b) The 3D sequence [3 2 1].
System (CHRS).
4 Gas mixture becomes flammable (damage stimulus
activated).
[tini , tAD ]. Therefore, the whole squared domain will
contain two different sequences, depending on which
event occurs first. The shadowed domain corresponds
to the sequence [2 1] to be integrated where any point
dynamic event (COMB) is included with an associated fulfills that t2 ≤ t1 (the white domain being that of the
peak-pressure damage stimulus2 . opposite sequence [1 2]). The area A of any of those
domains is given by the following expression:

4 INTEGRATION OF THE PATH TSD


FORMULA. THE SEARCH FOR DAMAGE (tAD − tini )2
A(Xi ≤ Xj ) = (i, j = 1, 2; i  = j)
DOMAINS 2

4.1 Incremental volume of one path


Similarly, Figure 1b shows the sampling domain
We can graphically represent the integration domain in of a 3-dimension (3D) sequence, where the shadowed
eq. (23) by points in an integration volume of dimen- domain corresponds to the reduced sequence [3 2 1],
sion N . Each point corresponds to a specific path, and and the rest of the cubic domain to combinations of
we distinguish three types of paths: damage paths, those three stochastic events in different order. Here
safe paths and impossible paths (those that do not the numbers refer to the events, not to the j states, as
activate the header stimulus on time). For instance, in table 1.
in the example below a sequence with two headers The equation of the volume V for those 3D
have dimension 2 and impossible paths are those where sequences will be:
the flammability stimuli activates after the combustion
event.
Numerical integration of the exceedance frequency (tAD − tini )3
V (Xi ≤ Xj ≤ Xk ) =
eq. (23) will then consists of computing the Q-kernel 6
on the points inside the damage domain and adding up (i, j, k = 1, 2, 3; i  = j  = k)
all the contributions, multiplied by the ‘incremental
volume’ ( V ) associated to each sampling point. The
way to perform the sampling strategy and the accurate and the generalization of this equation to N dimen-
computation of the incremental volume associated to sions:
each path are key points for the correct computation
of the exceedance frequency.
(tAD − tini )N
Figure 1a shows the domain of a 2-dimension (2D) V (Xi1 ≤ Xi2 ≤ · · · ≤ XiN ) =
sequence containing stochastic events 1 and 2, where N!
X1 and X2 are the sampling variables of the occurrence (i1 , i2 , . . . , iN = 1, 2, . . . , N ; i1  = i2  = . . .  = iN )
times of events 1 and 2 respectively. The time limits
of the sequence are tini (initiating event occurrence (24)
time) and tAD (accident duration time), and X1 , X2 ∈
We can associate to any damage point in the
N -dimensional sampling domain an incremental vol-

2 The stimuli activations for headers SIS and CHRS ume of the form V = Ni=1 ti where the incremen-
(spray) follow the rules of the example 3.2. tal times on each axis have to be determined.

167
4.2 Paths sampling strategy: Search of the damage perfectly determined by equation (21), being ti
domain the sampling step through the axis direction i.
As stated before, we are only interested in the damage Remark from Figure 4 that:
paths, as they are the only ones contributing to the
damage exceedance frequency. Therefore, a sampling – For the same number of sampling points (N = 136),
strategy has been designed to: the damage domain is better determined with the
mesh grid sampling strategy.
– Minimize the number of non-damage paths (safe – Incremental volume associated to each damage path
and impossible paths) being sampled and analyzed; is much easier and more accurate to quantify in the
– Search and define the shape and size of the damage mesh grid sampling strategy.
domain within the sampling domain. – For refining purposes, neighboring points to each
sampled path are better defined in the mesh grid
Three main sampling strategies may be considered sampling strategy.
to do that:
We adopt then the mesh-grid sampling strategy.
1. Non-uniform Monte Carlo (MC). This sampling
strategy uses the probability density functions of
each event to perform the sampling, leading to a 4.3 Adaptive search algorithm
non-uniform sampling distribution. This approach The ultimate goal is therefore to be able to precisely
can not be used here, as probability density func- define the size of the damage domain within the
tions are already taken into account in the TSD sequence sampling domain, in order to multiply it by
formulae for the computation of the probability the ‘weighting factors’ given by the Q-kernels of each
Q-kernels. path. An adaptive search algorithm has been designed,
2. Uniform Monte Carlo (MC). Points are sam- based on the mesh grid sampling strategy, to refine
pled uniformly within the sequence time interval the search where damage paths are detected. The basic
(N alleatory numbers between tini and tAD ordered idea is to analyze the neighbors of each damage path,
in increasing order). However, even with a uni- and to sample in a finer mesh grid around them until
form sampling, the damage zone inside the sam- non-damage paths (limits of the damage domain) are
pling domain may be irregularly shaped, leading found.
again to a heterogeneous sampling density of the The algorithm is formed by one initial stage and an
damage points along the sampling domain (see adaptive search stage divided in three parts: refining
Figure 2a). We could partition the sampling domain stage, seeding stage and growing stage. The adaptive
into smaller subdomains where a homogeneous dis- search stage is repeated at successively higher scales
tribution of the sampling points could be assumed, (more refined mesh grids), until convergence of the
and approximate the differential volume of each damage exceedance frequency is reached. We describe
point as the subdomain volume divided by the the different stages as follows:
number of points lying inside.
3. Meshgrid sampling. In this sampling strategy, sam- • Initial stage: An initial mesh grid is defined and
pling points are obtained from a cartesian meshgrid points within that grid are sampled and analyzed,
partition of the sampling domain (see Figure 2b). In
this case, the volume associated to each sampling
point (damage, safe or impossible path-point) is

Sequence [1 2 4] Sequence [1 2 4]
14000 14000

12000 12000

10000 10000
COMB

COMB

8000 8000

6000
6000

6000 8000 10000 12000 14000


6000 8000 10000 12000 14000
a) SIS b) SIS

Figure 2. 2D sequence (‘SIS’ and ‘COMB’ are events 1


and 2 respectively) analyzed with two different sampling
strategies: a. Uniform MC sampling strategy; b. Meshgrid
sampling strategy. Cross points are damage paths and dot
points are impossible paths. N = 136 paths. Figure 3. Neighbors of a given damage path.

168
via the simplified dynamic models to determine damage domain, i.e. if a neighbor of an exist-
whether they are a damage, success or impossible ing damage point is not a damage one, then a
path. In order to assure that the ulterior refinement new point is sampled between them and evalu-
of the mesh grid does not ly at points contradict- ated. Figure 4b shows the result of this stage for
ing the dynamic model time step, the time step of the same example sequence [1 2 4].
this initial stage, dt, is chosen following a dyadic ◦ Seeding stage: An aleatory seeding of new sam-
scale of that one (and equal for all the axes). Addi- pling points along the whole domain has been
tionally, information about the neighbors of each included here, in order to discover new damage
sampled path is registered in an array, assigning a zones disjoint from the previous ones. Several
value of 1 when the neighbor is a damage path, and parameters control the stopping criteria in this
a value of 0 in any other case (success or impossible stage. In particular, we stop when a number of
path). Neighbors are defined here only those paths seeding points proportional to the refinement of
at ±dt on each axis direction (see highlighted points the initial sampling mesh grid has been reached.
around the cross damage path in Figure 3), instead Figure 4c shows the result of this stage for the
of all the surrounding points (8 in 2D, 26 in 3D, etc). actual example.
The last option would lead to an unaffordable num- ◦ Growing stage: At this stage, the algorithm
ber of new sampled points as sequence dimension extends the sampling through all the interior zone
increases. Figure 4a shows an example of this ini- of the damage domain. With the combination of
tial stage for the ‘simplified benchmark’ sequence the refining stage + growing stage, we optimize
[1 2 4] where 4 means the damage stimuli activation the number of new points being sampled while
event. This stage is defined here as scale 1. refining all inside the damage domain.
• Adaptive search stage: A loop is performed with
successively higher scales 2, 3, 4, etc. until a Figure 4d shows the result of this stage in the
stopping criterium is reached. This stage has the example sequence.
following parts:
◦ Refining stage: When entering a higher scale, the 5 VERIFICATION TESTS
new time step for the sampling process is half the
previous one, dt/2. Then, the algorithm samples To verify the integration routines we took the sim-
new points through the borders of the existing ple case of the section 3.2 example, already used in

a) Sequence [1 2 4] b) Sequence [1 2 4]
14000

13000 13000

12000 12000
11000 11000

10000 10000
COMB

COMB

9000 9000

8000 8000

7000 7000
6000 6000

5000 5000

6000 8000 10000 12000 14000 6000 8000 10000 12000 14000
SIS SIS

c) Sequence [1 2 4] d) Sequence [1 2 4]
14000 14000

13000 13000

12000 12000

11000 11000

10000 10000
COMB

COMB

9000 9000

8000 8000

7000 7000

6000 6000

5000 5000

6000 8000 10000 12000 14000 6000 8000 10000 12000 14000
SIS SIS

Figure 4. Stages of the adaptive search algorithm in the sequence [1 2 4]: a. Initial stage; b. Refining stage; c. Seeding stage;
d. Growing stage.

169
section 4, with two header events and constant, given Table 2. State probabilities at t = tAD .
values of the two events transition rates, p. The ini- Analytical computation.
tial event frequency has been set equal to 1. The
conditioned-to-the-initiating event path probability is Sequence Probability
computed for each possible sequence built with those 1 0.1250
two header events and the TSD integral of each ana- 13 0.1250
lyzed sequence is then computed by aggregation of 12 0.3750
the paths probabilities multiplied by the incremental 132 0.1875
volume V as in section 4. 123 0.1875
The dynamics of this section 4 and 5 example are
based on a benchmark exercise performed in the work
frame of the SARNET research network. It analyzes
the risk of containment failure in a nuclear power plant Table 3. State probability at t = tAD . TSD computation
stopping criterium πi − πi−1 ≤ 5%.
due to overpressurization caused by hydrogen com-
bustion (see ref. 4 for detailed specifications of the Sequence # Total paths Probability
benchmark). In the results presented in this section,
combustion has not been modeled. Instead, entering 1 1 0.1250
the flammability region for H2 concentration has been 13 16 0.1231
selected as the damage stimulus. 12 16 0.3717
The events involved are given in Table 1. The details 132 136 0.1632
about the dynamics of the transients are not within the 123 136 0.2032
scope of this article, and can be consulted in ref. 4.
The cumulative probabilities q for the single dynamic
events 2 and 3 are 0.75 and 0.5 respectively, during
Table 4. State probability at t = tAD . TSD computation
the accident time interval. Transition rates have been stopping criterium πi − πi−1 ≤ 1%.
extracted from them by integrating eq. (4) (with only
one event) within that interval and forcing q to be the Sequence # Total paths Probability
given values. In the analytical case compared below,
single event q’s were additionally approximated by 1 1 0.1250
functions linear with time. 13 64 0.1251
As the transient ends when activating the damage 12 32 0.3742
stimulus only five sequences are possible. 132 2080 0.1672
123 2080 0.2091

5.1 Numerical results


The first part of the verification tests computes the 1.4

probability of being in a specific state πj (t). Here no


[1]
[1 3]
1.2
damage stimulus is considered, and only probabilities [1 2]
[1 2 3]
of the possible sequences obtained with the header 1 [1 3 2]
events 2 and 3 and initiator event 1 are computed.
Table 2 presents the ‘‘linear q’s’’ analytical results. 0.8
Tables 3 and 4 show the results obtained without
pj
i

q-approximations but integrating the time domains 0.6

with the ‘SCAIS_PathQKernel.m’ module which


0.4
implements the TSD, and compare them with the
analytical case. 0.2
Firstly, the difference between linear q functions
in eq. 4 and uniform transition rates is only signifi- 0
0.4 0.6 0.8 1 1.2 1.4 1.6
cant in sequences with more than one header event time (s)
x 10
4

(apart from the initiating event). This is because in


sequences with only one event, the cumulative survival Figure 5. Bar diagram of the πj probabilities.
probabilities only include the event and are computed
up to tAD = 14220 s (accident duration time), and
those values (0.75 for event 2 and 0.5 for event 3) are πj probability) can be observed, as well as the fulfil-
imposed the same in the two computations. Secondly, ment of the normalization condition given by eq. (6b),
the consistency of the SCAIS module results for dif- further confirmed in Fig. 5, that shows a bar diagram
ferent stopping criteria (convergence of the computed with the probabilities at all times.

170
Table 5. Damage exceedance relative frequency TSD com- new TSD methodology based on it is feasible, com-
putation of the verification example; stopping criterium patible with existing PSA and accident dynamic tools.
ϕiexc − ϕi−1
exc ≤ 5%.
It is able to find the damage exceedance frequency
that is the key figure of merit in any strategy for
Sequence # Damage paths # Total paths Exceed. freq.
safety margins assessment. We have discussed some
1 0 1 0.0000 issues associated to the development of its com-
13 9 15 0.1718 puter implementation, and given results of preliminary
12 14 15 0.5188 verification tests.
132 3155 3308 0.0700
123 1178 1314 0.0350
REFERENCES

Labeau P.E., Izquierdo J.M., 2004. Modeling PSA problems.


The second part of the verification tests com- I The stimulus driven theory of probabilistic dynamics.
II A cell-to-cell transport theory approach. NSE: 150,
putes the relative exceedance frequency ϕ exc /ϕ(0) pp. 115–154 (2005).
of the damage stimulus 4 (entering the flammabil- NEA/CSNI/R(2007)9, Task Group on Safety Margins
ity region for H2 combustion) for the same example. Action Plan (SMAP) Safety Margins Action Plan. Final
Table 5 shows the values of ϕ exc /ϕ(0), of each possible Report. See also NEA/SEN/SIN/LOSMA(2008)1, Sum-
sequence, and the total relative exceedance frequency mary Record of the First Plenary Meeting of the CSNIask
will be the sum of them, totalling 0.7956. Group on LOCA Safety Margin Assessment (LOSMA).
Izquierdo J.M., Melendez E., Devooght J., 1996. Relationship
between probabilistic dynamics and event trees. Rel. Eng.
Syst. Safety 52, pp. 197–209.
6 CONCLUSIONS Izquierdo J.M. Cañamón I., Status Report on SDTPD
path and sequence TSD developments Deliverable
We have shown a new math description of SDTPD D-73. DSR/SAGR/FT 2004.074, SARNET PSA2 D73
which is suitable for its SCAIS implementation. The [rev1]TSD.

171
Fault identification and diagnostics
Safety, Reliability and Risk Analysis: Theory, Methods and Applications – Martorell et al. (eds)
© 2009 Taylor & Francis Group, London, ISBN 978-0-415-48513-5

Application of a vapour compression chiller lumped model


for fault detection

J. Navarro-Esbrí & A. Real


Department of Mechanical Engineering and Construction, Universidad Jaume, Castellón, Spain

D. Ginestar
Department of Applied Mathematics, Universidad Politécnica de Valencia, Valencia, Spain

S. Martorell
Department of Chemical and Nuclear Engineering, Universidad Politécnica de Valencia, Valencia, Spain

ABSTRACT: Faults in Heating Ventilating and Air-Conditioning (HVAC) systems can play a significant
role against the system in terms of energy efficiency loss, performance degradations, and even environmental
implications. Being the chiller one of the most important components of an HVAC system, the present work
is focused on it. A lumped model is proposed to predict the chiller fault free performance using data easily
obtained from an industrial facility. This model predicts the chilled water temperature, the operating pressures
and the system overall energy performance. The fault detection methodology is based on comparing actual and
fault free performances, using the proposed model, and operating variables thresholds. The technique has been
successfully applied for fault detection of a real installation in different faulty conditions: refrigerant leakage,
and water reduction in the condenser and in the secondary circuit.

1 INTRODUCTION The FDD techniques involve three main steps: fault


detection, the indication that a fault truly exists; fault
Air conditioning and refrigerating facilities based on isolation or diagnosis, the location of the fault; and
vapour compression systems account for a very signi- fault identification, called fault evaluation according
ficant portion of energy consumption in the industrial to Isermann (1984), that consists of determining the
and commercial sectors, Buzelina et al. (2005). A magnitude of the fault. The methods for fault detec-
system that automatically detects an incipient fail- tion and diagnosis may be classified into two groups:
ure condition can help to improve better compliance those which do not use a model of the system (model-
of installation objectives, to reduce energy use, to free methods), Navarro-Esbrí et al. (2006) and those
optimize the facility maintenance and to minimize which do (model-based methods). Here we will focus
undesirable environmental effects as those associated on the mode-based methods. These methods rely on
with ozone depletion or greenhouse effect refrigerant the concept of analytical redundancy, comparing the
leakage. model expected performance with the actual measured
The application of fault detection and diagnosis performance to analyze the presence of faults in the
(FDD) techniques to refrigerating vapour compression system.
systems is relatively recent, even though the existence The literature related to FDD applied to refrigerat-
from the 70s of a large body of literature related to ing vapour compression systems is relatively limited.
FDD for critical processes, such as chemical pro- Several works analyze the most common faults and
cess plants, Himmenlbau DM. (1978), and aerospace their impacts, Breuker & Braun (1998) in rooftop air
related research, Potter & Suman (1978). The interest conditioners or McKellar (1987) in domestic refrig-
of using fault detection methodology to refrigerat- erators. Stallard (1989), develops an expert system
ing systems is a result of different studies focused on for automated fault detection in refrigerators. Also in
energy saving in air conditioning and refrigerating sys- this way, Rossi & Braun (1997) present a rule-based
tems and it is also interesting by the expected benefits diagnosis for vapour compression air conditioners.
of applying FDD techniques, including less expensive Among the FDD model-based methods applied
repairs and maintenance, and shorter downtimes. to chillers, we will concentrate on those methods

175
based on physical models. These physical models are The thermodynamic state of the refrigerant and
based on first principles involving mass, energy and secondary fluids at each point is determined using
momentum balances and mechanical characteristics. the measurements of 14 K-type thermocouples and
The ability of a model-based FDD technique to detect 8 piezoelectric pressure transducers. The pressure and
faults during chiller operation depends on the model temperature sensors are calibrated in our own labora-
performance. It is desirable that the model presents a tory using certified references, obtaining a degree of
good prediction capability in any operating condition uncertainty of 0.3 K in the thermocouples and a preci-
and that it could be obtained in a fully automatic way, sion of 10 kPa in the pressure sensors. The refrigerant
combining simplicity with low degree of data require- mass flow rate is measured by a Coriolis-effect mass
ment. In this way, we will present a simplified physical flow meter with a certified accuracy within ±0.22% of
model for a chiller facility and we will use it to develop the reading, and the secondary fluids mass flow rates
a fault detection methodology for three typical faults are measured with electromagnetic flow meters, intro-
in this kind of facilities, as faults in the condenser cir- ducing a maximum error of ±0.25%. Furthermore, a
cuit, faults in the evaporator circuit and leakage in the capacitive sensor is installed to obtain the compressor
refrigerant circuit. rotation speed, with a maximum error after calibration
The rest of the paper is organized as follows. of ±1%. The electrical power consumption of the com-
Section 2 is devoted to the description of the exper- pressor is evaluated using a digital Wattmeter (with a
imental facility used to develop and to test the fault
detection methodology. In section 3, the proposed
T10
fault detection methodology is presented and a sim- T5, P5 T6, P6 T11

ple model for the vapour compression chiller is briefly

T
described. The commonest failures of this kind of T7, P7

installations are studied in section 4, including a sensi- T8, P8 T9


tivity analysis of the most affected variables for these
failures. In section 5, the fault detection methodology
is tested with different typical faulty operation modes.
The main conclusions of the paper are summarized in
T2, P2
section 6. T3
P1, T1
T4 P3
P4
T12

T13

2 EXPERIMENTAL TESTS FACILITY

For the development of the fault detection method- Figure 1. Scheme of the installation.
ology, a monitored vapour-compression chiller, which
develops a simple compression cycle, has been used.
This facility consists of the four basic components:
an open-type compressor driven by a variable speed
electric motor, an isolated shell-and-tube (1–2) evapo-
rator, where the refrigerant is flowing inside the tubes,
using a brine (water-glycol mixture 70/30% by vol-
ume) as secondary fluid, an isolated shell-and-tube
(1–2) condenser, with the refrigerant flowing along
the shell, where water is used inside the tubes as sec-
ondary fluid, and a thermostatic expansion valve. In
(Fig. 1) a scheme of the installation is shown.
In order to introduce modifications in the chiller
operating conditions, we use the secondary fluids
loops, which help to simulate the evaporator and con-
denser conditions in chillers. The condenser water
loop consists of a closed-type cooling system, (Fig. 2),
which allows controlling the temperature of the water
and its mass flow rate.
The cooling load system (Fig. 3) also regulates
the secondary coolant temperature and mass flow rate
using a set of immersed electrical resistances and a
variable speed pump. Figure 2. Condenser water loop.

176
calibration specified uncertainty of ±0.5%). Finally, 3 FAULT DETECTION METHODOLOGY
all these measurements are gathered by a National
Instruments PC-based data acquisition system (Fig. 4), The fault detection is accomplished by evaluating the
based on LABVIEW and REFPROP subroutines. residuals obtained from the comparison between the
actual performance of the installation, determined
from experimental measurements, and some expec-
tation of performance obtained by using a physical
model of the system (Fig. 5). If the residuals exceed a
given threshold, then a fault is indicated.
It has to be noted that the selected output variable,
xk , must be proven to be representative of the system
operation, and its selection depends on the kind of fault
to be detected. For this reason, a previous analysis of
how a specific fault affects the main chiller operat-
ing variables is needed, being the most sensitive and
‘‘inexpensive’’ ones selected as output variable for the
fault detection technique.

3.1 Physical model


The proposed model for the vapour compression
chiller has a general structure as the one presented
in (Fig. 6), where it can be seen that the model
inputs are the secondary fluids input variables and the
compressor speed. Using these inputs and the main
characteristics of the compressor and heat exchang-
ers, the model predicts the fault-free performance:
operating pressures, secondary fluids output variables
Figure 3. Load simulation system (evaporator’s circuit).
(operating pressure and temperatures) and energy per-
formance (including power consumption and energy
efficiencies).
VAPOUR COMPRESSION CHILLER The model computes the refrigerant properties
using dynamic libraries of Refprop, Lemmon et al.
(2002), while the secondary fluids thermo-physical
properties are evaluated by means of interpolating
SCXI ACQUISITOR SYSTEM polynomials calculated from the ASHRAE Handbook
NATIONAL INSTRUMENTS (2001).
The kernel of the model consists of a set of five
equations based on physic laws, which model the main
parts of the system, shown schematically in (Fig. 7).
MEASUREMENT & AUTOMATION
Comparing the measured variable with the calcu-
lated one, a residual is obtained. While this residual
remains between two confidence thresholds the oper-
MATLAB REFPROP
ation will be considered fault-free. Some consecutive
SIGNAL CONDITIONING WITH

values of the residual out of these thresholds will


generate an alarm signal and the identification of a
possible fault will proceed.
LABVIEW

USER INTERFACE
xk
Actual
performance
Residual Residuals
System
generator evaluation
~x
Model (expected k

performance)

Figure 4. Scheme of the data acquistion system. Figure 5. Fault detection technique scheme.

177
A typical fault on both the evaporator and the con-
Refrigerant thermophysical properties
denser circuits may be the reduction of the mass flow
rate of the secondary fluids through the interchang-
Tb ,out ers. These reductions can deteriorate the compliance
Tw,out of installation objectives and, on the other hand, can
mb , Tb ,in cause energy efficiency reduction due to mass flow
pe rate reduction of the secondary fluids at the condenser
mw , Tw,in Model
pk and evaporator, which produces an increase of the
N compression rate.
PC
Thus, in the following, we will analyze these three
COP kinds of failures. The first step for the correct identifi-
cation of each type of fault is to analyze the variations
produced bye each kind of faulty operation, followed
Geometric characteristics of the system
by the selection of a reduced group of variables that
have a significant variation during the fault that can be
Figure 6. Model scheme. used to detect it.
In order to identify these variables some variations
have been forced in the facility in such a way that they
water water
can simulate those failures. Then, the correlation of
vapour to saturated liquid superheated vapour
each variable with each fault has been calculated both
in the short term and medium term or quasi-steady
air state after it.
superheated vapour

Expansion valve

Compressor
4.1 Fault in the condenser’s circuit
In order to simulate a fault in the condenser circuit, a
valve has been partially shut down to reduce water
refrigerant to saturated vapour
N
mass flow rate. In a real facility this fault may be
brine
caused by an obstruction in a filter of the circuit, or
a failure in the impulsion pump. Table 1 shows the
variation of each one of the main operating variables
Figure 7. Schematic structure of the model kernel. of the facility after the shut down of the valve.
The variable that shows best conditions to be mon-
itored is the condenser pressure, Pk . The evolution
Special attention must be paid to the assignation of
of this variable is shown in (Fig. 8). Some temper-
those confidence thresholds. On one hand, too narrow
atures also show deviations such as the condenser
thresholds will increase the possibility of false alarms
outlet’s temperature, but they are not only smaller, but
by confusing a noisy signal with a deviation due to
also slower as the variation can only be seen after the
a fault. On the other hand, too wide thresholds will
transient state.
increase the possibility of ignoring a fault.

Table 1. Deviations of the main operating variables of


4 COMMON FAILURES AND SENSITIVITY the facility after a fault in the condenser’s circuit (short
ANALYSIS term, ST, and medium term, MT).

Variable Initial ST var (%) MT var (%)


The most common failures to be detected on vapour
compression chiller are: refrigerant leakage, fault in Pk (bar) 15.18 1.46 1.96
the condenser circuit and fault in the evaporator circuit. Pe (bar) 4.00 0.22 −0.22
A refrigerant leakage may be difficult to detect since Tki (K) 0.068 0.08 −0.73
it only affects the system operation after a large loss Tko (K) 347.63 0.02 0.14
of refrigerant. When it occurs, refrigerant in vapour Toi (K) 308.61 0.04 0.23
phase may appear in the outlet of the condenser. This T∞ (K) 265.09 0.02 −0.01
will interfere in the expansion device normal function, Twi (K) 266.51 0.00 0.49
Two (K) 291.00 0.01 −0.05
it will increase the quality of the refrigerant at the evap- Tbi (K) 297.68 0.11 0.15
orator inlet and thus, the efficiency of the system will Tbo (K) 286.89 0.02 0.00
be reduced.

178
4.2 Fault in the evaporator circuit As shown in (Table 2), both the pressure at the con-
denser and the pressure at the evaporator are likely to
The simulation of this failure has been carried out in
be monitored. Furthermore both have a clear reaction
a similar way to the one for the condenser’s circuit.
in the transient state and after it, however the evap-
A valve at the secondary fluid loop has been partially
orator pressure has approximately three times higher
shut down in order to reduce its mass flow rate. An
deviation than the pressure at the condenser and thus,
obstruction in a filter or an impulsion’s pump failure
it will be a more reliable indicator for the failure detec-
may be the real causes of this kind of failure. (Table 2)
tion. The response of the pressure in the condenser is
shows the deviations of the main operating variables
shown in (Fig. 9).
of the facility.

4.3 Refrigerant leakage


15.7
15.6 To simulate a refrigerant leakage in the chiller a small
Condenser valve in the refrigerant circuit has been temporally
15.5 Pressure
15.4
open, thus allowing the refrigerant to flow out. The
response of the system variables is shown in (Table 3).
15.3
The analysis of variations presented in (Table 3)
15.2
shows that superheat (difference between evaporator
15.1 outlet temperature, Too , and saturation temperature,
15.0 Toi ) and pressure at the evaporator are the most sensi-
14.9 tive parameters to a refrigerant leakage fault. However,
1700 1900 2100 2300 2500 2700 2900 3100 the pressure at the evaporator is the most sensitive
variable, having the advantage of being a directly
Figure 8. Response of the condenser pressure when a fault
measured variable. So, the simplest approach consists
is forced in the condenser’s circuit. of using the pressure at the evaporator for the fault
detection (Fig. 10).
Table 2. Deviations of the installation’s main variables
after a fault in the evaporator’s circuit. Table 3. Deviations of the main operating variables in a
refrigerant leakage.
Variable Initial ST var (%) MT var (%)
Variable Initial Var (%)
Pk (bar) 16.28 −1.02 −1.20
Pe (bar) 3.60 −3.77 −3.38 Pk (bar) 15.21 −1.35
Tki (K) 352.97 0.00 0.05 Pe (bar) 3.89 −4.60
Tko (K) 311.26 −0.04 −0.16 Tki (K) 347.57 0.00
Toi (K) 263.45 −0.27 −0.34 Tko (K) 308.48 −0.02
T∞ (K) 266.52 0.09 −0.55 Toi (K) 264.91 −0.15
Twi (K) 295.89 0.03 −0.03 T∞ (K) 266.41 0.21
Two (K) 302.03 −0.03 −0.10 Twi (K) 291.05 −0.01
Tbi (K) 284.44 0.01 0.06 Two (K) 297.66 −0.05
Tbo (K) 352.97 0.00 0.05 Tbi (K) 287.07 0.00
Tbo (K) 347.57 0.00

!b
3.85 4.10

3.80
4.05
3.75
4.00
Pressure [bar]

Pressure [bar]

3.70
3.65 3.95
Evaporator Evaporator
3.60 Pressure Pressure
3.90
3.55
3.85
3.50
3.45 3.80
17300 17800 18300 18800 19300 7700 7900 8100 8300 8500 8700 8900
Time(s) Time (s)

Figure 9. Response of the condenser pressure after a fault Figure 10. Response of the evaporator pressure in a
forced in the condenser’s circuit. refrigerant leakage.

179
5 RESULTS 0.8

+3
In this section the methodology is validated, using 0.6

the model to predict the fault free performance and


0.4
comparing measured and predicted variables. As the -3
evaporator and condenser pressures are the most sen- 0.2
sitive variables to the chiller faults, the residuals of
these two variables are calculated. 0.0
To follow the behaviour of the fault-sensitive vari-
ables a threshold must be set, to decide which deviation -0.2
of the residual must be considered as a fault and which 3300 3500 3700 3900 4100 4300 4500 4700
one must be considered as noise of the signal.
The selection of those thresholds must be studied
Figure 11. Residuals of the condenser pressure due to a
separately in each case, as it will define the sensibil- fault in the condenser’s circuit.
ity and robustness of the fault detection methodology.
In the following experiments, the criteria followed to
set the thresholds is to use a 3σ limit, where σ repre- 0.20
sents the standard deviation of the error between the
0.10
measured and the expected value of the monitored vari-
able. The value of σ is computed using an initialization 0.00
fixed time, tinit , as:
-0.10

max [r (t0 ) , . . . , r (tinit )] + 3σ


α= (1) -0.20
σ [r (t0 ) , . . . , r (tinit )]
-0.30 - 3σ
where r(t) is the value of the residual at time t. -0.40
Thus the behaviour of the error will be considered 5000 5500 6000 6500
valid while it remains between +3σ and −3σ , and the
alarm will take place as the error runs out of these
limits. Figure 12. Evaporating pressure residuals evolution when
The following experiments have been carried out a fault in the evaporator’s circuit is introduced.
in order to test the functioning of the fault detec-
tion methodology. For that, some deviations have been
forced to simulate the faults analyzed in the paper. As it has been shown in section 4, the most sensitive
variable of this fault is the evaporator pressure, so this
5.1 Fault in the condenser’s circuit will be the monitored variable. The evolution of the
evaporating pressure residuals is shown in (Fig. 12).
As has been exposed above, the best variable to pre- As shown in this Figure, the fault is clearly detected,
dict a fault in the condenser’s circuit is the condenser since the residuals are clearly out of the thresholds
pressure. In order to verify if the fault would be pre- when it occurs. Furthermore, the residuals remain
dicted when it takes place, we will force this fault inside the thresholds during the fault-free operation,
in the facility by shutting down partially the valve at showing the ability of the methodology to avoid false
the condenser circuit. The evolution of the residuals alarms.
associated with the condenser pressure is shown in
(Fig. 11).
As we can see in this Figure, the residuals of the con-
5.3 Refrigerant leakage
denser pressure remains inside the thresholds until the
fault takes place. That shows that the selected variable To carry out the simulation of this fault, in steady-
and the methodology proposed would detect the fault, state operation a continuous small leakage from the
not causing false alarms during fault-free operation. refrigerant circuit is introduced.
As it has been presented above, to detect this fault
the evaporator pressure is the most sensitive variable
5.2 Fault in the evaporator’s circuit
to the refrigerant leakage. So, the residuals between
In a similar way, a fault in the evaporator’s circuit is expected and actual evaporator pressure are followed.
simulated by shutting down partially a valve in the cir- (Fig. 13) shows the behaviour of the residuals
cuit. This action causes the reduction of the secondary between measured and expected value of the evapo-
fluid mass flow rate. rator pressure.

180
0.10 heat load variations, in order to see if it is possible to
distinguish this kind of transients from the ones associ-
0.05
ated with a fault in the chiller. Once a fault is detected,
0.00 an interesting subject is to couple the fault detection
Residuals

technique with a classification methodology to be able


-0.05
+3
to make a diagnostic of the type of failure that is taking
-0.10 place.
-0.15 -3

-0.20 REFERENCES
8000 8200 8400 8600 8800
Time (s) ASHRAE 2001. Handbook of Fundamentals, Chap. 21.
Breuker M.S., Braun J.E., 1998. Common Faults and their
Figure 13. Evaporator pressure residuals in the refrigerant Impacts for Rooftop Air Conditioners. International Jour-
leakage. nal of Heating, Ventilating, and Air Conditioning and
Refrigerating Research, 4 (3): 303–318.
Buzelina L.O.S., Amico S.C., Vargas J.V.C., Parise J.A.R.,
As it can be observed in this Figure, the residuals 2005. Experimental development of an intelligent refrig-
suffer a sharp deviation out of the thresholds, showing eration system, International Journal of Refrigeration 28:
the capability of the methodology proposed to detect 165–175.
this kind of fault. Himmenlbau D.M., 1978. Fault Detection and Diagno-
sis in Chemical and Petrochemical Processes. Chemical
Engineering Monographs, Vol. 8, Elsevier.
Isermann R., 1984. Process Fault Detection Based on Mod-
6 CONCLUSIONS eling and Estimalion—A Survey. Automatica, 20 (4):
387–404.
The aim of this article has been to present a Lemmon E.W., Mc Linden M.O., Huber M.L., 2002. Ref-
methodology to detect common faults in a vapour- erence Fluid Thermodynamic and Transport Properties.
compression chiller. REFPROP v7.0. NIST Standard Reference Database 23.
Two kinds of experiments have been carried out. McKellar M.G., 1987. Failure Diagnosis for a House-
A set of experimental tests used to analyze the vari- hold Refrigerator. Master’s thesis, School of Mechanical
ations of the operating variables during the faults, in Engineering, Purdue University.
Navarro-Esbrí J., Torrella E., Cabello R., 2006. A vapour
order to set the most fault-sensitive variables. Another compression chiller fault detection technique based on
set of experiments or validation tests, used to test the adaptative algorithms. Application to on-line refriger-
performance of the fault detection methodology. ant leakage detection. Int. Journal of Refrigeration 29,
In the methodology a simple steady state model has 716–723.
been used to obtain the fault-free expected values of Potter J.E., Suman M.C., 1978 Thresholdless redundancy
the fault-sensitive variables. These values have been management with arrays of skewed instruments. Ingegrity
used to obtain the residuals between the expected and in Electronic Flight Control Systems, AGARDigraph-
measured variables. Finally, a methodology based, on 224, 1977, 15: 1–25.
static thresholds, has been applied. The experimental Rossi T.M., Braun J.E., 1997. A Statistical Rule-Based Fault
Detection and Diagnostic Method for Vapor Compression
results have shown the capability of the technique to Air Conditioners. HVAC&R Research, 3 (1): 19–37.
detect the faults. Stallard, L.A., 1989. Model Based Expert System for Failure
In a future work the behaviour of the fault detection Detection and Identification of Household Refrigera-
methodology should be tested with transients associ- tors. Master’s thesis, School of Mechanical Engineering,
ated with the normal operation of the installation, as Purdue University.

181
Safety, Reliability and Risk Analysis: Theory, Methods and Applications – Martorell et al. (eds)
© 2009 Taylor & Francis Group, London, ISBN 978-0-415-48513-5

Automatic source code analysis of failure modes causing error propagation

S. Sarshar
OECD Halden Reactor Project/Institute for energy technology, Halden, Norway

R. Winther
Østfold University College, Halden, Norway

ABSTRACT: It is inevitable that software systems contain faults. If one part of a system fails, this can affect
other parts and result in partial or even total system failure. This is why critical systems utilize fault tolerance
and isolation of critical functions. However, there are situations where several parts of a system need to interact
with each other. With today’s fast computers, many software processes run simultaneously and share the same
resources. This motivates the following problem: Can we, through an automated source code analysis, determine
whether a non-critical process can cause a critical process to fail when they both run on the same computer?

1 INTRODUCTION code is proposed. Section 6 discusses the proposed


approach and concludes our work.
In this paper we report on the results from the prac-
tical application of a method that was first presented
at ESREL 2007 (Sarshar et al. 2007). The presented
method identified failure modes that could cause error 2 BACKGROUND
propagation through the usage of the system call inter-
face of Linux. For each failure mode, its characteristics We performed a study of Linux to understand how
in code were determined so it could be detected when processes can interact with the operating system and
analyzing a given source code. Existing static analysis other processes. This quickly turned our attention to
tools mainly detect language specific failures (Koenig the system call interface, which provides functions to
1988), buffer overflows, race conditions, security vul- interact with the operating system and other processes.
nerabilities and resource leaks, and they make use of We analyzed these system calls in detail and identified
a variety of analysis methods such as control and data how these functions could fail and become mecha-
flow analysis. An examination of some of the avail- nisms for error propagation. A system call may take
able tools showed that no existing tool can detect all some arguments, return a value and have some system-
the identified failure modes. wide limitation related to it. To identify failure modes
The purpose of this paper is to present and evaluate a we applied FMEA on each of the function compo-
conceptual model for a tool that allows automatic anal- nents. The failure effects were determined using fault
ysis of source code for failure modes related to error injection testing. Then, we determined the character-
propagation. Focus is on the challenges of applying istics of the failure modes in code for detection when
such analysis automatically. An algorithm is proposed, analyzing an application’s source code. Thus, the pro-
and a prototype implementation is used to assess this posed methodology enabled criteria to be defined that
algorithm. In contrast to existing tools, our checks could make it possible to automatically analyze a given
specifically focus on failure modes that can cause error source code for error propagation issues.
propagation between processes during runtime. As we In (Sarshar 2007), the methodology was applied on
show by applying the tool on a case, most of the fail- a subset of SYSTEM V APIs related to shared memory.
ure modes related to error propagation can be detected This target was found to be suitable because it involved
automatically. an intended channel for communication between pro-
The paper is structured as follows: Section 2 cesses through a shared resource; the memory. We also
describes the background and previous work. Section 3 performed FMEA on other system calls to evaluate
describes the analysis method (Sarshar et al. 2007). whether the method was applicable to a wider class of
In Section 4, existing tools are considered and in functions and not restricted to those related to shared
Section 5 a method for automatically assessing source memory.

183
The errors identified in this approach were erro- error propagation that do not involve usage of the sys-
neous values in the variables passed to the system tem call interface will not be covered by this approach.
call interface and errors caused when return, or mod- Eternal loop structures in code is an example of a fail-
ified, pointer variables were not handled properly. ure mode that does not make use of system calls. This
From the analysis we know not only which func- failure mode can cause error propagation because it
tions behave non-robustly, but also the specific input uses a lot of CPU time.
that results in errors and exceptions being thrown
by the operating system. This simplifies identifi-
cation of the characteristics a failure mode has in 3 ANALYSIS
source code.
Our proposed approach of analyzing error propa- In (Sarshar 2007) we analyzed several system calls and
gation between processes concerns how the process identified almost 200 failure modes related to their
of interest can interact with and affect the environ- usage. Because these have their manifestation in the
ment (the operating system and other processes). code, it is expected that they also have characteristics
A complementary approach could be to analyze how that can make them detectable when analyzing source
a process can be affected by its (execution) environ- code. This section describes general issues regarding
ment. In (Johansson et al. 2007), the authors inject the characteristics the identified failure modes will
faults in the interface between drivers and the operat- have in code.
ing system, and then monitor the effect of this faults The identified failure modes can be categorized
in the application layer. This is an example where pro- depending on the object they apply to:
cesses in the application layer are affected by their
• Arguments (syntax analysis)
execution environment. Comparing this method to our
• Return variable (logical analysis)
approach, it is clear that both methods make use of
• The functions sequential issues and limitations
fault injection to determine different types of failure
(logical analysis)
effects on user programs. However, the examination
in (Johansson et al. 2007) only concerns incorrect val- Determination of whether arguments contain a spe-
ues passed from the driver interface to the operating cific failure mode can be done with syntax analysis.
system. Passing of incorrect values from one compo- Syntax analysis can identify the passed arguments
nent to another is a mechanism for error propagation variable type and value and check it against the list
and relate to problems for intended communication of failure modes for that specific argument. For the
channels. Fault injection is just one method to evalu- return variable, and for the functions’ sequential issues
ate a process robustness in regards to incorrect values and limits, logical analysis is required. This involves
in arguments. In (Sarshar 2007), the failure effects tracking of variables and analysis of the control flow
of several mechanisms were examined: passing of for the code.
arguments and return value, usage of return value, Failure modes related to passing of arguments can
system-wide limitations, and sequential issues. These be divided into two groups: (1) variables and (2)
methods complement each other. Brendan Murphy, pointers. A passed argument variable can have fail-
co-author of (Johansson et al. 2007), from Microsoft ure modes related to its type which will be common
Research1 pointed out his worries at ISSRE 2007: for all arguments of that given type. In addition it may
‘‘The driver developers do not use the system calls have failure modes related to the context it is used, that
correctly. They do for instance not use the return val- is in the context of the system call. Pointer arguments
ues from the system calls. It is nothing wrong with the can in a similar way have failure modes related to their
API, it is the developer that does not have knowledge type, but will in addition have failure modes related to
about how to use the system calls.’’ its usage in the code. Pointer variables or structures are
Understanding the failure and error propagation often used to return data from a function when used as
mechanisms in software-based systems (Fredriksen & an argument.
Winther 2006) (Fredriksen & 2007) will provide the The following failure modes may occur for a return
knowledge to develop defences and avoid such mecha- variable (applies to all system calls with a return
nisms in software. It is therefore important to be aware variable):
of the limitations for the proposed approach. This anal-
A. The value is not used in the code
ysis only identifies failure modes related to the using
B. The value is stored in a variable of wrong type
of system calls in source code. Other mechanisms for
C. The external variable errno is not used if return
value indicates an error
To determine whether failure mode A occurs in
a given source code: (1) the return variable must be
1 Cambridge, UK. retrieved, that is, not ignored with e.g. usage of void;

184
Table 1. Some failure mode characteristics in code for shmget().

Ref. Failure mode Detectability in code Check

F.29.1.D Parameter key is Check that value is not below zero Value
less than type key_t
F.29.1.E Parameter key is Check that value is not higher than Value
greater than type key_t type key_t
F.29.1.F Parameter key is Check the type of the passed variable Type
of wrong type to be type key_t
F.29.1.G IPC_PRIVATE specified Logical error that can be difficult
as key when it should not to identify without program documentation
F.29.2.F Parameter size is Check the type of the passed variable Type
of wrong type to be type size_t
F.29.3.B Parameter shmflg is Calculate the values of legal flags Value
not legal (documented in man page) and verify that
the passed value is legal
F.29.3.C Parameter shmflg is Check the type of the passed variable Type
of wrong type to be type int
F.29.3.D Permission mode is not Check the least 9 bits of shmflg to get hold Value
set for parameter shmflg of permission mode for user, group and all.
If none of these are equal to 5, 6 or 7,
permission mode is not set correct
F.29.3.E Access permission is Can be detected using the same approach as for Value
given to all users instead 29.3.D, but whether it is intended is
of user only difficult to predict without documentation
F.29.3.F Permission mode is Mode should have value 5 but has value 7, Value
write when it should but the intention is difficult to predict without
have been read documentation
F.29.3.G Permission mode is Mode should have value 7 but has value 5, Value
read when it should not an issue unless a write is performed on
have been write the segment later in the code
F.29.4.A Return value is not used Track the variable to the end of scope Variable
and check whether it is used in the code usage

(2) the variable must be traced to the end of its scope of the underlying operating system for e.g. maximum
to determine whether it has been used in some way. number of files one process can have open, it is dif-
Failure mode B can be checked with syntax analysis ficult to predict such a failure mode. However, with
by verifying that the variable storing the return value is use of logical analysis and control flow graphs, one
of the same type as described in the documentation for may be able to determine whether the code open many
that specific function. In order to determine whether files and warn the user that such a failure mode may
failure mode C is present in the code: (1) the return be present in the code.
value must be used in a statement to check whether Though not all failure modes can be detected when
it indicates an error; (2) the errno variable must be analyzing source code, a warning can be given when a
checked in a statement before the end of the function given failure mode might be of concern. For instance,
block. if it cannot be determined whether a return variable is
Detection of sequential issues using system calls used, a warning can be issued to the programmer.
requires logical analysis. One must keep track of return Table 1 lists an excerpt of failure modes with their
variables that are used as arguments for other system characteristics in source code for the shmget()2 system
calls, to determine which calls that are related. This call and Table 2 lists failure modes and their character-
information can then be used to determine whether istics in source code related to sequential issues when
they are run in the correct order. Use of control flow using shared memory services.
graphs can be very helpful for analyzing source code
for such failure modes.
Failure modes related to functional and system wide
limits can be difficult to determine when analyzing
source code only. If we do not know the limitations 2 Allocates a shared memory segment.

185
Table 2. Failure mode characteristics in code related to sequential issues for shared memory.

Ref. Function Failure mode Detect-ability in code Check

F.shm.A shmdt() Target segment Check line nr of shmdt() and shmat() Line numbers
is not attached to verify that shmat() is run prior
to shmdt() for a specific segment
F.shm.B shmat() Segment to attach Check line nr of shmat() and shmget() Line numbers
is not allocated to verify that shmget() is run prior
to shmat() for a specific segment
F.shm.C shmctl() Target segment Check line nr of shmctl() and shmat() Line numbers
is not identified to verify that shmat() is run prior
to shmctl() for a specific segment
F.shm.D shmat() Segment is not Check line nr of shmat(), shmdt() and Line numbers
detached prior end of scope (return), to verify
to end of scope that shmdt() is run prior to return
for a specific segment from shmat()

4 EXISTING TOOLS Table 3. Existing analysis tools for C.

A tool search was made on the internet in order to Tool Rel. Checks
find descriptions of existing analysis tools. We iden-
Coverity Some Passing of arguments
tified 27 tools related to analysis of C code based
and return values
on this search3 . These can be grouped in commer- Klockwork Some Ignored return values
cial tools, and academic research tools. Although cost PolySpace No –
prevented actual testing of the commercial tools, the Purify Some Passing of arguments
available documentation has been examined to deter- Flexelint Some Passing of arguments
mine whether they can detect failure modes that can and return values
cause error propagation. The academic and research LintPlus Unkn. –
tools that are open source or free have been tested with CodeSonar Some Ignored return values
a test program to determine their capabilities regarding Safer C toolset Some Passing of arguments
and sequential issues
the error propagation issues.
DoubleCheck Some Sequential issues
These tools were examined in (Sarshar 2007) based Sotoarc No –
on the available documentation. Table 3 lists the avail- Astree Some –
ability and relevance of some of the tools; whether Mygcc Some User-defined checks
they might detect the kind/category of failure modes Splint (LC-Lint) Some Passing of arguments,
we have unveiled in our analysis. The relevance of a ignored return values
tool is divided in the following categories: and sequential issues
RATS Unkn. –
Sparce No –
• Unknown (conclusion based on available documen-
tation)
• No (does not detect any of our failure modes)
• Some (detects some of our failure modes)
• All (detects all of our failure modes) 5 ASSESSMENT OF SOURCE CODE
AND PROTOTYPE TOOL

As expected, we found no tools that could detect all This section describes development of a prototype
of our failure mode categories. tool aimed at evaluating to what extent the process
of detecting the characteristics of the identified fail-
ure modes can be automated. The aim is to identify the
challenges involved in automating such a code analy-
sis process and not to develop a complete tool capable
3 Available online: http://www.spinroot.com/static/, of detecting all failure modes.
http://www.testingfaqs.org/t-static.html, and For this prototype we chose to use bash scripting,
http://en.wikipedia.org/wiki/List_of_tools_for_static_ which makes use of the available utilities provided by
code_analysis. the Linux operating system.

186
Figure 1 illustrates the logic and architecture of the • Compile the target source code (using gcc), exit
tool where the arrows indicate control flow and the if any error(s)
curved-arrows indicate reporting a message to the user • Create a control flow graph (CFG) of the code
without interfering with the normal control flow. The (using gcc -fdump-tree-fixcfg)
algorithm for analyzing the code is shaded.
2. Analysis part
The prototype tool is designed in modules to allow
easy updates. The tool it self performs the prelimi- • Search for system call usage in the CFG (using
nary part, analysis part and the report part, but the the standard C library invoke method)
checks that are performed to determine each failure • For each system call:
mode are placed in external modules. This will ease
the work of adding additional failure modes and sys- • Identify usage of parameters and determine
tem call checks. When a specific parameter is to be their variable type and value by backtrack-
checked, the prototype tool identifies its type and ing, then send data for each parameter to
value and sends this information to the check-module the check-module relevant for the system call
that holds the check for that specific system call. The being analyzed.
module then performs all available checks for the spec- • Check whether return variable is used, deter-
ified parameter and reports the results of the analysis. mine its type and name and send this data
The return value is checked similarly. We have also to both the general check-module, and to
included a general module which contains checks that the check-module for the system call being
apply to all system calls. This is necessary to avoid analyzed.
duplication of modules containing the same checks. • Store line number and system call info for
The tool performs the following steps: return value, and parameters for sequential
analysis.
1. Preliminary part: • Perform sequential analysis

Figure 1. The logic of the prototype tool.

187
3. Report part C. In line 22: permission mode not set for user when
passing argument shmflg (failure mode F.29.3.D)
• This part is integrated with the analysis part in
D. In line 30: no use of the return value from shmdt()
this prototype. Printed messages are in different
(failure mode F.67.2.A)
color depending on their contents. Warnings are
E. In line 30: passing of argument workaddr which is
printed in orange and errors are printed in red.
not attached (failure mode F.shm.A)
As a test program, consider the code of shm.c in F. In line 37 and 41: segment referred to by workaddr
Listing 1. is not released prior to return (failure mode
F.shm.D)
Listing 1: The code of shm.c The control flow graph of this code is created using:
1 # i n c l u d e < s t d i o . h>
2 # i n c l u d e < s y s / t y p e s . h> |gcc − fdump − tree − fixupcfgshm.c
3 # i n c l u d e < s y s / i p c . h>
4 # i n c l u d e < s y s / shm . h> An excerpt of the output is illustrated in Listing 2
5 e x t e r n vo i d p e r r o r ( ) ; (empty lines are removed). The notation is in GIM-
6 i n t main ( ) {
7 / / k e y t o be p a s s e d t o s h m g e t PLE4 which is an intermediate representation of the
8 i n t key = 1 0 0 ; program in which complex expressions are split into
9 / / s h m f l g t o be p a s s e d t o s h m g e t a three address code using temporary variables. GIM-
10 i n t shmflg = 00001000; PLE retains much of the structure of the parse trees:
11 / / r e t u r n v a l u e f ro m s h m ge t
12 i n t shmid ; lexical scopes are represented as containers, rather
13 / / s i z e t o be p a s s e d t o s h m ge t than markers. However, expressions are broken down
14 i n t s i z e = 1024; into a tree-address form, using temporary variables to
15 / / shmaddr t o be p a s s e d t o s hm a t
16 char * shmaddr = 0 0 0 0 0 0 0 0 ;
hold intermediate values. Also, control structures are
17 / / r e t u r n e d w o rk i n g a d d re s s lowered to gotos. Figure 2 illustrates the nodes of this
18 c o n s t char * wo r k a d d r ; control flow graphically. The three nodes ‘‘<bb 0>’’,
19 int ret ; ‘‘<L0>’’ and ‘‘<L1>’’ are marked in the figure.
20
21 / / C re a t e a new s h a re d memory s eg m e n t
22 i f ( ( shmid = s h m g e t ( key , s i z e , s h m f l g ) )
== 1) { Listing 2: Excerpt of the control flow graph for shm.c
23 p e r r o r ( “ s hm g e t f a i l e d ” ) ; generated by gcc
24 re t u r n 1 ;
25 } else { <bb 0> :
26 ( vo i d ) f p r i n t f ( s t d o u t , “ s h m g e t key = 1 0 0 ;
r e t u r n e d %d\ n ” , shmid ) ; shmflg = 512;
27 } s i z e = 1024;
28 shmaddr = 0B ;
29 / / Make t h e d e t a c h c a l l and r e p o r t t h e size .0 = ( s i z e t ) size ;
results D. 2 0 3 8 = s h mg e t ( key , s i z e . 0 , s h m f l g ) ;
30 r e t = shmdt ( wo r k a d d r ) ; shmid = D. 2 0 3 8 ;
31 p e r r o r ( “ shmdt ” ) ; i f ( shmid == 1) g o t o < L0> ; e l s e g o t o <
32 L1> ;
33 / / Make t h e a t t a c h c a l l and r e p o r t t h e < L0 > : ;
results p e r r o r (& “ s h m g e t f a i l e d ” [ 0 ] ) ;
34 wo r k a d d r = s h m a t ( shmid , shmaddr , s i z e ) ; D. 2 0 3 9 = 1 ;
35 i f ( wo r k a d d r == ( char * ) ( 1) ) { go t o < bb 5> (< L4> ) ;
36 p e r r o r ( “ shmat f a i l e d ” ) ; < L1 > : ;
37 re t u r n 1 ; stdout .1 = stdout ;
38 } else { f p r i n t f ( s t d o u t . 1 , &“ s h m g e t r e t u r n e d %d
39 ( vo i d ) f p r i n t f ( s t d o u t , “ s h m a t \ n ” [ 0 ] , shmid ) ;
returned succesfully ” ) ; D. 2 0 4 1 = shmdt ( wo r k a d d r ) ;
40 } r e t = D. 2 0 4 1 ;
41 re t u r n 0 ; p e r r o r (& “ shmdt : ” [ 0 ] ) ;
42 } D. 2 0 4 2 = s h m a t ( shmid , shmaddr , s i z e ) ;
wo r k a d d r = ( c o n s t char * ) D. 2 0 4 2 ;
The test program contains selected failure modes, i f ( wo r k a d d r == 4294967295B) go t o < L2> ;
e l s e go t o < L3> ;
and we will use the program to show how the proto-
type tool can detect failure modes. The injected failure
modes are:
A. In line 22: passing of argument key with type 4 This representation was inspired by the SIMPLE rep-
different than key_t (failure mode F.29.1.F) resentation proposed in the McCAT compiler project
B. In line 22: passing of argument size with type at McGill University for simplifying the analysis and
different than size_t (failure mode F.29.2.F) optimization of imperative programs.

188
Figure 2. Control flow graph of shm.c.

The tool then performs the described analysis.


Figure 3 illustrates the output of running the prototype
tool with shm.c as argument.
The tool identified failure modes related to the
shmget() system call and failure modes related to
sequential issues. Check modules for the shmat()
and shmdt() functions were not implemented and a
warning was therefore given by the tool.
Furthermore, the tool identified all of the injected
failure modes. It must, however, be validated through
several more examples before final conclusions can
be made to determine its ability to identify failure
modes. This is left to future work. The intention in
this paper was to demonstrate that it was possible to
detect these failure modes when analyzing source code
with an automated tool.
Figure 3. Running the prototype tool.
6 DISCUSSION
or less important automatically. However, a tool can
In the examination of the system call interface we iden- determine presence of known errors, faults or failure
tified failure modes that could cause error propagation. mode and can determine whether for instance error
For each failure mode, a check was described to deter- propagation is a concern for a given program.
mine its presence in code. The focus was on detection Another limitation using static analysis in our
of these when analyzing a given source code. The anal- approach concerns whether we can check the passed
ysis process was then to be automated using static arguments. In general, we can only control and check
analysis. Furthermore, we created a prototype tool the argument values that are statically set in the source
that managed to detect these when analyzing source code. Dynamic values that are received from else-
code. where, e.g. files or terminal input, can not be checked
Static analysis tools look for a fixed set of pat- statically. For such cases, dynamic analysis and testing
terns, or compliance/non-compliance with respect to must be applied to the program. However, static anal-
rules, in the code. Although more advanced tools allow ysis can detect whether the variables are of correct
new rules to be added over time, the tool will never type, and within legal range before they are passed
detect a particular problem if a suitable rule has not to a library function. To reduce such problems in run-
been written. In addition, the output of a static analy- time, one can develop wrapper functions which control
sis tool still requires human evaluation. It is difficult parameters against the identified failure modes before
for a tool to know exactly which problems are more these are sent to a system call. Such wrapper functions

189
will provide safety checks for programs that use REFERENCES
dynamic input as parameters for system calls.
We developed a conceptual model for a static anal- Fredriksen R. & Winther R. 2006. Error Propa-
ysis tool which we implemented as a prototype. The gation—Principles and Methods. Halden Internal
tool managed to detect many of the failure modes Report, Norway, HWR-775, OECD Halden Reac-
causing error propagation automatically, but not all. tor Project.
It is difficult, if not impossible, to control and check
Fredriksen R. & Winther R. 2007. Challenges Related
dynamic variables that are passed to system services
when performing static analysis. to Error Propagation in Software Systems. in Safety
The implemented prototype tool is not restricted to and Reliability Conference, Stavanger, Norway,
only check for failure modes that can cause error prop- June 25–27. Taylor & Francis, 2007, pp. 83–90.
agation. In our analysis of system calls we identified Johansson A., Suri N. & Murphy B. 2007. On the
numerous failure modes which could cause other types Impact of Injection Triggers for OS Robustness
of failures than error propagation. Using the same Evaluation. International in Proceedings of the 18th
principles as for automatic identification of error prop- IEEE International Symposium on Software Relia-
agation related failure modes, it is possible to extend bility Engineering (ISSRE 2007). Npvember, 2007,
the tool to also identify these other types of failure pp. 127–136.
modes. Thus, the tool can determine whether error
Koenig A. 1988. C Traps and Pitfalls. Reading, Mass.,
propagation is a concern when using system calls; the
tool can indicate which part of the software code is Addison-Wesley.
erroneous, and can pin-point the parts of the software Sarshar S. 2007. Analysing Error Propagation
that should be focused on during testing. One should between Software Processes in Source Code. Mas-
also test the component(s) this process affects; the ter’s thesis, Norway, Østfold University College.
tool can be used by driver and software developers Sarshar S., Simensen J.E., Winther R. & Fredriksen R.
which make use of system calls, to verify that these 2007. Analysis of Error Propagation Mechanisms
calls are used correctly (in regards to the documenta- between Software Processes. in Safety and Relia-
tion); and the tool can be used to identify failure modes bility Conference, Stavanger, Norway, June 25–27.
which needs special attention and exception handling Taylor & Francis, 2007, pp. 91–98.
routines.

190
Safety, Reliability and Risk Analysis: Theory, Methods and Applications – Martorell et al. (eds)
© 2009 Taylor & Francis Group, London, ISBN 978-0-415-48513-5

Development of a prognostic tool to perform reliability analysis

Mohamed El-Koujok, Rafael Gouriveau & Noureddine Zerhouni


FEMTO-ST Institute, UMR CNRS 6174–UFC/ENSMM/UTBM Automatic Control
and Micro-Mechatronic Systems Department, Besançon, France

ABSTRACT: In maintenance field, prognostic is recognized as a key feature as the estimation of the remaining
useful life of an equipment allows avoiding inopportune maintenance spending. However, it can be difficult to
define and implement an adequate and efficient prognostic tool that includes the inherent uncertainty of the
prognostic process. Within this frame, neuro-fuzzy systems are well suited for practical problems where it
is easier to gather data (online) than to formalize the behavior of the system being studied. In this context,
and according to real implementation restrictions, the paper deals with the definition of an evolutionary fuzzy
prognostic system for which any assumption on its structure is necessary. The proposed approach outperform
classical models and is well fitted to perform a priori reliability analysis and thereby optimize maintenance
policies. An illustration of its performances is given by making a comparative study with an other neuro-fuzzy
system that emerges from literature.

1 INTRODUCTION step, even impossible. Intelligent Maintenance Sys-


tems must however take it into account. 2) On the
The growth of reliability, availability and safety of a other hand, in many cases, it is not too costly to equip
system is a determining factor in regard with the effec- dynamic systems with sensors, which allows gather-
tiveness of industrial performance. As a consequence, ing real data online. Furthermore, monitoring systems
the high costs in maintaining complex equipments evolve in this way.
make necessary to enhance maintenance support sys- According to all this, neuro-fuzzy (NF) systems
tems and traditional concepts like preventive and appear to be very promising prognostic tools: NFs
corrective strategies are progressively completed by learn from examples and attempt to capture the subtle
new ones like predictive and proactive maintenance relationship among the data. Thereby, NFs are well
(Muller et al. 2008; Iung et al. 2003). Thereby, prog- suited for practical problems, where it is easier to
nostic is considered as a key feature in maintenance gather data than to formalize the behavior of the sys-
strategies as the estimation of the provisional reliabil- tem being studied. Actual developments confirm the
ity of an equipment as well as its remaining useful life interest of using NFs in forecasting applications (Wang
allows avoiding inopportune spending. et al. 2004; Yam et al. 2001; Zhang et al. 1998). In
From the research point of view, many devel- this context, the paper deals with the definition of an
opments exist to support the prognostic activity evolutionary fuzzy prognostic system for which any
(Byington et al. 2002; Jardine et al. 2006; Vachtse- assumption on its structure is necessary. This model
vanos et al. 2006). However, in practice, choosing an is well adapted to perform a priori reliability analysis
efficient technique depends on classical constraints and thereby optimize maintenance policies.
that limit the applicability of the tools: available The paper is organized in three main parts. In
data-knowledge-experiences, dynamic and complex- the first part, prognostic is briefly defined and posi-
ity of the system, implementation requirements (pre- tioned within the maintenance strategies, and the
cision, computation time, etc.), available monitoring relationship between prognostic, prediction and online
devices. . . Moreover, implementing an adequate tool reliability is explained. Following that, the use of
can be a non trivial task as it can be difficult to provide Takagi-Sugeno neuro-fuzzy systems in prognostic
effective models of dynamic systems including the applications is justified and the ways of building such
inherent uncertainty of prognostic. That said, devel- models are discussed. Thereby, a NF model for prog-
opments of this paper are founded on the following nostic is proposed. In the third part, an illustration
two complementary assumptions. 1) On one hand, of its performances is given by making a compara-
real systems increase in complexity and their behavior tive study with an other NF system that emerges from
is often non-linear, which makes harder a modeling literature.

191
2 PROGNOSTIC AND RELIABILITY – prognostic is mostly assimilated to a prediction
process (a future situation must be caught),
2.1 From maintenance to prognostic – prognostic is based on the failure notion, which
implies a degree of acceptability.
Maintenance activity combines different methods,
tools and techniques to reduce maintenance costs A central problem can be pointed out from this:
while increasing reliability, availability and security the accuracy of a prognostic system is related to its
of equipments. Thus, one usually speaks about fault ability to approximate and predict the degradation of
detection, failures diagnosis, and response develop- equipment. In other words, starting from a ‘‘current sit-
ment (choice and scheduling of preventive and/or cor- uation’’, a prognostic tool must be able to forecast the
rective actions). Briefly, these steps correspond to the ‘‘future possible situations’’ and the prediction phase
need, firstly, of ‘‘perceiving’’ phenomena, secondly, is thereby a critical one. Next section of this paper
of ‘‘understanding’’ them, and finally, of ‘‘acting’’ emphasizes on this step of prognostic.
consequently. However, rather than understanding a
phenomenon which has just appeared like a failure
(a posteriori comprehension), it seems convenient to 2.3 From prediction to reliability
‘‘anticipate’’ its manifestation in order to take ade- As mentioned earlier, an important task of prognostic
quate actions as soon as possible. This is what could is to predict the degradation of equipment. Follow-
be defined as the ‘‘prognostic process’’ and which is ing that, prognostic can also be seen as a process that
the object of this paper. Prognostic reveals to be a very allows the a priori reliability modeling.
promising maintenance activity and industrials show Reliability (R(t)) is defined as the probability that
a growing interest in this thematic which becomes a a failure does not occur before time t. If the random
major research framework; see recent papers dedi- variable ϑ denotes the time to failure with a cumulative
cated to condition-based maintenance (CBM) (Jardine distribution function Fϑ (t) = Prob(ϑ ≤ t), then:
et al. 2006; Ciarapica and Giacchetta 2006). The rel-
ative positioning of detection, diagnosis, prognostic R(t) = 1 − Fϑ (t) (1)
and decision/scheduling can be schematized as pro-
posed in Fig. 1. In practice, prognostic is used to be Let assume now that the failure is not characterized
performed after a detection step: the monitoring sys- by a random variable but by the fact that a degrada-
tem detects that the equipment overpass an alarm limit tion signal (y) overpass a degradation limit (ylim ), and
which activates the prognostic process. that this degradation signal can be predicted (ŷ) with a
degree of uncertainty (Fig. 2). At any time t, the failure
2.2 From prognostic to prediction probability can be predicted as follows:
 
Although there are some divergences in literature, F(t) = Pr ŷ(t) ≥ ylim (2)
prognostic can be defined as proposed by the Interna-
tional Organization for Standardization: ‘‘prognostic Let note g(ŷ/t) the probability distribution function
is the estimation of time to failure and risk for one or that denotes the prediction at time t. Thereby, by anal-
more existing and future failure modes’’ (ISO 13381-1 ogy with reliability theory, the reliability modeling can
2004). In this acceptation, prognostic is also called
the ‘‘prediction of a system’s lifetime’’ as it is a pro-
cess whose objective is to predict the remaining useful
life (RUL) before a failure occurs given the current
machine condition and past operation profile (Jardine
et al. 2006). Thereby, two salient characteristics of
prognostic appear:

Figure 1. Prognostic within maintenance activity. Figure 2. Prediction and reliability modeling.

192
be expressed as follows: major ANNs drawback (lack of knowledge explana-
tion) while preserving their learning capability. In this
 ∞ way, neuro-fuzzy systems are well adapted. More pre-
R(t) = 1 − Pr[ŷ(t) ≥ ylim ] = 1 − g(ŷ/t) · dy cisely, first order Tagaki-Sugeno (TS) fuzzy models
ylim have shown improved performances over ANNs and
(3) conventional approaches (Wang et al. 2004). Thereby,
they can perform the degradation modeling step of
The remaining useful life (RUL) of the system can prognostic.
finally be expressed as the remaining time between
the time in which is made the prediction (tp) and the 3.2 Takagi-Sugeno models: Principles
time to underpass a reliability limit (Rlim ) fixed by the
practitioner (see Fig. 2). a) The inference principle
These explanations can be generalized with a multi- A first order TS model provides an efficient and
dimensional degradation signal. See (Chinnam and computationally attractive solution to approximate a
Pundarikaksha 2004) or (Wang and Coit 2004) for nonlinear input-output transfer function. TS is based
more details. Finally, the a priori reliability analy- on the fuzzy decomposition of the input space. For
sis can be performed if an accurate prognostic tool each part of the state space, a fuzzy rule can be con-
is used to approximate an predict the degradation of structed to make a linear approximation of the input.
an equipment. This is the purpose of next sections of The global output approximation is a combination
this paper. of the whole rules: a TS model can be seen as a
multi-model structure consisting of linear models that
are not necessarily independent (Angelov and Filev
2004).
3 FUZZY MODELS FOR PREDICTION
Consider Fig. 3 to explain the first order TS model.
In this illustration, two inputs variables are considered,
3.1 Takagi-Sugeno system: A fitted prediction tool
two fuzzy membership functions (antecedent fuzzy
Various prognostic approaches have been developed sets) are assigned to each one of them, and the TS
ranging in fidelity from simple historical failure rate model is finally composed of two fuzzy rules. That
models to high-fidelity physics-based models (Vacht- said, a TS model can be generalized to the case of n
sevanos et al. 2006: Byington et al. 2002). Similarly to inputs and N rules (see here after).
diagnosis, these methods can be associated with one The rules perform a linear approximation of inputs
of the following two approaches, namely model-based as follows:
and data-driven. That said, the aim of this part is not
to dress an exhaustive overview of prediction tech- Ri : if x1 is A1i and . . . and xn is Ani
niques but to explain the orientations of works that are
THEN yi = ai0 + ai0 x1 + · · · + ain xn (4)
taken.
Real systems are complex and their behavior is
where Ri is the ith fuzzy rule, N is the number of
often non linear, non stationary. These considerations
fuzzy rules, X = [x1 , x2 , . . . , xn ]T is the input vector,
make harder a modeling step, even impossible. Yet, a j
prediction computational tool must deal with it. More- Ai denotes the antecedent fuzzy sets, j = [1, n], yi is
over, monitoring systems have evolve and it is now the output of the ith linear subsystem, and aiq are its
quite esay to online gather data. According to all this, parameters, q = [0, n].
data-driven approaches have been increasingly applied Let assume Gaussian antecedent fuzzy sets (this
to machine prognostic. More precisely, works have choice is justified by its generalization capabilities and
been led to develop systems that can perform nonlin- because it covers the whole domain of the variables)
ear modeling without a priori knowledge, and that are
able to learn complex relationships among ‘‘inputs and
outputs’’ (universal approximators). Indeed, artificial
neural networks (ANNs) have been used to support the
prediction process (Zhang et al. 1998), and research
works emphasize on the interest of using it. Never-
theless, some authors remain skeptical as ANNs are
‘‘black-boxes’’ which imply that there is no explicit
form to explain and analyze the relationships between
inputs and outputs. According to these considera-
tions, recent works focus on the interest of hybrid
systems: many investigations aim at overcoming the Figure 3. First order TS model.

193
to define the regions of fuzzy rules in which the local defines himself the architecture of the model and the
linear sub-models are valid: antecedents parameters values (Espinosa et al. 2004).
Gradient Descent (GD). The principle of the GD
μij = exp−[4x−x
i∗ 
j ]/[(σj ) ]
i 2
(5) algorithm is to calculate the premise parameters by
the standard back-propagation algorithm. GD has
where (σji )2 is the spread of the membership func- been implemented in a special neuro-fuzzy system:
the ANFIS model (Adaptive Neuro-Fuzzy Inference
tion, and xi∗ is the focal point (center) of the ith rule System) proposed by (Jang and Sun 1995).
antecedent.
The firing level of each rule can be obtained by the Genetic Algorithms (GAs). GAs are well known
product fuzzy T-norm: for their optimization capabilities. The GAs are used
by coding the problem into chromosomes and setting
τi = μi1 (x1 ) × · · · × μin (xn ) (6) up a fitness function. Since the consequent part of a
TS model can be calculated by using a least squares
method, only the premise part of the model is coded
The normalized firing level of the ith rule is: into chromosomes and optimized by the GAs.
Clustering Methods (CMs). The basic idea behind

N
fuzzy clustering is to divide a set of objects into self-
λi = τi τj (7)
similar groups (cluster). The main interest of this type
j=1
of methods is that the user does not need to define the
number of membership functions, neither the number
The TS model output is calculated by weighted of rules: CMs adapt the structure of the TS model by
averaging of individual rules’ contributions: the learning phase.
Evolving algorithms. These algorithms are based on

N 
N
CMs and therefore, do not require the user to define the
y= λi yi = λi xeT πi (8) structure of the TS model. In opposition to all previous
i=1 i=1 approaches, they do not need a complete learning data
set to start the identification process of the TS model
where πi = [ai0 ai1 ai2 . . . ain ] is the vector parameter (start from scratch): they are online algorithms with
of the ith sub-model, and xe = [1 X T ]T is the expanded self constructing structure. These approaches were
data vector. recently introduced (Angelov and Filev 2003; Kasabov
A TS model has two types of parameters. The and Song 2002).
non-linear parameters are those of the membership
functions (a Gaussian membership like in equation 5 3.3 Discussion: exTS for prognostic application
has two parameters: its center x∗ and its spread devi-
ation σ ). This kind of parameter are referred to as The selection of an identification approach for TS
premise or antecedent parameters. The second type of model depends obviously on the prediction context.
parameters are the linear ones that form the consequent According to the degradation modeling problem, a
part of each rule (aiq in equation 4). prediction technique for prognostic purpose should not
be tuned by an expert as it can be too difficult to catch
b) Identification of TS fuzzy models the behavior of the monitored equipment. Thereby, the
Assuming that a TS model can approximate an input- first approach for identification (table lookup scheme)
output function (previous section), in practice, this should be leaved aside.
kind of model must be tuned to fit to the studied Descent gradient and genetic algorithms approaches
problem. This implies two task to be performed: allow updating parameters by a learning process but
are based on a fixed structure of the model, which sup-
– the design of the structure (number and type of poses that an expert is able to indicate the adequate
membership functions, number of rules), architecture to be chosen. However, the accuracy of
– the optimization of the model’s parameters. predictions is fully dependent on this, and such iden-
For that purpose, different approaches can be used tification techniques suffer from the same problems
to identify a TS model. In all cases, the consequent as ANNs. Yet, the ANFIS model is known as a fit-
parameters of the system are tuned by using a least ted tool for time-series prediction and has been used
squares approach. for prognostic purpose (Goebel and Bonissone 2005;
Wang et al. 2004).
Mosaic or table lookup scheme. It is the simplest In opposition, clustering approaches require less
method to construct TS fuzzy system as the user a priori structure information as they automatically

194
determine the number of membership functions and of Step 1. Starting from k = 2, the potential Pk of the
rules. However, in practical applications, the learning data point zk is recursively calculated at time k:
process is effective only if sufficient data are avail-
able. In addition to it, when trained, such a TS model is
fixed. Thereby, if the behavior of the monitored system k −1
Pk (zk ) = j=1 i=1 j
(9)
changes significantly (like in a degradation phase), k −1+ n+m k−1 zi − zk 2
predictions can suffer from the lack of representative
learning data.
Considering the applicative restrictions that sup- Step 2. The potential of the cluster/rule centers is
poses the implementation of a prognostic tool, evolv- recursively updated:
ing TS models appear to be the more promising for
prognostic applications. Firstly, they are able to update (k − 1)Pk−1 (z ∗ )
the parameters without the intervention of an expert Pk (z ∗ ) = n+m ∗
k − 2 + Pk (z ∗ ) + P ∗
k (z ) j=1 z − zk−1 j
2
(evolving systems with regard to the parameters). Sec-
ondly, they can be trained in online mode as they (10)
have a flexible structure that evolves with the data
gathered from the system: data are collected continu-
ously which enables to form new rules or to modify an Step 3. The potential of the data point (step 1) is
existing one. This second characteristics is very use- compared to boundaries issued from the potential of
ful to take into account the non-stationary aspect of the cluster centers (step 2):
degradation.
According to all this, an accurate TS prediction
(P ≤ Pk (zk ) ≤ P) (11)
technique for online reliability modeling is the evolv-
ing one. A particular model is this one proposed by
(Angelov and Zhou 2006): the ‘‘evolving eXtended where (P = maxNi=1 {Pi (z ∗ )}) is the highest den-
Takagi-Sugeno’’ system (exTS). The way of learning
sity/potential, (P = minNi=1 {Pi (z ∗ )}) is the lowest
this type of model is presented in next section and the
density/potential and N is number of centers clusters
interest of using it is illustrated in section 4.2.
(xi∗ , i = [1, N ]) formed at time k.
Step 4. If, the new data point has a potential in
3.4 Learning procedure of exTS between the boundaries (11) any modification of the
rules is necessary. Else, they are two possibilities:
The learning procedure of exTS is composed of two
phases:
1. if the new data point is closed to an old center
σi
– Phase A: an unsupervised data clustering technique (minNi xk − x∗i j < 2j ), then the new data point
is used to adjust the antecedent parameters, (zk ) replaces this center (zj∗ := zk ),
– Phase B: the supervised Recursive least squares 2. else, the new data point is added as a new center
(RLS) learning method is used to update the conse- and a new rule is formed (N = N + 1; xN∗ ).
quent parameters.
Note that, the exTS learning algorithm presents an
a) Clustering phase: Partitioning data space adaptive calculation of the radius of the clusters (σji ).
The exTS clustering phase processes on the global See (Angelov and Zhou 2006) for more details.
input-output data space: z = [xT , yT ]T ; z ∈ Rn+m ,
b) RLS phase: update of the consequent parameters
n + m defines the dimensionality of the input/output
The exTS model is used for on-line prediction. In this
data space. Each one of the sub-model of exTS oper-
case, equation (8) can be expressed as follows:
ates in a sub-area of z. This TS model is based on
the calculus of a ‘‘potential’’ (see after) which is the
capability of a data to form a cluster (antecedent of 
N 
N
a rule). ŷk+1 = λi yi = λi xeT πi = ψkT θ̂k (12)
The clustering procedure starts from scratch assum- i=1 i=1
ing that the first data point available is a center of a
cluster: the coordinates of the first cluster center are
those of the first data point (z1∗ ← z1 ). The poten- ψk = [λ1 xeT , λ2 xeT , . . . , λn xeT ]Tk is a vector of the
tial of the first data point is set to the ideal value: inputs, weighted by normalized firing (λ) of the
P1 (z1 ) → 1. Four steps are then performed for each rules, θ̂k = [π1T , π2T , . . . , πNT ]Tk are parameters of the
new data gathered in real-time. sub-models.

195
The following RLS procedure is applied:

θ̂k = θ̂k−1 + Ck ψk (yk − ψkT θ̂k−1 ); k = 2, 3, . . .


(13)
Ck−1 ψk ψkT Ck−1
Ck = Ck−1 − (14)
1 + ψkT Ck−1 ψk

with initial conditions

θ̂1 = [π̂1T , π̂2T , . . . , π̂NT ] = 0, C1 = I (15)

where is a large positive number, C1 is a R(n + 1) × Figure 4. Architecture of an ANFIS with 4 inputs.
R(n + 1) co-variance matrix, and θˆk is an estimation
of the parameters based on k data samples.

Jan Maciejowski1 from Cambridge University. This


4 COMPARATIVE STUDY OF ANFIS data set contains 876 samples. Three variables (fuel
AND EXTS flow rate, hot gas exhaust fan speed, rate of flow of
raw material) are linked with an output one (dry bulb
4.1 exTS versus ANFIS temperature).
For simulations, both ANFIS and exTS have been
To illustrate the performances of exTS, this model used with five inputs variables. Predictions were made
is compared to the ANFIS model (Adaptive Neuro- at different horizons h. Assuming that t denotes the
Fuzzy Inference System) proposed by (Jang and Sun current time, the TS models were build as follows:
1995) that emerges from literature: ANFIS shown
improved performance over conventional method and – input 1: x1 (t)—fuel flow rate,
Wang demonstrates that it is a robust machine health – input 2: x2 (t)—hot gas exhaust fan speed,
condition predictor (Wang et al. 2004). – input 3: x3 (t)—rate of flow of raw material,
The whole learning algorithm of ANFIS can not be – input 4: x4 (t)—dry bulb temperature,
fully presented here (see referenced authors for more – input 5: x5 (t − 1)—dry bulb temperature,
explanation). In a few words, ANFIS uses an hybrid – output 1: ŷ(t + h)—predicted dry bulb temperature.
algorithm which is the combination of the gradient Air temperature in a mechanical system. The sec-
descent (that enables the updating of the antecedent ond data set is issued from an hair dryer. It has been
parameters) and of the least squares estimate (that contributed by W. Favoreel2 from the KULeuven Uni-
optimizes the consequent parameters). versity. This data set contains 1000 samples. The air
A specificity of ANFIS can be pointed out: ANFIS temperature of the dryer is linked to the voltage of the
is fully connected which implies that if M mem- heating device.
bership functions are assigned to each one of the For simulations, both ANFIS and exTS have been
n inputs variables, then the ANFIS is composed of used with five inputs variables. Predictions concern
N = M n rules (see Fig. 4 for an example). As the air temperature, and the TS models were build as
a consequence, many parameters must be updated follows:
but, when well trained, ANFIS may perform good
predictions. – input 1 to 4: air temperature at times (t − 3) to (t),
– input 5: x5 (t)—voltage of the heating device,
4.2 Experimental data sets – output 1: ŷ(t + h)—predicted air temperature.

Two real experimental data sets have been used to


test the prediction performances of ANFIS and exTS. 4.3 Simulations and results
In both cases, the aim of these predictions is to In order to extract more solid conclusions from the
approximate a physical phenomenon by learning data comparison results, the same training and testing data
gathered from the system. That can be assimilated to
the prediction step of the prognostic process.
1 ftp://ftp.esat.kuleuven.ac.be/sista/data/process_industry
Industrial dryer data set. The first data set is issued
from an industrial dryer. It has been contributed by 2 ftp://ftp.esat.kuleuven.ac.be/sista/data/mechanical

196
Table 1. Simulation results.

ANFIS exTS

Industrial dryer
t+1 Rules 32 18
RMSE 0.12944 0.01569
MASE 16.0558 2.16361
t+5 Rules 32 17
RMSE 0.84404 0.05281
MASE 114.524 7.38258
t + 10 Rules 32 17
RMSE 1.8850 0.18669
MASE 260.140 27.2177 Figure 5. Predictions—industrial dryer, t + 1.
Air temperature
t+1 Rules 32 4
RMSE 0.01560 0.01560
MASE 0.4650 0.47768
t+5 Rules 32 6
RMSE 0.13312 0.12816
MASE 2.01818 1.97647
t + 10 Rules 32 6
RMSE 0.23355 0.22997
MASE 3.66431 3.66373

sets were used to train and test both models. Pre- Figure 6. Pdf error—air temperature, t + 10.
dictions were made at (t + 1), (t + 5) and (t + 10)
in order to measure the stability of results in time.
The prediction performance was assessed by using Table 2. Complexity of the prediction systems.
the root mean square error criterion (RMSE) which Structural properties for the air temperature benchmark
is the most popular prediction error measure, and the at t + 10
Mean Absolute Scaled Error (MASE) that, according
to (Hyndman and Koehler 2006), is the more adequate Criteria ANFIS exTS
way of comparing prediction accuracies.
For both data sets, the learning phase was stopped nb inputs 5 5
after 500 samples and the reminding data served to test nb rules 32 6
the models. Results are shown in table 1. type of mf Gaussian Gaussian
antecedent parameters
mf/input 2 = nb rules = 6
4.4 Discussion tot. nb of mf 2×5 6×6
parameters/mf 2 2
a) Accuracy of predictions ant. parameters 2 × 2 × 5 = 20 2 × 6 × 5 = 60
According to the results of table 1, exTS performs consequent parameters
better predictions than ANFIS model. Indeed, for the parameters/rule 6 (5 inputs +1) 6
industrial dryer (data set 1), both RMSE and MASE cons. parameters 6 × 32 = 192 6 × 6 = 36
are minors with exTS than with ANFIS. An illustration parameters 20 + 192 = 212 60 + 36 = 96
of it is given in Fig. 5.
However, in the case of the air temperature data
set, exTS do not provide higher results than ANFIS
(RMSE and MASE are quite the same). Moreover, as b) Complexity of the prediction systems
it is shown in Fig. 6, the error spreadings of both model Let take the example of the last line of table 1 to com-
are very similar. Yet, one can point out that exTS only pare the structures of the ANFIS and exTS models.
needs 6 fuzzy rules to catch the behavior of the studied The number of parameters for both systems is detailed
phenomenon (against 32 for the ANFIS model). This in table 2.
lead us to consider the complexity of the structure of As there are 5 inputs for the Air Temperature appli-
both prediction systems. cation (see 4.2), and assuming Gaussian membership

197
functions for the antecedent fuzzy sets, the ANFIS and to the changing data. It is thereby an efficient
model is composed of 212 parameters. Following that, tool for complex modeling and prediction. Moreover,
with a more complex application than that of the any assumption on the structure of exTS is neces-
benchmark studied in the paper, an ANFIS system can sary, which is an interesting characteristic for practical
be quickly limited by the number of inputs (because problems in industry. The exTS is finally a promising
the numbers of parameters to be updated increases). tool for reliability modeling in prognostic applications.
In addition, classically, one says that the number of Developments are at present extended in order to
learning samples for the ANFIS model must be more characterize the error of prediction at any time and
than five times the numbers of parameters, which can thereby provide confidence interval to practitioners.
be critical for industrial practitioners. The way of ensuring a confidence level is also studied.
In opposition, exTS evolves only if there are sig- This work is led with the objective of being integrated
nificant modifications on the input-output variables to an e-maintenance platform at a French industrial
as it has an on-line learning process: exTS start partner (em@systec).
from scratch with a single rule and modifications or
additions of rules are made only if relevant. As a con-
sequence, for the same prediction purpose, an exTS REFERENCES
system can have the same prediction accuracy that an
ANFIS model but with less rules (6 vs 32 in the case Angelov, P. and D. Filev (2003). On-line design of takagi-
considered in table 2). This complexity reduction of sugeno models. Springer-Verlag Berlin Heidelberg: IFSA,
the prediction system can also be pointed out by con- 576–584.
sidering the total number of parameters (96 vs 212). Angelov, P. and D. Filev (2004). An approach to online iden-
tification of takagi-sugeno fuzzy models. IEEE Trans. on
c) Computation efficiency Syst. Man ad Cybern.—Part B: Cybernetics 34, 484–498.
Finally, although it can not be fully developed in the Angelov, P. and X. Zhou (2006). Evolving fuzzy systems
paper, exTS is much more computationally effective from data streams in real-time. In Proceedings of the Int.
than the ANFIS system. This can be explained from Symposium on Evolving Fuzzy Systems, UK, pp. 26–32.
two complementary point of views. Firstly, as stated IEEE Press.
Byington, C., M. Roemer, G. Kacprzynski, and T. Galie
before, an exTS system can perform predictions with
(2002). Prognostic enhancements to diagnostic systems
a slightly structure that the ANFIS, which implies for improved condition-based maintenance. In 2002 IEEE
that fewer parameters have to be updated. Secondly, Aerospace Conference, Big Sky, USA.
when using an exTS system, all learning algorithms Chinnam, R. and B. Pundarikaksha (2004). A neurofuzzy
are recursive ones which allows the on-line use of the approach for estimating mean residual life in condition-
system and ensure the rapidity of treatments. based maintenance systems. Int. J. materials and Product
Technology 20:1–3, 166–179.
Ciarapica, F. and G. Giacchetta (2006). Managing the
condition-based maintenance of a combined-cycle power
5 CONCLUSION plant: an approach using soft computing techniques. Jour-
nal of Loss Prevention in the Process Industries 19,
In maintenance field, prognostic is recognized as a 316–325.
key feature as the estimation of the remaining use- Espinosa, J., J. Vandewalle, and V. Wertz (2004). Fuzzy
ful life of an equipment allows avoiding inopportune Logic, Identification and Predictive Control (Advances
maintenance spending. However, it can be difficult to in Industrial Control). N.Y., Springer-Verlag.
define and implement an adequate and efficient prog- Goebel, K. and P. Bonissone (2005). Prognostic information
nostic tool that includes the inherent uncertainty of the fusion for constant load systems. In Proceedings of 7th
annual Conference on Fusion, Volume 2, pp. 1247–1255.
prognostic process. Indeed, an important task of prog- Hyndman, R. and A. Koehler (2006). Another look at
nostic is that of prediction. Following that, prognostic measures of forecast accuracy. International Journal of
can also be seen as a process that allows the reliabil- Forecasting 22–4, 679–688.
ity modeling. In this context, the purpose of the work ISO 13381-1 (2004). Condition monitoring and diagnostics
reported in this paper is to point out an accurate pre- of machines—prognostics—Part1: General guidelines.
diction technique to perform the approximation and Int. Standard, ISO.
prediction of the degradation of an equipment. Iung, B., G. Morel, and J.B. Léger (2003). Proactive mainte-
According to real implementation restrictions, nance strategy for harbour crane operation improvement.
neuro-fuzzy systems appear to be well suited for prac- Robotica 21, 313–324.
Jang, J. and C. Sun (1995). Neuro-fuzzy modeling and
tical problems where it is easier to gather data (online) control. In IEEE Proc., Volume 83, pp. 378–406.
than to formalize the behavior of the system being Jardine, A., D. Lin, and D. Banjevic (2006). A review
studied. More precisely, the paper point out the accu- on machinery diagnostics and prognostics implementing
racy of the exTS model in prediction. The exTS model condition-based maintenance. Mech. Syst. and Sign. Proc.
has a high level of adaptation to the environment 20, 1483–1510.

198
Kasabov, N. and Q. Song (2002). Denfis: Dynamic evolvinf Wang, P. and D. Coit (2004). Reliability prediction based on
neural-fuzzy inference system and its application for time- degradation modeling for systems with multiple degra-
series prediction. IEEE Transaction on Fuzzy Systems dation measures. In Proc. of Reliab. and Maintain. Ann.
10–2, 144–154. Symp.—RAMS, pp. 302–307.
Muller, A., M.C. Suhner, and B. Iung (2008). Formalisation Wang, W., M.F. Goldnaraghi, and F. Ismail (2004). Prognosis
of a new prognosis model for supporting proactive main- of machine health condition using neurofuzzy systems.
tenance implementation on industrial system. Reliability Mech. Syst. and Sig. Process. 18, 813–831.
Engineering and System Safety 93, 234–253. Zhang, G., B.E. Patuwo, and M.Y. Hu (1998). Forecasting
Vachtsevanos, G., F.L. Lewis, M. Roemer, A. Hess, and with artificial neural networks: the state of the art. Int.
B. Wu (2006). Intelligent Fault Diagnosis and Prognosis Journal of Forecasting 14, 35–62.
for Engineering Systems. New Jersey, Hoboken: Wiley &
Sons.

199
Safety, Reliability and Risk Analysis: Theory, Methods and Applications – Martorell et al. (eds)
© 2009 Taylor & Francis Group, London, ISBN 978-0-415-48513-5

Fault detection and diagnosis in monitoring a hot dip galvanizing line


using multivariate statistical process control

J.C. García-Díaz
Applied Statistics, Operations Research and Quality Department,
Polytechnic University of Valencia, Valencia, Spain

ABSTRACT: Fault detection and diagnosis is an important problem in continuous hot dip galvanizing and the
increasingly stringent quality requirements in automotive industry has also demanded ongoing efforts in process
control to make the process more robust. Multivariate monitoring and diagnosis techniques have the power
to detect unusual events while their impact is too small to cause a significant deviation in any single process
variable. Robust methods for outlier detection in process control are a tool for the comprehensive monitoring
of the performance of a manufacturing process. The present paper reports a comparative evaluation of robust
multivariate statistical process control techniques for process fault detection and diagnosis in the zinc-pot section
of hot dip galvanizing line.

1 INTRODUCTION until a homogeneous set of observations is obtained.


The final set of data is the in-control set data: the
A fault is defined as abnormal process behaviour reference data. The robust estimates of location and
whether associated with equipment failure, sensor scatter were obtained by the the Minimum Covariance
degradation, set point change or process disturbances. Determinant (MCD) method of Rousseeuw.
Fault detection is the task of determining whether a
fault has occurred. Fault diagnosis is the task of deter-
mining which fault has occurred. Fault detection and 1.1 Industrial process under study
diagnosis is an important problem in process engi-
In a continuous hot dip galvanizing line, the steel strip
neering. The ‘‘normal state’’ can be associated with
is coated by passing in through a pot of molten zinc
a set of desired trajectories that can be labelled as nor-
normally between 450 and 480◦ C (average bath tem-
mal evolutions. According to Himmelblau (1978), the
perature of 460◦ C) and with-drawing it from the pot
term fault is generally defined as a departure from an
through a pair of air-wiping jets, to remove the excess
acceptable range of an observed variable or a calcu-
liquid zinc, commonly known as shown schemati-
lated parameter associated with a process. A diagnosis
cally in Figure 1. Galvannealing is an in-line process
problem is characterised by a set of observations to be
during which the zinc layer of the steel strip is trans-
explained. When these observations are different from
formed into ZnFe layer by diffusion. Depending on the
the expected behaviour, we defined a faulty state. The
annealing temperature, annealing time, steel grade,
behaviour of a process is monitored through its sen-
aluminium content in the Zn bath, and other param-
sor outputs and actuator inputs. When faults occur,
eters, different intermetallic ZnFe phases are formed
they change the relationship among these observed
in the coating, which influence the formability of the
variables. Multivariate monitoring and diagnosis tech-
galvanized material (Tang 1999).
niques have the power to detect unusual events while
their impact is too small to cause a significant devia-
tion in any single process variable. This is an important
1.2 Data process
advantage because this trend towards abnormal opera-
tion may be the start of a serious failure in the process. Data for 26 batches were available. Six variables were
Robust methods for outlier detection in process con- selected for monitoring the process: the steel strip
trol are a tool for the comprehensive monitoring of velocity, four bath temperatures and bath level. The
the performance of a manufacturing process. Outliers data are arranged in a 26 × 6 data matrix X. The data
were identified, based on the robust distance. Again, can be analyzed to determine whether or not a fault
we remove all detected outliers and repeat the process has occurred in the process.

201
17
7

18
4

3
19

2
7

0
0 1 2 3 4 5 6

Figure 1. Galvannealing section of a hot dip galvanizing


line. Figure 2. Robust-Mahalanobis distances plot.

2 MULTIVARIATE DETECTION: THE ROBUST 3 FAULT DETECTION AND FAULT


DISTANCE DIAGNOSIS
Outlier identification with robust techniques is 3.1 Principal components analysis
obtained by considering two distances for each obser-
vation (Verboven and Hubert, 2005). Robust estima- Under normal operating conditions, sensor measure-
tors of location and scale for univariate data include ments are highly correlated. Researchers have used
the median, the median absolute deviation, and principal components analysis (PCA) to exploit the
M-estimators. In the multivariate location and scat- correlation among variables under normal opera-
ter setting we assume that the data are arranged in a tions conditions (in control) for process monitoring
n × p data matrix X consisting of n observations on p (Nomikos and MacGregor, 1995) and process fault
variables. When the number of variables p is smaller diagnosis (Dunia and Qin, 1998a, b). PCA captures
that the data size n, robust estimates of the centre μ the correlation of historical data when the process is
and the scatter matrix  of X can be obtained by the in control. The violation of the variable correlations
Minimum Covariance Determinant (MCD) estimator indicates an unusual situation. In PCA, the matrix X
proposed by (Rousseeuw and Van Driessen, 1999). can be decomposed into a score matrix T and a loading
The aim of MCD estimator is to find a subset of data matrix P.
objects with the smallest determinant of the covari- The score plot of the first two score vectors shows
ance matrix. This location and covariance estimator outliers and other strong patters in the data. The score
is very popular because of its high resistance towards plot in Figure 3 shows that batch 17 is above the 99%
outliers. The robust distance of an observation is used confidence limit, while the rest of the batches are
to detect whether it is an outlier or not. This robust below the 99% confidence limit. The batch 17 is abnor-
distance is the robust extension of the Mahalanobis mal and the rest of the batches are consistent (Nomikos
distance. The outliers are those observations having a and MacGregor, 1995).
robust distance larger that the cutoff-value c, where c2 A T 2 Hotelling and Square Prediction Error (SPE)
is the value of the 0.975 quantile of the χ 2 -distribution control charts for reference data process are carried
with p degrees of freedom. out for nonitoring de multivariate process. The SPE is
Using the robust distances and the Mahalanobis a statistic that measures lack of fit of a model to data
distances (Verboven and Hubert, 2005) many outly- (Jackson, 1991).
ing observations can be identified. In this plot, we
can again identify two groups of outliers. Looking at
3.2 Fault detection
Figure 2 we clearly see three outlying observations:
7, 17, 18 and 19. Both a classical and robust analysis Normal operations can be characterized by employing
would identify these observations as they exceed the Hotelling’s T 2 statistic. A value for the T 2 statis-
horizontal and vertical cutoff line. tic greater than the threshold given by control limit
Observation 17 is away from the data majority, indicates that a fault has occurred. Figure 4 shows
whereas observations 18 and 19 are outlying to a two complementary multivariate control charts for
smaller extent. batch 17. The first is a Hotelling’s T 2 chart on the

202
25 4 Var1 Var2 Var3 Var4 Var5 Var6
3
20 99 % limit
2

Contribution
15 95 % 1
0
10
90 % -1
5 -2

0 Batch Time Repeated For Each Variable


t[2]

-5 70 Var6

Integrated Contribution
60
-10 50
40
- 15 30
Var5
20
- 20 10 Var1 Var3
17 0 Var2 Var4
-25
- 30 -20 - 10 0 10 20 30

t[1]
Figure 5. Batch unfolding PCA SPE statistic chart for
batch 17.
Figure 3. Batch unfolding PCA score plot for all variables
and all batches with 90%, 95% and 99% confidence limits.
#7 16 #7
12 99 %
Hotelling T 2 - statistic SPE
12
20 8
18 95 % 8
14 4
99% 4
15
99% 10 0 0
95 % 0 20 40 60 0 20 40 60
10 6
# 18 16 # 18
95 % 12
2
5 12
0 20 40 60 0 20 40 60 8
Time Time 8
4
4
Figure 4. On-line hotelling T 2 -statistic and SPE control 0 0
charts with 95% and 99% control limits for batch 17. 0 20 40 60 0 20 40 60

# 19 # 19
12 16

first two components from a PCA fit model of the 12


8
process. This chart checks if a new observation vector 8
of measurements on six process variables projects on 4
4
the principal component plane within the limits deter-
mined by the reference data. The second chart is a 0 0
0 20 40 60 0 20 40 60
SPE chart. The SPE chart will detect the occurrence
of any new events that cause the process to more away
from the plane defined by the reference model with
Figure 6. On-line Hotelling T 2 -statistic and SPE control
two principal components. The limits for a good oper- charts with 95% and 99% control limits for (a) batch 7,
ation on the control charts are defined based on the (b) batch 18, and (c) batch 19. The symbol (•) show a signal
reference good set (training data). If the points remain of possible fault.
below the limits the process is in control. The batch
17 is out of control.
Figure 5 shows the SPE statistic contribution chart for
the batch 17. Variables 5 and 6 (bath temperatures)
3.3 Fault diagnosis
seem to be the major contributors in the SPE.
Once a fault has been detected, one can calculate Figure 6 shows the T 2 -statistic and SPE con-
the contribution to the SPE-statistic for each variable, trol charts for batches 7, 18 and 19 with 95% and
which can be used to help classify the cause of the 99% control limits. These charts show that the PCA
faults. Contribution plots are used for fault diagnosis. model adequately describes the reference database:
T 2 -statistic and SPE control charts produce an out- the process is assumed to be out-of-control if three
of-control signal when a fault occurs, but they do not consecutive points are out of the 99% control limit
provide any information about the fault and its cause. (no batch is above the 99% control limit). The SPE

203
Contribution Plot to the Overall SPE advantage because this trend towards abnormal oper-
70
ation may be the start of a serious failure in the
batch # 7
process. This paper show the performance monitor-
30 ing potential of MSPC and the predictive capability
Var4 Var5 of robust statistical control by application to an indus-
0
-10 Var1 Var2 Var3 Var6 trial process. The paper has provided an overview of
an industrial application of multivariate statistical pro-
50
cess control based performance monitoring through
Contribution

Var2 the robust analysis techniques.


0

Var1
Outliers were identified, based on the robust dis-
Var3 Var4 Var5 Var6
-50 tance.
batch #18
-100 Again, we remove all detected outliers and repeat
the process until a homogeneous set of observations
40
batch #19
is obtained. The final set of data is the in-control set
20
data or reference data. The robust estimates of loca-
Var1 Var2 Var3 Var4 Var6
tion and scatter were obtained by the MCD method of
0
Var5
Rousseeuw.
-10
A T 2 Hotelling and SPE control charts for reference
data process are carried out for nonitoring de multivari-
ate process. Contributions from each fault detected
Figure 7. Contribution plots of the variables contributing to using a PCA model are used for fault identification
the SPE for batch 7, batch 18, and batch 19. approach to identify the variables contributing most to
abnormal situation.
control charts shows that the process itself evolves
over time as it is exposed to various disturbances such REFERENCES
as temperature changes; after the process instability
occurred, the projected process data returned to the Dunia, R. and Qin, S.J. 1998a. Subspace approach to multi-
confidence limits. The disturbance will be discussed dimensional fault identification and reconstruction. The
below using the contribution plots. American Institute of Chemical Engineering Journal 44,
Contribution plots of the variables, at the time when 8: 1813–1831.
the SPE limit is violated, are generated and the vari- Dunia, R. and Qin, S.J. 1998b. A unified geometric approach
ables with abnormally high contribution to SPE are to process and sensor fault identification. Computers and
identified. Figure 7 illustrates the results of fault clas- Chemical Engineering 22: 927–943.
Himmelbau, O.M. 1978. Fault Detection and Diagno-
sification in batch # 7, 18 and 19. In order to identify
sis in Chemical and Petrochemical Process. Elsevier,
the variables that caused this abnormal situation, con- Amsterdam.
tribution plots of the variables contributing to the Jackson, J.E. 1991. A user’s guide to principal components,
SPE at the time when the SPE violated the limits are Wiley-Interscience, New York.
generated. Variables 2, 5 and 6 (bath temperatures Nomikos, P. and MacGregor, J.F. 1995. Multivariate SPC
in strategic points of the pot) seem to be the major charts for monitoring batch processes. Technometrics 37,
contributors in the SPE. 1: 41–59.
Rousseeuw, P.J. and Van Driessen, K. 1999. A Fast Algorithm
for the minimum Covariance Determinant Estimator.
Technometrics, 41: 212–223.
4 CONCLUSIONS
Tang, N.-Y. 1999. Characteristics of continuous galvaniz-
ing baths. Metallurgical and Materials Transactions B, 30:
Multivariate monitoring and diagnosis techniques 144–148.
have the power to detect unusual events while their Verboven, S.; Hubert, M. 2005. LIBRA: a Matlab library for
impact is too small to cause a significant deviation robust analysis, Chemometrics and Intelligent Laboratory
in any single process variable. This is an important Systems 75: 127–136.

204
Safety, Reliability and Risk Analysis: Theory, Methods and Applications – Martorell et al. (eds)
© 2009 Taylor & Francis Group, London, ISBN 978-0-415-48513-5

Fault identification, diagnosis and compensation of flatness errors


in hard turning of tool steels

F. Veiga, J. Fernández, E. Viles, M. Arizmendi & A. Gil


Tecnun-Universidad de Navarra, San Sebastian, Spain

M.L. Penalva
Fatronik. San Sebastian, Spain

ABSTRACT: The motivation of this paper is to minimize the flatness errors in hard turning (facing) operations
of hardened tool steel F-5211 (AISI D2) discs using Polycrystalline Cubic Boron Nitride (PCBN) tools with
finishing cutting conditions. To achieve this, two strategies have been developed. First an on-line Conditional
Preventive Maintenance (CPM) system based on monitoring a parameter that correlates well with the geometrical
error, namely, the passive component of the force exerted by the workpiece on the tool. The second strategy
is more sophisticated and consists of the development of an on-line Error Compensation System that uses the
error value estimated using the first strategy to modify the tool trajectory in such a way that flatness errors are
kept within tolerances. Moreover, this procedure allows the life of the tools to be extended beyond the limit
established by the first CPM system and also the reduction of part scrap and tool purchasing costs.

1 INTRODUCTION be replaced by fresh ones so that workpiece errors do


not exceed a certain value (Özel 2005, Scheffer 2003).
The process of cutting high hardness materials with CPM systems (Santos, 2006. Luce, 1999) have the
PCBN tools is known as hard turning. This process advantage, over the common Preventive Maintenance
has been increasingly used in finishing operations and systems of not requiring any knowledge about the evo-
has gained position against common grinding in recent lution of the variable of interest, in this particular case,
decades. The reason is that it presents two advantages the workpiece error. However, it need to find a param-
over grinding: a) lower process costs, and b) lower eter that is highly sensitivity to that variable and easy
environmental impact (König, 1993). to follow up on, so that by monitoring the parameter
But it has been recognized that before being a the variable of interest can be estimated.
consolidated alternative to grinding, some process In this paper, flatness errors in a hard facing opera-
outputs have to be improved, such as: part surface tion will be studied and a CPM that will identify when
integrity (absence of white layer) (Schwach, 2006. the tool wear level is such that it can no longer machine
Rech, 2002), surface roughness (Grzesik, 2006. Özel, acceptable workpieces will be proposed.
2005. Penalva, 2002) and geometrical errors. Once the CPM system has been developed, an on
So far, many papers have dealt with the problem line Error Compensation System will also be devel-
of workpiece distortions (Zhou, 2004) induced by the oped so that tool trajectory can be modified on-line in
cutting and clamping forces and the heat flowing into such a way that errors can be compensated for. This
the workpiece from the cutting zone, the motors, the system will allow the further use of tools that have
workplace temperature and so on (Sukaylo, 2005). But been classified as worn out by the CPM system.
the effect of tool expansion on error (Klocke, 1998)
has yet to receive attention.
However, in hard turning, tool expansion should be 2 FLATNESS ERROR
considered because when tools become worn the heat
power in the cutting zone, compared to that of conven- In Figure 1 an orthogonal model for hard turning can
tional turning, is several times higher. In this paper, be observed. In zone I the workpiece material is plasti-
tool expansion will be addressed for the develop- cally deformed by shear. In zone II sticking and sliding
ment of a Conditional Preventive Maintenance (CPM) occurs between the chip and the rake face of the tool
system in order to determine when worn tools should and in zone III the workpiece transitory surface slides

205
ef (d) Ef

Chip Workpiece
Tool
(II)
V
(III) (I) t
Figure 3. Flatness error profile on a hard turned disc of
Workpiece 150 mm external and 50 mm internal diameters.

Figure 1. Heat generation zones in an orthogonal cutting


model. point in which it reaches a maximum value Ef . As this
error is still aggravated by the amount of tool wear,
its value has to be checked at regular times so the tool
can be removed when the total flatness error Ef value
Tool
approaches the tolerance value. In this paper, a Condi-
Programed path tional Preventive Maintenance System (CPM) will be
developed to assess the Ef value.
This CPM requires a parameter that a) is easy
Workpiece to record, and b) correlates well with flatness error.
The passive component of the force, exerted by the
workpiece on the tool, has been found to fulfill
both conditions and the experimental work leading
Tool to the parameter selection is described in the next
paragraph.

Real path
3 TEST CONDITIONS

To achieve statistically sound results 6 tools of low


Figure 2. Typical deviation of tool tip from programed path. CBN content (40–50%), typical for continuous turn-
ing, were used and each tool made 20 facing passes to
a disk of the hardened tool steel F-5211 (AISI D2) with
150 mm of external and 50 mm of internal diameters.
on the clearance face. All these zones generate heat All the passes started at the external diameter and no
that strongly increases with tool wear. But compared cutting fluid was used.
to conventional hard material turning, the overall heat During the passes the force signals were measured
power generated in hard turning is very high as a result by a piezoelectric dynamometer with three compo-
of the exceptional hardness of the workpiece (around nents. Also, after each pass the resulting face profile
60 HRC). was captured by a LVDT (Linear Voltage Displace-
Most of this heat is removed by the chip itself, but ment Transductor) sensor attached to the tool turret
since the tool material in hard turning (CBN) has a without removing the work-piece from the machine.
thermal conductivity that is substantially greater than A stereoscopic microscope, equipped with a digital
that of the chip the tool the temperature increases con- camera, was used to measure the maximum value of
siderably. This rise in temperature is still greater in the flank wear VB max of the tool after each pass. All
hard turning due to the fact that no cutting lubricant this data has been summarized in Table 1.
is used. Therefore, the tool tip volumetric expansion
is high as compared to that of conventional turning.
Of that volumetric expansion, the one in the radial 4 FAULT IDENTIFICATION AND DIAGNOSIS
direction (perpendicular to the workpiece flank wear OF FACING ERROR
contact) forces the tool tip to penetrate into the newly
formed surface and thus a deviation in the programed Two parameters were considered for the total Ef fac-
tool tip path (Figure 2) is produced and, as a con- ing error estimation (see Figure 3), a) the length cut
sequence, the local flatness error ef (d) (Figure 3) by the tool and b) the passive force exerted on the
increases from the tool entrance point to the tool exit tool. Since both of them are expected to be sensitive

206
Table 1. Experimental work done to the analysis of flatness 50
error. tool 1
tool 2
Material UNE F-5211 (62 HRC hardness) 40

Flatness error (micron)


tool 3
tool 4
Cutting passes 20 facing passes on each of 6 tool 5
different tools 30
tool 6
Workpiece Discs of 150 mm external diameter
and 50 mm internal diameter.
CBN Tool content 40–50% 20
Operation Finishing
Cutting conditions Feed rate = 0, 1 mm/rev, depth of 10
cut = 0, 1 mm, cutting speed
180 m/min, dry cutting.
Force acquisition Acquisition, during the cutting 0
of the three force components 0 500 1000 1500 2000 2500 3000
(cutting, feed and passive)
Geometrical error Radius contour acquired by a
contact LVDT sensor Figure 4. Evolution of total flatness error Ef with the tool
Tool wear Flank tool wear measured (VB max ) life for 6 different tools.

to tool wear they should also be good facing error


estimators.
First the sensitivity of the cut length estimator will
be studied in section 4.1 and later that of the passive
force in section 4.2.

4.1 Cut length as a flatness error estimator


Figure 5. Flank wear for two different tools after having cut
As was expected, tests have shown that flatness error
the same length.
actually increases with tool wear and therefore with cut
length and, since the cut length is very easy and inex-
pensive to record, it is appealing as an error estimation
parameter. Flatness error was recorded for 6 tools and value in pass number 5 compared to pass number 4.
in Figure 4 it can be observed that the error values An explanation to this could be that Ef error is a very
have a large scattering. For example, for a cut length complex phenomenon that depends, not only on the
of about 2,700 m the scatter is of 22 μm (between 18 maximum flank wear VB max value, but also on other
and 40 μm). parameters like the flank wear topography. It is easy
This is a consequence of the VB max values disper- to understand that tools with the same VB max values
sion, which is also high, observed for tools with the but with different tool flank topographies will also cre-
same cut length due to the chipping phenomena that ate different contact conditions that generate different
sometimes appears. To illustrate this, Figure 5 shows amounts of heat power and therefore different error
two tools that, after having cut the same length, present values.
very different VB max values, as a result of the chipping Therefore, as a result of the high dispersion values
that can be observed in the tool on the right. The total of Ef , the cut length can not be considered a suit-
error values Ef of workpieces machined by these 2 able estimator of the facing error, so the feasibility
tools have been measured and, as expected, the one of the other parameter, the passive force, will now be
corresponding to the chipped tool presented a much considered.
larger Ef value.
Apart from the dispersion in Ef values, Figure 4
4.2 Passive force as a flatness error estimator
shows that for all of the 6 tools the total flatness error
Ef sometimes goes down with respect to that of the As was commented in section 2, heat power generated
previous pass. For example, tool number 5 produces a during the cutting process, brings about the expansion
smaller Ef error in pass number 5 than in the pervious δ of the tool tip and as can be seen in Figure 6 the
pass number 4. This is something unexpected since workpiece reacts to this with a passive force Fp . When
Ef should have been larger in pass 5 because a larger facing materials of conventional hardness, Fp is very
cut length should have generated a greater tool wear small, but for hard materials this component is even

207
60
Room
Temperature 50

Flatness errors (microns)


Regression
Tool Cutting 95% CI
40 95% PI
Temperature
30

20

10
Workpiece
Fp 0

Figure 6. Diagram of tool tip expansion and force generated 0 50 100 150 200
due to heat power.

Figure 8. Regression line of local flatness error as a function


0 of the passive force variation for 6 different CBN tools.
e f (microns)

-10
unacceptable errors, the correlation of the two curves
-20 is good and therefore ef error can be estimated by
150 125 100 75 50
recording the Fp
Diameter(mm)
Once it is demonstrated that a linear relationship
160 exists between ef (d) and Fp , the next step is to find
140 its arithmetic expression. To do this, 7 diameters were
Fp (N)

120 randomly selected and the values of ef (d) were plot-


100 ted against the corresponding passive force variations
80 (Fp (150) − Fp (d)) and the resulting point cloud fits
150 125 100 75 50 the following linear expression:

Figure 7. Flatness error (above) and passive force (below) ef (d) = 5, 52 + 0, 23(Fp (150) − Fp (d)) (1)
during the facing of a hard steel disc with: V = 180 m/min.,
feed and a cutting depth of = 0.1 mm with a PCBN tool with
VB max = 0, 15 mm.
with a linear regression value of 77.6% that is statisti-
cally significant.
Since experiments have shown that for any pass the
maximum value of the error variation is always located
larger than the cutting and feeding forces and, as a at the minimum diameter (50 mm), equation (1) par-
result, could be a facing error estimator although, so ticularized for this point can identify at the pass where
far, it has not received much attention in the literature the total error Ef has reached such a critical value that
(Lazoglu, 2006). the operator should proceed to replace the worn out
Figure 7 depicts the evolution of the local flatness tool. Therefore a Conditional Preventive Maintenance
error ef (d) at the top and the Fp force at the bottom of (CPM) Strategy has been developed. It indicates when
a randomly selected pass and a remarkable similarity a tool should be replaced to keep errors below a critical
in their shapes can be observed. This suggests that the level.
correlation between the two may be high.
The correlation coefficient between ef (d) and
(Fp (150) − Fp (d)) was calculated for all the tools and
results shows that it is greater than 0,95 for 90% of 5 FLATNESS ERROR COMPENSATION
all the cases. Fp (d) is the passive force in the diameter STRATEGY
d and Fp (150) is the force at the external diameter of
the disk where all passes begin. The remaining 10%, One step further in the direction of tool use optimiza-
(8 passes) with a value below 0.95, corresponded to tion is the development of an Error Compensation
fresh tools (cut length below 500 m), but since fresh System that allows the ‘‘on-line’’ correction on the
tools always give acceptable errors, it can be concluded tool tip trajectory as it deviates during the facing pass
that, where tool wear level is high enough to produce as result of the tool tip expansion.

208
1) Signal (Fp(d)) Dynamometer 6 CONCLUSIONS
PC
A Conditional Preventive Maintenance (CPM)
2) Flatness error Z
Mz System has been developed for the estimation of the
(ef (d)) carriage
part error value in a hard turning (facing) operation. It
is based on the passive component of the force exerted
Clamping
by the workpiece on the tool. This system presents two
Workpiece Z Motor
advantages over the well known Systematic Preventive
Maintenance Systems: important reduction in work-
CNC piece scrap and a more efficient use of the tools.
An Error Compensation System (ECS) has also
been developed. It employs the error value that has
Figure 9. ‘‘On-line’’ Compensation strategy diagram. been estimated with the CPM mentioned above, to
compensate the tool trajectory in such a way that flat-
ness errors are kept within tolerances even using tools
with a wear level that had been rejected by the CPM
This is possible by implementing equation (1) in a system. Compared to the CPM system, the ECS gives
machine tool with an open CNC control. The system better part quality and extends the life of the tools.
(Figure 9) performs the following tasks:
1 Data Acquisition and analysis
2 PC-CNC Communication ACKNOWLEDGEMENTS
3 New tool position generation
This work has been made under the CIC marGUNE
framework and the authors would like to thank the
5.1 Data acquisition and analysis Basque Government for its financial support.
In this module, the passive force Fp (d) signal is first
obtained by a dynamometer mounted on the tool tur- REFERENCES
ret, then it is captured by a National Instruments DAQ
board and is then analyzed by a PC using a specific Grzesik W., Wanat T. 2006. Surface finish generated in hard
software application that has been developed in Lab turning of quenched alloy steel parts using conventional
View. In the PC the (Fp (150) − Fp (d)) values are cal- and wiper ceramic inserts. Int. Journal of Machine Tools
culated for a number of positions which depend on the & Manufacture.
required accuracy, and finally through equation 1 the König W.A., Berktold A., Kich K.F. 1993. Turning vs
corresponding ef (d) are estimated grinding—A comparison of surface integrity aspects and
attainable accuracies. CIRP Annals 42/1: 39–43.
Lazoglu I., Buyukhatipoglu K., Kratz H., Klocke F. 2006.
5.2 PC-CNC communication Forces and temperatures in hard turning. Machining
Science and Technology 10/2: 157–179.
Once ef (d) has been calculated in 5.1, an analog signal, Luce S. 1999. Choice criteria in conditional preventive main-
with the appropriate compensation value is generated tenance. Mechanical Systems and Signal Processing 13/1:
in the Lab View application and sent to the CNC unit 163–168.
of the machine via a DAQ board analog output. Özel T., Karpat Y. 2005. Predictive modelling of surface
roughness and tool wear in hard turning using regression
and neural networks. Int. Journal of Machine Tools &
5.3 New tool position generation Manufacture 45/4–5: 467–479.
Penalva M.L., Arizmendi M., Díaz F., Fernández J. 2002.
The machine PLC (Programmable Logic Controller), Effect of tool wear on roughness in hard turning. Annals
that analyses periodically all the machine parame- of the CIRP: 51/1: 57–60.
ters, registers the compensation value of the signal Rech J., Lech M., Richon, J. 2002. Surface integrity in fin-
input that has been received by the CNC and writes it ish hard turning of gears. Metal cutting and high speed
in the tool compensation parameter of the machine so machining, Ed. Kluwer, pp. 211–220.
that the Z drive motor can modify the Z coordinate of Santos J., Wysk Richrad A., Torres, J.M. 2006. Improving
the tool tip to compensate for the flatness error. The production with lean thinking. John Wiley & Sons, Inc.
Scheffer C., Kratz H., Heyns P.S., Klocke F. 2003, ‘‘Develop-
correction period can be selected by the operator but it ment of a tool wear-monitoring system for hard turning’’,
will always be higher than the Z axis cycle time (a few International Journal of Machine Tools & Manufacture,
centiseconds). For common hard facing cutting speeds Vol. 43, pp. 973–985.
and feeds, periods between 0,1 and 1 seconds can give Schwach D.W., Guo Y.B. 2006, ‘‘A fundamental study
decent compensation results. on the impact of surface integrity by hard turning on

209
rolling contact fatigue’’, Int. Journal of Fatigue, 28/12: Zhou J.M., Andersson M., Ståhl J.E. 2004, ‘‘Identification
pp. 1838–1844. of cutting errors in precision hard turning process’’, Jour-
Sukaylo V., Kaldos A., Pieper H.-J., Bana V., Sobczyk nal of Materials Processing Technology. Vol. 153–154,
M. 2005, ‘‘Numerical simulation of thermally induced pp. 746–750.
workpiece deformation in turning when using various cut-
ting fluid applications’’, Journal of Materials Processing
Technology, Vol. 167, pp. 408–414.

210
Safety, Reliability and Risk Analysis: Theory, Methods and Applications – Martorell et al. (eds)
© 2009 Taylor & Francis Group, London, ISBN 978-0-415-48513-5

From diagnosis to prognosis: A maintenance experience


for an electric locomotive

O. Borgia, F. De Carlo & M. Tucci


‘‘S. Stecco’’ Energetics Department of Florence University, Florence, Italy

ABSTRACT: Maintenance policies are driven by specific needs as availability, time and cost reduction.
In order to achieve these targets, the recent maintenance approach is based on a mix of different policies such
as corrective, planned, and curative. The significance of the predictive activities is becoming more and more
important for the new challenges concerning machines and plants health management.
In the present paper we describe our experience about the development of a rule-based expert system for the
electric locomotive E402B used in the Italian railways system, in order to carry out an automated prognostic
process. The goal of the project was to develop an approach able to improve the maintenance performance. In
particular we would like to develop an advanced prognostic tool able to deliver the work orders. This specific
issue has been identified and requested from the maintenance operators as the best and the only solution that
could ensure some results.

1 INTRODUCTION information about the commercial services of the loco-


motive. The user of the locomotive fleet, Trenitalia
Predictive and condition based maintenance policies S.p.A., has potentially a lot of information but they’re
are intended to assess the state of health of the devices fragmented in many different databases. Therefore,
to which they are applied. This objective is achieved this data and knowledge, as shown in paragraph 3, is
by running a continuous (sometimes might also be very unproductive and inefficient.
on-line) monitoring of the equipments. The ultimate In paragraph 4 is presented the analysis of the diag-
purpose is to carry out the best maintenance opera- nostic data. Their great number couldn’t be easily
tions at the most appropriate time, obviously before the managed. Hence, in order to keep only the relevant
system’s failure, in order to minimize the overall costs. data with some useful information, we adopted a
Two important phases of the predictive mainte- customized statistical approach.
nance process are the incipient fault identification In this manner we could remove many of the useless
and the subsequent life forecasting, respectively called records, for example the ones describing secondary
diagnostic and prognostic processes. events. We could achieve our aim of fault identi-
Nowadays in the industrial applications, the diag- fication thanks to the analysis of the only truthful
nostic process is very frequent and often well-done, diagnostic records.
also thanks to the recent evolution of sensors tech- The prognostic skill, based on the rules logic, was
nologies, data acquisition and analysis. Conversely the gained by the matching of maintenance activities and
prognostic approach is not so common and there aren’t of diagnostic records for a restricted time interval and
specific computer-based methodologies or widely distinguishing a specific fault event.
spread software tools developed for this second stage. As illustrated in paragraph 5, in the development of
This paper describes an experience in the rail- the latter task we faced a lot of problems as the quantity
ways maintenance environment, aiming at develop- and reliability of the data, the information ambiguity
ing a prognostic approach, starting from an existing and the weak know-how either of the product and of
diagnostic system. the maintenance activities.
The case study considered is the diagnostic system In order to take advantage of all the available data
of an entire train fleet, composed by 70 locomotives and information we have suggested, in paragraph 6
operating on the Italian railways net. a new configuration for the predictive maintenance
The first step of our work, as explained in the para- process.
graph 2, was the data collection action, including extra As a final point, in the last paragraph are proposed
diagnostic data, additional maintenance reports and some considerations regarding the future perspectives

211
of the true integration of the prognostic approach Table 2. Maintenance data & information.
within the maintenance strategies.
Data Description Database

Maintenance Maintenance note contains the SAP PM


2 DATA, INFORMATION AND KNOWLEDGE note information about a problem
SOURCES that could be solved by a
maintenance operation
Let us consider one of the electric locomotives of the Work Work order are generated SAP PM
fleet investigated, as a binary system with two possible order originating from a mainte-
states: operational and under maintenance. nance notice. It contains the
All the data and information collected in the first information for carrying out
maintenance operation (task,
phase of the project could be ascribed to one of these
skills, spare parts,). After ful-
two states. All the data and information have been filling the job, the record is
classified in two clusters and the criterion for the clas- completed with the descrip-
sification has been the state of the rolling stock during tion of the accomplished
the time when the data have been generated. maintenance activity
Expert Every important fault or break- Server
opinion down with important conse- work-
2.1 Operational data & information quences in the operational sheet
phase are analyzed and com-
The first set is made by the information coming from mented from a group of experts.
the operational phase. They are shown in the following
table with a brief description and with the name of the
belonging database.
The first two data types are at first stored temporar- Thanks to the GPS technology, each data is also pro-
ily on the memory of the train’s data logger. Then, vided with some information about the geographic
every five minutes, they are sent, by means of a GPS position of the place where it was generated.
transmission, to a ground-fixed server. A specific web At the same time, some essential information (event
application, named DRView, allows to visualize and code) is shown directly on the on board screen of the
quickly filter the data by an advanced database query. driver’s display in order to help managing the train
while the mission is going on.
The last data type are the daily missions carried
Table 1. Operational data & information. out from the locomotive, enriched with some related
information. These data come from a software, made
Data Description Database
for fleet’s management which, at the present time, is
Event The locomotive is equipped by DR View not integrated with the DRView tool.
code a control system able to give a The first two just mentioned kinds of information
specific alarm when a set of are not available for all the fleet, since the data logger
operation parameters are over and the transmission apparel is installed only on two
their standard values. The vehicles (namely number 107 and 119). The data from
alarms are available on line for the other locomotives are never downloaded into the
the drivers, while the mission is databases. The available data recorded on the fixed
going on, and they are also ground server have been recorded within a time inter-
stored for maintenance
opportunities
val starting from June ’06 until now. In the meanwhile
Diag- The locomotive is equipped by a DR View Trenitalia is developing an ongoing project for the
nostic control system able to give a extension to other 10 locomotives of the on board GPS
code diagnostic message when a set transmitter in order to be available a wider number
of abnormal parameters or event of data.
codes have been recorded. The
diagnostic codes are stored
during the mission and they are
available for maintenance
2.2 Maintenance data & information
operators The second set of data comes from the maintenance
Commer- They are the daily travels with GEST-RAZ phase, because it’s collected by the maintenance crew
cial distance information, the main and managed trough a dedicated CMMS.
service stations and the number of the
trains that have been carried out
Maintenance notes and work orders are available
by a locomotive for all components of the fleet but the quality of the
included information often very poor.

212
A maintenance note can have three different origins,
each one of which its specific identification code:
• ZA, if it comes from a note on the trip book of
the drivers. They are advices collected during the
mission and generally point out some abnormal
conditions;
• ZB, if it comes from the inspection of the locomotive
when it enters in the maintenance workshop;
• ZC, this code has been provided for a future
automated monitoring system that will be able to
diagnosis the state of the vehicles.
The opinion of the experts is simply recorded in a
worksheet. Nevertheless the information included are
very useful to understand what has really happened to
a locomotive in case of a alarming failure. Sometimes Figure 1. Work order generation process.
also some suggestions for the development of a correct
approach to manage the fault are reported, and some
indications about when it should be better to intervene.
Moreover the maintenance notes can be generated
After reporting the list of data & information avail-
from a maintenance worker when the locomotive is in
able into the three different not-integrated databases,
maintenance operation for a planned task or for a prob-
the next paragraph will explain how they are managed
lem detected in the general inspection when it enters
and used to perform the diagnosis of the degradation
in the workshop. This general beginning inspection
state of a locomotive and the subsequent maintenance
includes obviously the analysis of the travel book.
activities.
This approach doesn’t allow to have a clear cor-
respondence between maintenance notes and work
orders because, as predictable, very often a work
3 THE CURRENT MAINTENANCE
order contains the reports of a few maintenance notes.
MANAGEMENT PROCESS—‘‘AS IS’’
Sometimes it can also happen that a single work order
contains contemporarily tasks coming from the main-
A diagnostic system has the main goal to execute an
tenance plan and from abnormal conditions signaled
on condition maintenance policy in order to prevent
in operational state.
failures and, when eventually faults occur, may help in
The next image is meant to represent the different
precisely identifying the root causes.
process that can lead to a definite maintenance activity
At the present time the maintenance process of the
on the vehicles.
locomotives’ owner provides two different procedures
for carrying out a work order. Both of them generate
either corrective or planned maintenance operations.
Usually a locomotive goes in a maintenance workshop 3.2 What happens to a locomotive?
only if a deadline of the maintenance plan is expired The maintenance process just described provokes a
or if there is an heavy damage that causes some delays strong lack of balance in the maintenance frequen-
or a reserve. cies. For a long period (three or four weeks) a vehicle
The fact remains that the suggestions of the on board is not maintained and it is recalled in the workshop
diagnostic system are currently ignored. The only way only if a severe fault occurs or if a planned task must
to look at the operational data is the on board driver’s be accomplished. This approach generates a big heap
logbook. In such a document the abnormal conditions, of stress due to the natural process of decay of the
episodes or events (event code) are reported as long as locomotive’s main equipments. This behavior usually
the on board operator considers them important for the involves a series of serious damages, a debacle for the
future maintenance activities. mission of the train.

3.1 Maintenance note and work order


3.3 Information & data quality
As mentioned in paragraph 2, the maintenance warn-
ings could be generated in the operational phase even The most significant problem for the collected data
from the traffic control room. It happens, for instance, and information is their intrinsic reliability.
when the locomotive has a problem during a mission For example, all the locomotives’ documentation,
and the driver signals a specific problem. such as manuals, schemes, etc., have not been updated

213
for a long time and sometimes it is very difficult to
restore the right knowledge.
Since the locomotive’s manufacturers are not longer
traceable while the diagnostic system’s manufacturer
are still in good relations, a lot of helpful information
has been lost.
On the other hand the maintenance data are very
often extremely poor in terms of information. This is
due mainly to the maintenance operators’ approach
which doesn’t give the right value to the collection
of maintenance data. Hence the maintenance activi-
ties description doesn’t allow to understand accurately
which tasks have been actually accomplished.
Although our locomotives belong to a previous
generation, without any chance of improvement, the Figure 3. Rule construction approach.
medium age of the means is very low. For this rea-
son the fleet has hopefully before itself still many
years of operation, justifying the efforts, in terms of
human resources (experts) and technological invest- On the other side of the picture we can see the main-
ments (diagnostic system), aiming at improving the tenance data. Unfortunately their poor quality doesn’t
maintenance strategy. allow us to use them to drive the rules construction
In the next paragraphs we will report about the process, as expected for an ideal system configura-
analysis and studies performed to carry out a diagnos- tion. At the moment the maintenance information can
tic and prognostic approach, able to take some more be used barely to receive a confirmation after a diag-
advantage from the existing data. nosis carried out from the diagnostic code of the on
board monitoring system.
A diagnostic rule is a proposition composed by two
arguments, the hypothesis and the thesis. The Hypoth-
4 EXPERT SYSTEM DEVELOPMENT esis (XX, YY) are the events that must happen in
order to generate an event (ZZ) and therefore verify
The available data and their attributes suggested us a thesis. If the diagnostic rule is maintenance oriented,
that a rule based system would have been the most as a replacement for the cause the rule can contain
suitable approach in order to improve the maintenance the specific maintenance activity able to interrupt the
performance. degradation or to resolve a breakdown if the fault is
Considering Figure 2, on the one side we have already occurred.
the data coming from operational phase—they are In case of advanced diagnostic systems the rule
very unmanageable, as we will show in the next should can be prognostics oriented. This means that
paragraph—but if properly treated, they represent the thesis statement is a prediction of the remaining
a very important information and a signal of the life to failure of the monitored component.
incipient fault. In case of an on condition based maintenance, the
thesis statement contains the prediction of the deadline
for performing the appropriate maintenance activities.
This actions should be able to cut off or at least manage
the incoming component failure. The following chart
is meant to explain these different rule methods.
In this project one of the goals is to gain a prog-
nostic approach from the diagnostic system. Possibly
it should also be biased toward conducting an on con-
dition maintenance. So the thesis declaration will be
generated by matching the user’s manual indications,
the experts’ suggestions and the work orders activities
descriptions.
Since the hypothesis are the most critical element in
the rule, they will be extracted from the on board diag-
nostic database by a significant statistical treatment.
We used the techniques developed for the process
Figure 2. Scheme of the available data. control in order to manage a huge number of data.

214
5 DATA ANALYSIS AND TREATMENT those codes whose related information are not enough.
This unpleasant circumstance is due to a bad manage-
As mentioned in paragraph 2, at the moment the avail- ment of the fleet’s documentation; many events, such
able data of the on board diagnostic system are only as diagnostic system’s upload or modification, are not
those of locomotives number (#) 107 and 119. As still mapped from the user so now and then impor-
reported in the following table the data comparison tant knowledge is lost. The final number is achieved
shows significant difference for each means of trans- through the cutting of the many life messages and of
portation. The locomotive #107 has nearly twice the some codes considered not reliable after a technical
number of codes per kilometer of the #119. Con- discussion with some maintenance experts.
versely the maintenance activities are distributed with In Figure 4 data of locomotive #119 are shown.
the opposite ratio: #119 has roughly the 33% of main- Starting from this point of the paper until the end, we
tenance operations more than #107. The data used will show the study results of these data. The same
in the analysis were collected in five months, from process was made for the other vehicle but, for brief-
January to May 2007. ness’s sake it won’t be described since the results are
Although they are identical for the design, for all significantly similar.
subsystems’ characteristics and for the operational and
maintenance management, each rolling stock has its
own specific behavior. 5.2 Pareto analysis and codes classification
So the diagnostic approach, basing on statistical
methods, can’t be developed without considering the The diagnostic system is based on about two hundred
specific vehicle. Aging, obviously, causes a natural different codes. Their great number led us to perform
and not avoidable drift of the functional parameters. a Pareto analysis in order to identify the most impor-
This phenomenon is strongly amplified in complex tant code. The results in terms of codes occurrence is
systems, as is a modern electric locomotives, where all reported in the following diagram.
the subsystems are dependent on each other. Accord- As clearly visible the most frequent eight codes cor-
ingly the time dependence of the reference parameters respond to the 80% of the occurrences. Although they
will be appreciated to guarantee a dynamic diagnostic
process.

5.1 Filtering
First of all the main important action of the data treat-
ment was the filtering that allowed the management of
a suitable number of information.
As reported in Table 3, the initial number of codes
was very high, the average production is 1,8 codes·km-
1 and 0,8 codes·km-1 respectively for #107 and #119.
The progressive filtering procedure and its results
in terms of code identification is reported in the fol-
lowing table. For each step is also reported the reason Figure 4. Filtering process.
of the choice.
The first step of the filtering process is due to the
indications reported on the user’s manuals where it
is shown which codes are useful in support of the
driver for the management of the abnormal condi-
tions in the mission occasions. The next removal hits
the codes whose meaning is completely unknown and

Table 3. Data comparison.

Data Loco n.107 Loco n.119

Event codes 0, 53 km−1 0, 28 km−1


Diagnostic codes 1, 2 km−1 0, 52 km−1
Maintenance notes 88 129
Work orders 23 44
Figure 5. Pareto analysis of diagnostic code.

215
could be easily clustered in two different sets refer- Control charts may be classified into two general
ring to two subsystems of the locomotive (engineer’s types:
cab HVAC and AC/DC power transformation), we will
treat them separately without considering any mutual • variables control charts (Shewhart, CUSUM,
interaction. EWMA), when the variable is measurable and its
Another important element is the logic of codes distribution has a central tendency;
generation by the on board diagnostic system. As • attributes control charts (p-chats, np-charts,
mentioned in paragraph 2 a diagnostic code is gen- c-charts, u-charts), when the variable is not measur-
erated when a set of specific physical parameters able and its occurrence is characterized by a Poisson
reach their threshold values. This means that you have distribution.
reached abnormal operating conditions. Each code Traditional control charts are univariate, for exam-
record has a start time and an end time field, repre- ple the monitoring of an individual variable. This
senting respectively the appearance and disappearance implies the assumption that the variables or the
of the anomalous conditions. The values of these attributes used for describing the system are inde-
attributes allow us to suggest a classification of codes pendent from each other. Using multivariate control
basing on the duration of the abnormal conditions: charts, all the meaningful variables or attributes are
used together. The information residing in their cor-
• Impulsive signal, code with the same value for the relation structure is extracted and it allows a more
start time and end time. It means that the duration efficient tracking of the process over time for iden-
of the abnormal conditions is less than a second; tifying anomalous process points. Although the multi-
• Enduring signal, code with different values for the variate control charts seem more useful for describing
start time and the end time. The durations has an complex process, they are used less than the univari-
high variability (from some seconds up to some ate. This is due to two reasons: the univariate charts
hours); utilization is simpler and more efficient, moreover
• Ongoing signal, code characterize by a value for a successful implementation of multivariate control
the start time but without end time. It means an chart often requires further statistical studies as the
abnormal condition still persistent. principal components’ analysis.
Many codes are generated by different arrange-
ments of signals. Nevertheless they represent an 6.2 Control chart application
alteration of the equipment’s state so they are able to
describing the states of the locomotive’s subsystems. A diagnostic code represents an attribute describing
the state of its corresponding subsystem. We have
applied the u-chart, an attributes dedicated control
chart, to the daily codes occurrences.
6 THE PROPOSED METHODOLOGY Our resulting control chart differs from the standard
ones for some important parameters differentiations.
6.1 Statistical process control The used parameters expressions are the following:
Production processes will often operate under control, m
producing acceptable product for quite long periods. i=1 ui
Central line, CL = = ū (1)
Occasionally some assignable cause will occur and it m
results in a shift to an out-of-control state, where a large 
proportion of the process output does not conform any- ū
more to the user’s requirements. The goal of statistical Upper control limit, UCL = ū + 3 · (2)
process control is to quickly detect the occurrence ni
of precise causes or process shifts so that investi-    
gation of the process and corrective actions might ū 
be undertaken before many nonconforming units are Lower control limit, LCL = max ū − 3 · 0
ni 
manufactured.
The control chart is an online process-monitoring (3)
technique widely used for this purpose. It is a tool 
useful for describing what is exactly meant by statis- ū
tical control. Sample data are collected and used to Upper warning limit, UWL = ū + 1, 5 · (4)
construct the control chart, and if the sampled values ni
fall within the control limits and do not exhibit any
systematic pattern, we can say that the process is in where ui is the daily frequency per kilometer of the
control at the level indicated by the chart. code, m is the number of days when the process is

216
defined: UWL, UCL. Both of them correspond to a
specific deadline:
• when a code occurrence gets over the UWL, the
maintenance activity should be performed within
five days;
• when a code occurrence gets over the LCL, the
maintenance activity should be performed as soon
as possible.
The execution of the control charts as a monitoring
tool of the reliability performance of the train intro-
duces another important element: the trend analysis.
Figure 6. u-chart for a diagnostic code.
Or else matching the trend occurrence with experts’
experience. When a code occurrence falls for three
times consecutively between CL and UWL a degra-
dation process must be in progress so a maintenance
considered under control and ni is the daily distance activity should be scheduled inside the ten following
covered from the vehicle. days.
The central limit of the control chart has been
calculated in a time interval (m days) where the con-
sidered locomotive is under statistical control, so the 7 FINAL CONSIDERATION
recorded diagnostic codes can be regarded just as
noise because they’re generated from not identifiable 7.1 The next step of the analysis
stochastic events.
The choice of the daily frequency per km as There are a lot of opportunities for the extension of
chart variable, forces us to introduce the variance this project, following various directions. First of all
of the control limits depending on the daily covered we could investigate the likelihood to detecting spe-
distance. cific thresholds and maintenance deadlines for each
As above reported, the lower control limit is equal diagnostic code.
to the standard expression only when it is more than A second step could be the investigation of the daily
zero; otherwise it’s zero. This solution is adopted codes frequency in terms of run hours by a different
because the definition of a negative threshold for an attribute control chart (c-chart).
event occurrence hasn’t any meaning. Moreover we After that we could try to carry out a multivariate
have introduced a further limit: the upper warning statistical analysis, trying to cluster the codes belong-
threshold. Its value is the half of the UCL. ing to the same subsystems. As foreseeable they are
In the following diagram an example of the appli- conditioned by each other but in this first phase of
cation of the control chart to a diagnostic code is the analysis we didn’t have enough time to study their
reported. correlations.
Another important issue is the opportunity to inves-
tigate the daily permanence time of diagnostic codes
6.3 The prognostic approach using the control charts as variables. Probably this data
is more helpful in terms of information than the code
In the paragraph 4, we have underlined how the expert
occurrence.
system should be based on a rules approach with a
prognostic focus. The data analysis process identi-
fies the hypotheses of the rules as a threshold of the 7.2 Suggestion for the maintenance process
occurrence of a diagnostic code.
The project has highlighted some criticalities in the
On the other hand the thesis of the rule is a main-
maintenance process management. On the other side
tenance note for the subsystem including a deadline
a wide set of opportunities and possible improvements
warning for achieving the appropriate maintenance
has been outlined.
activity.
First of all, in order to return some helpful mainte-
Hence the rule model sounds like the following
nance data, an important quality improvement should
assertion: if a diagnostic code’s occurrence rises above
be carried out. This can be obtained by two main
a specific value, a maintenance action must performed
actions:
within a fixed time.
The definition of the control chart parameter was • a strong simplification of the data entry for the main-
performed looking at the problem from this point of tenance operators. A computerized check list of the
view, consequently two different thresholds have been possible entries for each field of the documents is

217
the only solution suggested from the maintenance Jardine, A.K.S., Lin, D. & Banjevic, D. 2006. A review
expert experience. A fixed number of choices for on machinery diagnostics and prognostics implementing
field can decrease the richness of an information condition-based maintenance. Mechanical Systems and
but can improve the data reliability and availability; Signal Processing 20: 1483–1510.
• an effective awareness campaign for the operators Leger, R.P., Garland, Wm.J. & Poehlmanb, W.F.S. 1998. Fault
detection and diagnosis using statistical control charts
that should be properly motivated by a specific and artificial neural networks. Artificial Intelligence in
training course and, eventually offering financial Engineering 12: 35–41.
rewards. MacGregor, J.F. & Kourtl, T. 1995. Statistical process con-
Then we suggest the implementation of a computer- trol of multivariate process. Control engineering Practice
3(3): 403–414.
based platform able to: Mann, L. Jr., Saxena, A. & Knapp, G.M. 1995. Statistical-
• integrate all the data, coming from different sources, based or condition-based preventive maintenance? Jour-
helpful for the prognostic approach; nal of Quality in Maintenance Engineering 1(1): 45–49.
• carry out the analysis process that has been devel- Montgomery, D.C. & Runger, G.C. 2007. Applied Statistics
and Probability for Engineers, 4th Edition. New York,
oped in this paper. New Jersey: John Wiley & Sons.
Another important issue of the platform could be the Niaki, S.T.A. & Abbassi, B. 2005. Fault diagnosis in mul-
chance to automatically generating the maintenance tivariate control charts using artificial neural networks.
notes, when a rule of the expert system based on the Quality and Reliability Engineering International 21(8):
825–840.
statistical process control would be verified. Only this Price, C. 1999. Computer-Based Diagnostic Systems.
type of integration and automation can be helpful to London, Berlin, Heidelberg: Springer-Verlag.
perform a properly predictive maintenance strategy. Ryan, T.P. 2000. Statistical methods for quality improvement,
2nd edition. New York, New Jersey: John Wiley & Sons.
Schein, J. & House, J.M. 2003. Application of control
7.3 Overview charts for detecting faults in variable-air-volume boxes.
This paper reports the partial results of a project that ASHRAE Transactions.
is still in progress but gives the possibility to face the Thomson, M., Twigg, P.M., Majeed, B.A. & Ruck, N. 2000.
most relevant and typical problems of the maintenance Statistical process control based on fault detection of
CHP units. Control Engineering Practice 8: 13–20.
field. Tokatli, F.K., Ali Cinar, A. & Schlesser, J.E. 2005. HACCP
As foreseeable, nowadays, the available technol- with multivariate process monitoring and fault diagnosis
ogy guarantees any data and information without any techniques: application to a food pasteurization process.
storage constraints or communication limits. Food Control 16: 411–422.
So the current challenges are on the one side the data Vachtsevanos, G., Lewis, F.L., Roemer, M., Hess, A. &
management, in terms of integration and treat-ment Wu, B. 2006. Intelligent fault diagnosis and prognosis
from the expert system. On the other side the human for engineering systems. New York, New Jersey: John
resource management in terms of expert experience Wiley & Sons.
formalization and field operator activities. Wikstro, C., Albano, C., Eriksson, L. et al. 1998. Multivariate
process and quality monitoring applied to an electroly-
These issues became necessary to implement a sis process, Part I. Process supervision with multivariate
maintenance strategy basing on a prognostic approach control charts. Chemometrics and Intelligent Laboratory
that should be able to meet the requirements of a Systems 42: 221–231.2.
competitive market. Yam, R.C.M., Tse, P.W., Li, L. & Tu, P. 2001. Intelli-
gent predictive decision support system for condition-
based maintenance. The international journal of advanced
REFERENCES manufacturing technology 17(5): 383–391.
Zhang, J., Martin, E.B. & Morris, A.J. 1996. Fault detection
Chain, L.H., Russel, E. & Braatz, R.D. 2001. Fault detec- and diagnosis using multivariate statistical techniques.
tion and diagnosis in industrial systems. London, Berlin, Chemical Engineering Research and Design 74a.
Heidelberg: Springer-Verlag.

218
Human factors
Safety, Reliability and Risk Analysis: Theory, Methods and Applications – Martorell et al. (eds)
© 2009 Taylor & Francis Group, London, ISBN 978-0-415-48513-5

A study on the validity of R-TACOM measure by comparing operator


response time data

Jinkyun Park & Wondea Jung


Integrated Risk Assessment Division, Korea Atomic Energy Research Institute, Daejeon, Korea

ABSTRACT: In this study, in order to validate the appropriateness of R-TACOM measure that can quantify
the complexity of tasks included in procedures, an operator’s response time that denotes the elapsed time to
accomplish a given task was compared with the associated R-TACOM score. To this end, operator response time
data were extracted under simulated Steam Generator Tube Rupture (SGTR) conditions of two reference nuclear
power plants. As a result, it was observed that operator response time data seem to be soundly correlated with the
associated R-TACOM scores. Therefore, it is expected that R-TACOM measure will be useful for quantifying
the complexity of tasks stipulated in procedures.

1 INTRODUCTION associated R-TACOM scores. As a result, it was


observed that operator response time data seem to
For over several decades, many studies have shown be soundly correlated with the associated R-TACOM
that the reliable performance of human operators is scores. Therefore, it is expected that R-TACOM mea-
one of the determinants for securing the safety of sure will be useful for quantifying the complexity of
any human-involved process industries and/or sys- tasks stipulated in procedures.
tems, such as nuclear industries, aviation industries,
petrochemical industries, automobiles, marine trans-
portations, medicine manufacturing systems, and
2 BACKGROUND
so on (Yoshikawa 2005, Hollnagel 2005, Ghosh &
Apostolakis 2005). In addition, related studies have
2.1 The necessity of quantifying task complexity
commonly pointed out that the degradation of a
human performance is largely attributable to compli- As stated in the foregoing section, a human perfor-
cated tasks (Topi et al. 2005, Guimaraes et al. 1999, mance related problem has been regarded as one of
Melamed et al. 2001). Therefore, managing the com- the radical determinants for the safety of any human-
plexity of tasks is one of the prerequisites for the safety involved systems. It is natural that a great deal of work
of human-involved process industries and/or systems. has been performed to unravel the human performance
For this reason, Park et al. developed a task com- related problem. As a result, it was revealed that the
plexity measure called TACOM that consists of five use of procedures is one of the most effective counter-
complexity factors pertaining to the performance of measures for the human performance related problem
emergency tasks stipulated in the emergency oper- (AIChE 1994, IAEA 1985, O’Hara et al. 2000). In
ating procedures (EOPs) of nuclear power plants other words, procedures are very effective to help
(NPPs) (Park and Jung 2007a). In the course of human operators in accomplishing the required tasks
validating TACOM measure, however, a potential because they can use detailed instructions describing
problem related to the interdependency of five com- what is to be done and how to do it. However, the use
plexity factors has been identified. Subsequently, in of procedures has the nature of a double-edged knife.
order to resolve this problem, the revised TACOM That is, since procedures strongly govern the physical
(R-TACOM) measure was developed based on a behavior as well as the cognitive behavior of human
theory of task complexity (Park and Jung 2007c). operators, it is expected that the performance of human
In this study, in order to validate the appropri- operators would be largely attributable to the complex-
ateness of R-TACOM measure, operator response ity of procedures. Actually, existing literatures support
time data that were extracted under simulated steam this expectation, since the performance of human oper-
generator tube rupture (SGTR) conditions of two ref- ators seems to be predictable when they use procedures
erence nuclear power plants were compared with the (Stassen et al. 1990, Johannsen et al. 1994).

221
Figure 1. The overall structure of TACOM measure.

In this regard, Park et al. developed a measure called 2.3 The revision of TACOM measure
TACOM that can quantify the complexity of emer-
As illustrated in Fig. 1, TACOM measure quantifies
gency tasks stipulated in the EOPs of NPPs (Park and
the complexity of a given task using Euclidean norm
Jung 2007a). Fig. 1 briefly depicts the overall structure
of five sub-measures. This is based on the assumption
of TACOM measure.
such that ‘‘all the five complexity factors are mutually
As shown in Fig. 1, TACOM measure consists of
independent.’’ Unfortunately, in the course of compar-
five sub-measures that represent five kinds of factors
ing response time data with the associated TACOM
making the performance of procedures complicated.
scores, a clue indicating that the independent assump-
Detailed explanations about these complexity factors
tion could be doubtful was observed from the analysis
are provided in the references (Jung et al. 1999).
of correlation coefficients among five sub-measures.
That is, three variables (SSC, AHC and EDC) seem
to be mutually dependent because they have relatively
2.2 Validating TACOM measure
strong correlations (Park and Jung 2007c). In statis-
If the complexity of tasks stipulated procedure can tics, this problem is known as multicollinearity (Cohen
be properly quantified by TACOM measure, then it et al. 2003).
is natural to assume that ‘‘the performance of human Accordingly, it is indispensable to prevent against
operators can be soundly explained by the associated the possibility of the multicollinearity problem among
TACOM scores.’’ Accordingly, in order to validate five sub-measures.
the appropriateness of TACOM measure, it is crucial In order to unravel this problem, creating a multi-
to elucidate what kinds of human performance data item scale was regarded because this approach has
should be compared with TACOM scores. been frequently applied to the treatment of psycholog-
In this regard, two kinds of human performance ical and sociological data. In creating a new scale, the
data, response time data and OPAS (Operator Per- scores of two or more variables are summated and/or
formance Assessment System) scores, were com- averaged to form a single scale that represents the char-
pared with the associated TACOM scores. As a acteristics of included variables. However, without a
result, response time data as well as OPAS scores systematic framework that can provide a theoretical
showed significant correlations with the associated basis, it is likely to fail in creating an appropriate
TACOM scores (Park and Jung 2007a, Park and Jung scale. For this reason, the theory of a task complexity
2007b). is revisited.

222
3 R-TACOM MEASURE In the light of this expectation, five sub-measures
were reorganized along with the definition of three
On the basis of existing literatures, Harvey suggested dimensions included in the generalized task complex-
a generalized task complexity model that consists of ity model. Fig. 3 shows the reorganized structure of
three complexity dimensions as depicted in Fig. 2 TACOM measure (i.e., R-TACOM measure) (Park and
(Harvey & Koubek 2000). Jung 2007c).
In Fig. 2, it should be emphasized that this complex-
ity model provides the three orthogonal dimensions
that affect the complexity of tasks. In other words, 4 COMPARING RESPONSE TIME DATA
although many researchers have identified various WITH THE ASSOCIATED R-TACOM SCORE
kinds of dominant factors that can make the perfor-
mance of tasks complicated, a model that provides the In order to investigate the appropriateness of R-
overall structure as well as the dependency among task TACOM measure, two sets of response time data
complexity factors (e.g., the three orthogonal dimen- collected from nuclear power plants (NPPs) were com-
sions) seems to be very rare. From this regard, it is pared with the associated R-TACOM scores. In the
expected that this complexity model can be used as a case of emergency tasks included in the emergency
technical basis to resolve the multicollinearity problem operating procedures (EOPs) of NPPs, a task perfor-
of the TACOM measure. mance time can be defined as an elapsed time from the
commencement of a given task to the accomplishment
of it. Regarding this, averaged task performance time
data about 18 emergency tasks were extracted from
the emergency training sessions of the reference NPP
(plant 1) (Park et al. 2005). In total 23 simulations
were conducted under steam generator tube rupture
(SGTR) conditions.
Similarly, averaged task performance time data
about 12 emergency tasks under SGTR conditions
were extracted from six emergency training sessions
of another reference NPP (plant 2). It is to be noted
that, although the nature of simulated scenario is very
similar, the emergency operating procedures of two
NPPs are quite different.
Fig. 4 represents the result of comparisons
between averaged task performance time data and the
associated TACOM as well as R-TACOM scores. For
the sake of convenience, equal weights were used to
Figure 2. Generalized task complexity model by Harvey. quantify complexity scores (α = β = γ = ε = δ = 0.2

Figure 3. The overall structure of R-TACOM measure.

223
Figure 4. Comparing two sets of response time data with the associated R-TACOM scores.

in Fig. 1, and α = β = γ = 1/3 α1 = α2 = β1 = Therefore, although more detailed studies are


β2 = 0.5 in Fig. 3). indispensable to confirm the appropriateness of the
In Fig. 4, the capability of the R-TACOM mea- R-TACOM measure, the following conclusion can
sure in evaluating the complexity of tasks is improved be drawn based on the foregoing discussions—‘‘the
because the value of R 2 about the R-TACOM mea- R-TACOM measure is a proper measure in quantifying
sure is higher than that of the TACOM measure. This the complexity of tasks prescribed in procedures.’’
strongly implies that the R-TACOM measure can be
used to standardize the performance of human oper-
ators who carry out emergency tasks. If we adopt
REFERENCES
this expectation, then the R-TACOM measure can be
regarded as a good starting point to explain the differ- AIChE. 1994. Guidelines for preventing human error in pro-
ence of human performance due to operating culture. cess safety. Center for Chemical Process Safety of the
Actually, it is expected that the difference of task per- American Institute of Chemical Engineers.
formance time data in Fig. 4 (i.e., Plant 1 data are Cohen, J., Cohen, P., West, S.G. and Aiken, L.S. 2003.
higher with similar R-TACOM scores) may come from Applied multiple regression/correlation analysis for the
a different operating culture. behavioral sciences. Lawrence Erlbaum Associates, Third
Edition.
Ghosh, S.T. and Apostolakis, G.E. Organizational contribu-
5 GENERAL CONCLUSION tions to nuclear power plant safety. Nuclear Engineering
and Technology, vol. 37, no. 3, pp. 207–220.
Guimaraes, T., Martensson, N., Stahre, J. and Igbaria, M.
In this study, the appropriateness of R-TACOM mea- 1999. Empirically testing the impact of manufacturing
sure based on the generalized task complexity model system complexity on performance. International Journal
was investigated by comparing two sets of aver- of Operations and Production Management, vol. 19, no.
aged task performance time data with the associated 12, pp. 1254–1269.
R-TACOM scores. As a result, it was observed that Harvey, C.M. and Koubek, R.J. 2000. Cognitive, social,
response time data obtained when human operators and environmental attributes of distributed engineering
accomplished their tasks using different procedures collaboration: A review and proposed model of collabo-
consistently increase in proportion to the increase of ration. Human Factors and Ergonomics in Manufacturing,
vol. 10, no. 4, pp. 369–393.
R-TACOM scores. In other words, even though human Hollnagel, E. Human reliability assessment in context.
operators used different procedures, it is expected that Nuclear Engineering and Technology, vol. 37, no. 2,
the performance of human operators would be simi- pp. 159–166.
lar if the complexity of tasks they are faced with is IAEA. 1985. Developments in the preparation of operating
similar. procedures for emergency conditions of nuclear power

224
plants. International Atomic Energy Agency, IAEA- Park, J. and Jung, W. 2007b. The appropriateness of TACOM
TECDOC-341. for a task complexity measure for emergency operating
Johannsen, G., Levis, A.H. and Stassen, H.G. 1994. The- procedures of nuclear power plants—a comparison with
oretical problems in man-machine systems and their OPAS scores. Annals of Nuclear Energy, vol. 34, no. 8,
experimental validation. Automatica, vol. 30, no. 2, pp. 670–678.
pp. 217–231. Park, J. and Jung, W. 2007c. A Study on the Revision of
Jung, W., Kim, J., Ha, J. and Yoon, W. 1999. Compara- the TACOM Measure. IEEE Transactions on Nuclear
tive evaluation of three cognitive error analysis methods Science, vol. 54, no. 6, pp. 2666–2676.
through an application to accident management tasks in Park, J., Jung, W., Kim, J. and Ha, J. 2005. Analysis of human
NPPs. Journal of the Korean Nuclear Society, vol. 31, performance observed under simulated emergencies of
no. 6, pp. 8–22. nuclear power plants. Korea Atomic Energy Research
Melamed, S., Fried, Y. and Froom, P. 2001. The interactive Institute, KAERI/TR-2895/2005.
effect of chronic exposure to noise and job complex- Stassen, H.G., Johannsen, G. and Moray, N. 1990. Inter-
ity on changes in blood pressure and job satisfaction: nal representation, internal model, human performance
A longitudinal study of industrial employees. Journal model and mental workload. Automatica, vol. 26, no. 4,
of Occupational Health and Psychology, vol. 5, no. 3, pp. 811–820.
pp. 182–195. Topi, H., Valacich, J.S. and Hoffer, J.A. 2005. The effects
O’Hara, J.M., Higgins, J.C., Stubler, W.F. and Kramer, J. of task complexity and time availability limitations on
2000. Computer-based Procedure Systems: Technical human performance in database query tasks. Interna-
Basis and Human Factors Review Guidance. US Nuclear tional Journal of Human-Computer Studies, vol. 62,
Regulatory Commission, NUREG/CR-6634. pp. 349–379.
Park, J. and Jung, W. 2007a. A study on the development of Yoshikawa, H. 2005. Human-machine interaction in nuclear
a task complexity measure for emergency operating pro- power plants. Nuclear Engineering and Technology,
cedures of nuclear power plants. Reliability Engineering vol. 37, no. 2, pp. 151–158.
and System Safety, vol. 92, no. 8, pp. 1102–1116.

225
Safety, Reliability and Risk Analysis: Theory, Methods and Applications – Martorell et al. (eds)
© 2009 Taylor & Francis Group, London, ISBN 978-0-415-48513-5

An evaluation of the Enhanced Bayesian THERP method


using simulator data

Kent Bladh
Vattenfall Power Consultant, Malmö, Sweden

Jan-Erik Holmberg
VTT (Technical Research Centre of Finland), Espoo, Finland

Pekka Pyy
Teollisuuden Voima Oy, Helsinki, Finland

ABSTRACT: The Enhanced Bayesian THERP (Technique for Human Reliability Analysis) method has been
successfully used in real PSA-studies at Finnish and Swedish NPPs. The method offers a systematic approach
to qualitatively and quantitatively analyze operator actions. In order to better know its characteristics from a
more international perspective, it has been subject to evaluation within the framework of the ‘‘HRA Methods
Empirical Study Using Simulator Data.’’ This paper gives a brief overview of the method with major findings
from the evaluation work including identified strengths and potential weaknesses of the method. A number of
possible improvement areas have been identified and will be considered in future development of the method.

1 INTRODUCTION and help establish a methodology for assessing HRA


methods using simulator data.
1.1 HRA as part of PSA
The modeling and quantification of human interac- 1.3 Scope
tions is widely acknowledged as a challenging task
of probabilistic safety assessment (PSA). Methods for This paper presents an evaluation of the enhanced
human reliability analysis (HRA) are based on a sys- Bayesian THERP in the pilot phase of the ‘‘HRA Meth-
tematic task analysis combined with a human error ods Empirical Study Using Simulator Data.’’ Major
probability quantification method. The quantification findings from this evaluation are summarized. Pre-
typically relies on expert judgments, which have rarely sentation of the outcomes of the international study is
been validated by statistical data. outside the scope of this paper.

1.2 International study


2 PILOT STUDY SET-UP
In order to compare different HRA methods an inter-
national study ‘‘HRA Methods Empirical Study Using
2.1 Overview
Simulator Data’’ has been initiated using actual simu-
lator data as reference for the comparison (Lois et al The pilot study is based on a set of simulator
2007 & Dang et al 2008). The overall goal of the inter- experiments run in the Halden Reactor Project’s
national HRA method evaluation study is to develop an HAMMLAB (HAlden huMan-Machine LABoratory)
empirically-based understanding of the performance, simulator facility. Fourteen operating crews from an
strengths, and weaknesses of the HRA methods. It is operating nuclear power plant (a pressurized water
expected that the results of this work will provide the reactor) participated in a series of performance shap-
technical basis for the development of improved HRA ing factor/masking experiments. Without knowledge
guidance and, if necessary, improved HRA methods. of the crews’ performances, several HRA analysis
As a first step in the overall HRA method evaluation teams from different countries, using different meth-
study, a pilot study was conducted to obtain initial data ods, performed predictive analyses of the scenarios.

227
2.2 Scenarios 3 ENHANCED BAYESIAN THERP
In the pilot study, two variants of a steam gener-
The Enhanced Bayesian THERP (Technique for
ator tube rupture (SGTR) scenario were analyzed:
Human Reliability Analysis) method is based on the
1) a basic case, i.e., a familiar/routinely practiced case,
use of the time-reliability curve introduced in the
and 2) a more challenging case so called complex case.
Swain’s human reliability analysis (HRA) handbook
In the complex case, the SGTR was masked by a simul-
(Swain & Guttmann 1983) and on the adjustment of the
taneous steamline break and a failure of all secondary
time-dependent human error probabilities with per-
radiation indications/alarms. It could be expected that
formance shaping factors (PSFs) (Pyy & Himanen
operators have difficulties to diagnose the SGTR. The
1996). The method is divided into a qualitative and
event sequence involves several operator actions, but
quantitative analysis part.
this paper is restricted to the first significant operator
action of the scenarios, i.e., isolation of the ruptured
steam generator (SG). 3.1 Qualitative analysis
The qualitative analysis consists of a modelling of the
scenario with a block diagram and a description of
2.3 HRA analysis teams
the basic information of each operator action. The
In order to facilitate the human performance pre- purpose of the block diagram is to define the oper-
dictions, the organizers of the experiment prepared ator actions in relation to relevant process events. The
an extensive information package for the HRA anal- block diagram representation is close to a PSA event
ysis teams including descriptions of the scenarios, tree but is usually a somewhat more detailed model
description of the simulator and its man-machine inter- than an event tree. The block diagram can also be used
face, differences between the simulator and the home to present the dependencies between operator actions
plant of the crews, procedures used in the simula- belonging to the same scenario. The purpose of the
tor, characterization of the crews, their work practices description of the basic information of each operator
and training. The task of the HRA analysis teams action is to consistently characterize main aspects of
was to predict failure probabilities of operator actions an operator action, e.g., initiating event, scenario, time
defined, e.g., isolation of the ruptured steam generator, windows, support from procedures and MMI, practical
and to qualitatively assess which PSFs affect posi- maneuvers needed in the action and other noteworthy
tively or negatively to success or failure of the crew. information.
The members of the Enhanced Bayesian THERP team The block diagram is also used to show the assumed
included the authors of this paper. dependencies between operator actions belonging to
the same scenario. The blocks used in the diagram
should have exact correspondence with functional
2.4 Time criterion events (in event trees or system fault trees) of the
PSA-model. This is important in cases where operator
On the empirical side, time was used as the criterion
action basic events are modeled in system fault trees
for defining success/failure of the crew performance.
so that the link to event tree branches is not obvious.
In the SG isolation case, the available time for the
In this study, the operator actions were given, so that
operator action was considered from the radiological
the construction of the block diagrams did not serve
consequence point of view, not from a core damage
as defining the operator action basic events.
point of view. In order to avoid opening of a SG pres-
sure relief valve, the crew should isolate the SG before
overfilling it. 3.2 Quantitative analysis
The human error probability is derived using the time-
2.5 Performance shaping factors dependent human error probability model as follows,
 
The empirical identification of PSFs was based on 
5
a detailed analysis of simulator performances. Ana- p(t) = min 1, p0 (t) Ki , (1)
lysts viewed the video und transcribed key commu- i=1
nications and events, and used also additional data
sources, such as crew interview, crew PSF question- where p0 (t) is the basic human error probability taken
naire, and observer comments. Finally, the analyst from Swain & Guttmann 1983, see Figure 2, t is the
summarized the observed episode in the form of an time available for identification and decision making,
operational story, highlighting performance character- and K1 , . . . , K5 are the performance shaping factors.
istics, drivers, and key problems. A specific method The min-function ensures that the final probability
was used to rate the PSFs (Lois et al 2007). stays within the range 0 to 1.

228
5 min 8-10 min to go through E -0 and enter E -3 5 min for step 3 of E-3

Steam generator
tube rupture
(SGTR )

Valves closed in
Feedwater to Safety injection Identification and
all outlet and Sequence
Manual scram SGs to primary circuit isolation of the
inlet paths of the continues
(auto-function) (auto-function) ruptured SG
ruptured SG
P=0 P=0 P=0
Automatic scram
(on low
pressurizer
pressure )
P=0
Anticipated SG dry -out , Loss of core Unisolated SGTR ,
transient without major SG rupture cooling contamination of the
scram secondary side ,
loss of primary coolant

Figure 1. Block diagram of the beginning of the SGTR basic scenario. Light grey boxes are operator actions, white boxes
process events, and dark gray boxes end states of the event sequences. E-0 and E-3 are emergency operating procedures.
Possibility of technical failures is not considered in this analysis (P = 0).

1E+0

1E-1

1E-2
Probability of failure

Swain 95%
1E-3
Swain Median
Swain 5%
1E-4 Base probability

1E-5

1E-6

1E-7
1 10 100 1000 10000
Time for identification and decision making t [min]

Figure 2. Time-dependent human error probability curve.

The time available for identification and decision where tind is time for first indication, t time for identi-
making is shorter than the total time available for the fication and decision making and tact time for action.
operator action ttot which is assumed to be composed The following performance shaping factors are used
of three parts as follows (Pyy & Himanen 1996):

K1 : Quality and relevance of procedures


ttot = tind + t + tact , (2) K2 : Quality and relevance of training

229
K3 : Quality and relevance of feedback from pro- 4 PREDICTIONS ACHIEVED BY USING
cess (MMI) THE METHOD
K4 : Mental load in the situation
K5 : Need for coordination and communication. 4.1 Human error probabilities
Table 1 summarizes numerical results from the HRA
Each performance shaping factor can receive a made using the Enhanced Bayesian THERP method.
value 1/5, 1/2, 1, 2 or 5. A level above 1 means that Time available for identification and decision making
the action has a complicating character compared to was taken from the information package submitted by
a ‘‘nominal’’ situation. Consequently, a level below 1 Halden. The prior human error probability is derived
means that the action is easier than the nominal case. from the Swain’s curve, see Figure 2. Four experts
Level ‘‘1’’ means that the factor plays no major role or assessed independently the performance shaping fac-
that this factor is in a nominal level. tors and the assessments were aggregated using the
The meaning of each value for each PSF is Bayesian procedure. Mean values (i.e., posterior mean
explained qualitatively in the method. For instance, values) are shown in Table 1.
regarding ‘‘Quality and relevance of procedures,’’ According to this analysis failure probability is in
K1 = 1/5 is interpreted as ‘‘Very good instructions, the base case 0,03 and in the complex case much higher
operators should not make any mistake,’’ K1 = 1/2 as 0,2. In the simulator experiments, 1 out of 14 crews
‘‘Good instructions, applicable for the situation and failed to isolate the SG within the critical time window
they support well the selection of correct actions,’’ in the base case, and in the complex case 7 out of
K1 = 1 as ‘‘Instructions play no major role in the sit- 14 crews failed. Numerically, the predictions and the
uation,’’ K1 = 2 as ‘‘Instructions are important but outcome are well in balance.
they are imperfect,’’ and K1 = 5 as ‘‘No instruc-
tions or misleading instructions, instructions would
be needed.’’ Explanations for the other PSFs are 4.2 Performance shaping factors
analogous. The values of the performance shaping factors can
The performance shaping factors will be given be interpreted so that in the base scenario the crew
independently by a number of experts, and these judg- should get good support from procedures, training
ments are consolidated with a Bayesian approach. and process feedback to identify the situation and to
In this approach, the performance shaping factors make correct decision in time. Mental load is some-
are assumed to be random variables following a what higher than in a normal case, and there is also
multinomial probability distribution, some coordination and communication needs related
to the action. The experts commented in free text
that ‘‘good instructions, often training, clear indica-
P(Ki = j|qj ) = qj , j = 1/5, 1/2, 1, 2, 5,
(3) tions, radiation alarm gives a clear indication of SGTR,
q1/5 + · · · + q5 = 1. scram and shortage of time are likely to increase
mental load.’’ These judgments were in accordance
with empirical PSF ratings except maybe the proce-
The prior distribution for the parameters of the dural guidance where some difficulties were found
multinomial distribution is assumed to be a Dirichlet empirically.
distribution. The convenient feature of Dirichlet dis- In the complex scenario, procedures are rated to
tribution is that if we assume the expert judgments as provide good support, but the situation is now con-
independent observations from a multinomial distri- sidered unfamiliar from the training point of view,
bution, the posterior distribution is also Dirichlet and and feedback from process is considered poor or mis-
can be easily derived. The prior distribution is chosen leading. Mental stress is considered higher than in
by maximizing the entropy function. This distribution the base case. The experts commented in free text
has an interpretation to represent maximal uncertainty. that ‘‘good instructions, often training, clear indica-
The mathematical procedure is presented in Holmberg tions, radiation alarm gives a clear indication of SGTR,
& Pyy 2000. scram and shortage of time are likely to increase men-
Four experts have participated in this exercise, and tal load.’’ good instructions, situation is unfamiliar,
made their assessments independently of each other less trained, normal feedback missing, and mental
based on material obtained from Halden and processed load is high for various reasons.’’ Empirically, it was
by VTT (see e.g. the block diagrams and definitions judged that the procedures do not provide good sup-
for the operator actions). It should be observed that port. Otherwise predictions and observations were in
experts normally include also members from the oper- line. The difference in the PSF judgments can be seen
ation crews at the actual plant, which was not possible as an expert opinion issue, and not as an HRA method
during this experiment. issue.

230
Table 1. Predictions for operator failure to isolate the ruptured steam generator in time.

Base Complex

Time available for identification and decision making 12 min1 15 min1


Prior human error probability 7.3E-2 0.049
Performance shaping factors, mean values of the expert judgments2
Quality and relevance of procedures, scale 0.2–5 0.6 0.7
Quality and relevance of training, scale 0.2–5 0.5 1.4
Quality and relevance of feedback from process, scale 0.5–5 0.4 2.4
Mental load in the situation, scale 1–5 1.8 2.7
Need for coordination and communication, scale 0.5–5 1.4 1.2
Posterior human error probability 2.6E-2 1.7E-1
Uncertainty distribution (posterior)
5% 5.8E-4 9.9E-4
50% 7.3E-3 4.0E-2
95% 9.1E-2 1.0E + 03

1 Difference in time window between the base case and complex case is that, in the complex case, the plant trip is actuated
immediately in the beginning of the scenario, while in the base scenario an alarm is received first and the first indication of
increasing level in the ruptured SG is received 3 min after the alarm.
2 Interpretation of the scale: 0.2 = very good condition, 0.5 = good condition, 1 = normal, 2 = poor condition, 5 = very
poor condition. Note that full scale is not used for all PSFs.
3 Due to the min-function in the human error probability model, see formula (1).

5 EVALUATION FINDINGS results from this pilot study did not clarify the actual
need for calibration.
5.1 Strengths
The enhanced Bayesian THERP method seems to be 5.3 Possible improvement areas
a cost-effective approach for these type of standard
‘‘PSA operator actions’’ when the aim is to model and A significant empirical observation was the variability
quantify operator actions for a PSA model. between crews with regard to affecting performance
As mentioned above, the predictions of the method shaping factors which means that PSFs are not only
are well in line with the outcomes of the simulator action dependent but also crew dependent. This vari-
experiments. This is true for the quantification as well ability is not explicitly accounted in the enhanced
as for most of the PSFs. Bayesian THERP method, even though the method
produces a probability distribution for each PSF. These
probability distributions, however, reflect variability
5.2 Potential weaknesses of expert judgements not the variability of crews.
Assessment of the time window is critical in this Method development may be needed to account the
method. It seems, however, that the time reliability variability of the crews.
curve is quite reasonable at least for operator actions Experts should be urged to justify the rates. This
with a time window between 5–30 min. It should is an essential way to collect insights, e.g., for
be noted that the time window in this case can be improvements of the human factors.
defined quite accurately since the operator action is Another finding was that the method could be com-
the first one of the event sequence. There is much plemented with a discussion phase after the expert
more variability in time windows for the subsequent judgements where experts could jointly comment the
actions, which is a challenge for this type of HRA results and draw conclusions from the assessments.
models. This would facilitate the interpretation of the results
Another critical point in the method is the inter- which is now based on pure interpretation of the
pretation of the performance shaping factors and their numbers.
numerical rating. It is obvious that different experts
will always interpret differently the explanations given
for the scaling. As long as an expert is consistent in 6 CONCLUSIONS
his/her judgments, values given for different opera-
tor actions can be compared. From the absolute level The experiment shows that the Enhanced Bayesian
point of view, some calibration may be needed. So far, THERP method gives results in close match with

231
simulator data, at least within the experimental REFERENCES
limitations.
Concerning the quantitative results, no significant Lois, E. et al 2007. International HRA Empirical Study—
deviations were identified. Description of Overall Approach and First Pilot Results
For negative PSFs, there was a difference in com- from Comparing HRA Methods to Simulator Data. Report
mon for both scenarios. While the Enhanced Bayesian HWR-844, OECD Halden Reactor Project, draft, limited
distribution.
THERP method predicted mental load/stress and defi- Dang, V.N. et al 2008. Benchmarking HRA Methods Against
cient feedback as important factors, the simulation Simulator Data—Design and Organization of the Halden
focused more on procedural feedback and task com- Empirical Study. In: Proc. of the 9th International Confer-
plexity. The reasons behind this might be method ence on Probabilistic Safety Assessment and Management
related, but could also depend on limitations in expert (PSAM 9), Hong Kong, China.
selection and/or differences in stress level between real Swain, A.D. & Guttmann H.E. 1983. Handbook of Human
operation and simulator runs. Reliability Analysis with Emphasis on Nuclear Power
The comparison of empirical observations with pre- Plant Applications. NUREG/CR-1278, Sandia National
dictions was found as a useful exercise to identify areas Laboratories, Albuquerque, USA, 554 p.
Pyy, P. & Himanen R. 1996. A Praxis Oriented Approach
of improvements in the HRA method. An aspect not for Plant Specific Human Reliability Analysis—Finnish
covered by the method is the variability between the Experience from Olkiluoto NPP. In: Cacciabue, P.C., and
crews with regard to importance of different PSFs. Papazoglou, I.A. (eds.), Proc. of the Probabilistic Safety
Also explanations for numerical scales for PSFs could Assessment and Management ’96 ESREL’96—PSAMIII
be improved to harmonize the way experts interpret Conference, Crete. Springer Verlag, London, pp.
the scales. In this way, empirical tests are necessary to 882–887.
validate an HRA method. Holmberg, J. & Pyy, P. 2000. An expert judgement based
Otherwise the evaluation gives confidence that the method for human reliability analysis of Forsmark 1
time reliability curve is a feasible and cost effective and 2 probabilistic safety assessment. In: Kondo, S. &
Furuta, K. (eds.), Proc. of the 5th International Confer-
method to estimate human error probabilities, at least ence on Probabilistic Safety Assessment and Management
when the time window is well defined and relatively (PSAM 5), Osaka, JP. Vol. 2/4. Universal Academy Press,
short. Tokyo, pp. 797–802.

232
Safety, Reliability and Risk Analysis: Theory, Methods and Applications – Martorell et al. (eds)
© 2009 Taylor & Francis Group, London, ISBN 978-0-415-48513-5

Comparing CESA-Q human reliability analysis with evidence


from simulator: A first attempt

L. Podofillini
Paul Scherrer Institut—PSI, Villigen PSI, Switzerland

B. Reer
Swiss Federal Nuclear Safety Inspectorate—HSK, Villigen HSK, Switzerland
(until July 2007 with the Paul Scherrer Institut)

ABSTRACT: The International Human Reliability Analysis (HRA) Empirical Study started in late 2006, with
the objective of assessing HRA methods based on comparing their results with data. The focus of the initial phase
is to establish the methodology in a pilot study. In the study, the outcomes predicted in the analyses of HRA teams
are compared with the findings obtained in a specific set of simulator studies. This paper presents the results of
one HRA analysis team and discusses how the predicted analysis compares to the observed outcomes from the
simulator facility. The HRA method used is the quantification module of the Commission Errors Search and
Assessment method (CESA-Q), developed within the HRA research project at the Paul Scherrer Institut (PSI). In
this pilot phase, the main focus of the comparison is on qualitative results: the method predictions, the scenario
features and performance factors that would mostly contribute to failure (or support success). The CESA-Q
predictions compare well with the simulator outcomes. This result, although preliminary, is encouraging since
it gives a first indication of the solidity of the method and on its capability of producing founded insights for
error reduction. Also, the comparison with empirical data provided input to improve the method, regarding
the treatment of the time factor and of knowledge- and training-based decisions. The next phases of the HRA
Empirical Study will address also the quantitative aspects of the HRA. It is planned to use further insights from
the next phases to a) refine the CESA-Q guidance and b) evaluate the method to see whether additional factors
need to be included.

1 INTRODUCTION i.e. their predictions, identification of the scenario


features and performance factors that would most con-
The aim of the Human Reliability Analysis (HRA) tribute to the failures of these actions (or support
empirical study is to assess the quality of HRA meth- success). The results of this first pilot phase will be
ods based on simulator data (Dang et al., 2007; Lois published in a forthcoming report (Lois et al., 2008).
et al. 2008). The next phases of the project will address the
Fourteen operating crews from a nuclear power remaining HFEs (as well as additional HFEs) and
plant participated in a series of scenarios in the the quantitative results of the HRA, the values of the
Halden Reactor Project’s HAMMLAB (HAlden human error probabilities (HEPs).
huMan-Machine LABoratory) simulator facility, in This paper presents the results of one HRA analysis
Halden, Norway. Without knowledge of the crews’ team and discusses how the predicted analysis com-
performances, HRA analysis teams performed predic- pares to the observed outcomes from the simulator
tive analyses of the scenarios. facility. The focus is on the qualitative HRA results.
The first pilot phase of the project run until begin- The HRA method used is the quantification module
ning of 2008, with the aim of establishing the method- (CESA-Q) (Reer, 2006a) of the Commission Errors
ology for comparing the predictions of the HRA teams Search and Assessment method (CESA), (Reer et al.,
with the experimental outcomes of the simulator runs. 2004; Reer & Dang, 2007).
The comparison addressed two Human Failure Events The application of CESA-Q in this study has been
(HFEs), out of the nine on which the HRA predic- exploratory. The method’s development and previous
tive analyses were made. The focus in this phase of applications have focused on errors of commission,
the project is on the qualitative results of the HRA, while this study addresses errors of omissions, i.e. the

233
non-performance of required actions. CESA-Q is still HRA analyses for evaluation against the data. In this
under development. Although preliminary, the evalu- way, initial results concerning the HRA methods as
ation of the method against empirical data has been well as feedback on the comparison methodology itself
very informative at this stage and provided insights were obtained.
(minor, but worthwhile) for further improvement. In this phase of the Empirical Study, two Steam
There are still methodological aspects to be resolved Generator Tube Rupture scenarios were defined,
on how to compare an HRA analysis with evidence specifically a straightforward or ‘‘base’’ scenario and
from the simulator. Together with the HRA teams, the a more difficult or ‘‘complex’’ scenario. The base sce-
steering committee of the international HRA empirical nario includes four main operator actions while the
study is currently working on resolving these. complex scenario included five operator actions.
The paper is organized as follows. Section 2 gives The HRA analysis teams performed predictive anal-
more details on the HRA empirical study. Section 3 yses with their chosen HRA method on the basis of
presents CESA-Q as it has been applied in the study, ‘‘reference’’ inputs (an information package) prepared
describing also how the method was adjusted to by the assessment group. Further details on the overall
address the EOOs. Section 4 describes the HFEs study methodology are presented in (Dang et al., 2007;
addressed in the comparison. The CESA-Q HRA of Lois et al., 2008) and in other related papers from this
the considered HFE is presented in Section 5. Section 6 conference.
presents the comparison between the method predic- In the pilot phase, the HRA analysis teams ana-
tions and the outcome from the simulator runs. Derived lyzed the failure of the nine actions or ‘‘human failure
insights for improvement of CESA-Q are discussed in events’’. The qualitative results of their analyses,
Section 7. Conclusions are given at closure. identify the scenario features and performance fac-
tors that would most contribute to the failures of
these actions (or support success); these make up
2 THE HRA EMPIRICAL STUDY the team’s predicted outcomes. Their quantitative
results are the estimated human error probabilities. On
The motivations for the Empirical Study are the differ- the empirical side, the Halden staff and the Study’s
ences in the scope, approach, and models underlying assessment group analyzed the data collected on the
the diversity of established and more recent HRA crews’ performance in these scenarios to identify the
methods (Dang et al., 2007; Lois et al., 2008). These scenario features and factors that were observed or
differences have led to a significant interest in assess- inferred to cause difficulties for the operators, lead-
ing the performance of HRA methods. As an initial ing to delays in completing the actions or to failures
step in this direction, this international study has been to complete the actions in time. At the qualitative
organized to examine the methods in light of data, level, the predicted features and factors from each
aiming to develop an empirically-based understanding HRA analysis of an action are compared with the
of their performance, strengths, and weaknesses. The features and factors observed in the data for that
focus of the study is to compare the findings obtained action.
in a specific set of simulator studies with the outcomes In view of the ambitious schedule and the desire to
predicted in HRA analyses. obtain feedback from the HRA analyst teams early in
Hosted by the OECD Halden Reactor Project, the the study, the comparison in this first pilot phase was
Empirical Study has three major elements: limited to two actions defined in the scenarios, the
identification and isolation of the faulted steam gen-
– predictive analyses where HRA methods are applied
erator in the base and complex scenarios respectively.
to analyze the human actions in a set of defined
For each action, the prediction results are compared
scenarios,
with the results from the observations.
– the collection and analysis of data on the perfor-
The simulator studies with the operator crews were
mance of a set of operator crews responding to
carried out in late 2006. To avoid biasing the com-
these scenarios in a simulator facility (the Hammlab
parison, the assessment group and HRA teams were
experimental simulator in Halden),
not provided information of any kind on the simulator
– and the comparison of the HRA results on predicted
observations until after the review of the HRA team
difficulties and driving factors with the difficulties
submissions were completed.
and factors found in the observed performances.
The second pilot phase planned for 2008 will
The tasks performed in 2007 aimed a) to establish address the remaining actions in the two SGTR scenar-
the methodology for the comparison, e.g. the proto- ios and include the comparison in quantitative terms,
cols for interacting with the HRA analyst teams, the in other words, it will address how well the HEPs
information exchanged, and the methods for the data estimated by an HRA method correlate with the level
analysis and comparison; and b) to test the comparison of difficulty observed in the empirical data for these
methodology with expert teams submitting predictive actions.

234
3 THE CESA-Q METHOD AS APPLIED Table 1. Steps of the CESA-Q analysis as applied in the
IN THE HRA EMPIRICAL STUDY HRA empirical study.

The initial development of PSI’s CESA method Step


# Description
focused on the identification and prioritization of
aggravating operator actions in post initiators sce- 1 List the decision points that introduce options for
narios, i.e. errors of commission (EOCs). CESA is deviating from the path (or set of paths) leading
described in (Dang et al., 2002; Reer et al., 2004; to the appropriate response, and select relevant
Reer & Dang, 2007). Subsequently, work was started deviations, contributing to the human failure
to address quantification, again focusing on EOCs, event (HFE) in question.
resulting in an outline of a method for EOC quan- 2 For each decision point, evaluate whether a
tification: CESA-Q (Reer, 2006a). The quantification situational factor (Reer, 2006a) motivates the
inappropriate response (deviation from the
module CESA-Q was applied in the HRA empiri-
success path). If not proceed with step 3. If yes
cal study. Identification and prioritization, addressed proceed with step 4.
by CESA, were not relevant for the empirical study 3 If no situational factor applies, estimate a reliability
since the HFEs were given as part of the study index (i) in the range from 5 to 9. Guidance is
design. provided in (Reer, 2007).
Some features of CESA-Q are as follows (refer Proceed with step 6.
to Reer (2006a) for a complete description of the 4 For the EFC case, evaluate the adjustment factors
method). The EOC is analyzed in terms of the plant- which mediate the impact.
and scenario-specific factors that may motivate inap- Refer to the guidance presented in (Reer, 2006a).
5 For the EFC case, estimate a reliability index (i)
propriate decisions. The focus of CESA-Q is therefore
in the range from 0 to 4.
in the analysis of inappropriate decision-making. As Refer to the reference cases (from operational
discussed later, this focus makes the application of events) summarized in (Reer 2006a).
CESA-Q to EOOs plausible. Two groups of factors 6 a) Assign an HEP (pF|i value) to each decision
are introduced: situational factors, which identify point (CESA-Q associates each reliability index
EOC-motivating contexts, and adjustment factors, with an HEP value, Reer 2006a), and b) determine
which refine the analysis of EOCs to estimate how the overall HEP (as the Boolean sum
strong the motivating context is. A reliability index of the individual HEPs).
is introduced which represents the overall belief of the 7 Evaluate recovery (the option for returning to the
correct response path) for the most likely error
analyst regarding the positive or negative effects on the
assigned in step 6a. Apply the recovery HEP
EOC probability (defined from 0 for strongly ‘‘error- assessment guidance in Reer (2007).
forcing’’ contexts to 9 for contexts with very low EOC
probabilities). Quantification is done by comparing
the pattern of the factors evaluations with patterns
of catalogued reference EOCs (identified from 26 EOO). Table 1 gives details on all steps of the CESA-Q
operational events, previously analyzed qualitatively analysis.
and quantitatively in Reer and Dang (2006) and Reer For example, the decision by the crew to transfer to
(2006b). an inappropriate procedure (where the required action
The HFEs addressed in the HRA empirical study are is not instructed) may result in the required action not
EOOs, while the development of CESA-Q and its pre- being performed, or, if the decision is revised too late,
vious applications have focused on EOCs. Therefore, not being performed in the available time.
the application of CESA-Q has been explorative.
In particular, the EOOs (i.e. the non performance
of the required action) were interpreted as the final 4 HFE SCENARIO DESCRIPTION
result of inappropriate actions or decisions made by AND SUCCESS CRITERION
the crew while proceeding along their path to the
required action, with the result that the required action The results reported in this paper relate to one of the
is not performed. Therefore, CESA-Q could be used two HFEs addressed in the first pilot phase: HFE #1B,
to analyze these inappropriate decisions. the failure to identify and isolate the ruptured SG in
The HFEs analysis by CESA-Q was based on the the complex SGTR scenario.
identification of the decision points encountered by The complex SGTR scenario is designed such that
the operators while proceeding with their response. important cues for the diagnosis of the SGTR event,
Then, the decision points were analyzed to determine which is needed for HFE #1B, are missing. In the
if inappropriate decisions made by the crew at the deci- HAMMLAB simulator, this was accomplished by
sion points could result in the required action not being defining hardware failures additional to the SGTR. In
performed within the required time window (i.e. in the particular, a main steam line break, SLB, precedes the

235
SGTR and leads to closure of the main steam isolation procedural transfers to the applicable EOP E-3 are
valves, with the consequence (in the reference plant) based. Indeed, the applicable EOPs for the refer-
that most of the secondary radiation indications read ence plant rely strongly on radiation indications. It
‘normal’. The remaining available secondary radiation should be noted that this may not be the case in other
indications is also failed as part of the scenario design. plants. Therefore the operators have to diagnose the
The Emergency Operating Procedures (EOPs) SGTR event based only on the level indications in the
guiding the crews are of the Westinghouse type. The ruptured SG.
immediate response is guided by procedure E-0. The Success in performing #1B requires that the crew:
response to the SGTR event is guided by proce-
dure E-3. There are a number of opportunities to – enters procedure E-3 (based on the various oppor-
enter E-3: tunities to enter E-3), and
– has closed/isolated all steam outlet paths from the
– step 19 of E-0, with instructions based on secondary ruptured SG (SG #1), and
radiation indications; – stopped all feed to the ruptured SG as long as the
– first transfer to ES-1.1 (‘‘Safety injection termina- ruptured SG level is at least 10% as indicated on the
tion’’) at step 21 of E-0 and then transfer to E-3 narrow range SG level indications (to ensure the SG
based on the ES-1.1 fold-out page, based on the U-tubes will remain covered).
check in the fold-out of secondary radiation con- – perform the above by 25 minutes once the steamline
ditions and whether level in any ‘intact SG’ is break occurs (which is the start of the event) consti-
increasing in an uncontrolled manner; tutes ‘‘failure’’ as this would be a slower response
– step 24b in E-0, based on whether the level in SG than expected/desired.
#1 cannot be desirably controlled; or
– step 25a in E-0, based on secondary radiation
indications.
5 THE CESA-Q ANALYSIS OF HFE #1B
The challenge connected with HFE#1B is the lack
of secondary radiation cues, which would be expected This Section reports the qualitative aspects of the
in case of a SGTR and on which a number of the HRA. As mentioned, these are the aspects on which the

Table 2. Critical decision points (selected) identified in the CESA-Q analysis of HFE #1B.

Description of the inappropriate


Decision point decision Consequence

#1B.1 (E-0, step 1) Operators erroneously transfer from Potential delay (not within 25 min) to accomplish
E-0 step 1 to procedure FR-S.1 ‘‘response #1B. If error is not recovered, operator could go
to nuclear power generation/ATWS’’. as far as to inject boron, but this action has
consequences on the ‘‘safe side’’.
#1B.4 (E-0, step 16) Operators erroneously transfer Most likely consequence is potential delay (not
from E-0, step 16 to E-1 ‘‘Loss of within 25 min) to accomplish #1B. In addition,
reactor or secondary coolant’’. depending on how far operators go into E-1 they
may risk overfeeding the RCS or the SG.
#1B.6 (E-0, step 18) Operators erroneously transfer Potential delay (not within 25 min) to accomplish
from E-0, step 18 to E-2 ‘‘Isolation #1B.
of faulted SG’’.
#1B.7 (E-0, step 21, Operators erroneously do not transfer to E-3 Operators stay in ES-1.1 and become involved in the
and fold out page (transferring from E-0, step 21 to ES-1.1 steps required to control SI. Potential delay to
of ES-1.1) first and then from the fold out page accomplish #1B, i.e. not within 25 min.
of ES-1.1 to E-3).
#1B.8 (E-3, steps 2 Operators fail to identify and isolate the Primary to secondary leak is not controlled.
and 3) ruptured SG (E-3 step 3).
#1B.12 (E-3, Operators isolate PORV of ruptured SG Isolation of SG A PORV may lead to challenge of the
step 3b) when SG erroneously perceived below SG A safety valves which may stick open and
70.7 bar). thus disable isolation of SG A. It is assumed here
that the SG safety valves are not qualified for
water discharge and thus certainly fail under SG
fill-up conditions.

236
Table 3. Analysis of the situational factors for decision Table 4. Analysis of the situational factors for decision
point #1B.6 at step 18 of E-0—Operators erroneously transfer point #1B.7—Operators do not transfer to E-3 (first from
to procedure E-2 (instructing isolation of faulted SG). E-0, step 21 to ES-1.1 and then from fold-out page of ES-1.1
to E-3).
Situational Evalua-
factor tion Comment Situational Evalua-
factor tion Comment
Misleading No Indications are not misleading:
indication pressure in the SG is not Misleading Yes Cues on high secondary radiation
or instruc- decreasing in uncontrollable indication levels are missing. This is
tion (MI) manner as required in or Instruc- a consequence of the exceptional
step 18b for transfer to E-2. tion (MI) condition of SLB and SGTR
Procedures do not support combined.
transfer to E-2 in this scenario. Adverse Yes The exceptional condition of
Adverse No Transfer to E-2 is inadequate for exception combined SLB and SGTR
exception this scenario. It is not made (AE) results in the lost of important
(AE) inadequate by an exceptional cues: high secondary radiation
condition. levels. These are the first cues
mentioned in E-0 as indications
Adverse Yes There is an initial drop in all SG of a SGTR (E-0, step 19). Cues
distraction pressures due to the main SLB. related to differences in the SG
(AD) Although pressure recovers very levels come up later than expec-
fast upon main SL isolation, ted for a ‘‘normal’’ SGTR. Under
operators may fix on this initial this error-forcing condition, the
cue, quickly reach step 18 and operators are expected to enter
enter E-2. ES-1.1. At this point it is possible
that they get involved in termina-
Risky No There is no credible reason to ting SI and overlook the transfer
incentive transfer to E-2 in order to follow E-3 in the foldout page of ES-1.1.
(RI) conflicting goals.
Adverse Yes Lack of relevant cues (high radia-
distraction tion levels in the secondary) is a
(AD) distraction for the operators.
Risky No Operators would not miss the
current phase of the HRA empirical study is focused incentive transfer to E-3 in order to follow
on. The steps of the analysis related to quantification (RI) conflicting goals.
have been skipped in the following description.

5.1.1 Step 1—list decision points reports the analysis for decision point #1B.7, which,
Procedures were analyzed in order to identify the according to the analysis, dominates #1B.6. Note that
decision points that may contribute to HFE #1B. dominating decision points result from the quantitative
13 decision points were found, identified as #1B.1, analysis, which is not reported in this paper.
#1B.2, . . . , #1B.13. Selected decision points are
reported in Table 2. The consequence of the inap-
propriate decisions can be a delay in performing #1B 5.1.4 Step 7—evaluate recovery
(compared to the time window of 25 minutes) as well Recovery analysis is carried out for the most likely
as aggravation of the plant condition. decision errors identified in the previous step 6, i.e.
#1B.6 (E-0 step 18) and #1B.7 (E-3 step 2, step 3); see
5.1.2 Step 2—evaluate the situational factors Table 6.
The 13 decision points were analyzed in terms of
the four CESA-Q situational factors (Reer, 2006a), 6 COMPARISON OF THE CESA-Q ANALYSIS
which may motivate an inappropriate response. Poten- TO OPERATING CREW DATA
tial error-forcing conditions (EFCs) were identified at
two decision points, #1B.6 and #1B.7 (Table 2), the The CESA-Q analysis of HFE 1#B was submitted
analyses of which are reported in Table 3 and Table 4, by the HRA team to the assessment and comparison
respectively. group. The assessment and comparison group com-
pared the HRA predictions to the simulator outcomes.
5.1.3 Step 4—evaluate adjustment factors This Section highlights some points of the comparison,
The decision points #1B.6 and #1B.7, for which error which were used to derive the insights discussed in
forcing conditions were found, were further analyzed Section 7. The detailed comparison will be published
in terms of adjustment factors (Reer, 2006). Table 5 in the forthcoming study report (Lois et al., 2008).

237
Table 5. Analysis of the adjustment factors for decision Table 6. Recovery analysis for decision point
point #1B.7—Operators erroneously do not transfer to E-3 #1B.7—Operators do not transfer toE-3 (first from E-0, step
(transferring from E-0, step 21 to ES-1.1 first and then from 21 to ES-1.1 and then from fold-out page of ES-1.1 to E-3).
the fold out page of ES-1.1 to E-3).
Recovery Evalua-
Adjustment factor tion Comment
factor Evaluation Comment
RTP— Yes A time reserve of 5 min is defined
Verification 0.8 (slightly The foldout page of Recovery as a boundary condition for the
hint error-forcing) ES-1.1 gives indi- Timely HRA. This reserve is deemed
cation that the adequate Possible as sufficient for returning to the
to transfer to E-3. The appropriate path.
rate response is of 0.8
RCA— Yes Step 24b in E-0 provides
it is not known how has
Recovery alternative path from E-0 to E-3,
frequently the operators
Cue with cues based on the SG level.
the foldout page.
Available
Verification 1 (success Level indications are
means forcing) available and clearly ST— Yes Time reserve of 5 min (HRA
visible. Shortage boundary condition) indicates
Verification 0.8 (slightly Cognitive requirement been of Time shortage of time.
difficulty error-forcing) slightly increased, since MC— Probably Although the main indication is
lack of radiation indica- Masked masked, difference in the SGs
tions the given since Cue level becomes progressively
alrepresents a deviation evident.
from the base case of
trained rule application.
In addition, the level
indications are available. As a general statement, this prediction was well in
with some delay line with the observations. Indeed, the dominant crew
compared to the behaviors were as follows.
expectation.
Verification 1 (success Negligible physical effort – Six crews entered ES-1.1, thus passing the transfer
effort forcing) required for verification. to the SGTR procedure at E-0 step 19. All crews
Time 0.5 (mod- Time for taking decision eventually transferred to E-3, based on the instruc-
pressure erately error is around 10 minutes. tions in the fold-out page in ES-1.1, or on their
forcing) knowledge that increasing level in one SG provides
Benefit 1 (success No particular benefit a cue for SGTR (although half of the crews did not
prospect forcing) to stay in ES-1.1 in this do so in the required time of 25 minutes).
scenario. – Five crews transfer directly from E-0 to E-3, with-
Damage 0 (not success No particular damage out passing through ES-1.1. This transfer was
potential forcing) potential is implied to knowledge-based as well.
stay in ES-1.1.
The CESA-Q analysis predicted that the crews
would have eventually managed to transfer to E-3
According to the CESA-Q predictions, the way the using the fold-out page of ES-1.1. Actually, in the
human failure would most likely develop is as follows. observations, 2 crews did so, while many of the crews
As the result of the error-forcing condition of missing that entered ES-1.1 decided to transfer to E-3 from
radiation indications and (probably) delayed SG levels knowledge-based diagnosis. Indeed, as predicted by
indications, the operators are expected to pass the EOP the CESA-Q analysis, there was no particular benefit
step transferring to the SGTR EOP (E-0, step 19) and to stay in ES-1.1 too long in this scenario.
enter ES-1.1, as instructed by later steps of E-0 (step The CESA-Q analysis predicted that some short-
21). At this point it is possible that they get involved age of time may have been experienced by the crews if
in performing the necessary steps to terminate safety they eventually enter ES-1.1 and become involved in
injection as instructed by ES-1.1 and overlook the the ES-1.1 procedural steps. This would result in a dif-
transfer to E-3 in the foldout page of ES-1.1. The ficulty for the crews to meet the time requirement of 25
decision to transfer to E-3 in this case is not straight- minutes. Indeed, the observations from the simulator
forward due to the EFCs. However, it is expected that confirmed that the time of 25 minutes available for the
they would at some point transfer to E-3, realizing response was limited and this had an important impact
that the increasing level in one SG is a cue for SGTR on the crews’ performance. Although all of the 14
(procedures have additional later transfers to E-3). crews managed to enter the SGTR procedure E-3 and

238
isolate the ruptured SG, 7 out of the 14 crews did not do entailed to re-establish safety functions and ‘‘running
so within the 25 minutes (but 13 crews did it within 35 out of time’’ was not a concern.
minutes and the last one did it after 45 minutes). Com- It is planned to use the successive phases of the HRA
pared to the simulator outcome, the CESA-Q analysis empirical study to gather additional insights on how
did not recognize that the situation would slow the CESA-Q should address the adequacy of time factor
crews’ response such that as many as half of the crews and to provide guidance to the users as necessary.
did not meet the time requirement of 25 minutes (see
success criteria in Section 4). The implication of this
will be discussed in the next Section 7.1. 7.2 Focus of CESA-Q on aspects
However, it must be noted that the 25 minutes of knowledge-based behavior required
criterion is not a typical PSA success criterion for for EOC quantification
identification and isolation of the ruptured SG in a CESA-Q accounts for operator behavior based on
SGTR event. A typical PSA success criterion would knowledge (or training) in the following factors:
be to respond in time to avoid SG overfill or to avoid
damage of the SG safety valves due to flooding of • in the evaluation of EFCs represented by situational
the steam line. Although 25 minutes is about the time factors AD and RI. For example, concerning AD:
after which it is expected that the SG level will reach the distraction caused by an indication not referred
100% on the wide range indicators, the operators are to in the nominal path through the procedures may
aware that still some time is left before the SG over- suggest a response (inappropriate), followed by
fills. Indeed in their response, the goal the operators knowledge or training; and
have in mind is to avoid or limit overfill rather than • in the evaluation of adjustment factors (Table 5) and
respond in 25 minutes (indeed, except for one, all the recovery (Table 6), e.g. regardless of the guidance
other crews were just late up to 10 minutes). in the procedure, an abnormal SG level indication
may be credited as a hint to verify if an evaluation
of operator training concludes that the SG level is
7 INSIGHTS FOR IMPROVEMENT OF CESA-Q in the focus of the operator’s attention.
FROM THE COMPARISON
A thorough analysis of operational events with
EOC involved has shown that account of these aspects
7.1 Influence of the time available on the HEP
of knowledge-based behavior is required for EOC
CESA-Q treats the time factor by focusing on the quantification (Reer & Dang, 2006).
effect of time pressure, intended as urgency to act, However, some of the experimental results from this
on the quality of decision-making, and accounts for HRA empirical study suggest that additional consider-
shortage of time in decision error recovery. Time ation of knowledge-based behavior may be required,
pressure impacts the ability of the operators to think especially for EOO quantification. The observation
straight about alternative decisions and to possibly of the crews’ behaviors has shown that, especially in
revise an inappropriate decision. In CESA-Q, these the case in which the EOP guidance is not optimal
aspects enter into the evaluations of the situational like in the HFE#1B case, knowledge-based as well as
factors ‘‘Adverse distraction’’ and ‘‘Risky incentive’’, training-based decisions become important drivers for
in the adjustment factor ‘‘Time pressure’’ and in the successful performance, therefore for not committing
evaluation of the time available for recovery. the EOC or the EOO. Indeed, for many of the crews,
The evidence from the simulator showed that the the decision to transfer to E-3 was made based on their
failure in HFE #1B was influenced by the adequacy knowledge of the SGTR symptoms, after realizing that
of time, among other factors, rather then by time pres- the EOPs were not conducting them to E-3.
sure. The time frame of 25 minutes resulted to be short Guidance may therefore be helpful on when the
to reach consensus (or for the crew leader to reach CESA-Q analysis should extend the consideration
enough initiative) to enter E-3 in the complex scenario. of knowledge-based and training-based decisions in
But there was no evidence that the crews’ performance the scope of the analysis. Indeed, it should keep
was negatively influenced by time pressure. into account that how much the crews adhere to the
Therefore, it seems that the method, in its current procedures or integrate them with knowledge-based
version, does not give proper credit to the effect of and training-based decisions may vary depending on
‘‘running out of time’’ while making correct decisions plant-specific or country-specific work processes.
as guided by the procedure, which seems to be one of Consequently, also guidance should be included on
the drivers for HFE 1B. Indeed, CESA-Q was devel- when the analysis should consider multiple success
oped based on the analysis of catalogued reference paths, for example based on the likelihood of the crews
EOCs (identified from 26 operational events in Reer taking the different paths and in the identification of
& Dang, 2006), where the success of the operators the single, or eventually multiple, dominant path.

239
8 CONCLUSIONS The next steps planned for the HRA empirical study
are the comparison of the remaining HFEs in the Steam
This paper has concerned the results of the HRA per- Generator Tube Rupture scenarios and another com-
formed by the CESA-Q team on one of the nine HFEs parative analysis on a Loss of Feedwater scenario.
addressed by the HRA empirical study. The paper has Generally, it is planned to use the insights to a) refine
discussed how the predicted analysis compares to the the CESA-Q guidance and b) evaluate the method to
observed outcomes from the simulator facility. see whether additional factors need to be included. In
It should be emphasized that the application of this regard, empirical data provide invaluable input.
CESA-Q in this study is explorative: the method’s Although an empirical model of performance needs to
development and previous applications have focused be based on far more than one scenario (two variants in
on errors of commission, while this study addresses this case), this data contributes to such a model. This
errors of omissions. should lead to improvements in CESA-Q as well as
PSI’s CESA-Q method performed well on the qual- other HRA methods.
itative aspects of the exercise, i.e. how well the
methods predicted what elements of the actions may
be challenging. This result, although preliminary, is ACKNOWLEDGEMENTS
encouraging since it gives a first indication of the solid-
ity of the method and on its capability of producing This work is funded by the Swiss Nuclear Safety
founded insights for error reduction. These qualitative Inspectorate (HSK), under DIS-Vertrag Nr. 82610.
aspects were the main emphasis in this phase of the The views expressed in this article are solely those
study; currently, the assessment group is planning to of the authors.
treat the more quantitative aspects in the next phase. B. Reer contributed to this work mostly while he
The empirical data, consisting of systematic obser- was with the Paul Scherrer Institut (since July 2007,
vations of the performances of multiple crews on the he is with the HSK).
same scenarios, have been useful in deriving insights
on potential improvements of the CESA-Q method.
For instance, in treating time, CESA-Q focuses on LIST OF ACRONYMS
the effect of time pressure on the quality of decision-
making and accounts for shortage of time in decision CESA—Commission Errors Search and Assessment
error recovery. It seems that the method, in its current CESA-Q—Quantification module of CESA
version, does not give proper credit to the effect of EFC—Error Forcing Condition
‘‘running out of time’’ while making correct decisions EOP—Emergency Operating Procedures
as guided by the procedure. HAMMLAB—Halden huMan-Machine LABoratory
It should be also investigated whether guidance HFE—Human Failure Event
should be added to base the analysis on multiple HRA—Human Reliability Analysis
expected response paths and to consider knowledge- MSLB—Main Steam Line Break
based and training-based decisions in the definition of PSF—Performance Shaping Factor
the expected response paths and of the critical decision PSA—Probabilistic Safety Assessment
points. PSI—Paul Scherrer Institut
An aspect that makes CESA-Q well suited for SGTR—Steam Generator Tube Rupture
comparison against simulator data is that it produces
detailed descriptions of crews’ behaviors, in the form
of paths of response actions and critical decisions REFERENCES
taken along the response. These paths and decisions
could be indeed observed in the simulators. CESA and Dang, V.N., Bye, A. 2007. Evaluating HRA Methods in Light
CESA-Q shares this characteristic with some other of Simulator Findings: Study Overview and Issues for
recent HRA methods like EDF’s MERMOS and US an Empirical Test. Proc. Man-Technology Organization
NRC’s ATHEANA. Sessions, Enlarged Halden Programme Group (EHPG)
Finally, the study can provide an opportunity not Meeting, HPR-367, Vol. 1, paper C2.1, 11–16 March
only to compare CESA’s predictions with empirical 2007, Storefjell, Norway.
data but also to compare HRA methods and their Dang, V.N., Bye, A., Lois, E., Forester, J., Kolaczkowski,
resulting analyses on the same set of actions. In partic- A.M., Braarud, P.Ø. 2007. An Empirical Study of HRA
Methods—Overall Design and Issues’’ Proc. 2007 8th
ular, when performed in the context of empirical data, IEEE Conference on Human Factors and Power Plants
a method comparison has the added value that there is (8th HFPP). Monterey, CA, USA, 26–31 Aug 2007,
a shared basis (the data) for understanding the scope CD-ROM, (ISBN: 978-1-4244-0306-6).
of each factor considered by a method and how the Dang, V.N., Reer, B., Hirschberg, S. 2002. Analyzing Errors
method treats these in detail. of Commission: Identification and First Assessment for a

240
Swiss Plant. Building the New HRA: Errors of Commis- Reer, B. 2006b. An Approach for Ranking EOC Situations
sion—from Research to Application, NEA OECD report Based on Situational Factors. Paul Scherrer Institute,
NEA/CSNI/R(2002)3, 105–116. Villigen PSI, Switzerland, Draft.
Forester, J., Kolaczkowski, A., Dang, V.N., Lois, E. 2007. Reer, B. 2007. Notes on the Application of the CESA Quan-
Human Reliability Analysis (HRA) in the Context of HRA tification Method for Scenarios Simulated in the Halden
Testing with Empirical Data. Proc. 2007 8th IEEE Con- Reactor Experiments. Paul Scherrer Institute, Villigen
ference on Human Factors and Power Plants (8th HFPP). PSI, Switzerland, Draft.
Monterey, CA, USA, 26–31 Aug 2007. CD-ROM, (ISBN: Reer, B., Dang, V.N. 2006. Situational Features of
978-1-4244-0306-6). Errors of Commission Identified from Operating Expe-
Lois, E., Dang, V.N., Forester, J., Broberg, H., Massaiu, S., rience. Paul Scherrer Institute, Villigen PSI, Switzerland,
Hildebrandt, M., Braarud, P.Ø., Parry, G., Julius, J., Bor- Draft.
ing, R., Männistö, I., Bye, A. 2008. International HRA Reer, B., Dang, V.N. 2007. The Commission Errors
empirical study—description of overall approach and first Search and Assessment (CESA) Method. PSI Report
pilot results from comparing HRA methods to simulator Nr. 07-03, ISSN 1019-0643, Paul Scherrer Institut,
data. HWR-844. OECD Halden Reactor Project, Norway Switzerland.
(forthcoming also as US NRC report). Reer, B., Dang V.N., Hirschberg S. 2004. The CESA method
Reer, B., 2006a. Outline of a Method for Quantifying Errors and its application in a plant-specific pilot study on errors
of Commission. Paul Scherrer Institute, Villigen PSI, of commission. Reliability Engineering and System Safety
Switzerland, Draft. 83: 187–205.

241
Safety, Reliability and Risk Analysis: Theory, Methods and Applications – Martorell et al. (eds)
© 2009 Taylor & Francis Group, London, ISBN 978-0-415-48513-5

Exploratory and confirmatory analysis of the relationship between


social norms and safety behavior

C. Fugas & S.A. da Silva


ISCTE/CIS, Lisbon, Portugal

J.L. Melià
University of Valencia, Valencia, Spain

ABSTRACT: Although several researchers have argued that social norms strongly affect health behaviors,
the measurement of health and safety norms has received very little attention. In this paper, we report the
results of our study designed to: 1) test the reliability and construct an effective questionnaire devoted to
the measurement of social influences on safety behavior; 2) assess the predictive validity of supervisor and
coworker’s descriptive and injunctive safety norms on safety behavior; 3) test a Four-Factor CFA model of social
influence on safety behavior (confirmatory factor analysis). The questionnaire has 11 items four-scales and
used a 7-point Likert-type scale of safety behavior. A self-reporting scale of safety behavior was also included.
A sample (N = 250) of operational team workers from a Portuguese company participated voluntarily and
anonymously in the study. Overall results from this study (EFA and CFA) confirmed the questionnaire structure
and provided support for a correlated, Four-Factor model of Safety Group Norms. Furthermore, this study
had demonstrated that coworker’s descriptive and injunctive safety norms were a strong predictor of safety
behavior.

1 INTRODUCTION Evidence from workplace safety (safety climate)


research reveals that groups emerge as a critical unit
Work accidents are still a problem in the world, far analysis in safety task accomplishment and safety
from being solved, even in some developed countries. performance (e.g. Zohar, 2002) and it is suggested
According to Hämäläinen, Takala & Saarela (2006), that they will become more important in the evolving
the fatality and accident rates per 100 000 workers modern workplace (Tesluk & Quigley, 2003). Never-
in Canada, United States, Ireland, Italy, Portugal and theless, the role of group and team factors in safety,
Spain are higher than average. relative to other criteria, has not been extensively
The understanding of the antecedents and determi- studied.
nants of safety behavior can be an important contribu- On the other hand, the measurement of health and
tion to the prevention of work accidents. Consistent safety norms has also received little attention. Some
with the common focus of safety research at the recent applications of the theory of planned behav-
individual level of analysis, the vast majority of behav- ior use single-item or one-dimensional measures (see
iorally based safety interventions (e.g. safety training, for an exception Fekadu & Kraft, 2002; Conner &
feedback) have focused on individual factors. How- McMillan, 1999) and have ascribed only crossover
ever, there is a widespread view that group norms effects. These findings reflect the lower importance
represent an important contextual influence on health given to normative factors in the study of behaviors.
and safety behaviors (e.g. Conner, Smith & McMillan, A meta-analytic review of the efficacy of the the-
2003; Linnan, et al., 2005) and interventions aimed at ory of planned behavior (Armitage & Conner, 2001)
changing behaviors by reshaping people’s normative found that more reliable multiple-item measures of
beliefs may be promising. Thus, understanding the subjective norm and normative beliefs had signifi-
‘‘pressure points’’, whereby group and normative vari- cantly stronger correlations with intention than any
ables shape worker’s health and safety, may contribute of the other measures.
to maximize the effectiveness of health promotion Thus, multidimensional measures, that might be
programs. consistently applied to quantify and characterize the

243
structure of norms, can be valuable to the understand- be done and motivate by promising social rewards and
ing of the contextual normative influences on worker’s punishments.
health behaviors. It is usually that descriptive and injunctive norms
are mutually congruent. According to Cialdini et al.
(1990) ‘‘Because what is approved is often actually
2 OVERVIEW OF THEORETICAL what is done, it is easy to confuse these two meanings of
FRAMEWORK norms . . . it is important for a proper understanding of
normative influence to keep them separate, especially
The proposed conceptual framework in this paper in situations where both are acting simultaneously’’
relies on workplace safety literature (e.g. Zohar & (p. 1015).
Luria, 2005) and on social cognitive theories, namely Although the discriminant and convergent valid-
the theory of planned behavior (e.g. Ajzen, 2005) and ity of descriptive and injunctive norms (or subjective
focus theory of normative conduct (e.g. Cialdini, et al., norms) constructs have been supported by literature
1990, 1991) to understand how group norms impact (see for a review, Rivis & Sheeran, 2003) it remains
safety behavior. unclear, in the specific domain of safety, how both
According to Cialdini & Trost (1998) group norms normative components predict safety behaviors.
are guidelines for acceptable and unacceptable behav- Finally, in the present study, the definition of safety
ior that develop through interactions among group behavior comprises not only compliance behaviors
members and are informally agreed upon by group such as properly using personal protective equipment
members. They may be actively/verbal (e.g. explicit and engaging in work practices that reduce risk, but
statements) or passively transmitted/non-verbal (e.g. also more proactive safety behaviors or safety citizen-
modeling). Any social punishments for not comply- ship behaviors, such as helping teach safety procedures
ing with norms come from social networks and not to new crew members, assisting others to make sure
from formal systems established by the organization. they perform their work safely, making safety-related
Following these assumptions, it is our view that safety recommendations about work activities (see Burke,
group norms are internalized informal safety rules that Sarpy, Tesluk, & Smith-Crowe, 2002 and Hoffmann,
work groups adopt to regulate and regularize group Gerras & Morgeson, 2003).
member’s behavior.
Reviews of previous research indicate that, typ-
ically, normative influence on behavior has been 3 METHOD
studied in terms of subjective norm or person’s percep-
tions of whether specific salient others think he/she 3.1 Participants
should engage in the behavior and the motivation
to comply with such pressure (cf. Ajzen, 1991, Participants in this study were operational workers,
2005). For instance, an application of the theory of members of work teams employed in a Portuguese
planned behavior to the prediction of safe behavior (see company (passenger transportation company with
Johnson & Hall, 2005) used three items to measure high safety standards). The sample consisted of
subjective norms (e.g., ‘‘Most people who are impor- 250 workers, who provided anonymous ratings of
tant to me would strongly encourage/discourage me to the descriptive and injunctive safety norms of their
lift materials within my strick zone’’). coworkers and supervisors and of their own individual
But, there is an important distinction in the liter- safety behavior. All participants were male and most
ature on social influence between injunctive norms respondents (83,2%) were not supervisors and had het-
and descriptive norms. Cialdini and colleagues (e.g. erogeneous jobs in the company. Almost half of the
Cialdini et al., 2006) call the subjective norms, in participants (48,8%), were between 31 e 40 years old
the usual applications of the theory of planned behav- and the other half (42,5%) were over 41 years old.
ior, injunctive social norms, as they concern other’s Finally, 38,4% had tenure on the job between 6 and 15
social approval or disapproval. They argue that ‘‘when years and 41,6% tenure superior to 15 years.
considering normative influence on behavior, it is cru-
cial to discriminate between the is (descriptive) and
3.2 Material
the ought (injunctive), because each refers to a sepa-
rate source of human motivation (Deutsch & Gerard, In this study, data was collected from individual group
1955)’’ (cit. Cialdini et al., 1990, p. 1015). In other members, using a quantitative methodology (survey
words, whereas descriptive norms provide informa- administered in the form of a questionnaire).
tion about what is normal, what most people do and Four 11-item scales measured descriptive and
motivate human action providing evidence of what is injunctive safety group norms and we also considered
likely to be effective and adaptative action, injunc- the referent implied in the survey items. Therefore,
tive norms provide information about what ought to descriptive and injunctive norms were assessed in

244
relation to coworkers and supervisors. The scale items 4 RESULTS
(descriptive vs. injunctive) were based on the assump-
tions of the focus theory of normative conduct (e.g. This Two-step approach was comprised of exploratory
Cialdini et al., 2006) concerning the measurement analysis and confirmatory factor analysis.
of social norms, as prior applications to the study In the first step, exploratory analysis included fac-
of health behaviors in other contexts (e.g. Conner & tor analysis, reliability and predictive validity analysis.
McMillan, 1999). We should note that the four scales Data was analyzed using the Statistical Package for the
included generic items that can be applicable to many Social Sciences (SPSS). The method of factor extrac-
types of jobs involving safety issues. To begin scale tion was principal component (varimax rotation).
development, we departed from Zohar and Luria’s In the second step, a confirmatory factor analy-
(2005) group-level safety climate scale, we created a sis was performed with Structural Equation Modeling
pool of 11 items for the individual’s perceptions about (SEM) using the Maximum Likelihood Estimation
the supervisors’s descriptive safety norms (e.g. ‘‘My (MLE) method. All estimations were made with
direct supervisor checks to see if we are obeying safety AMOS 7.0 software. This multivariate technique was
regulations’’). Then, each item was adapted to the indi- used to test (confirm) the measurement construct
vidual’s perceptions of the descriptive safety norms of validity of a Four-Factor CFA model of safety group
their coworkers (e.g. ‘‘My team members check to see norms, and to compare this model with One-Factor
if other team members obey the safety rules’’) and that and Two-Factor CFA models.
we follow the same procedure for injunctive safety
norms (e.g. ‘‘My direct supervisor thinks that com-
pliance to safety rules should be checked’’ and ‘‘My
4.1 Exploratory factor analysis
team members think that compliance to safety rules
should be checked out’’). This procedure resulted in an Exploratory Factor Analysis (EFA) played an impor-
initial 44-item ‘‘Safety Group Norms’’ questionnaire. tant complementary role in evaluating the proposed
Respondents rated the frequency (using a 7-point scale dimensionality of constructs. The initial 44-item
ranging from never to always) with which cowork- Safety Group Norms Questionnaire was submitted
ers and direct supervisors performed/think should be for two independent principal components analysis
performed each of the 11 indicated behaviors. (rotation of varimax) with the purpose of identi-
‘‘The Safety Behavior’’ self-report was measured fying the underlying structure of the variables and
using a revised and updated version of the origi- also to create a subset of variables representative of
nal General Safety-Performance Scale (Burke et al., the factors, much smaller in number (four 5-item
2002), and Safety Citizenship Role Definitions and scales) for inclusion in subsequent analysis (confir-
Behavior Items (Hofmann et al, 2003). Because our matory factor analysis). The overall results of reli-
interest was in the overall safety behavior, we com- ability analysis of this smaller version is presented
bined these two measures, resulting in a 12-item Safety in Table 1. Four dimensions reflected by a separate
Behavior Scale (e.g. ‘‘I properly performed my work factor were identified (SDSN—Supervisor’s Descrip-
while wearing personal protective equipment’’ and tive Safety Norms; SISN—Supervisor’s Injunc-
‘‘I have made suggestions to improve safety’’). We tive Safety Norms; CDSN—Coworker’s Descriptive
assume that safety behaviors can be assessed with Safety Norms; CISN—Coworker’s Injunctive Safety
respect to the frequency that employees engage in Norms). All factor loadings were above .65, all Cron-
the behaviors, using a 7-point Likert-type scale rang- bach’s alpha coefficients exceed .80, corrected item-
ing from never (1) to always (7) with high scores to-total correlations were all above .60 and inter-item
representing more positive safety behaviors. correlations were greater than .40. In brief, exploratory
Given the limitations associated with the use of factor analysis results confirm the proposed multidi-
self-reports of behavioral safety, the present study also mensionality of constructs and the appropriateness of
used microaccidents1 as outcome criterion of behav- the variables.
ioral safety. Participants were also asked to self-report The 12-item Safety Behavior Scale was also submit-
the number of microaccidents they were involved in ted to factorial analysis (principal components analy-
the last 6 months. sis) and two factors were extracted. Only 9 items were
retained for further analysis. The first factor, was
denominated proactive safety practices (PASB -5
items) and explained 52,60% of variance; the sec-
ond factor was designated compliance safety prac-
1 In this study microaccidents were conceptualized tices (CPSB—4 items) and explained 11,48% of total
according to Zohar (2000) as on-job behavior-dependent variance. Both scales had Cronbach’s alphas higher
minor injuries requiring medical attention, but not incur than .70, coefficients indicative of a good internal
any lost workdays. consistency.

245
Table 1. Exploratory factor analysis results. and injunctive norms accounted for 31,7% of the
variation in compliance safety behavior (Adjusted
1A. Supervisor’s 1B. Coworker’s R 2 = .32). However, as found previously, compli-
safety norms. safety norms. ance safety behavior was only significantly associated
Items 1 2 Items 1 2
with coworker’s descriptive safety norms (Beta = .37,
p < .0001) and coworker’s injunctive safety norms
P1g .84 .38 P3a .22 .81 (Beta = .23, p < .0001).
Microaccidents did not predict safety behavior
P1h .87 .38 P3c .27 .81
(Adjusted R 2 = .02, p > .05).
P1i .86 .38 P3d .15 .75
P1j .83 .45 P3e .24 .78
4.2 Confirmatory factor analysis
P1k .79 .35 P3k .39 .65
P2b .42 .79 P4f .81 .26 With the purpose of confirming the dimensionality
P2c .42 .82 P4g .87 .26 of safety group norm constructs, a Four-Factor CFA
model was tested (Model 1). This model was compared
P2d .35 .84 P4h .90 .24
with a Two-Factor CFA model (Model 2) and One-
P2e .37 .85 P4i .83 .30 Factor CFA model (Model 3). Model fit was assessed
P2f .37 .86 P4j .87 .24 considering the Chi-square, overall goodness-of-fit
statistics, an analysis of residuals and the magnitudes
Alpha .96 .95 Alpha .86 .94
of item factor loadings (see Table 3 and Fig. 1).
Total variance 84,87% Total variance 72,54%
Table3. Summary of model fit statistics.

Table 2. Multiple regression analysis to predict safety Model 2/DF GFI CFI RMSEA AIC ECVI
behavior (variable criteria).
1 2.16 .90 .97 .07 287.8 1.19
PASB CPSB 2* 3.75 .93 .98 .11 105.5 .44
3 14.7 .48 .63 .24 1595.95 6.62
B SE B B B SE B β
* This values refers to supervisor’s scale
SDSN −0.02 0.06 −.03 −0.05 0.05 −.09
SISN 0.05 0.06 .07 0.08 0.05 .13 err1 p1g
,91
CDSN 0.27 0.07 .29∗∗ 0.28 0.06 .37∗∗ err2 p1h
,94

.29∗∗ .23∗
,93 SDSN
CISN 0.30 0.07 0.20 0.06 err3 p1i
,94

Note: Adjusted R 2 = .28 (p < .0001) for Proactive Safety


err4 p1j

Behavior; Adjusted R 2 = .32 (p < .0001) for Compliance


,79

Safety Behavior. ∗∗ p < .0001, ∗ p < .001. err5 p2b


,90

err6 p2c
,94
,62
,86
Regression analysis was used to assess the predic- err7 p2e
SISN

,92
tive validity of safety group norms on safety behavior
(variable criteria). The principal results of this multiple err8 p2f
,49
,55
regression are reported in Table 2.
err9 p3a
The results of the statistical test of the regression ,89
,91
model’s ability to predict proactive safety behav- err10 p3c
,59 CDSN ,51
iors reveals that the model is statistically significant err11 p3e ,69
(F = 24, 201; p < .0001); supervisor and coworker’s
err12 p3k
descriptive and injunctive safety norms accounted
,55
together for 28,1% of the variation in proactive safety
practices (Adjusted R 2 = .28). Nevertheless, proac- err13 p4g
,89
tive safety practices were only significantly associated err14 p4h
,95

with coworker’s descriptive safety norms (Beta = ,84 CISN

.29, p < .0001) and coworker’s injunctive safety err15 p4i ,85

norms (Beta = .29, p < .0001). err16 p4j

The model for the prediction of compliance


safety behaviors was also confirmed (F = 28, 819; Figure 1. Standardized maximum likelihood solution for
p < .0001); supervisor and coworker’s descriptive the hypothesized four-factor CFA model (all p s. < .0001).

246
The whole set of fit statistics confirms the Four- ACKNOWLEDGEMENTS
Factor CFA model (according to Hu and Bentler,
1999 criteria for fit indexes). The examination of the This research was supported by Fundação para a Ciên-
first set of model fit statistics shows that the Chi- cia e Tecnologia—Portugal (SFRH/BDE/15635/2006)
square/degrees of freedom ratio for the Four-Factor and Metropolitano de Lisboa.
CFA model (2.16) is adequate; the GFI index (.90),
the CFI (.97) and the RMSEA (.07) are consistent
in suggesting that the hypothesized model represents REFERENCES
an adequate fit to the data; finally, the AIC and
the ECVI (criteria used in the comparison of two or Ajzen, I. 1991. The theory of planned behavior. Organi-
more models) presents smaller values than in the One- zational Behavior and Human Decision Processes, 50:
Factor CFA model, which represents a better fit of the 179–211.
hypothesized Four-Factor CFA model. Ajzen, I. 2005. Laws of human behavior: Symetry, compabil-
The standardized path coefficients are portrayed ity and attitude—behavior correspondence. In A. Beaudu-
in Figure 1. The parameters relating items to factors cel, B. Biehl, M. Bosniak, W. Conrad, G. Shönberger e D.
ranged between .59 and .95 (p < .0001). Correla- Wagener (eds.), Multivariate Research Strategies: 3–19.
Aachan, Germany: Shaker Verlag.
tions between the four factors were also significant Armitage, C.J. & Conner, M. 2001. Efficacy of the the-
(p < .0001) and ranged between .49 and .79. ory of planned behaviour: a meta-analytic review. British
Journal of Social Psychology, 40: 471–499.
Burke, M.J., Sarpy, S.A., Tesluk, P.E. & Smith-Crowe, K.
5 CONCLUSIONS 2002. General safety performance: A test of a grounded
theoretical model, Personnel Psychology, 55.
The results of the current study adds to the growing Cialdini, R.B., Kallgren, C.A. & Reno, R. 1991. A focus
body of literature for the valid measurement of safety theory of normative conduct. Advances in Experimental
group norms. Results of EFA and CFA confirms the Social Psychology, 24: 201–234.
Cialdini, R.B., Reno, R. & Kallgren, C.A. 1990. A focus the-
questionnaire structure. Supervisor’s and coworker’s ory of normative conduct: Recycling the concept of norms
descriptive and injunctive safety norms are not iso- to reduce littering in public places. Journal of Personality
morphic constructs, they refer to separate dimen- and Social Psychology, 58(6): 1015–1026.
sions of social influence on safety behavior. Overall Cialdini, R.B., Sagarin, B.J., Barrett, D.W., Rhodes, K. &
results provide support for a correlated, four-factor Winter, P.L. 2006. Managing social norms for persuasive
model. impact, Social Influence, 1(1): 3–15.
In addition, this study further examines the predic- Cialdini, R.B. & Trost, M.R. 1998. Social influence: Social
tive power of safety group norms on safety behavior. norms, conformity and compliance. In D.T. Gilbert,
In line with expectations, multiple regression results, S.T. Fiske, & G. Lindzey (Eds.), The Handbook of
Social Psychology (4th ed., Vol. 2): 151–192. New York:
have demonstrated that coworker’s descriptive and McGraw-Hill.
injunctive safety norms are a strong predictor of proac- Conner, M. & McMillan, B. 1999. Interaction effects in
tive and compliance safety behavior. Surprisingly, in the theory of planned behavior: Studying cannabis use,
this company, supervisor’s safety norms didn’t predict British Journal of Social Psychology, 38: 195–222.
individual safety behavior. These results propose that Conner, M., Smith, N. & McMillan, B. 2003. Examining
coworker’s safety behaviors are mostly influenced by normative pressure in the theory of planned behaviour:
their peers, and less by their supervisors, suggesting Impact of gender and passengers on intentions to break
that coworkers can mediate the effect of supervi- the speed limit, Current Psychology: Developmental,
sor’s safety norms on individual safety behavior at Learning, Personallity, Social, 22(3): 252–263.
Deutsch, M. & Gerard, H.B. 1955. A study of normative
work. This is an important finding given that very and informational social influences upon individual judg-
little research has examined how peers can impact ment. Journal of Abnormal and Social Psychology, 51:
safety behavior. These results also suggest that orga- 629–636.
nizational safety initiatives should be aware of the Fekadu, Z. & Kraft, P. 2002. Expanding the Theory
important role of fellow team members on individual of Planned Behavior: The role of social Norms and
attitudes and safety behaviors at work. New research Group Identification. Journal of Health Psychology, 7(1):
should be conducted in order to replicate the factorial 33–43.
structure and contrast if the pattern of safety influ- Hämäläinen, P., Takala, J. & Saarela, K.L. 2006. Global
ences is an idiosyncratic characteristic of the structure estimates of occupational accidents. Safety Science, 44:
137–156.
of teamwork and supervision in this company. Also, Hofmann, D.A., Morgeson, F.P. & Gerras, S.J. 2003. Cli-
the validity of this model for predicting behavior will mate as a moderator of the relationship between LMX
further include the test of a more complete socio- and content specific citizenship behavior: Safety climate
cognitive model with SEM, ideally using a multilevel as an exemplar. Journal of applied Psychology, 88(1):
approach. 170–178.

247
Hu, L.-T. & Bentler, P.M. 1999. Cutoff criteria for fit indexes Tesluk, P. & Quigley, N.R. 2003. Group and normative influ-
in covariance structure analysis: Conventional criteria ver- ences on health and safety, perspectives from taking a
sus new alternatives. Structural Equation Modeling: A broad view on team effectiveness. In David A. Hofmann
Multidisciplinary Journal, 6: 1–55. e Lois E. Tetricck (Eds.), Health and Safety in Organi-
Johnson, S. & Hall, A. 2005. The prediction of safe lifting zation. A multilevel perspective: 131–172. S. Francisco:
behavior: An application of theory of planned behavior. John Wiley & Sons.
Journal of Safety Research, 36: 63–73. Zohar, D. 2000. A group-level model of safety climate:
Linnan, L., Montagne, A., Stoddard, A., Emmons, K.M. & Testing the effect of group climate on microaccidents in
Sorensen, G. 2005. Norms and their relationship to behav- manufacturing jobs. Journal of Applied Psycholog, 85:
ior in worksite settings: An application of the Jackson 587–596.
Return Potential Mode, Am. J. Health Behavior, 29(3): Zohar, D. 2002. The effects of leadership dimensions, safety
258–268. climate, and assigned priorities on minor injuries in work
Rivis, A. & Sheeran, P. 2003. Descriptive norms as groups. Journal of Organizational Behavior, 23: 75–92.
an additional predictor in the Theory of Planned Zohar, D. & Luria, G. 2005. Multilevel model of safety
Behaviour: A meta-analysis. Current Psychology: climate: Cross-level Relationships between organization
Developmental, Learning, Personality, Social, 22(3): and group-level climates. Journal of Applied Psychology,
218–233. 9(4): 616–628.

248
Safety, Reliability and Risk Analysis: Theory, Methods and Applications – Martorell et al. (eds)
© 2009 Taylor & Francis Group, London, ISBN 978-0-415-48513-5

Functional safety and layer of protection analysis with regard


to human factors

K.T. Kosmowski
Gdansk University of Technology, Gdansk, Poland

ABSTRACT: This paper is devoted to selected aspects for knowledge-based layer of protection analysis (LOPA)
of industrial hazardous systems with regard to human and organizational factors. The issue is discussed in the
context of functional safety analysis of the control and protection systems to be designed and operated according
to international standards IEC 61508 and IEC 61511. The layers of protection can include, for instance, the
basic process control system (BPCS), human-operator (HO) and safety instrumented system (SIS). Such layers
should be independent; however, due to some factors involved the dependencies can occur. Thus, it may result in
the risk level increasing of accident scenarios identified in the process of risk analysis. The method is illustrated
on example of the control system (TCS) and protection system (TPS) of a turbo-generator set, which has to
be shut-down in some situations of internal or external disturbances. The required risk reduction is distributed
among TCS and TPS to be designed for appropriate safety integrity level (SIL).

1 INTRODUCTION operational latent errors include maintenance errors,


which may make a critical system unavailable or leave
The intensive development of advanced technologies it in a vulnerable state. Organizational latent errors
is observed in many domains, such as manufacturing, include the design errors, which give raise to intrin-
electric power sector, process industry, transport, ship- sically unsafe systems, and management or safety
ping etc. However, the organisational and management policy errors, which create conditions that induce
structures in these domains are often traditional. Some active human errors. Some categorizations of human
researchers express opinion that ‘‘second generation errors have been proposed, e.g. by Swain & Guttmann
management is applied to fifth generation technology’’ (1983) and Reason (1990).
(Rasmussen & Svedung 2000). Traditionally, the human and organisational factors
The research works concerning causes of indus- are incorporated into the probabilistic models through
trial accidents indicate that broadly understood human the failure events and their probabilities evaluated
errors, resulting from organisational inefficiencies and using relevant methods of human reliability analysis
management inadequacies, are determining factors in (HRA) (Gertman & Blackman 1994). In some HRA
70–90% of cases, depending on industrial sector and methods it is possible to introduce also the organisa-
the system category. Because several defences against tional factors, related in principle to the safety culture
foreseen faults and potential accidents are usually used aspects. Correct evaluation of human behaviour and
to protect people and installations plants, multiple potential errors is a prerequisite of correct risk assess-
faults have contributed to many accidents. ment and rational safety-related decision making in the
It has been often emphasized that disasters arise safety management process (Swain & Guttmann 1983,
from a combination of latent and active human errors Dougherty & Fragola 1988, COA 1998, Hollnagel
committed during the design, operation and main- 2005).
tenance (Embrey 1992). The characteristic of latent Lately some approaches have been proposed by
errors is that they do not immediately degrade the Carey (2001), Hickling et al. (2006) and Kosmowski
functions of systems, but in combination with other (2007) how to deal with the issues of human factors in
events, such as random equipment failure, external the functional safety management. The human actions
disturbance or active human errors, can contribute to and related errors can be committed in entire life cycle
a catastrophic failure or major accident with serious of the system, from the design stage, installation,
consequences. commissioning, its operation to decommissioning.
Embrey (1992) distinguishes two categories of During the operation time the human-operator inter-
latent errors: operational and organizational. Typical ventions include the control actions during transients,

249
disturbances, faults and accidents as well as the diag- control and/or protection systems. Human-operator
nostic activities, the functionality and safety integrity (HO) contributes to realization of a SRF according
tests, maintenance actions and repairs after faults. to the technical specification. There are still method-
The operators supervise the process and make deci- ological challenges concerning the functional safety
sions using some alarm panels within the operator sup- management in life cycle (Kosmowski 2006).
port system (OSS), which should be designed carefully An important term related to the functional safety
for abnormal situations and accidents, also for cases of concept is the safety integrity (IEC 61508), understood
partial faults and dangerous failures within the electric, as the probability that given safety-related system will
electronic and programmable electronic (E/E/PE) sys- satisfactorily perform required SRF under all stated
tems (IEC 61508) or the safety instrumented systems conditions within a given period of time. The safety
(SIS) (IEC 61511). The OSS when properly designed integrity level (SIL) is a discrete level (1 ÷ 4) for
will contribute to reducing the human error probability specifying the safety integrity requirements of given
and lowering the risk of potential accidents. safety function to be allocated using the E/E/PE system
The paper outlines the concept of using a or SIS (IEC 61511). The safety integrity level of 4
knowledge-based method for the layer of protection (SIL4) is the highest level, which require a redundant
analysis (LOPA) of industrial hazardous systems with architecture of the E/E/PE system with diagnosing and
regard to the influence of human and organizational testing of subsystems.
factors (H&OF). Various layers of protection can For consecutive SILs two probabilistic criteria are
be distinguished in the context of identified acci- defined in IEC 61508, namely:
dent scenarios, including e.g. basic process control
system (BPCS), human-operator (HO) and safety – the average probability of failure to perform the
instrumented system (SIS) designed according to safety-related function on demand (PFDavg ) for the
requirements and probabilistic criteria given in func- system operating in a low demand mode, and
tional safety standards: IEC 61508 and IEC 61511. – the probability of a dangerous failure per hour PFH
The protection layers should be independent. How- (the frequency) for the system operating in a high
ever, due to some factors involved the dependencies demand or continuous mode of operation.
can occur. It can result in a significant increase of
the risk level of accident scenarios identified in the The interval probabilistic criteria for the safety-
risk evaluating process. The problem should be care- related functions to be implemented using E/E/PE
fully analyzed at the system design stage to consider systems are presented in Table 1. Similar interval cri-
relevant safety related functions of appropriate SILs. teria also used in assessments of SIS (IEC 61511)
To reduce the human failure probability (HEP) an and Safety-related Electrical, Electronic and Pro-
advanced OSS should be designed. For dynamic pro- grammable Electronic Control Systems for Machinery
cesses, with a short HO reaction time permitted, the (SRECS) (IEC 62061).
HEP can be high, close to 1. The paper emphasizes the The SIL for given SRF is determined in the risk
importance of the context-oriented human reliability assessment process for defined risk matrix, which
analysis (HRA) within the functional safety manage- includes areas for distinguished categories of risk, e.g.
ment and necessity to incorporate in a systematic way unacceptable, moderate and acceptable (IEC 61508).
more important influencing factors. Verifying SIL for given safety-related function to
be implemented using the E/E/PE or SIS system is
usually a difficult task due to lack of reliability data and
other data used as parameters in probabilistic models
2 FUNCTIONAL SAFETY ANALYSIS of the system in design. In such situation, a qualitative
AND HUMAN FACTORS method for crude verifying of SIL is permitted in IEC
61508 for the system architectures considered at the
2.1 Safety integrity levels and probabilistic criteria design stage.
Modern industrial systems are extensively comput-
erised and equipped with complex programmable con-
trol and protection systems. In designing of the control Table 1. Probabilistic criteria for safety functions to be
and protection systems a functional safety concept allocated using E/E/PE systems.
(IEC 61508) is more and more widely implemented
in various industrial sectors, e.g. the process industry SIL PFDavg PFH [h−1 ]
(IEC 61511) and machine industry (IEC 62061).
4 [10−5 , 10−4 ) [10−9 , 10−8 )
The aim of functional safety management is to 3 [10−4 , 10−3 ) [10−8 , 10−7 )
reduce the risk of hazardous system to an accept- 2 [10−3 , 10−2 ) [10−7 , 10−6 )
able or tolerable level introducing a set of safety- 1 [10−2 , 10−1 ) [10−6 , 10−5 )
related functions (SRFs) to be implemented using the

250
However, the rules proposed in this standard for The hazard and risk analysis has to include
qualitative evaluation of the E/E/PE block diagrams (Gertman & Blackman 1994, Hickling et al. 2006,
should be used with caution, because they do not IEC 61508):
cover fully cases of potential dependent failures of
subsystems. The problem is more challenging if it is – All relevant human and organizational factors
necessary to include potential human errors, which issues,
are related to the human and organisational factors – Procedural faults and human errors,
(Kosmowski 2007). – Abnormal and infrequent modes of operation,
– Reasonably foreseeable misuse,
– Claims on operational constraints and interventions.
2.2 Functional safety management with regard
to human factors It is emphasized that the operator interface analysis
should:
Lately, a framework was proposed (Carey 2001) for
addressing human factors in IEC 61508 with consider- – Be covered in safety requirements,
ation to a range of applications of the E/E/PE systems – Take account of human capabilities and limitations,
in safety-related applications. It has highlighted the – Follow good HF practice,
diversity of ways in which human factors requirements – Be appropriate for the level of training and aware-
map on to various E/E/PE systems in different indus- ness of potential users,
tries and implementation contexts. Some conclusions – Be tolerant of mistakes—see classification of
were formulated as follows: unsafe acts and human errors by Reason (1990).

– Determination of the safety integrity level (SIL) Thus, the scope of analyses should include relevant
for an E/E/PE system requires careful consideration human and organizational factors (H&OFs) aspects
of not only of the direct risk reduction functions to be traditionally included in the HRA methods
it is providing, but also those risk reduction func- used in Probabilistic Safety Analysis (PSA) (Swain
tions performed by staff that interact with it; this & Guttmann 1983, Humphreys 1988, COA 1998).
requires addressing in the step of Hazard and Risk
Analysis of the IEC 61508 lifecycle;
– Having determined the safety integrity of the 2.3 Human reliability analysis in probabilistic
E/E/PE system, it was suggested that the effort that modeling of safety-related systems
needs to be placed into operations and maintenance
The human reliability analysis (HRA) methods are
in relation to human factors should be greater as the
used for assessing the risks from potential human
SIL level increases, especially for solutions of SIL3
errors, and for reducing the system vulnerability, oper-
and SIL4;
ating in given environment. However, some basic
– The types of human factors issues that need to be
assumptions in HRA methods used within proba-
addressed vary between the classes of systems dis-
bilistic safety analysis of hazardous systems are the
cussed; therefore, the framework is not specific in
subjects of dispute between researchers (Hollnagel
terms of the technology or other aspects related to
2005).
human factors.
Practically all HRA methods assume that it is mean-
Some general remarks were made for addressing ingful to use the concept of human errors and it is
human factors (HFs) within IEC 61508 that include: justified to estimate their probabilities. Such point of
view is sometimes questioned due to not fully veri-
– Incorporation of human tasks and errors into the fied assumptions about human behavior and potential
Hazard and Risk Assessment process; errors. Hollnagel (2005) concludes that HRA results
– Use of the tables to define the human factors are of limited value as an input for PSA, mainly
requirements for a given safety integrity level. because of oversimplified conception of human per-
formance and human error.
In the paper by Hickling et al. (2006) publishing In spite of this criticism, waiting for next genera-
a Guidance for Users of IEC 61508 was announced. tion of HRA methods, the human factor analysts use
The guidance is designed to respond to requirements for PSA several exiting HRA methods. Below selected
laid down in this standard. They fall into two broad HRA methods are shortly characterized, which can be
categories of those associated with: applied for HRA within the functional safety analy-
sis. The rough human reliability assessments based on
1. the hazard and risk analysis, and qualitative information with regard to human factors
2. the interface between human operator and technol- can be especially useful for initial decision making
ogy (process). at the designed stage of the safety-related functions

251
and systems (Kosmowski 2007). It will be demon- made to the base error weights. Each task no longer
strated that a functional safety analysis framework has to be rated on both processing (diagnosis) and
gives additional insights in HRA. response (action) components, only if the task con-
Several traditional HRA methods have been used tains diagnosis does diagnosis get rated, and similarly
in PSA practice, including THERP method (Swain for action.
& Guttmann 1983), developed for the nuclear Changes were also made to the worksheet to
industry, but applied also in various industrial sectors. enhance usability and to gather more information
Other HRA methods, more frequently used in prac- when non-nominal ratings are made. The overall range
tice are: Accident Sequence Evaluation Procedure- of possible HEPs has been expanded.
Human Reliability Analysis Procedure (ASEP-HRA), The final conclusion is that the enhanced SPAR-H
Human Error Assessment and Reduction Technique methodology is useful as an easy-to-use, broadly
(HEART), and Success Likelihood Index Method applicable, HRA screening tool. The comparisons
(SLIM)—see description and characterization of HRA and enhancements allow the SPAR-H methodology to
methods (Humphreys 1988, COA 1998). maintain the strengths of the original ASP HRA (1994)
Two first mentioned methods (THERP and ASEP- methodology, while taking advantage of the informa-
HRA) are the decomposition methods, based on a tion available from user feedback and from other HRA
set of data and rules for evaluating the human error methods and sources.
probabilities (HEPs). HEART consists of generic HEP is evaluated when the human failure event is
probabilistic data and a set of the influence factors placed into the structure of probabilistic model of the
for correcting nominal human error probability. SLIM system. In the HRA within PSA only more impor-
enables to define a set of influence factors, but requires tant human failure events are considered (Kosmowski
data for calibrating the probabilistic model. 2004). Then the context related PSFs are specified and
In the publication by Byers et al. (2000) five HRA determined according to rules of given HRA method.
methods were selected for comparison on the basis As the result the particular value of HEP is calculated.
of either relatively widespread usage, or recognized Different methods are used for evaluating HEP with
contribution as a newer contemporary technique: regards to PSFs, e.g. assuming a linear relationship for
each identified PSFk and its weight wk , with constant
– Technique for Human Error Rate Prediction
C for the model calibration
(THERP) (Swain and Guttmann, 1983);
– Accident Sequence Evaluation Program (ASEP) 
(Swain, 1987); HEP = HEPno min al wk PSFk + C (1)
– Cognitive Reliability and Error Analysis Method k
(CREAM) (Hollnagel, 1998);
– Human Error Assessment and Reduction Technique or nonlinear relationship, as in the SPAR-H method-
(HEART) (Williams, 1988); ology (Gertman et al. 2005)
– A Technique for Human Event Analysis (ATHEANA)
(USNRC, 1998). NHEP · PSFcomposite
HEP = (2)
In addition to these methods, other sources of NHEP(PSF composite − 1) + 1
information were also examined to provide insights
concerning the treatment and evaluation of human where NHEP is the nominal HEP. NHEP equals 0.01
errors. Comparisons were also made with regard to for diagnosis, and NHEP equals 0.001 for action.
the SPAR-H HRA method (Gertman et al. 2005).
The team members were asked to construct more
detailed summaries of their methods and sources, 2.4 Including human failure events in functional
including specific lists of: 1) error types; 2) any base safety analysis and probabilistic modeling
rates associated with the error types; 3) performance In probabilistic modeling of the E/E/PE safety-related
shaping factors (PSFs); 4) PSF weights; and 5) depen- system the human failure events and their probabilities
dency factors. Obviously, not all methods and sources are elements of a subsystem model as explained below.
contained all of this information, and certainly not all For instance, PFDavg of a E/E/PE subsystem (SUB)
used the same terminology of error types and PSFs operating in the low demand mode is calculated from
(Byers et al. 2000). the formula (Kosmowski et al. 2006):
In summary the authors conclude that a result of
the comparison of the ASP HRA methodology to SUB ∼
PFDavg = PFDavgFT
+ PFDavg AT
+ PFDHE (3)
other methods and sources, enhancements were made
as regards: error type names, error type definitions,
FT
PSFs, PSF weights, PSF definitions, dependency con- where: PFDavg is the average probability of a sub-
ditions and dependency definitions. No changes were system failure on demand, detected in periodical

252
AT – independent of the initiating event and the compo-
functional test (FT); PFDavg —the probability of sub-
system failure on demand, detected in automatic test nents of any other IPL already claimed for the same
(AT); PFDHE —the probability of failure on demand scenario,
due to human error (HE). Depending on the subsys- – auditable, i.e. the assumed effectiveness in terms
tem and situation considered the human error can be a of consequence prevention and PFD must be capa-
design error (hardware of software related) or an oper- ble of validation (by documentation, review, testing,
ator error (activities of the operator in the control room etc.).
or as a member of maintenance group).
The E/E/PE safety-related system in designing con-
An active IPL generally comprises following com-
sists of subsystems: sensors/transducers/converters
ponents:
(STC), programmable logic controllers (PLC) and
equipment under control (EUC). Each of these sub-
systems can be generally treated as KooN architecture, – A—a sensor of some type (instrument, mechanical,
which is determined during the design. Each PLC or human),
comprises the central unit (CPU), input modules (dig- – B—a decision-making element (logic solver, relay,
ital or analog) and output modules (digital or analog). spring, human, etc.),
The average probability of failure on demand PFDavg – C—an action (automatic, mechanical, or human).
of the E/E/PE safety-related system (SYS) is evalu-
ated as the sum of probabilities for these subsystems
(assuming small values of probabilities) from the Such IPL can be designed as a Basic Process Con-
formula trol System (BPCS) or SIS (see the layers 2 and 4
in Fig. 1). These systems should be functionally and
SYS ∼ structurally independent; however, it is not always pos-
PFDavg STC
= PFDavg + PFDavg
PLC
+ PFDavg
EUC
(4) sible in practice. Figure 2 illustrates the functional
relationships of three protection layers: 2, 3 and 4
The PFDs in formula (4) can be evaluated shown in Figure 1. An important part of such complex
qualitatively (indicating SIL with relevant interval system is the man-machine interface (MMI) (GSAChP
PFDavg —see Table 1) or quantitatively (results of 1993, Gertman & Blackman 1994). Its functionality
probabilistic modeling process) with regard to poten- and quality is often included as an important PSF in
tial dependent failures and relevant human errors. HRA (Kosmowski 2007).
For complex KooN architectures suitable methods of
probabilistic modeling are used (Kosmowski 2007).

6. Plant / community emergency response

3 HUMAN FACTORS AND LAYERS 5. Relief devices / physical protection

OF PROTECTION ANALYSIS 4. Safety instrumented systems (SISs)


3. Critical alarms and operator interventions
3.1 Defining layers of protection 2. Control and monitoring (BPCS)

Industrial hazardous plants are nowadays designed 1. Installation /


according to a concept of defense in depths, dis- PROCESS

tinguishing several protection layers. The design of


safety-related systems is based on the risk assessment
for determination of required SIL and its verifica-
tion in the process of probabilistic modeling (GSAChP Figure 1. Typical protection layers of hazardous installa-
1993, IEC 61511). tion.
Figure 1 shows typical protection layers of a haz- PLANT / PROCESSES
ardous industrial plant. An interesting methodology
for preliminary risk analysis and safety-related deci- Equiment under
EUC (SIS) STC (SIS) EUC STC
Control(EUC)
sion making is the Layer of Protection Analysis Sensors /
Safety-related
(LOPA) methodology (LOPA 2001). system (SRS) Safety Basic Process
SRS Transducers /
Converters
Instrumented
The layers shown in Figure 1 can be dependent. The System (SIS)
Control System
(BPCS)
(STC)

layer (a device, system, or action) to be considered State control,


Testing, Supervision
Information /
interface Decisons / Information /
Control Interface
as the Independent Protection Layer (IPL) should be
INFORMATION / ALARMS > HUMAN OPERATORS > DECISIONS
(LOPA 2001):
– effective in preventing the consequence when it Figure 2. Components of functional safety systems for
functions as designed, control and protection of hazardous plant.

253
PL1 PL2 PL3 3.3 Treating of dependent failure events
BPCS HO SIS in human reliability analysis
As it was mentioned the international standard IEC
Figure 3. Protection layers for reducing the frequency of 61508 emphasizes in many places the significance of
accident scenarios. human failures and their potential contribution to the
probabilistic evaluations, however, there is no direct
indication how this problem should be solved. Some
PL1 PL2 PL3
TCS HO TPS general proposals are mentioned in the report (Carey
2001). One of the methods for treating dependent
failures is proposed in THERP technique (Swain &
Figure 4. Protection layers reducing the frequency of acci-
dent scenario.
Guttmann 1983).
THERP offers a dependency model for poten-
tial human failure events to be considered in com-
plex situations distinguishing: ZD—zero dependence,
3.2 Dependency issue of protection layers LD—low dependence, MD—moderate dependence,
Protection layers (PLs) 2, 3 and 4 from Figure 1 are HD—high dependence, and CD—complete depen-
shown in Figure 3. They include: dence. This model assumes fixed levels of the depen-
dency factors, equivalent to beta-H factors: βH = 0 for
– PL1—basic process control system (BPCS), ZD, βH = 0.05 for LD, βH = 0.14 for MD, βH = 0.5
– PL2—human-operator (HO), for HD and βH = 1 for CD.
– PL3—safety instrumented system (SIS). This dependency model is illustrated on an example
of two dependent events of potential human errors:
If these layers are treated as independent the fre- A (previous) and B (consecutive). The probability to
quency of i-th accident scenario FiIPLs is calculated make an error A and error B (potentially dependent)
form the formula is evaluated as follows

P(A · B) = P(A)P(B|A)
FIIPLs = FiI PFDiIPL1 PFDiIPL2 PFDiIPL3 (5)
= (1 − βH )QA QB + βH QA (7)

where: FiI is the frequency of an initiating event (I) where: P(A) = QA and P(B) = QB are probabilities
of relevant failure events. For βH = 0 (independence
and PFDiIPL1 , PFDiIPL2 , PFDiIPL3 are probabilities of
of events/errors) the result is P(A·B) = QA QB , but for
failure on demand for protection layers shown in Fig. 4,
βH = 1 (complete dependence of errors) P(A · B) =
treated as independent layers.
QA = P(A). When QA = QB << 1 then a simpli-
If these protection layers are dependent the fre-
fied version of formula (7) can be used: P(A · B) ∼ =
quency of i-th accident scenario FiPLs is calculated tak-
βH QA . The value of βH depends on various factors
ing into account the conditional probabilities resulting
to be considered in probabilistic modeling of given
in a higher value of the frequency will be obtained
safety-related situation (Swain & Guttmann 1983).

FIPLs = FiI PFDiPL1 PFDiPL2 PFDiPL3 = d · FiIPLs (6) 3.4 An example of protection layers analysis
including human failure event
The value of coefficient d can be much higher than Two safety-related systems are considered below: a
1 (d >> 1). Its value, depending on situation ana- turbine control system (TCS) and a turbine protection
lyzed, can be higher even an order of magnitude. The system (TPS). These systems perform an important
value of d is significantly influenced by assumptions function to switch off the relevant valves and shut down
concerning the probabilistic modeling of dependent a turbine set of the electrical power unit in situations
failures due to equipment failures and/or human errors of internal or external disturbances. Failures of these
(Kosmowski 2007). systems can lead to very serious consequences.
An important problem is to model the dependencies The TSC operates in a continuous mode of opera-
on the level of accident scenario (between protection tion and TPS is the system operating in a low demand
layers). It is especially challenging for the layer 2 (HO) mode (see item 2.1). These systems are designed
which is usually significantly dependent on the layer 1 with regard to required SIL, determined in the pro-
(BPCS). These cases are considered below with regard cess of risk assessment. Based on these assessments it
to potential influence of human and organizational was evaluated that the risk measure should be low-
factors. ered by factor of 10−4 (equivalent to SIL4) using

254
TSC and TPS, treated as first and third barrier (see It is proposed to increase SIL of TPS increasing
Fig. 3). Between these layers there is a second barrier: SIL of subsystem A from SIL2 (see Fig. 6) to SIL3
Human-Operator (HO) (see Fig. 4). and of subsystem C from SIL2 to SIL3. In subsystem
Preliminary results of the functional safety analysis A the architecture 2oo3 is proposed to lower PFDavg
of TCS and TPS indicate that assumed architectures and to lower frequency of TPS spurious operation. For
TCS ensures the safety integrity level SIL1. This low increasing SIL in subsystem C it is not justified to
level is mainly due to the safety integrity of subsys- implement 1oo2 structure due to too very high costs.
tem C (a control valve V-TCS) evaluated on the level Therefore, the regulatory aspects (the valve selection
of SIL1 (Fig. 5). The subsystem A (sensors, trans- of a high quality) and testing strategy of the shut-down
ducers and converters) contributes less significantly valve must be adapted to meet requirements for SIL3
to the result obtained (SIL2) and the contribution to (see interval criteria in Table 1).
PFDavg of subsystem B (the E/E/PE system) is very
low because of SIL3.
The TPS ensures the safety integrity level 2 (SIL2), 4 CHALLENGES IN SAFETY MANAGEMENT
as it can be seen in Figure 6. It is mainly due to SIL REGARDING HUMAN FACTORS
of subsystem C (a control valve V-TPS) evaluated as
SIL2. Subsystem A (sensors, transducers and con- 4.1 Advanced solutions influencing
verters) contributed less significantly (SIL2) and the human-operator reliability
contribution of subsystem B (E/E/PE system) to the
PFDavg is low thanks to SIL3. As it has been mentioned the hazardous industrial
Resulting safety integrity levels of protection layers plants require integrated safety management system
are shown in Figure 7. Due to significant dynamic of oriented not only on technical aspects, but first of all on
the system for most initiating events and high depen- human and organizational factors. Below two aspects
dency of human actions the human error probability are emphasized related to: (1) the dependability of
was evaluated as very high (HEP ≈ 1). Therefore, safety-related systems including human-operator reli-
the risk reduction by TCS and TPS is on the level of ability, and (2) the quality and safety aspects of
10−3 (SIL1 + SIL2). computerized systems used for monitoring, control
Thus, the risk reduction required at level of 10−4 and protection.
is not met and the design improvements have to be Nowadays more and more designers and safety
considered. The analysis has shown that it is not pos- managers are convinced that knowledge concerning
sible at reasonable costs to make a higher SIL of TCS the human performance capabilities as well as the
(from SIL1 to SIL2). Therefore other design option human and organizational factors should be a basis for
was taken into account to make higher SIL for TPS development of advanced technology for monitoring,
(from SIL2 to SIL3—see right block in Fig. 7). control and protection of hazardous systems. There
are new ideas how to design an advanced control room
concentrating on human factors.
A. B. C. EUC From the safety point of view of hazardous plants
STC E/E/PES V-TCS it is crucial to support the operators using advanced
tools for safety-related decision making oriented on
SIL2 SIL3 SIL1 reducing probabilities of human errors. The idea of an
intelligent operator support system (OSS) is usually of
Figure 5. Subsystems of the turbine control system (TCS). interest, integrated with an on-line diagnostic system
and an on-line dynamic risk assessment system. Such
A. B. C. EUC
OSS could be also used in situations of partial faults
STC E/E/PES V-TPS and dangerous failures within E/E/PE safety-related
systems or SISs, contributing to reducing the prob-
SIL2 SIL3 SIL2 ability of human errors and risks of abnormal states.
The OSS should provide intelligent advice during acci-
Figure 6. Subsystems of the turbine protection system dents based on advanced procedures that combine the
(TPS). symptom-oriented procedures and the dynamic risk
oriented procedures.
PL1 PL2 PL3
TCS HO TPS
4.2 Towards integrated knowledge-based approach
SLI1 HEP~1 SLI2->SIL3 in managing safety including human factors
Figure 7. Final safety integrity levels in protection layers of Current approaches used for the safety manage-
the turbine shut-down system. ment of programmable systems are not coherent.

255
The research works have been undertaken to develop ACKNOWLEDGMENTS
advanced methods, data/knowledge bases and tools
for knowledge-based computer aided functional safety The author wish to thank the Ministry for Science
management (Kosmowski 2007) oriented on: and Higher Education in Warsaw for supporting the
research and the Central Laboratory for Labour Pro-
– Determining required SIL for safety-related func- tection (CIOP) for co-operation in preparing a research
tions with using (a) quantitative risk models or programme concerning the safety management of
(b) the risk graphs defined using qualitative infor- hazardous systems including functional safety aspects.
mation with relevant knowledge bases and interac-
tive graphical support,
– Verifying SIL for E/E/PE or SIS using (a) quan- REFERENCES
titative probabilistic models or (b) block dia-
grams defined using qualitative information with Byers, J.C., Gertman, D.I., Hill, S.G., Blackman, H.S.,
relevant data/knowledge bases and interactive Gentillon, C.D., Hallbert, B.P. & Haney, L.N. 2000.
graphical support for the design and operation Simplified Plant Risk (SPAR) Human Reliability Anal-
phases, ysis (HRA) Methodology: Comparisons with Other HRA
– Knowledge bases for supporting evaluation of Methods. INEEL/CON-00146. International Ergonomics
dependent failures and human errors on the level of Association and Human Factors & Ergonomics Society
Annual Meeting.
safety-related functions and systems, and accident Carey, M. 2001. Proposed Framework for Addressing Human
scenarios, Factors in IEC 61508. Prepared for Health and Safety
– Knowledge bases for supporting designing and Executive (HSE). Contract Research Report 373. War-
managing an integrated safety and security strategy rington: Amey Vectra Ltd.
in the life cycle for systems operating in network. COA 1998. Critical Operator Actions—Human Relia-
bility Modeling and Data Issues. Nuclear Safety,
The quantitative risk model can be a basis for devel- NEA/CSNI/R(98)1. OECD Nuclear Energy Agency.
oping an on-line dynamic risk assessment module Dougherty, E.M. & Fragola, J.R. 1988: Human Reliabil-
within OSS to reduce human error probabilities and ity Analysis: A Systems Engineering Approach with
risks of abnormal states. Nuclear Power Plant Applications. A Wiley-Interscience
Publication, New York: John Wiley & Sons Inc.
Dougherty, Ed. 1993. Context and human reliability analy-
sis. Reliability Engineering and System Safety, Vol. 41
5 CONCLUSIONS (25–47).
Embrey, D.E. 1992. Incorporating Management and Organ-
The paper outlines the knowledge-based methodology isational Factors into Probabilistic Safety Assessment.
of layer of protection analysis (LOPA) for indus- Reliability Engineering and System Safety 38: 199–208.
Gertman, I.D. & Blackman, H.S. 1994. Human Reliability
trial hazardous systems with regard to human and and Safety Analysis Data Handbook. New York: A Wiley-
organizational factors (H&OF). The layers of pro- Interscience Publication.
tection include basic process control system (BPCS), Gertman, D., Blackman, H., Marble, J., Byers, J. &
human-operator (HO) and safety instrumented system Smith, C. 2005. The SPAR-H Human Reliability Anal-
(SIS) that can be dependent due to influencing factors ysis Method. Idaho Falls: Idaho National Laboratory.
involved. The human and organizational factors are NUREG/CR-6883, INL/EXT-05-00509.
incorporated into the probabilistic models using a rule- GSAChP 1993. Guidelines for Safe Automation of Chemical
based human reliability analysis taking into account Processes. New York: Center for Chemical Process Safety,
some existing HRA methods. American Institute of Chemical Engineers.
Hickling, E.M., King, A.G. & Bell, R. 2006. Human Factors
The functional safety oriented approach offers in Electrical, Electronic and Programmable Electronic
a framework for more extensive HRA analysis with Safety-Related Systems. Warrington: Vectra Group Ltd.
emphasis on context-oriented human-operator behav- Hollnagel, E. 2005. Human reliability assessment in con-
ior in situations of danger and safe failures (to be text. Nuclear Engineering and Technology, Vol. 37, No. 2
potentially partly detected and partly undetected) with (159–166).
regard to the safety-related functions of control and Humphreys, P. 1988. Human Reliability Assessors Guide.
protection systems. They should be designed with Wigshaw Lane: Safety and Reliability Directorate.
regard to knowledge concerning the human behav- IEC 61508:2000. Functional Safety of Electrical/ Electronic/
ior and potential errors. Some additional research is Programmable Electronic Safety-Related Systems, Parts
1–7. Geneva: International Electrotechnical Commission.
needed to provide more comprehensive insights con- IEC 61511:2003. Functional safety: Safety Instrumented
cerning the contextual influence of human factors to be Systems for the Process Industry Sector. Parts 1–3.
included in HRA in the context of safety-related func- Geneva: International Electrotechnical Commission.
tions to be designed using the programmable control IEC 62061:2005. Safety of Machinery—Functional Safety of
and protection systems. Safety-Related Electrical, Electronic and Programmable

256
Electronic Control Systems. Geneva: International Elec- Kosmowski, K.T. (ed.) 2007. Functional Safety Manage-
trotechnical Commission. ment in Critical Systems. Gdansk: Gdansk University of
Kosmowski, K.T. 2004. Incorporation of human and orga- Technology.
nizational factors into qualitative and quantitative risk LOPA 2001. Layer of Protection Analysis, Simplified Pro-
analyses. Proceedings of the International Conference on cess Risk Assessment. New York: Center for Chemical
Probabilistic Safety Assessment and Management (PSAM Process Safety, American Institute of Chemical Engi-
7—ESREL ’04). Berlin: Springer, Vol. 3: 2048–2053. neers.
Kosmowski, K.T. 2006. Functional Safety Concept for Rasmussen, J. & Svedung, I. 2000. Proactive Risk Manage-
Hazardous System and New Challenges. Journal of Loss ment in a Dynamic Society. Karlstad: Swedish Rescue
Prevention in the Process Industries 19: 298–305 Services Agency.
Kosmowski, K.T., Śliwiński, M. & Barnert T. 2006. Method- Reason, J. 1990. Human Error. Cambridge University Press.
ological Aspects of Functional Safety Assessment. Jour- Swain, A.D. & Guttmann, H.E. 1983. Handbook of Human
nal of Machines Operation and Maintenance (ZEM, Polish Reliability Analysis with Emphasis on Nuclear Power
Academy of Science), Vol. 41 (148): 158–176. Plant Application. NUREG/CR-1278.

257
Safety, Reliability and Risk Analysis: Theory, Methods and Applications – Martorell et al. (eds)
© 2009 Taylor & Francis Group, London, ISBN 978-0-415-48513-5

How employees’ use of information technology systems shape reliable


operations of large scale technological systems

Thale K. Andersen, Pål Næsje, Hans Torvatn & Kari Skarholt


SINTEF Technology and Society, Trondheim, Norway

ABSTRACT: Modern large scale technology systems’—like power grids—reliance on information systems is
considerable. Such systems employ not only one, but a range of different information systems. This creates three
important, interdependent challenges for safe and reliable operations. The first is the sheer volume of systems,
which tie up organisational members in bureaucratic work, removing them from operational tasks and thus
introduce additional stress. This implies that the employees must be willing to speak out their previously tacit
knowledge, rules of thumb and know-how—not written in formal job instructions—and enter this information
into the systems, risking to loose personal assets relating to career and workplace identity, being thus a second
challenge. The third problem relates to data quality. Without valid and reliable data the systems will not have
any real value. The systems rely on the quality of key information entered by organisational members.

1 INTRODUCTION scale technological organizations introduce large scale


technological information systems in order to improve
Since the early 1990’s Norwegian electrical power organizational efficiency and profit. The outcomes of
grid companies have focused on developing, imple- these processes are not specific to a Norwegian con-
menting and upgrading their Network Information text, either; international research on information sys-
Systems (NIS)—systems containing information on tems reveals many similar examples. As argued; ‘‘The
the condition and functioning of the electrical network review highlights the relative lack of overall principles
and its components, as well as geographic position. or frameworks for improving information manage-
There is increased focus on economic performance and ment per se and the overall information management
more rigorous demands from governmental regulating infrastructure including people, practices, processes,
institutions in terms of reporting. The decision to intro- systems and the information itself’’ (Hicks 2007:235).
duce digital NIS was typically based on an expected
profitability from cost-benefit analysis. Earlier sys-
1.1 Three challenges for safe and reliable
tems were costly in terms of man hours, with poor
operations
data quality, and not supporting geographical infor-
mation. Furthermore, they would in most cases offer The introduction of ICT systems creates three impor-
no functionality beyond documentation purposes, thus tant interdependent challenges tied to work practices
not supporting new requirements for decision-making. and processes for safe and reliable grid operations. The
The cost-benefit analysis would in most cases identify first is the understanding of the interaction between
several areas, not only reporting, where substantial technology and organisation, identifying actual as
savings could be obtained. More than a decade later opposed to formal work practice in the adaptation and
the results prove to be modest, especially regarding implementation of ICT. Second, employees must be
economic gains. The NIS remain underutilized, data willing to change their occupational identity and share
quality is poor, new work practices are not fully inte- their tacit knowledge, rules of thumb and know-how
grated, and old systems and work practices run in and enter this information into the systems, which
parallel with new ones. might have consequences for their careers. The third is
This story is not unique to the Norwegian con- the shift in balance between operational and adminis-
text. While the exact details in the development and trative tasks for the individual. This shift forces the
use of information and communication technology individual to work more with computers. ICT sys-
(ICT) will vary between countries and sectors depend- tems rely on input from people codifying and storing
ing on a range of characteristics and conditions, this information. Thus the systems tie up employees in
is the story of work life in the nineties: how large bureaucratic work, removing them from operational

259
tasks and adding strain. Research has shown negative The qualitative data has been stored an analyzed in
effects on health and safety from ‘‘all this paperwork’’ Nudist 6. The analysis has been following explorative
in the offshore sector (Lamvik & Ravn, 2004). These analytic methods. First, single interesting observa-
challenges all relate to data quality. Without highly tions have been tracked and identified in the program.
valid and reliable data the ICT systems are not func- Second, these observations have been interpreted
tional. And data validity and reliability depend on the with the goal of identifying generic issues in the
precision and accuracy of the employees entering key material (i.e. knowledge sharing, translation, com-
information. Earlier research indicates that employees petence). Third, the generic issues have been paral-
are not necessarily trained, motivated or equipped to leled with relevant background variables in the data,
handle these tasks (Næsje et al. 2005). such as age, ICT-proficiency, function, company and
so on.

1.2 Outline of the paper


3 FINDINGS
This paper will mainly look at the installers’ role in
power grid safety, and starts with a study of grid oper-
The two companies we have studied have both cho-
ating companies’ current situation based on a variety
sen to invest heavily in the same large scale Net
of empirical data collected over a two-year period in
Information System (NIS). They have the same goal
two different companies located in two major cities in
for their investment, more cost-efficient and reliable
Norway.
power grid operations. In this chapter we will describe
Theoretically, our research-in-progress is inspired
the two companies, the work practice of their plan-
by the works of Barley and Kunda, Orr, Zuboff and
ners and installers, as well as their ICT reality and
Orlikowski on work practice (Barley, 1996; Orr, 1996;
implementation strategy.
Barley and Kunda, 2001; Orlikowski, 2002; Zuboff,
1988). The underlying idea is to map and describe
how work is carried out in order to get to a better 3.1 Description of companies and company groups
understanding of how to use ICT in grid maintenance
The major task of a power grid company is to develop
organizations in order to ensure safe grid operations.
and maintain its part of the national power grid,
As argued by Barley and Kunda (2001) a shift is needed
so electrical power can be delivered to all its cus-
in organizational thinking where a focus on the proces-
tomers. Since electrical power is vital to a modern
sual aspects of ICT-based work is more central as the
society the sector is carefully and strictly regulated,
nature of work is changing. Tasks and roles must be
and any failure to deliver will result in sanctions
redistributed, the transfer of knowledge and technol-
from the national authorities. Power grid companies
ogy need to be rethought, and routines and workplace
today usually employ an organisational model called
values should be renegotiated.
‘‘planner’’ and ‘‘doer’’. The planners analyse and plan
tasks, deciding when, where and how grid mainte-
nance should take place. They sketch development
2 METHOD AND DATA projects, bill customers, report to authorities and so
on. The planners are usually educated as BSc or MSc
This is a qualitative study, funded by the Norwegian in electricity.
Research Council and the two grid companies The doers—here after called installers—carry out
involved. Our project group has been working closely the manual work designed by the planners, and report
with two large power grid companies since 2006. Over both how they carried out their work and the state of
the two years we have conducted 52 semi-structured the grid after they have performed their tasks. Both
interviews (mostly face to face, but also some via these reports (work conducted and state of grid) are
telephone) with employees in all positions in the com- fed into NIS for the planners to use in their further
panies, from top management to installers. We have work. Unlike the planners, most of the installers do
carried out four field visits/observations, each of them not have a college or university degree; rather they
lasting several days. We have had access to a wide have vocational education, with various certifications
range of information and documentation describing which allows them to carry out a range of physical
rules and procedures. During the project period we labour grid operations.
have also reported and discussed preliminary findings This has to do with what we will call the dilemma
with the two grid companies on several occasions. of data use: The total dependence from the side of the
Feedback and discussions of findings have involved planners on data reported from the part of the installers,
personnel in all positions, but focused on management. and the lack of perceived usefulness of the ICT from
All interviews have been taped and transcribed. the side of the installers, which does not motivate them
Field notes have been written, and photos taken. to use and report to the systems.

260
The two companies are similar in several respects: challenge is the installers’ attitude to the use of ICT-
based tools as an inherent part of their routines and
• Both are among the largest grid companies in
work practices—showing the identity issues lying
Norway
between working physically on the grid versus increas-
• Both operate complex grids in big cities (by Norwe-
ingly working with a system. The question is which
gian standards) and surrounding areas with a high
organisational elements need to be considered in order
number of customers
to facilitate this change in work practice, intensifying
• They both employ the same NIS
the interplay between technology and organisation.
• Both have a history of growth by acquisition the last
decade
• Both started implementing NIS in the mid-90’s, and 3.2.1 The planners’ use of ICT
carried out huge data capture projects to get NIS The planners and the management of the two enter-
data initially prises use NIS in their daily work. An area where
• Both acknowledge that gains from NIS has been use of NIS has been successful is authority reporting.
smaller and slower forthcoming than initially calcu- NIS has simplified and automated several of the steps
lated, but management maintains that NIS is crucial in yearly reports. Such reporting was a process that
for their operations earlier was manual, cumbersome and time consuming
The two companies differ in one important dimen- (Næsje et al. 2005).
sion. One of the companies, Company X, has chosen NIS is also used as a tool for early stage project-
to outsource its installation work, and hence does not ing when a new project is started, usually before it is
itself employ any installers. Rather, it hires all its decided when or by whom the project will be carried
installers, project by project. Company Y, on the other out. When determined, a more detailed planning is
hand, have kept their installers, and is in this respect an done. However, for this more detailed planning work
integrated company with planners and installers in the NIS is considered to be awkward and not up to date.
same organisational unit. While there are important The planners state that: ‘‘We use NIS as a first step for
differences between outsourced and integrated com- simple and overall calculations. We don’t use NIS for
panies our experience tells us that there are also lots of detailed project planning. To do so is inconvenient;
similarities. In this paper we focus on the work practice NIS lacks basic features, and data is unreliable. Once
and role of the installer, and from what we have seen you get under the ground everything looks different
in our interviews and fieldwork the similarities among anyway.’’
installers clearly outweigh the dissimilarities when it Thus, the project planners want to get out in the
comes to the handling and use of NIS. However, it field, to have look at it and discuss it with the doers.
should be noted from an occupational health point of While the goal of top management is that ‘‘planners
view that the outsourced installers complained about should be able to do all the planning without leaving
deteriorated working conditions in a number of areas their office desks’’ the planners themselves believe in
after being outsourced. A survey where this issue is visiting the project site, discussing with the installers,
investigated further is in progress. and rely on the installers’ local knowledge as far as
possible. They simply do not trust data in NIS to
be good enough to carry out project work, without
3.2 Integration of ICT in the grid companies calibrating with the physical reality.
All though NIS is in use among the planners, they
Modern power grid operations are completely depen-
are aware of the systems shortcomings regarding data
dent on complex and secure ICT-solutions as tools for
quality, and in order to carry out projects smoothly
safe and reliable functioning, and for becoming more
and establish reliable grids they prefer hands on local
cost-efficient. The quality of information is vital in
knowledge instead of decontextualised information in
order to ensure safe operations, involving trustworthi-
an ICT system.
ness, relevance, clarity, opportunities and availability.
The quality of the information flow is equally impor-
tant, including the attitude to learning, cooperative 3.2.2 The installers’ use of ICT
climate, process and competencies. Data quality thus Most installers do not use NIS other than in the
has prime priority and is a condition for ICT-based most limited way possible. Some of them manage to
cooperation—the data systems in use must be correct print out topographic instructions entered by planners,
and continually updated. ICT makes this multi-actor while most get a printed copy from his project leader
reality possible. which is used as a reference out in the field. Where
NIS is installed and in use today among plan- the NIS print-outs do not concur, they deal with the
ners and management in the companies, but a key unforeseen problem physically—digging holes in the
challenge is the installers’ lack of competencies ground checking the situation themselves, and calling
in computers and data systems. Another important their employers, contractors and subcontractors.

261
What regards the use of computers as the installers software. They need the ability to enter and transmit
main tool for ICT-based operations; while the com- new information, as well as the ability to find, extract,
panies has provided most installers with a personal interpret, mediate and use data.
computer, during our visits we only rarely saw one of Understanding the organisation as not only defined
them opened. Most of the installers have 20 years of through its operations and products but as an inter-
experience or more, and do not see the computer as a pretative entity, is to see that organisational pro-
primary work tool. Another drawback is that most of cesses, including the implementation of ICT, take
the installers have very little knowledge of computers, place through communication; we create and share
let alone data systems. And as the data systems are meaning through dialogue—dialogues mediate learn-
not sufficiently updated, they often encounter unfore- ing and is a collective tool. It is thus important to
seen challenges in the field. As said by two of the establish a culture for learning that goes beyond for-
installers, respectively 39 and 36 years old: ‘‘We didn’t mal coursing. Not least because ‘‘Formal descriptions
become installers because we liked going to school or of work and of learning are abstracted from actual
because we were into computers’’, and ‘‘Now I just practice. They inevitably and intentionally omit the
got a live-in partner, and she has started to show me a details. In a society that attaches particular value to
thing or two about computers and the internet’’. The abstract knowledge, the details of practice have come
installers have a work environment characterised by to be seen as nonessential, unimportant and easily
strong social relations and a high degree of profes- developed once the relevant abstractions have been
sional pride. They express regret for the decreasing grasped’’ (Brown & Duguid, 1991, p. 40).
focus on professional development and training, and Conventional learning theory—which most studies
the unpredictable working situation makes it necessary part from—emphasise abstract over practical knowl-
to have several ongoing projects at the same time. edge, and thus separate learning from performance,
believing that learning constitutes the bridge between
working and innovating. And thus emerges an under-
4 DISCUSSION standing of complex work processes as decomposable
in smaller and less complex parts where it is not neces-
The discussion will focus on important organisational sarily a precondition that the worker understands what
processes necessary to be explicit on in order to make he or she is doing (Brown & Duguid, 1991). This
installers use NIS more actively. First in this part we makes it seem like installers perform their tasks con-
will look at how ICT shape safe and reliable operations sistent with a formal job description, which is also
today, and then at how to understand the interaction used as a decision-making tool for diagnosing and
between ICT systems like NIS and the installers. It will problem-solving. But when something goes wrong the
further be outlined how ICT may represent a threat to installers seldom will look at the formal job descrip-
the occupational identity and career development of tion, but use his or hers long experience and will
installers—meaning that work identity can constitute eventually develop a non-sequential practice in order
a considerable barrier to use and thus to more reli- to complete the non-covering formal job description.
able operations. Finally, as bureaucratisation of work However—the installers will not use NIS properly if
imposes an important additional challenge, it is impor- they do not understand its purpose; e.g. if it does not
tant to consider its implications for the use of NIS correspond to their working reality.
among the installers. A summary of barriers to good Levin (1997) forward the view that both how the
implementation of ICT systems as well as some guide- technology works and how to work it in a particular set-
lines on how to integrate NIS in installers’ working ting is something that needs to be learned, understood
reality will conclude the discussion. as knowledge that is socially constructed in the partic-
ular company. In this social process all types of knowl-
edge interact, formal knowledge also, but especially
4.1 Man-machine interaction and the social
various forms of tacit and operational knowledge, not
construction of technology
least cultural skills. An organisation is always con-
The multitude and variance of the challenges tied to tingent of its organisational culture and procedures;
the interaction between ICT-based grid operations and ‘‘. . . routines are the factual organisation. . . [. . . ]. . .
installers outlined in the previous section show that the The core essence of an organisational development
relationship between human factors and use of tech- process will thus be the facilitation of learning to
nology is complex. Installers are dependent on high shape a new understanding and the skills necessary to
quality data from updated databases and reliable infor- change the organisation.’’ (Levin, 1997, p. 301). Thus,
mation in order to reach a high degree of accuracy work practices or work routines must be transformed
in their operations. In order to handle the ICT-based and adapted to the new working reality, through nego-
systems, the installers need to be skilled not only in tiation between all employees affected by the change.
ordinary computer technology, but also in the actual This might be especially useful in implementing the

262
understanding of how the system work, as well as why immediate, the installers are expected to operate more
it is introduced, for the use of the new ICT—in this through computers at the same time as they are told
case NIS. to lessen the physical contact with the other main part
of the organisation, namely the part that might have
helped them create references with regard to the new
4.2 Occupational identity and tacit knowledge
technology.
as threats to the integration of NIS
Occupational identity may thus be a potential source
When using a computer, installers do not get the instant of resistance against ICT-based change, as installers
satisfaction of seeing the outcome, and thus ICT might experience a devaluation of what is their professional-
not be perceived as particularly meaningful regarding ism, and thus occupational identity, without a proper
the outspoken demands—getting the job done. On the replacement. Experiencing ambiguity related to the
contrary—in stead of being given the possibility of use of ICT implies a feeling of insecurity vis-à-vis the
extended interaction and integration with the adminis- new technology as the informating side has not been
trative personnel as all employees now are supposed to implemented together with the actual computers and
operate on and from the same platform (NIS), installers software. Installers are told to let go of their previous
feel increasingly alienated not only from their original work identity—being valuable in that it represent the
work tasks but also from the organisation itself, as the profession of being a technical installer, but without an
new company policy implies a greater physical sepa- equivalent valuable identity offered as a replacement
ration of the technical installers and the planners and it will not be perceived as interesting to do so.
management. New ICT systems are thus implemented An individual asset that may hamper the proper
physically, but not integrated in a meaningful way in implementation and integration of NIS is the installers’
the installers’ working situation, as they correspond tacit knowledge tied to work tasks and organisational
on a very limited basis to their working reality. functioning. In order to develop and use NIS, indi-
By offering their installers two sessions à 45 min- viduals and teams must be willing to share their
utes in order to learn the new systems, which first of all knowledge. In times of organisational change based
must be said to be too little, the management implic- on ICT, the changes often imply such a great redef-
itly assumes that 1) their installers are familiar with inition of boundaries tied to expertise and respon-
working with computers and ICT systems in the first sibilities that employees may feel that their—often
place, and 2) their installers automatically understand hard-built—position in the formal or informal hier-
the usefulness and impact of using the new systems, or archy is threatened. A way to obtain protection is
worse; that they need not know what they are doing at not to share one’s specific knowledge and perhaps
work. This means that the companies take for granted even consciously produce counter-activities towards
employees’ ability and willingness to learn and use the the systems’ proper functioning (Schön, 1983; Argyris
new technology, with minimal effort from the side of & Schön, 1978). This behaviour may be perceived by
the organisation. Rather, the organisation must move the individual installer as the protection of his career.
from a focus on the performance of physical work to Knowledge and authority are redistributed and forms
cognitively operated tasks, and these different ways of a new hierarchy through the implementation of ICT;
work need to be learned in different ways—they are by those above the technology and those understanding
no means directly transferable to each other (Zuboff, and thus handling it best are those with the highest
1988). status. According to Zuboff (1988, p. 391), those ‘‘. . .
Change in work practice implies a change in rou- who must prove and defend their own legitimacy do not
tines. For true ICT-based change to happen, installers easily share knowledge or engage in inquiry. Work-
need to change their work practice, which is some- ers who feel the requirements of subordination are
thing that will not easily occur unless they get a good not enthusiastic learners. New roles cannot emerge
reason to do so, other than extrinsic rewards or sanc- without the structure to support them.’’
tions (Nelson & Winter, 1982). Routines—by some
also called norms or institutionalised behaviour, pro-
4.3 The bureaucratisation of grid operations
cesses and structures—become important in defining
and expressing employees’ occupational identities. If The strict meaning of the word ‘‘bureaucracy’’ is civil
installers shall be willing to change their routines, they service, and is traditionally associated with paper-
will need a meaningful replacement of the occupa- heavy office-work. The bureaucracy exerts a specific
tional identity built by these routines. Installers may be type of authority and power, with a great number of
less receptive to use new ICT as they have been doing rules, regulations and routines (Weber, 1974). The
perfectly fine up to now without it. When their man- rationality behind the bureaucracy is standardisation,
ual tasks are automated, they might have difficulties in legitimacy and easy error-detection. In the power grid
perceiving the change as valuable. From working in a industry the new bureaucracy to an increasing extent
reality where productivity outcomes are concrete and are the total of the ICT-based systems which needs to

263
be handled in the correct manner by employees, so they and complex. Organisational challenges are general
can perform their work tasks as stated in the company un-preparedness when it comes to abilities, motiva-
policy. ICT-induced bureaucratisation involves, as the tion and perceived usefulness of the systems from
findings in this study show, a strong focus on data. the part of the installers, the lack of a learning
High quality data further implies a vigorous empha- culture including awareness of the systems’ context-
sis on the collecting of the correct information, the specific conditions for use (local informal knowledge
correct encoding and storage of information—that the and actual work practice), failure to recognise the
collected data are placed in the correct systems and implications of ICT-based organisational change for
databases, and in the correct categories, and the correct installers’ occupational identity and career develop-
use of stored information, entailing correct interpre- ment, as well as little insight in the consequences of
tation of these data. This illustrates that the quality the bureaucratisation of work.
and reliability of data to a high degree is dependent In addition to these more universal challenges,
of the installers handling it, and thus their subjective empirically based on a range of different studies, we
judgment, abilities, willingness and motivation. identified, in accordance with other empirical findings
For teams, which often find themselves faced with (Venkatesh et al. 2003), also a set of specific factors
new tasks due to automation and the elimination of being especially relevant for the grid companies in
routine jobs, it implies a simplification of communica- question. The first is work experience—it seems clear
tion and infrastructural support. But, communication that among the installers having worked 20 years or
is more than the transmitting of symbols from one per- more (and most have) in the same field of work with
son to another. It also includes absorption and under- approximately the same methods as when they started,
standing as well as practical use of the transmitted the motivation is not very high when it comes to inte-
information—the significance can be difficult to keep grating manual and digital operations. This finding is
through coding and decoding. Information technology of course tightly bound to the next; the lack of com-
makes it possible to operate quantitative information petence and thus self-confidence among the installers
that earlier was too complex to handle, because of the when it comes to the handling of computers. The tran-
calculation-load and too many causal relationships. sition from action-oriented to cognitive-based skills
The limiting factors are the costs of the information may be seen as unattainable. A third barrier is earlier
and our ability for understanding and using it. and existing work practice. Existing work practice is
The challenges imposed by the inherent charac- to a large degree informal and characterised by a num-
teristics of ICT are above all again connected to the ber of heuristic and ad hoc methods. And finally, the
interaction between the installers and the technology, physical equipment must be easy and convenient to
showing that the bureaucratic effect it imposes sepa- use. In order to be able to use NIS and other systems
rate the employees from each other, contributing to while working out in the field, the installers need to
the disruption of social relations in the workplace, have optimal network availability. For example, loos-
and gives negative effects on health and safety of ing network connection when driving through a tunnel,
employees, in struggling with the extra work this and having to restart the computer in order to get back
bureaucratisation represents. Trist (1981) refers to on track again, is not optimal. And likewise, it must be
what he calls the ‘primary work system’, which is the easy and convenient to install the personal computer
concrete work group an employee belongs to and the in the car.
framework for the performance of his or hers daily In order to reduce the gap between formal and infor-
work tasks. According to Zuboff (1988), technological mal theory, and thus between NIS and the installers,
change creates limiting conditions for what is possible, the organisation’s management need to engage in an
as technology is not neutral but specific. This gives a inquiry to make explicit the underlying assumptions in
new angle vis-à-vis Groth’s (1999) theoretical stand- the installers’ theory-in-use. Schön (1983) launches
point that humans are the main restricting agents for ‘‘organisational inquiry’’ as a tool for organisational
the possibilities for technological and organisational change and development. An important element in
development. Both might be right. But technology this organisational inquiry is to raise consciousness
does not exist in a vacuum. In order to be useful it of practice, values and underlying assumptions tied to
needs to interact with both human and organisational actual performance and then compare it to the writ-
systems. Bureaucratisation is an abstraction of work ten and formal procedures. Inconsistencies need to
that must be explicitly considered. be identified and corrected in order to secure safe
operations. Moreover, routines must be established
for information sharing, relating both to tacit knowl-
4.4 Barriers to use and some guidelines for the
edge concerning work tasks as well as competencies
successful integration of ICT systems
when it comes to computer handling. These routines
We have seen that the challenges to the use of NIS in must be perceived as necessary and beneficial by the
our two grid companies are numerous, heterogeneous, installers in order to work properly, which is of crucial

264
importance for data quality and reliability. In this aspects of NIS were seen as attainable through an over-
respect it is important to be aware of possible obstacles all level implementation of the system—from manage-
to organisational learning. ment in order to analyse the reliability and efficiency
of grid operations to installers, using data to locate
exactly where they should operate geographically out
4.5 How ICT shape safe and reliable operations in the field, reporting back standard work data, as well
as inconsistencies between the information systems
As for today ICT systems like NIS seem to have a
and the actual field in order to continuously update
smaller impact on safe and reliable operations than we
the systems. ICT is thus significant both as a grid
might have expected after almost two decades of use.
controlling and as an operating device.
While planners use NIS for reporting and doing some
As the two companies see it, they also need to
project work, they also supplement information from
become efficient ICT-based organisations so as to be
NIS with ‘‘on-the-spot’’ observations and discussions
able to respond to the increased complexity of their
with those who actually implement their planning.
working environment and organisational structures, in
Installers to a large degree avoid using NIS. They
order to continue to carry out reliable grid operations.
have neither training, nor motivation, and does not
But because of too little focus on strategy and neces-
need to use NIS. For them NIS is not a support tool
sary conditions for continuous use and functioning,
but rather a control mechanism which they are sup-
the implementation process has more or less stag-
posed to report to, in order to document their work
nated. Results show that while planners are coping
progress. However, also installers are given their work
fairly well based on previous experience with com-
orders and maps from NIS, and hence to some degree
puters and constant interaction with management and
use NIS information as their point of departure for their
support functions, the installers only have been given
work. The understanding of the usefulness of NIS on
between one and two hours of training in using their
an abstract level is not self-evident, and reporting is
new equipment, which basically has been the only
something they try to do as little of as possible. In
interaction they have had with other organisational lay-
the field they prefer to rely on their own and their col-
ers regarding the implementation and use of the new
leagues local knowledge. Thus, installers and planners
technology.
avoid relying entirely on a system without high quality
Recent research on organisational change put
data. This is a two-sided dilemma, since high quality
emphasis on the importance of well conducted change
data is dependent on high quality reporting and use of
processes, important elements are taking existing
the system.
social norms into account, being aware of workforce
Still NIS is being used in decision making. It is
diversity, making conflicts constructive, clear distri-
used in buying services (in the tender processes), when
bution of roles and responsibilities, and the manage-
it comes to fixing the priorities of maintenance and
ment’s availability (Saksvik et al. 2007). Orlikowski
development, it is used to prepare work orders for the
(1996) views organisational change as enacted through
installers, and it is used for reporting purposes. And
the situated practices of organisational actors as they
use and reliance on NIS is increasing as more and more
improvise, innovate and adjust their work routines.
data is entered, and more and more functions are built
As ICT engender automating as well as informat-
into the system. While detailed project planning is dif-
ing capacities (Zuboff, 1988), implementation of NIS
ficult today, maybe the next version or the version after
requires a process where dialogue and negotiations are
will be better. Top management is still likely to pursue
key. Technology transfer as the learning of new ICT
the goal of the planner who does not leave his desk.
must be recognised as a collective achievement and
While both installers and planners today to a vary-
not top-down decision-making, or grid reliability will
ing extent ignore and distrust NIS, in the future more
remain random and difficult to control.
and more decisions, work and operations will depend
The studied companies learned the necessity of
on the ICT systems. In this perspective it is necessary
rethinking several aspects of information handling
to get installers to use and report to NIS.
in the process of becoming ICT-based somewhat the
hard way. The systems are implemented but remain
disintegrated, and thus data quality is poor. Some prob-
5 CONCLUSION lems—intrinsically tied to the lack of focus on systems
specifications before the start of the implementation
Grid operating companies after market deregulation processes, were the non-compatibility of the old sys-
embraced ICT systems in order to become increasingly tems with the new ones, as well as poor routines for
effective and fast-responding. data quality assurance and data maintenance. The lack
The companies have focused on information sys- of synchronisation between the development of the
tems in general, and NIS in particular for more than software and which functionalities that are actually
a decade. The maximum efficiency and cost-saving needed in order to use it as operating device, is also a

265
major reason for this underutilisation, and the use of Brown, J.S. & Duguid, P. (1991). Organizational Learning
the system is not equally rooted in all organisational and Communities of Practice: Toward a Unified View
layers and subgroups. of Working, Learning, and Innovation. Organization
Among management and project leaders the neces- Science, 2(1), 40–57.
sity of the systems seems clear, while installers Groth, L. (1999). Future organizational design. Chichester,
UK: John Wiley and sons.
have not been sufficiently explained the relationship Hicks B.J. (2007). ‘‘Lean information management: Under-
between the systems and their specific tasks, and thus standing and eliminating waste’’ International Journal of
they don’t see the relevance of adjusting their work Information Management 27(4), 233–249.
practices to system requirements. As an earlier study Lamvik, G. & Ravn, J.E. (2004). Living safety in drilling:
showed; ‘‘Data will not come before use, and use will how does national culture influence HES and working
not come before there is data. Thus, data collection practice? (No. STF38 A04020). Trondheim: SINTEF
must be supported or pushed by stakeholders, espe- Industrial Management, New Praxis.
cially management and different operational units.’’ Levin, M. (1995). Technology Transfer is Organiza-
(Næsje et al. 2005, p. 5). The analyses show that it tional Development—An investigation in the relation-
ship between technology transfer and organizational
is crucial to focus more strongly on aspects tied to change. International Journal of Technology Manage-
the collection and the management of data, as this to ment, 14(2/3/4), 297–308.
a large degree was neglected in the selection of soft- Nelson, R.R. & Winter, S.G. (1982). An evolutionary The-
ware. The grid operating company must on one hand ory of Economic Change. Cambridge: Belknap Press of
decide to what extent the individual installer shall be Harvard University Press.
given access to the systems, and if, whether read-only Næsje, P.C., H.Y. Torvatn, et al. (2005). Strategic Chal-
or editing. On the other hand, the company needs lenges in Implementing NIS: Investigations on Data
to make sure that the intended use of these systems Quality Managment. IPEC, 7th Int Power Engineering
gets incorporated into the routines and work prac- Conference, Singapore, IEEE Conference Proceedings.
Orlikowski, W.J. (2002). ‘‘Knowing in practice: Enact-
tices of their installers; ‘‘The lack of closeness between ing a collective capability in distributed organizing.’’
information and operational tasks is a major barrier, Organization Science 13(3), 249.
hindering efficient completion.’’ (Næsje et al. 2005, Orr, J.E. (1996). Talking about machines: an ethnography of
p. 5). The shift from local knowledge on how to solve a modern job. Ithaca, N.Y., Cornell University Press.
tasks—present in the head of the installers, to cen- Saksvik, P.Ø., Tvedt, S.D., Nytrø, K., Buvik, M.P., Andersen,
tralised knowledge leaning on integrated data in a G.R., Andersen, T.K. & Torvatn, H. (2007). Developing
common database, is an important challenge for safe criteria for healthy organizational change. Work & Stress.
grid operations. 21: 243–263.
Schön, D. (1983). Organizational Learning. In G. Morgan
(Ed.), Beyond Method: Strategies for Social Research
(pp. 114–128). Beverly Hills, Ca: Sage.
REFERENCES Trist, E.L. (1981). Evolution of sociotechnical systems. In
A.H. van de Ven & W.F. Joyce (Eds.), Perspectives on
Alvesson, M. (1993). Organizations As Rethoric: Knowledge- organization design and behaviour (pp. 19–75): John
Intensive Firms and the Struggle with Ambiguity. Journal Wiley & sons.
of Management Studies, 30(06), 997–1015. Venkatesh, V., Morris, M.G., Davis, G.B., & Davis, F.D.
Argyris, C. & Schön, D. (1978). Organizational learning: (2003). User acceptance of information technology:
A theory of action perspective. USA: Addison Wesley Toward a unified view. MIS Quarterly, 27(3), 425–478.
Publishing. Weber, M. (1974). Makt og byråkrati. Oslo: Gyldendal
Barley, S.R. (1996). ‘‘Technicians in the workplace: Ethno- Zuboff, S. (1988). In the Age of the Smart Machine. New
graphic evidence for bringing work into organization York: Basic Books.
studies.’’ Administrative Science Quarterly 41(3), 404.
Barley, S.R. & Kund G. (2001). ‘‘Bringing work back in.’’
Organization Science 12(1): 76.

266
Safety, Reliability and Risk Analysis: Theory, Methods and Applications – Martorell et al. (eds)
© 2009 Taylor & Francis Group, London, ISBN 978-0-415-48513-5

Incorporating simulator evidence into HRA: Insights from the data analysis
of the international HRA empirical study

S. Massaiu, P.Ø. Braarud & M. Hildebrandt


OECD Halden Reactor Project, Norway

ABSTRACT: The observation of nuclear power plant operating crews in simulated emergencies scenarios
reveals a substantial degree of variability in the timing and execution of critical safety tasks, despite the extensive
use of emergency operating procedures (EOP). Detailed analysis of crew performance shows that crew factors
(e.g. leadership style, experience as a team, crew dynamics) are important determinants of this variability.
Unfortunately, these factors are problematic in most Human Reliability Analysis (HRA) approaches, since most
methods do not provide guidance on how to take them into account nor on how to treat them in predictive
analyses. In other words, factors clearly linked to the potential for errors and failures, and information about
these factors that can be obtained from simulator studies, may be neglected by the HRA community. This paper
illustrates several insights learnt from the pilot phase of the International HRA Empirical Study on analysis,
aggregation and formatting of the simulator results. Suggestions for exploiting the full potential of simulator
evidence into HRA are made.

1 INTRODUCTION case) of two scenarios, a steam generator tube rupture


(SGTR) and a total loss of feed water (LOFW). This
This article is based on insights from the empirical data article relates to the results of the analysis and compar-
analysis of the first phase of the ‘‘International HRA isons of the so called pilot phase of the International
Empirical Study’’ (Lois et al. 2008). The project is an HRA Empirical Study, which is based on only the two
international effort coordinated by the OECD Halden variants of the SGTR scenario. Further it is limited
Reactor Project, aimed at assessing the strengths and to only the first and most important phase of this sce-
limitations of existing HRA (Human Reliability Anal- nario type, consisting of identification and isolation of
ysis) methods. The study is based on a comparison of the ruptured steam generator. These two scenarios are
observed performance in HAMMLAB (HAlden Man two versions of a SGTR: one base case where there are
Machine LABoratory) simulator trials with the out- no other complications than the tube rupture, and one
comes predicted in HRA analyses. The project goal is complex case where the radiation indications to diag-
to develop an empirically-based understanding of the nose the tube rupture are lacking. The HAMMLAB
performance, strengths, and weaknesses of a diversity pressurized water reactor (PWR) emergency operating
of methods currently available. This article focuses on procedures (EOPs) used in the study were based on the
empirical aspects of the project methodology. Emergency Response Guidelines (ERGs) developed
by the Westinghouse Owners Group.
The Halden experimental staff conducted the simu-
2 THE INTERNATIONAL HRA EMPIRICAL lator sessions in the HAMMLAB facility and had the
STUDY main responsibility for the analysis of the experimental
data. Regulators, utilities and research institutions pro-
The Pilot study (2007) was designed around four sets vided the HRA teams. The teams applied one or more
of participants: 1) the operator crews, 2) the Halden HRA methods to obtain predictions for the Human
experimental staff, 3) the HRA teams, and 4) the study Failure Events (HFEs) in the scenarios defined for the
assessment group. study. Table 1 summarizes the methods tested and the
In the period from October to December 2006, 14 teams applying them.
crews of three licensed PWR operators per crew par- The Study assessment group acted as the inter-
ticipated in an experiment performed in HAMMLAB face between the HRA teams and the experimental
(Braarud et al. 2007). All crews responded to two ver- staff. As a starting point, and together with Halden
sions (a base case and a more challenging ‘‘complex’’ staff, it provided an information package (analysis

267
Table 1. Methods tested and HRA teams. as to what are the actions to be analysed. It should be
noted that defining the HFEs for the HRA teams did
Method Team Country not eliminate the qualitative analyses to be performed,
since the HFEs were defined on a functional level, i.e.
ATHEANA NRC staff + USA
fails to perform X within Y minutes.
consultants
CBDT EPRI (Scientech) USA One of the most important aspects of HRA methods
CESA PSI Switzerland is the identification and evaluation of the factors ‘‘driv-
CREAM NRI Chech ing’’ performance of the HFEs, commonly referred to
Republic as performance shaping factors (PSF) or the ‘‘driv-
Decision NRI Chech ing factors’’ of performance. Comparing the specific
Trees + ASEP Republic factors identified as driving factors by the HRA teams
HEART Ringhals + Sweden/ for the defined HFEs with those observed in Hammlab
consultants Norway was the main focus of the comparison.
KHRA KAERI South Kores
The HRA teams typically provided their analy-
MERMOS EDF France
PANAME IRSN France ses and the outcomes they predicted in several ways.
SPAR-H NRC staff + USA They filled out one form (‘‘Form A’’ ) that empha-
consultants, INL sized identifying the main drivers of performance in
THERP NRC staff + USA terms of PSFs, causal factors, and other influence
consultants characterizations explicitly identified through the
THERP (Bayesian VTT/TVO/Vattenfall Finland/ HRA method they were using. In addition, in Form
enhanced) Sweden A, they were also asked to describe how and why a
– Simulations – scenario might be difficult or easy for the crews in
IDAC University of Maryland USA terms of the specifics of the scenarios, i.e. in more
MICROSAINT Alion USA operational terms. They were also asked to provide an
QUEST-HP Riso Denmark HEP for each HFE based on the application of their
Politecnico di Milano Italy method. Additionally the teams were asked to pro-
vide a Form B, a ‘‘closed-form’’ where the responses
had to be structured according to a modified ver-
sion of the HERA taxonomy (Halbert et al. 2006).
inputs) to the HRA teams and answered requests Finally, the HRA teams were asked to provide their
from the HRA teams for additional information and method specific analysis documented according to
questions concerning ambiguities in the instructions PRA (good) practices, which included the derivation
and assumptions. The information package included, of the HEPs.
among other, instructions to the HRA teams, scenario
descriptions and HFEs, characterization of the crews
(e.g. their work practices and training), the proce- 2.2 Comparison of predicted and experimental
dures used in Hammlab, and forms for the responses outcomes
of the HRA teams. Finally, the study assessment group
reviewed the HRA team responses and performed the The core of the HRA empirical study is the evaluation
assessment and comparison of the predicted outcomes of HRA methods by means of comparing their predic-
vs. the experimental outcomes, together with input tions with observed performance. Comparisons were
from experimental staff from Halden. performed on several levels:
– On the factors that most influenced the performance
2.1 Focus on the qualitative outcomes: driving of the crews in the scenarios (‘‘driving factors’’).
factors and PSFs – On the level of difficulty associated with the oper-
ator actions of interest (the HFEs). For the HRA
At a high level, HRA methods have the same purpose predictions, the level of difficulty was mainly
(or aims) due to the role of the HRA within the PRA. represented by the HEP.
These common aims are: 1) an identification of the – On the reasons for the difficulties (or ease) with
HFEs to be included in the PRA accident sequence which the crews performed the tasks associated with
model, 2) the qualitative analysis of the HFEs, and each HFE, and how these difficulties were expressed
3) the quantification of the probability of these HFEs. in operational and scenario-specific terms (‘‘opera-
The study focused mainly on qualitative analysis tional expressions’’).
and to a lesser degree on the quantitative analysis.
The HFEs were defined for the HRA analysts. Def- In addition, several other factors were evaluated,
inition of the HFEs was needed in order to control for like the insights given by the HRA method for error
variability in the interpretation of the various teams reduction, sensitivity issues such as the impact of

268
qualitative choices on the HEP, and issues of guidance performers). The criteria used in the selection process
and traceability. were:
– SGTR isolation time
2.3 Experiment analysis methodology – Ruptured SG level at isolation.
As the qualitative HRA methods predictions typically The criteria led to the selection of 9 crews, 3 base
refer to the operational requirements and to the driver cases (2 ‘‘successes’’, 1 ‘‘failure’’) and 6 complex
factors that an idealised crew would face in the scenar- cases (3 ‘‘successes’’, 3 ‘‘failures’’). Other crews were
ios of the study, the main challenge of the analysis was also analysed in-depth, and this information was used
how to aggregate the performance of 14 different crews to confirm and/or extend the tendencies identified
into one average or exemplary operational description, from the analysis of the best and the worst performers.
as well as into a general assessment of the driving
factors. In other terms, while there were observed 14 2.5 In-depth analyses
operational stories for each scenario, together with
as many constellations of factors affecting the crews, The bases for the in-depth qualitative analysis were the
the HRA predictions were (mainly) on one set of fac- audio-video recordings, the recorded on-line expert
tors only and on one or few operational options (see comments, the simulator logs, and the crew inter-
Figure 1). views. The core of the analysis process was the detailed
review of the video recordings of the scenario phases
corresponding to HFE 1. These reviews were struc-
2.4 Quantitative screening and crew selection tured so as to be useful and relevant for comparison to
the HRA analysis submissions.
The starting point for the experimental analysis was The analysts viewed the video and transcribed key
to look at the quantitative data, namely performance communications and events. They also wrote com-
figures generated from simulator logs (e.g. perfor- ments about salient aspects of crew performance.
mance times), OPAS data (Skraaning 1998), expert Immediately after the viewing, they completed a sim-
and observer performance ratings, and crew PSF rat- plified version of the HERA system worksheets in
ings. This was a necessary step for assessing the order to record the PSF details identified during the
crews’ performance of the HFE under considera- video review in a common format. In completing
tion (e.g. time used for identification and isolation, HERA, the analysts also drew on additional data
SG1 level at isolation). This screening also provided sources, such as the crew interviews, crew PSF
information which was later used for the writing of questionnaires, and observer comments. Finally, the
summary operational stories (i.e. typical or average analysts summarised the observed episode in the form
crews progressions through the scenarios), by e.g. pro- of an operational story, highlighting performance char-
viding the execution times of important procedural acteristics, drivers, and key problems into so called
steps. ‘‘crews summaries’’.
However, a thorough qualitative analysis was nec- The format of the crew summaries was based on
essary to derive the required insights into drivers of the format for the reporting of the HRA methods
performance. The time schedule of the study and assessment, as follows:
the resource limitations conducted to the selection of
a subset of crews for in-depth study. The selection 1. Short story of what happened in the selected part
was aimed at identifying a mixture of crews at both of the scenario (written after reviewing DVD,
ends of the performance spectrum (‘best’ and ‘worst’ logs and interview) for identification and isolation
separately, which consisted of:
– Extracts of crew communications, including
Observations time stamps
(raw data) 14 Crew-level – A short summary of the observed episode in a
- Audio / video performances
- Operational story 2 scenario-level free form (not chronologically) including com-
- Interviews
- Simulator logs - Observed
performances (over ments on crew performance.
influences all crews)
- OPAS (PSFs) - 2 operational
expressions
2. Summary of the most influencing factors affecting
- On-line performance - Observed
ratings difficulties - 2 sets of driving performance:
(interpreted in factors with
- On-line comments ratings
- Crew self-ratings
light of other
types of raw
– The PSFs were categorised as ‘‘direct nega-
- Observer ratings data) tive influences’’, ‘‘negative influence present’’,
‘‘neutral influence’’, or ‘‘positive influence’’. In
analyzing a performance, a PSF is a ‘‘direct
Figure 1. Data integration and analysis. negative’’ when its negative influence on the

269
crew’s performance of the HFE was observed. Table 2. Performance times in the two scenarios.
In some cases, factors were identified as nega-
tive, but only as ‘‘present’’, meaning there was Sce- SG SG
no clear evidence that they significantly affected Crew nario Time1 level2 Crew Scenario Time1 level2
the performance of the HFE.
M Base 10:23 20 L Complex 19:59 78
3. Summary of the observed difficulty or ease the H Base 11:59 10 B Complex 21:10 1003
crew had in performing the HFE and of its assumed L Base 13:06 6 I Complex 21:36 70
causes. B Base 13:19 21 M Complex 22:12 81
A Base 13:33 17 G Complex 23:39 88
The ‘‘Summary of the most influencing factors I Base 13:37 31 N Complex 24:37 86
affecting performance’’ was for each crew a combina- E Base 14:22 40 H Complex 24:43 91
tion of variable PSFs and constant PSF. The constant K Base 15:09 39 K Complex 26:39 64
D Base 16:34 55 D Complex 27:14 100
PSFs were assessed by an expert panel and are con- J Base 17:38 44 A Complex 28:01 100
stant for all crews, e.g. the quality of the interface. G Base 18:38 39 C Complex 28:57 99
The constant PSFs have then a fixed assessment that F Base 18:45 73 F Complex 30:16 100
is applied to all crew stories to keep overview of all C Base 18:53 57 J Complex 32:08 100
PSFs assessed when looking at each individual story. N Base 21:29 75 E Complex 45:27 98

1 Fromto tube rupture to isolation; 2 At isolation; 3 simulator


3 EXPERIMENT RESULTS problem.

This paper concentrates on the empirical side of the


pilot project. The results of the HRA methods-to-data did not find an explicit transfer condition but judged
comparisons are reported extensively on the pilot study the plant state to require the implementation of the
final report (Lois et al. 2008). By results here we mean ruptured SG isolation procedure (E-3).
the simulator trials results. A closer analysis at the two tables shows that there
The most striking result of the simulations was the does not exist an ideal way through the procedures
presence of a large degree of variability on the crew that will automatically lead the crew to success (fast
performance of the first scenario phase ‘‘identification transfer and isolation). Fast and slow crews often share
and isolation’’. Even a visual inspection of Table 2, the same procedure paths.
where the times the crews used to isolate the ruptured In both scenario variants, the analysis identified
steam generator are displayed, shows that the scenario crew factors as the primary determinants of within-
variant (base/complex) accounts for a large amount of group performance time variability. Drivers catego-
the performance time differences. There is no overlap rized as team dynamics and work processes, appeared
between the two groups, with the exception of the base to have an impact on performance time in the inves-
case run of crew N. tigated emergency situation. In particular, the shift
Also, within both cases there is large variability in supervisor’s situation understanding, his or her abil-
completion times. The range of performance times for ity to focus on the main goal and to give directions to
the base case is 10:23 min to 21:29 min, with the range the crew, as well as good procedures reading seemed
for the complex runs being 19:59 min to 45:27 min. to be key factors to success (Broberg et al. 2008a).
While it is clear that the experimental manipulation
had a significant effect, this alone does not reveal how
3.1 Crew factors
complexity translates into performance difficulties or
work styles which, in turn, produce the within-group In both scenario variants, we have attributed the vari-
time differences. ability of performance times to crew factors. Crew
One sense in which complexity translates in oper- factors can be distinguished in two categories (this
ational behaviour is the way the crews progressed categorization was not followed in the pilot study):
throughout the procedures. In the base case scenario, 1) crew characteristics, and 2) crew mechanisms (also
all crews took the same way through the procedures ‘‘crew interaction’’).
(E-0 ‘‘SGTR identification step’’—E-3) and trans- Crew characteristics are systematic, stable factors
ferred to the isolation procedure by following an easily of the crews being investigated. Examples are: fre-
recognizable transfer condition (radiation indication). quency and recency of training, the average experience
In the complex case, on the contrary, there were 8 dif- of the teams, the degree of stability in the crew set-ups
ferent paths that were used among the 14 crews. In at the plant-unit level, the work practices at the plant
Table 3 below all paths and the grounds for transfer such as the organization of work, shifts, and national
are displayed. In most instances the transfer to the iso- characteristics. All these factors are needed in order to
lation procedure was ‘‘knowledge based’’, i.e. the crew describe the ‘‘average’’ crew to be modeled.

270
Table 3. Procedure progressions and transfer grounds in complex scenario.

Crew Procedure progression Ground for transfer to E-3

C E-0 step 21 Knowledge based (level)


G E-0 step 21 Knowledge based (level)
L E-0 step 21 – Knowledge based (level) + ES-1.1 foldout
N E-0 step 21 Knowledge based (level)
A E-0 step 21—ES-1.1 foldout page – SG Level
M E-0 step 21—ES-1.1 foldout page – SG Level
E E-0 step 21—ES-1.1—E-0 step 19 – SG1 gamma level 1 and 2 (slow crew)
F E-0 step 21—ES-1.1—E-0 step 19 Knowledge based (level)
I E-0 step 21—ES-1.1—E-0 step 19 Knowledge based (level)
H E-0 step 21—ES-1.1—FR-H5—E-0 step 19 Knowledge based (level)
B E-0 step 24 – SG level
D E-0 step 24—25 Knowledge based (level)
J E-0 (second loop) step 14–E-2 step 7 Knowledge based (level)
K E-0 step 19 – Gamma radiation. The crew manually trips
the reactor. This causes lower level in SG1
(less scrubbing effect) and high pressure in
SG1 early, which likely leads to the PORV
opening on high pressure and the consequent
activation of the total gamma sensor.
– Decision guided or confirmed by procedure transfer point

Crew mechanisms are the elements and functions little integration of crew characteristics in most meth-
that describe how a team works (and, in HRA terms, ods, integration that should ideally also account for the
possibly fail). Examples are leadership and roles, com- relations between the two kinds of crew factors: crew
munication, openness, coordination, adaptability and characteristics and crew interaction.
prioritization.
The crew mechanisms, in addition to explaining
how the crews perform, determine both crew-to-crew
and crew-to-scenario variability, and should ideally 4 THE CHALLENGE: HRA PREDICTIONS VS.
be represented in the (crew) reliability model of an SIMULATOR RESULTS
HRA method. In contrast, crew characteristics would
determine systematic crew-to-crew variability (they The main effort of the data analysis process has been in
generally refer to the preparedness of the crew to the finding a presentation format compatible with outputs
task) and are, or should be, considered as context obtainable from HRA applications (section 2 above).
factors in HRA. Overall, the study methodology have been judged as
To understand the point, we could see traditional adequate to the purposes of the study by all concerned
HRA as built around reliability models of individ- parties, HRA-teams, Steering group, and Experi-
ual cognition, where personal level variability is the mentalist. However, on specific aspects, potential
irreducible random variability across people (this is for improvements have been individuated (Lois et al.
assumed to be small compared to the variation caused 2008). This paper concentrates on areas of improve-
by the context on the cognitive functions). Were the ment relative to empirical data integration, formatting,
reliability model to include crew cognition (as well and use. The main issue here is that crew factors
as non-cognitive team interactions) the ‘‘systematic’’ and other dynamic factors, which were observed to
crew-to-crew variability would be expected to be larger influence performance, could be better represented
than the person-to-person variability, in part due to the and used in the method-to-data comparison, and, in
fact that the crew would not be modeled on micro-tasks general, to inform HRA.
in laboratory settings, and in part because the status of To begin with, let us remind that the experimental
knowledge about team interaction does not account results were summarized in three main parts:
for many general laws. The existence of a consid-
erable crew-to-crew systematic variability is already 1. Response times for identification/isolation and rup-
reflected by the fact that many HRA methods possess tured SG levels.
‘‘crew characteristics’’ PSFs. However, the lack of 2. Aggregated operational stories for the two scenario
guidance on the use of this type of PSFs testifies the variants.

271
3. Aggregated driving PSFs (based on driving fac- In such cases there is a disjunction between PSFs for
tors’ summaries for ‘‘best’’ and ‘‘worse’’ performing the HFE, those that influence the speed of actions and
crews and operational stories). hence quick success, and the factors which would nor-
mally be considered influences on good performance
These presentation formats were chosen to allow the (like good consultations) but which could in the same
comparison of HRA method predictions with observed cases slow down the performance of the HFE. In other
simulator performance. The response times were nec- words, there is a mismatch between the categorization
essary in order to assess the performance of the HFEs required by the PRA HFEs representation and the one
of the study. The aggregated stories were written in implicit in the reliability models of HRA methods. The
order to summarize the performance of 14 different pilot study testifies this mismatch: the PSF profile of
crews (in the two scenarios), into single operational the second fastest crew has many similarities to the
expressions, which are the typical level of representa- profiles of the slow performing ones in the complex
tion of HRA analyses (as a discretization of all possible scenario.
scenario variants). The same goes for the summary In general terms, it is difficult to see how HRA
of the driving PSFs, which could be considered as can be directly informed by focusing on HEPs. In
the PSFs of the aggregated stories, as opposed to the first place it was recognized that HEPs could not
the various configurations of context in the individual be easily derived from observed ‘‘errors’’ in the tri-
scenario runs. als, even given the large number of runs. HFEs were
Concerns could be raised about the level of accu- therefore adapted to the constraints of a simulator exer-
racy and completeness of the empirical information cise (Broberg et al. 2008b). Yet, they still bring the
reached trough this process of aggregation and format- theoretical framework of the PRA, superimposing it
ting, concerns which regard all three types of results on the crew performance. In other words, while the
output. HFEs focus on what particular goals are not achieved
(i.e. what prescribed human-initiated recovery func-
tions are not performed), this cannot be a viable
4.1 Failure vs. performance
perspective for understanding crew performance. In
Crew performance in the pilot study was operational- fact, human and crew behavior in emergency scenar-
ized as crew performance of the HFEs, in terms of ios are not directed by the discrete temporal succession
completion time and ruptured SG level at isolation of PRA goals (which moreover are unknown to the
(the lowest the better). When the crews are evaluated agents until their course of action is aligned to the
on the performance of the HFE a strong emphasis was ‘‘right’’ event response), but by the dynamic goals
laid on time, with ‘‘best’’ crews being the fastest to which derive from the interaction of contextual inputs
isolate, and the ‘‘worst’’ the slowest. This is a con- with their interpretation and the responses to them.
sequence of defining the HFE on a time criterion, As a consequence, if the empirical basis of HRA is
although the time criterion has a strong relation to sev- information on the reliability models and their param-
eral functional goals, including the PSA relevant one eters, this has to be derived from studying what was
of avoiding filling up the steam generators (Broberg done and why, rather than focusing on what design
et al. 2008b). based actions were not performed (failures and errors).
On the fine-grained level of a simulator trial, The approach of focusing on the reliability models and
however, the speed of action can only be one of their parameters, rather than failures and errors, would
several indicators of good performance, and one have consequences on both the criteria for crew selec-
that can never be isolated from the other indica- tion (‘‘best’’ and ‘‘worst’’ crew would not be defined
tors. For instance, a crew can act very fast when foremost from completion times), and on the PSFs
a shift supervisor takes an extremely active role, profiling (the best/worse crews’ PSFs profiles would
decides strategies without consultation and orders the have more consistency).
crew to perform steps from procedures, although the
latter is a reactor operator responsibility. This per-
4.2 The difficult treatment of ‘‘variable’’ PSFs
formance would not be optimal in terms of other
indicators: first, such behavior would not be consid- The derivation of the driving PSFs for the base and
ered in accordance to the shift supervisor function complex scenarios are based on the driving factors
and the training received. Further, it would reduce the identified during the DVD reviews and summarized
possibility for second checks, with one person cen- in the crew summaries. The DVD review of individual
tralizing all diagnosis and planning functions. Third, scenario runs (as well as their final aggregation and
it may disrupt successive team collaboration as the evaluation), were influenced by the HERA terminol-
reactor operator would feel disposed of his/her func- ogy and other HRA-specific documents. This has been
tions and could assume either a passive or antagonistic challenging for the experimentalists, since such clas-
position. sifications and their definitions are not observational

272
tools, and since they incorporate context-performance unresolved issue in the pilot phase of the study. On the
models not necessarily meant for fine-level analysis other hand, it is also true that the scenario-based differ-
of crew behaviour and interaction. ences (the manipulation of the study), complexity and
Further, a distinction was made between ‘‘constant procedure-situation fit, which were captured by the
PFS’’ and ‘‘variable PSFs’’. Constant PSF were con- constant PSFs, did have strong and observed effects
sidered the same for all crews, and, in part, could be as testified by the performance differences between
determined before the actual runs (e.g. indication of base and complex conditions.
conditions) based on:
– scenarios descriptions 4.3 The interaction between PSF: observed vs.
– the nature of the simulated plant responses, proce- modelled models
dures, and interface
– the plant specific work practices of the participating For the identification of the driving factors from the
crews. crew summaries, the scenario events analyzed were
evaluated against a list of PSFs: for each item in the
Variable PSFs are those factors not supposed to be given set the presence, direction and effect was deter-
the same for all crews, and which had to be evalu- mined, as well as a description of its manifestation. For
ated for each crew after the scenario run. Many of instance, in one crew summary ‘‘communication’’ was
these PSFs have a dynamic nature in that they could rated as ‘‘negative influence present’’ and described in
be evaluated only as the result of their interaction with the following way: ‘‘While working on the isolation,
other context factors and of the interaction of these RO and ARO talk past each other, and the orders to the
with crew behaviour. For instance, stress levels can field operator are initially not what they intended’’.
vary across crews: a late identification could create This format is consistent with the conventional
high stress levels during isolation for a crew with lit- modeling of performance and PSFs for HRA in PRA,
tle experience in working together, but not in a more where the assessment problem can be formulated as
experienced one. Most variable PSFs identified turned follows:
out to relate to crew characteristics/mechanisms (e.g.
leadership style, accuracy of procedure reading) and Pf (Ti ) = f (wi1 v(F1 ), . . . , win v(Fn ), ei ) (1)
as such were classified under ‘‘work practices’’, ‘‘crew
dynamics’’ and ‘‘communication’’. where Pf (Ti ) is the probability of failure of task
This classification prompts three orders of related Ti in a particular event sequence, F1 , . . . , Fn are
problems. The first is that the variable, crew- PSFs that influence human performance of the given
interaction PSFs do not fit most of the current HRA task, v(F1 ), . . . , v(Fn ) are their quantified values,
methods, since those methods do not incorporate reli- wi1 , . . . , win are weighting coefficients representing
ability models of crew interaction and functioning (at the influence of each PSFs in task Ti and ei is an error
best they model the crew as a second level of informa- term representing model and data uncertainty. f repre-
tion processing), and, foremost, cannot analytically sents the function that yields the probability estimate,
treat crew-to-crew variability (they can at best accom- which together with the parameters of the expression
modate it mathematically, in sensitivity analysis). above could be called the reliability model, (i.e. a
The second problem is that there is little guidance model of human performance). Different HRA meth-
for most HRA methods and tools (e.g. HERA) on how ods incorporate different reliability models. Strictly
to determine the presence and appropriate level of con- speaking, in the context of PRA, HRA empirical data
stant (systematic) crew-characteristics PSFs (e.g. work is information about the parameters of the reliability
practices, differences in experience and cohesion). In models.
HERA for instance both ‘‘work practices’’ and ‘‘crew For several methods the reliability model or func-
dynamics’’ have sub-items on supervisor behaviour, tion f is one of independent factors, e.g. in SLIM:
so it is not to clear where to classify that dimension.
It is therefore hard to compare simulator results with Pf (Ti ) = wi1 v(F1 ) + wi2 v(F2 ), . . . , + win v(Fn ) + ei
predictions on such factors.
The third problem is that the results of the empiri- (2)
cal study regarding main drivers and PSFs might have
overweighed the importance of the constant/crew- This type of models treats the PSFs as orthogonal,
independent PSFs at the expense of the variable and direct influences on the probability of task failure.
crew-level PSFs, because they better fit the methodol- Even leaving aside the issue of failure, this kind
ogy of the study followed to produce the results. This of modeling is generally not adequate for describing
is reflected by the fact that the treatment of ‘‘team task performance in simulator trials. In the first place,
dynamics’’, ‘‘work processes’’, and ‘‘communication’’ the assignments and ratings cannot be done ‘‘one-by-
in the identification of main drivers, was admittedly an one’’, as the PSFs are not independent (HRA itself

273
recognizes them to be correlated or overlapping). In Also, detailed analyses of crew performance show
addition, the categorization in terms of ‘‘presence’’ that crew factors are important determinants of
and ‘‘directness’’ does not exhaust the range of pos- performance variability. If HRA wants to exploit the
sible interactions. An alternative and more realistic full potential of empirical information, we suggest
modeling would have to detail the entire assumed crew factors (and especially crew mechanisms) as the
influence set, by specifying all direct and indirect centre of research for future HRA developments, to
links (and possibly reciprocal effects). Hypothetical, the same extent that individual cognition has been so
observed structural models would be a much closer far. Models of individual human cognition are unable
approximation to the operational events as described to explain the full spectrum of crew interactions, even
and explained in the narratives of the crew sum- when a second (crew) level information processing is
maries. With such models, the process of aggregation modeled. Important aspects of crew behavior would
across crews would then be a process of generalizing be missed by such modeling.
influence patterns across crews and events. Finally, the well-known issue of PSFs interaction
It must be added that, although the drivers were cannot any longer be avoided by HRA methods and
presented and aggregated in tabular form, the method- empirical analysis alike, as complex patterns of causal
to-data comparisons in the pilot study have been factor are the essence of observed operating crews’
performed by using all available information: by behavior.
combining information on PSFs to operational data,
descriptions, and evaluation of difficulties, as well as
by interacting with the experimentalists. Also, some REFERENCES
more recent methods as ATHEANA and MERMOS do
not foremost adopt the ‘‘factorial’’ model illustrated Braarud, P.Ø., Broberg, H. & Massaiu, S. (2007). Perfor-
above, but develop operational stories/deviation sce- mance shaping factors and masking experiment 2006:
narios, which allowed for a more direct comparison to Project status and planned analysis, Proceedings of the
observed operation. Enlarged Halden Programme Group meeting, Storefjell,
Norway, 11–16 March 2007.
Broberg, H., Hildebrandt, H., Massaiu, H., Braarud, P.Ø. &
Johansson, B. (2008a). The International HRA Empirical
5 CONCLUSIONS Study: Experimental Results and Insights into Perfor-
mance Shaping Factors, proceedings of PSAM 2008.
The pilot study has shown that it is possible to inform Broberg, H., Braarud, P.Ø. & Massaiu, S. (2008b). The Inter-
HRA by empirical evidence, provided that a sound national Human Reliability Analysis (HRA) Empirical
methodology for comparing HRA methods predic- Study: Simulator Scenarios, Data Collection and Identi-
tions and simulator results is followed. At the same fication of Human Failure Events, proceedings of PSAM
2008.
time, issues for improvement have been identified. Hallbert, B., Boring, R., Gertman, D., Dudenhoeffer, D.,
One central idea of this paper is that HRA Whaley, A., Marble, J., Joe, J. & Lois, E. (2006).
empirical content is information about the reliability Human Event Repository and Analysis (HERA) System,
models (i.e. models of crew performance) incorpo- Overview, NUREG/CR-6903, U.S. Nuclear Regulatory
rated by the methods. In this view, HEPs are theoretical Commission.
entities generated by the methods and their assump- Lois, E., Dang, V.E., Forester, J., Broberg, H., Massaiu,
tions, which cannot be the object of observation, S., Hildebrandt, M., Braarud, P.Ø., Parry, G., Julius, J.,
and therefore cannot be empirically informed inde- Boring, R., Männistö, I. Bye, A. (2008). International
pendently of the process, methods and assumptions HRA Empirical Study—Description of Overall Approach
and First Pilot Results from Comparing HRA Methods
used to produce them. As a consequence, special to Simulator Data, HWR-844, OECD Halden Reactor
care has to be taken when defining the objects of Project, Norway.
comparisons between HRA predictions and simulator Skraaning, G. (1998). The operator performance assess-
evidence, as illustrated in the discussion of failure vs. ment system (OPAS), HWR-538, OECD Halden Reactor
performance. Project.

274
Safety, Reliability and Risk Analysis: Theory, Methods and Applications – Martorell et al. (eds)
© 2009 Taylor & Francis Group, London, ISBN 978-0-415-48513-5

Insights from the ‘‘HRA international empirical study’’: How to link data
and HRA with MERMOS

H. Pesme, P. Le Bot & P. Meyer


EDF R&D, Clamart, France

ABSTRACT: MERMOS is the reference method used by Electricite de France (EDF) for Human Reliability
Assessment (HRA), to assess the operation of nuclear reactors during incidents and accidents in EDF Probabilis-
tic and Safety Assessment (PSA) models. It is one of the second generation HRA methods that have participated
in the ‘‘HRA international empirical study’’ organised by the NRC and the Halden Reactor Project. This inter-
national study is not finished but has been already an opportunity to debate on relevant HRA issues during the
workshops in Halden and in Washington in 2007.
In this paper we will focus on the nature and meaning of predictive HRA, compared to the nature of data
(from observations on simulators or small real incidents). Our point of view on this subject will be illustrated
with an example of a MERMOS analysis implemented for the international study. Predictive HRA exists when
failure cannot be observed: it is a way to explore and reduce uncertainty, regarding highly reliable socio-technical
systems. MERMOS is a method which is supported by a model of accident that enables to describe the risk
and to link it to data. Indeed failure occurs when a way of operating (that usually leads to success) happens to
be inappropriate to a very specific context. Then data, in fact knowledge, is needed to describe two things: the
different operating ways (for example focusing for a while on the recovery of a system), and specific situations
(a serious failure of this system with a problem for identifying it). The HRA analyst has then to find which
combinations of operating habits and very specific situations could mismatch and lead to a serious failure of the
human action required to mitigate the consequences of the accident.
These links between operating habits, small incidents and big potential accidents that we will try to describe
in this paper should be understood for decision making in the field of safety, human factors and organisation;
indeed for example changing a working situation might be very risky regarding the whole panel of situations
modelled in a PSA. HRA should thus help the decision making process in the Human Factors field, besides
ergonomic and sociological approaches.

1 INTRODUCTION MERMOS method: indeed it was an opportunity for


EDF to make MERMOS more widely understood and
An international empirical HRA (human reliability to compare different HRA methods through the imple-
analysis) study has been launched in 2007, with the mentation of common examples. Some first insights
Halden Research Institute and several HRA experts have been presented in a paper for PSAM 9: how to
participating and debating [1]. The method designed describe potential failure and its influencing factors in
and used for Human Reliability Analysis in EDF’s PSA a given situation [4].
(probability and safety assessment) is MERMOS (in This paper focuses on the use of simulator data
French: ‘‘méthode d’évaluation de la réalisation des for HRA: indeed this international study has chosen
missions opérateurs pour la sûreté’’); it is a second a specific process to use simulator data, whereas the
generation HRA method [2] [3]. A team from EDF MERMOS method follows a very different one: we
has participated to this international study with the will try to explain these differences and describe how

275
data and HRA can be linked using the MERMOS Shaping Factors (PSF) can. We would give some
method, giving examples taken from the analyses examples further.
implemented for the international study. MERMOS is not focused on general micro individ-
ual performance (including success) prediction, but
on macro collective and systemic failure at a safety
2 HOW TO DEFINE HRA RESULTS mission level. However, systemic failures occur only
AND HRA DATA in very specific contexts, which include some ‘‘opera-
tional expressions’’ that we can observe on simulator,
2.1 Nature and meaning of HRA as the comparison done in the international study
shows well. Moreover those MERMOS operational
HRA is helpful when failure cannot be observed: in stories are a way to express how different factors can
fact it is a way to explore and reduce uncertainty, combine to lead to failure, in opposition to consider-
regarding high reliable socio-technical systems. Thus ations on PSF which does not take into account their
HRA results intend to describe how failure can occur, combinations.
regarding PSA’s criteria: they describe ‘‘big’’ failures, However it is not sufficient to compare the MER-
not small ones that we can see on simulators from time MOS operational expressions with the operational
to time. expressions of failure that has been observed during
the dedicated simulations on Hammlab: what is inter-
2.2 Nature of HRA data esting is on the one hand to observe the MERMOS
operational expressions in any of the teams’ test (suc-
What we call data in HRA are data from simulations cessful or not), and then on the other hand to check that
or from small real incidents. In the frame of the inter- the combination of the items in the MERMOS scenar-
national study, we focus on the use of simulator data, ios leads to failure (even if the whole MERMOS failure
more precisely of data from dedicated simulations on scenario has not been observed).
Hammlab (the Halden simulator).
From the international empirical study point of
view, the data collected on simulator can be com-
pared to HRA results, and this is even one of the 4 LINKING DATA WITH MERMOS
main objective of the study to be able to show how the ANALYSES: EXAMPLES TAKEN
HRA methods could predict what could be observed FROM THE INTERNATIONAL STUDY
on simulator.
On the other hand from our point of view, what 4.1 How to link data and MERMOS analyses
could be observed on simulator are only small errors MERMOS is a method which is supported by a model
which never lead to failure in the sense of PSA, given of accident that enables to describe the risk and to link
the high level of safety of a nuclear powerplant. So it to data. Failure occurs when a way of operating (that
such data cannot be compared to HRA results; however usually leads to success) proves inappropriate to a very
there is of course a link between those data and HRA. specific context.
We consider as data any information which is use-
ful to the MERMOS application and that involves
3 PROCESS OF COMPARISON OF HRA the entire operating system (interactions between
RESULTS TO HALDEN DATA team, interface and procedures), not only individual
IN THE INTERNATIONAL STUDY behaviour.

The purpose of the international study is to compare


the results of the methods to data to see how well they
could predict the factors that drive performance. We C
are very cautious with this purpose because we think success
C context A
that this process is not justified: simulator data are CICA 1
inputs for HRA (small failures, . . .) and cannot be
compared to HRA results (failure at a safety mission
level). C context B failure
But, as quoted in the summary of the workshop, CICA 1
some models use ‘‘operational stories’’, like MER-
MOS: the advantage of such a method is that some Non conceivable scenarios
expressions of these operational stories can be eas-
ily compared to observed operational expressions, Figure 1. The MERMOS model of accident (from P. Le Bot,
and we think surely more easily than Performance RESS 2003).

276
There are three types of information: procedures are carried out etc.) and contributes to
building up their expertise. The data gathered thus con-
1. Information interpreting a particular operating sys-
stitutes a knowledge database of qualitative data—as
tem context: ‘‘Situation features describing the
opposed to a database of uniquely statistical data or
structural and contextual characteristics of the situ-
probabilities- which can be used more easily for expert
ation’’. It is at this level in particular that individual
judgment.
errors are found.
Given these data, the HRA analyst has then to find
Example: ‘‘The reactor operator makes a test
which combinations of operating habits and very spe-
error and is oriented to the ECP1 procedure’’
cific situations could mismatch and lead to a serious
This observation reflects the variations in human
failure, as considered in PSAs [5]. Let us take an
behavior which can create uncommon, particular
example to illustrate it.
contexts in which the resulting collective behavior
may fail.
2. Information interpreting a configuration or orien- 4.2 Examples of analyses from the international
tation of the entire system: the ‘‘CICA’’ (Important study: HFE1A & HFE 1B
Characteristics of Accident Management) allow the
operation of the operating system to be described Nine HFE has been analysed for the international
over time. They interpret a system configuration or study. Let us take the two first ones as examples:
orientation. They are rarely observed directly dur-
ing the test, but are the result of the interpretation HFE 1A:
of these observations. The initiating event is: steam generator tube rupture
Example: ‘‘the system follows procedures step (SGTR, base scenario).
by step’’ In this situation the mission is: to identify and
3. Information which feeds expert judgments (for the isolate the ruptured steam generator 20 minutes after
evaluation of situation features and CICA). SGTR during full power, in order to prevent overfill-
Example (context: loss of electric board ‘‘LLE’’ ing the ruptured steam generator. These 20 minutes
and overabundant safety injection): ‘‘document has been chosen by adding 5 minutes to the mean time
complexity is delaying the diagnosis’’ and does not correspond to a PSA criteria; however
An expert who needs to analyze a human factor they approximately correspond to the overfilling of
which occurs in this context will be able to take its the steam generators.
specificity into account and imagine failure sce-
narios where the complexity of documents and a HFE 1B:
delayed diagnosis influence the failure. The initiating event is: SGTR + major steam line
This data, gathered in a particular context (for this break quickly isolated (so that the secondary radiation
simulation, with the team being observed and for a indications are shown normal).
particular task), is reusable in order to imagine fail- In this situation the mission is: to identify and
ure scenarios for a similar HF (Human Factor) task, isolate the ruptured steam generator 25 minutes after
and also for a different HF task: many observations SGTR during full power, in order to prevent overfill-
can in fact be generalized. Some can directly be used ing the ruptured steam generator. These 25 minutes
statistically: for example graphs showing the distri- has been chosen by adding 5 minutes to the mean time
bution of the observed times needed for carrying out and does not correspond to a PSA criteria; however
an action can be drawn up, however their creation and they approximately correspond to the overfilling of
use will need to be refined. However it is very difficult the steam generators.
to decide what to incorporate into the samples, as the
situations are never completely reproducible: in our
4.3 Examples of data observed on Hammlab
experience, the data taxonomy and other data catego-
and found in the MERMOS analyses
rizations are by nature reductive and dependent upon
models which can quickly become obsolete. In the MERMOS analyses of HFE 1A and HFE 1B, we
What is more, their extrapolation in conserva- can recognize some elements, that we can call ‘‘oper-
tive PSA contexts is delicate and must be carried ational stories’’ that could have been also observed on
out with prudence. It is mainly through expert judg- the Halden simulator. This is not surprising because
ment that we could reuse this data in a quantitative MERMOS describes qualitatively the ways to fail,
manner, taking into account these limits in model- specifying precisely the situations that could partially
ing both observations and predictions. In addition, be observed during simulations.
observing simulations gives information which allows Here are some examples, gathering the comparison
us to increase each analyst’s knowledge of accident work from the expert board of the international study
management (operation of the operating system, how and some complements by EDF.

277
Operational expressions Operational expressions
Operational expres- observed during the ded- Operational expressions observed during the
sions predicted in icated experiments in predicted in dedicated experiments in
MERMOS—HFE 1A Halden (HFE 1A) MERMOS—HFE1B Halden (HFE 1B)
‘‘not act with a sense ‘‘Several crews decided trollably. This is prob- that the lack of training
of urgency’’ to take a meeting to able (assigned p = 0.3). on checking of alternative
assess status and develop The ARO will not be cues for SGTR is
a strategy before trans- fast because this check supported strongly by
ferring to E-3 (based on is not often trained the empirical data.
radiation indication)’’
during SGTR scenarios
‘‘trained transient’’ Good to very good [which rely more
training and experience strongly on other cues]
‘‘easy to diagnose HMI and indication of The operators follow ‘‘2 crews read carefully
transient’’ conditions: very good the instructions the foldout page’’
cautiously
Working through F- ‘‘SS abruptly to RO:
Operational expressions 0, the SS wishes to ‘‘you can go to FR-H5
Operational expressions observed during the quickly orientate the (one is not allowed to
predicted in dedicated experiments in team towards FR-H.5 enter this procedure
MERMOS—HFE1B Halden (HFE 1B) before step 22)’’

The system does not Supported by the evidence: Waiting for feedback This refers to E-3 Step 3.
perform the procedural ‘‘A crew was delayed from local actions leads The evidence indeed
steps fast enough and in E-0 due to their to delays (isolation not shows that the crews
does not reach the early manual steam line completed in time need a fair amount of
isolation step within break identification and window) time to complete this
the allotted time isolation’’ step due to the local
Also: ‘‘They use some time, actions mixed in with
check if sampling is the control room actions.
open’’
‘‘Identification of the ‘‘Crew cannot explain SG1
SGTR by checking level without radiation
steam generator levels and lack of level in
can cause problems or PRZ. Long discussions 4.4 Examples of data observed on Hammlab
time wastage.’’ and unstructured and not found in the MERMOS analyses
meetings’’
One important operational story that has been
‘‘The absence of This is strongly supported observed on the Halden simulator and that has caused
radioactivity does not by the empirical evidence some delay does not take place in any MERMOS fail-
facilitate diagnosis or and is in fact a dominant ure scenario. Our explanation is that we did not find
enable other hypotheses operational issue there could be a problem of transfer to procedure E-
to be developed for the (when combined with 3 for HFE 1B. Indeed theoretically, as mentioned in
event in progress. the procedural guidance’s the package, there are several paths to enter E-3 by
reliance on this indication following the procedures. From our experience with
and an apparent lack of French teams, operators should have no difficulties to
training on the alternative enter E3 if they follow strictly the procedure, without
cues). using their knowledge to transfer to E3 directly from
step 19. Because of that, we imagined that the main
ARO takes time to This is similar to reason to fail was a too strict following of the proce-
check that the level in the second operational dures that leads to spend too much time. In fact, it has
SG#1 is rising uncon- expression above. Note been observed exactly the opposite: Halden’s teams
spent time to transfer directly from step 19 to E3, by
(continued) communicating and concerting.

278
Only a good knowledge of the habits of the Halden’s 5 CONCLUSION
teams could have alert us and imagine that differ-
ence. We can explain that difference by the fact that In this paper we have focused on the nature of pre-
today, French procedures (state based procedures) are dictive HRA, compared to the nature of data (from
designed to provide the operators with a solution in any simulations or from small real incidents). Predictive
case. Then the operators trust the procedure and follow HRA exists when failure cannot be observed: it is
them first. We do not agree with the fact that an impor- a way to explore and reduce uncertainty, regarding
tant PSF for HFE1B is the inadequate procedure guid- high reliable socio-technical systems. MERMOS is a
ance: it is more likely the lack of trust of the operators method which is supported by a model of accident that
in the procedures, but this has to be deeper analyzed. enables to describe the risk and to link it to data, as
This difference between our predictions and obser- we could see through examples from the international
vations illustrates well one of the main objections we study; the HRA analyst has then to find which combi-
raised when the international study was launched: it is nations of operating habits and very specific situations
groundless to aim at predictive analyses without know- could mismatch and lead to a serious failure of the
ing the operators’ way of working; the best way to human action required to mitigate the consequences
achieve this goal is to observe simulator tests. The pur- of the accident, as considered in PSAs.
pose of HRA is not to predict what can be observed but These links between operating habits, small inci-
to predict from observations what cannot be observed. dents and big potential accidents that we have tried
to describe in this paper should be understood for
decision making in the field of safety, human fac-
4.5 Examples of data observed on EDF simulators
tors and organisation; indeed for example changing
and found in the MERMOS analyses
a working situation might be very risky regarding the
Last, it is important to underline that some of the mean- whole panel of the situations modelled in a PSA. HRA
ingful operational stories in MERMOS could not be should thus help the decision making process in the
observed on the Halden dedicated simulations. Indeed Human Factors field, besides ergonomic and socio-
the MERMOS analyses take the most of all the simu- logical approaches, even if it still needs research to
lations that we know (from our EDF simulators) and push away its boundaries.
that could be extrapolated for this study.
Here are some examples of those data:
REFERENCES
Operational
expressions predicted in Operational [1] Lois E., Dang V., Forester J., Broberg H., Massaiu
MERMOS—HFE 1A expressions observed S., Hildebrandt M., Braarud P. Ø., Parry G., Julius J.,
Boring R., Männistö I., Bye A. ‘‘International HRA
and HFE 1B on EDF simulators Empirical Study—Description of Overall Approach and
First Pilot Results from Comparing HRA Methods
‘‘local actions may Yes (and this is to Simulator Data’’, HWR-844, OECD Halden Reac-
cause delay’’ simulated by the tor Project, Norway (Forthcoming also as a NUREG
trainers) report, US Nuclear Regulatory Commission, Washing-
‘‘run through the procedures Yes ton, USA), 2008.
step by step’’ [2] Bieder C., Le Bot P., Desmares E., Cara F., Bonnet
‘‘the SS does not incite Yes J.L. ‘‘MERMOS: EDF’s New Advanced HRA Method’’,
the operators to accelerate PSAM 4, 1998.
the procedural path [3] Meyer P., Le Bot P., Pesme H. ‘‘MERMOS, an extended
second generation HRA method’’, IEEE/HPRCT 2007,
‘‘the SS does not worry Yes
Monterey CA.
about the effective perfor- [4] Pesme H., Le Bot P., Meyer P., ‘‘HRA insights from the
mance of the action’’ International empirical study in 2007: the EDF point of
‘‘delegation of . . . to the Yes view’’, 2008, PSAM 9, Hong Kong, China.
other operator’’ [5] Le Bot P., Pesme H., Meyer P. ‘‘Collecting data for MER-
‘‘the operator makes a Yes MOS using a simulator’’, 2008, PSAM 9, Hong Kong,
mistake in reading . . . ’’ China.
‘‘the shift supervisor Yes
leads or agrees the strategy
of the operators’’
‘‘suspension of operation’’ Yes (in order to
gain a better
understanding of
the situation)

279
Safety, Reliability and Risk Analysis: Theory, Methods and Applications – Martorell et al. (eds)
© 2009 Taylor & Francis Group, London, ISBN 978-0-415-48513-5

Operators’ response time estimation for a critical task using the fuzzy
logic theory

M. Konstandinidou & Z. Nivolianitou


Institute of Nuclear Technology-Radiation Protection, NCSR ‘‘Demokritos’’, Aghia Paraskevi, Athens, Greece

G. Simos
Hellenic Petroleum S.A., Aspropyrgos, Athens, Greece

C. Kiranoudis & N. Markatos


School of Chemical Engineering, National Technical University of Athens, Zografou campus, Athens, Greece

ABSTRACT: A model for the estimation of the probability of an erroneous human action in specific industrial
and working contexts based on the CREAM methodology has been created using the fuzzy logic theory. The
expansion of this model, presented in this paper, covers also operators’ response time data related with critical
tasks. A real life application, which is performed regularly in a petrochemical unit, has been chosen to test
the model. The reaction time of the operators in the execution of this specific task has been recorded through
an indication reported in the control room. For this specific task the influencing factors with a direct impact
on the operators’ performance have been evaluated and a tailored made version of the initial model has been
developed. The new model provides estimations that are in accordance with the real data coming from the
petrochemical unit. The model can be further expanded and used in different operational tasks and working
contexts.

1 INTRODUCTION Many factors influence human performance in com-


plex man-machine systems like the industrial context
In Human Reliability Analysis the notion of human but not all of them influence the response time of
error does not correspond only to the likelihood that operators, at least not with the same importance.
an operator will not perform correctly the task that he Many studies have been performed to estimate oper-
has been assigned to do but also (among other things) ators’ response time mainly for Nuclear Power Plants
to the likelihood that he will not perform the assigned (Boot & Kozinsky 1981, Weston & Whitehead 1987).
task within the required time. Most of the critical Those were dynamic simulators studies with the objec-
tasks include the concept of time in their character- tive to record response time of operators under abnor-
ization as ‘‘critical’’ and most of the error taxonomies mal events (Zhang et al. 2007) and also to provide
developed specifically for human reliability analy- estimates for human error probabilities (Swain &
sis include errors like ‘‘too early/too late’’, ‘‘action Guttmann 1983). A fuzzy regression model has also
performed at wrong time’’, ‘‘delayed action’’, ‘‘oper- been developed (Kim & Bishu 1996) in order to assess
ation incorrectly timed’’, ‘‘too slow to achieve goal’’ operators’ response time in NPP. The work presented
and ‘‘inappropriate timing’’ (Embrey 1992, Hollnagel in this paper is an application for the assessment of
1998, Isaac et al. 2002, Kontogiannis 1997, Swain & operators’ response times in the chemical process
Guttmann 1983). What is thus important for Human industry.
Reliability Analysis is the identification and quan- A model for the estimation of the probability
tification of human error and at the same time the of an erroneous human action in specific industrial
estimation for the response time of the operator in and working contexts has been created using the
the performance of a critical task. In modeling human fuzzy logic theory (Konstandinidou et al. 2006b).
performance for Probabilistic Risk Assessment it is The fuzzy model developed has been based on the
necessary to consider those factors that have the CREAM methodology for human reliability analy-
biggest effect on performance. The same is also valid sis and includes nine input parameters similar to the
for factors that influence operators’ response time. common performance conditions of the method and

281
one output variable: the action failure probability of i. Selection of the input parameters
human operator. Validation of the model and sensi- ii. Development of the fuzzy sets
tivity analysis has already been performed (Konstan- iii. Development of the fuzzy rules
dinidou et al. 2006b, 2006a). iv. Defuzzification
The expansion of this model, presented in here,
covers also operators’ response time data related with The fuzzification process was based on the
critical tasks. The model disposes now of a second out- CREAM methodology and the fuzzy model included
put variable that calculates the estimated response time nine input variables similar to the common perfor-
of the operator performing a specific task in a specific mance conditions of the same methodology namely:
industrial context. For the estimation of the response Adequacy of organization, Working conditions,
time the model takes into account factors (common Adequacy of man-machine interface and operational
performance conditions) that influence the reaction of support, Availability of procedures, Number of simul-
the operator during this specific task. taneous goals, Available time, Time of day (circadian
Section 2 of this paper gives a quick overview of the rhythm), Adequacy of training and Crew collabora-
model developed for the estimation of human error tion quality. The fuzzy logic system has as output
probabilities while section 3 presents the expansion parameter the Action Failure Probability of Human
of the model to cover also operators’ response time. Operator.
Section 4 describes the real industrial task which will For the development of the fuzzy sets and the fuzzy
be used for the application of the model. Section 5 rules the phrasing and the logic of CREAM has been
presents the application of the model in this task and used. According to CREAM a screening of the input
a shorter version of the model which is more tailored parameters can give an estimation of the mode in which
made to include only those input parameters that affect an operator is acting (based on his Contextual Control
operators’ response time. Section 6 makes a com- Mode). The rules are constructed in simple linguistic
parison of the results between the two models while terms and can be understood at a common sense level.
section 7 presents the conclusions of this paper. At the same time these rules result in specific and
reproducible results (same inputs give same output).
The defuzzification process is performed through
2 FUZZY MODEL FOR HUMAN RELIABILITY the centroid defuzzification method (Pedrycz 1993),
ANALYSIS where an analytical calculation of the ‘‘gravity’’ center
produces the final result. The output fuzzy sets cover
A fuzzy logic system for the estimation of the prob- the interval from 0.5 ∗ 10−5 to 1 ∗ 100 (corresponding
ability of a human erroneous action given specific to the values of probability of action failure defined in
industrial and working contexts has been previously CREAM).
developed (Konstandinidou et al. 2006b). The fuzzy The system has operated with different scenarios
logic modeling architecture has been selected on and the results were very satisfactory and in the range
account of its ability to address qualitative informa- of the expectations (Konstandinidou et al. 2006b).
tion and subjectivity in a way that it resembles the These results can be used directly in fault trees and
human brain i.e. the way humans make inferences and event trees for the quantification of specific undesired
take decisions. Although fuzzy logic has been charac- events, which include in their sequences failures of
terized as controversial by mathematician scientists, human factors.
it is acknowledged that it offers a unique feature: the Another use of the model, which can be compared
concept of linguistic variable. The concept of a lin- to sensitivity analysis, deals with the input parameters
guistic variable, in association with the calculi of fuzzy of the model as influencing factors in Human Relia-
if–then rules, has a position of centrality in almost all bility. Factors which influence human reliability play
applications of fuzzy logic (Zadeh, 1996). a very important aspect in the quantification of human
According to L. Zadeh (2008) who first introduced error. The context in which the human action will take
fuzzy logic theory, today fuzzy logic is far less contro- place is defined by these factors. These are the factors
versial than it was in the past. There are over 50,000 that usually have the name of ‘‘Performance Shaping
papers and 5,000 patents that represent a significant Factors’’, or ‘‘Common Performance Conditions’’ or
metric for its impact. Fuzzy logic has emerged as a very ‘‘Performance Influencing Factors’’. Obviously those
useful tool for modeling processes which are rather factors, as their name indicates, influence the action
complex for conventional methods or when the avail- failure probability of the human operators, by increas-
able information is qualitative, inexact or uncertain ing it when they have a negative effect on it or by
(Vakalis et al. 2004). decreasing it when they support the action and the
The Mamdani type of fuzzy modeling has been operator. What is common knowledge (but not com-
selected and the development of the system has been mon practice) is that the better the quality of these
completed in four steps. factors the more reliable the operator behavior.

282
The fuzzy model has been used in order to detect The new model disposes of a new output param-
the critical transitions in the optimization of human eter namely ‘‘operators’ response time’’. The output
reliability, in a form of sensitivity analysis, for corre- parameter provides the needed estimations for oper-
sponding changes in each of the nine input parameters, ators’ response time. In order to maintain the con-
tested in specific contexts. Changes in the working nection with the initial model the same names and
context have been evaluated through their relevant notions in the output parameters were used. The out-
reduction to the action failure probability of the human put fuzzy sets correspond to the four control modes
operator (Konstandinidou et al. 2006a). As concluded of the COCOM model that is the cognitive model
from the application of the model, the input parameters used in CREAM (Hollnagel 1998). Those modes are:
that induce the highest variations in the action fail- the ‘‘strategic’’ control mode; the ‘‘tactical’’ control
ure probability are the ‘‘adequacy of training’’ and the mode; the ‘‘opportunistic’’ control mode; and the
‘‘crew collaboration quality’’. For the parameter ‘‘time ‘‘scrambled’’ control mode.
of the day’’ the model shows that operators are more For the application of the ‘‘ORT’’ fuzzy model the
prone to errors during the night hours (20:00–4:00) and four control modes were used to define the time inter-
also during shift ending and turnovers. The results of vals within which the operator would act to complete
this application are in the form of percentages. These a critical task. Hence quick and precise actions that
percentages represent the variations induced on the are completed within very short time are compatible
output result, namely the action failure probability, with the ‘‘strategic’’ control mode; ‘‘tactical’’ con-
from the variations in the input parameters. The mean- trol mode includes actions within short time intervals
ing of these percentages is that with an improvement in slightly more broad than the previous one; ‘‘oppor-
the training of operators and in the ability to collabo- tunistic’’ control mode corresponds to slower reactions
rate with each other, the level of human reliability will that will take longer time while ‘‘scrambled’’ con-
increase significantly. The values of these percentages trol mode includes more sparse and time consuming
are not so important; the most important is that with the reactions.
use of the fuzzy model the critical intervals are defined The relevant time intervals as defined for the four
within which the significant variations are located. The control modes in the ‘‘ORT’’ fuzzy model are pre-
determination of the critical transitions depicts in this sented in table 1. A graphical representation of the four
way the points in which the analyst should focus and fuzzy sets is given in figure 1. The range of the four
the areas of improvement which are meaningful and fuzzy sets is equivalent to the range used in the prob-
essential. With the ability of having numerical val- ability intervals of action failure probabilities in the
ues of human error probabilities, the analyst is able to initial model (Konstandinidou et al. 2006b) expressed
prioritize the possible improvements in elements that in logarithmic values.
affect operators’ reliability. These results can be used
furthermore in cost—benefit analysis with the objec-
tive to compare the parameters adjustments cost to the Table 1. Control modes and response time intervals.
impact they induce on the performance and reliability
of the human operator. Operators’ response time
(minutes)

Control mode Min Max


3 FUZZY MODEL WITH RESPONSE Strategic 0 <t< 0.1
TIME ESTIMATIONS Tactical 0.01 <t< 1
Opportunistic 0.1 <t< 5
In order to produce estimates for response time of oper- Scrambled 1 <t< 10
ators in industrial context the fuzzy model for Human
Reliability Analysis has been used. With this model
as a basis the fuzzy model for ‘‘Operators’ Response Operator Response Time
1
Time—ORT’’ estimation has been built. Strategic
Tactical
The functional characteristics of the initial model Opportunistic
Scrambled

remained as they were defined. That means that the


same nine input parameters with the same defined
fuzzy sets have been used. The phrasing and the lin-
guistic variables have remained the same too. This was 0
very helpful in order to have a correspondence between 0 1 2 3 4 5

Time interval
6 7 8 9 10

the two models. For more details concerning the func-


tional characteristics of the initial model please refer Figure 1. Fuzzy sets representation for the ‘‘Operator
to (Konstandinidou et al. 2006b). response time’’ output variable.

283
A crucial step in the development of the model The required time frame for the specific task is very
is the development of the fuzzy rules. A cluster of tight. Operators must complete their actions within
fuzzy rules to include all the possible combinations of 1–2 minutes. Otherwise pressure may rise or may drop
the input parameters fuzzy sets has been developed in beyond the safety limits and disturb the operation of
(Konstandinidou et al. 2006b). 46656 rules have been the whole unit or even worse (in case of extreme varia-
defined, taking into consideration the multiple fuzzy tions) result in equipment failure. Pressure rises and/or
sets of each input parameter and using the logical AND drops in few seconds in the specific node so opera-
operation as the building mode. tors’ response is crucial and should be prompted. For
The fuzzy rules for the extension of the model the completion of the task one operator is needed.
retained the ‘‘if – part’’ of the initial model and the The reaction time of the operators in the execution
‘‘when’’ part was changed accordingly to include the of this task has been recorded through the pressure
time notion. drop indication reported in the control room. Data
An example (i.e. the first rule) is the following: concerning the specific in—field task of the petro-
‘‘If the adequacy of organization is deficient AND chemical unit has been gathered during a whole year
the working conditions are incompatible AND the period. From those data it was noticed that normal
availability of procedures and plans is inappropriate reaction time is within 10–15 seconds (when perform-
AND the adequacy of man-machine interface and ing the normal—drain operation), reaction time during
operational support is inappropriate AND the num- maintenance was around 1 minute, while reaction time
ber of simultaneous goals is more than actual capacity in emergency situations was between 1 to 10 minutes
AND the available time is continuously inadequate depending on the case.
AND the time of the day is night AND the adequacy of After discussion with the key personnel of the unit
training and experience is inadequate AND the crew on the specific events that took place during the one
collaboration quality is deficient THEN the opera- year period the conclusions were that the elements that
tor would act in a SCRAMBLED way. Acting in a differentiate the reaction time of the operators is the
SCRAMBLED way means that the response time for level of experience each operator has and the num-
the operator is between 1 and 10 minutes’’. ber of tasks he is assigned to do in the same time.
In this way all the possible combinations of the input This number varies between normal operation, main-
fuzzy sets correspond to one (and only one) output tenance and emergency response situations. What has
fuzzy set and to the relevant control mode with the also been observed through the collected data is that
associated time interval. the time of the day plays also an important role in
In order to have a crisp number as output variable some situations: operators response time is different
(and not an output set) the centroid defuzzification between day and night shifts.
method (Pedrycz 1993) has been used as in the initial Hence for this specific task the influencing factors
model. In this way the model comes up with spe- that have a direct impact on the operators performance
cific estimates for operators response time expressed are: the circadian rhythm of the operator, expressed in
in minutes. terms of the hour of the day that he/she is requested
to perform the task; the experience and the training
he/she obtains, expressed in years of presence in the
specific unit (and the petrochemical plant); the number
4 SPECIFIC APPLICATION FROM of simultaneous goals, expressed in terms of parallel
THE PROCESS INDUSTRY tasks to be performed during normal operation, main-
tenance (task performed in order to shut down or to
In order to test the model a real life application has been start up the unit) or emergency situations (equipment
chosen. A specific task, which is the opening/closure malfunction, trip of the unit).
of a manual valve in order to maintain a desired pres- The conclusions of our observations were the basis
sure drop, is performed regularly in a petrochemical for the development of a shorter version of the fuzzy
unit. This task may be performed at least twice a day model, a model that would include only the influenc-
during normal operation in order to unclog the drain ing factors of this application with the relevant fuzzy
channel. The same task is performed during mainte- sets. This is meaningful since all the nine parame-
nance operation in order to shut down or start up the ters that are included in the full version of the ‘‘ORT’’
unit. In case of an abnormality that leads to the trip model do not affect response time in this particular
of the unit or in case of equipment malfunction the application and the computational cost of the model is
operators are called to act immediately and perform significantly decreased with the use of only three input
the same task in order to maintain the desired pres- parameters. Additionally by building a new—tailored
sure drop so that the unit is not jeopardized. This is made model for the specific application new fuzzy
equivalent to emergency response situations. sets for the output parameter ‘‘operators’ response

284
time’’ can be used and adjusted according to real The observation data and the expertise of the key
data. personnel were the knowledge base for the develop-
ment of the fuzzy rules. The following observations
determined the definition of the fuzzy rules:
5 FUZZY MODEL FOR THE SPECIFIC
APPLICATION—‘‘SHORT ORT MODEL’’ a. Time of the day (day/night) does not affect opera-
tors response time during normal operations
For the development of this tailored made ‘‘Opera- b. Time of the day (day/night) does not affect opera-
tors Response Time—ORT’’ short model the Mamdani tors response time for operators with good level of
type of fuzzy modeling has been selected and the training and experience
development of the system has been completed in four According to the observed data and by taking into
steps. account the above mentioned statements 8 fuzzy rules
were defined for the short ‘‘ORT’’ fuzzy model:
i. Selection of the input parameters
Three input parameters have been chosen Rule 1: ‘‘If number of goals is equivalent to normal
according to the conclusions stated in the previous operation and adequacy of training and experience
section. These input parameters are: is good then operators’ response time is very good’’.
a. The number of simultaneous goals Rule 2: ‘‘If number of goals is equivalent to normal
b. The adequacy of training and experience operation and adequacy of training and experience
c. The time of the day is poor then operators’ response time is good’’.
As unique output parameter was defined the Rule 3: ‘‘If number of goals is equivalent to mainte-
Operators Response Time. nance and adequacy of training and experience is
good then operators’ response time is good’’.
ii. Development of the fuzzy sets Rule 4: ‘‘If number of goals is equivalent to mainte-
In the second step, the number and characteristics nance and adequacy of training and experience is
of fuzzy sets for the input variables and for the output poor and time is during day shift then operators’
parameter were defined. The definition of the fuzzy response time is normal’’.
sets was made according to the observations from the Rule 5: ‘‘If number of goals is equivalent to mainte-
real data and the comments of the key personnel as nance and adequacy of training and experience is
stated previously. poor and time is during night shift then operators’
‘Number of simultaneous goals’: for the first input response time is critical’’.
parameter three fuzzy sets were defined namely ‘‘Nor- Rule 6: ‘‘If number of goals is equivalent to emergency
mal operation’’, ‘‘Maintenance’’ and ‘‘Emergency and adequacy of training and experience is good
Situation’’. then operators’ response time is normal’’.
‘Adequacy of training and experience’: for the Rule 7: ‘‘If number of goals is equivalent to emergency
second input parameter two fuzzy sets were defined and adequacy of training and experience is poor and
namely ‘‘Poor Level of Training and Experience’’ and time is during day shift then operators’ response
‘‘Good Level of Training and Experience’’. time is critical’’.
‘Time of the day’: for the last input parameter two Rule 8: ‘‘If number of goals is equivalent to emergency
fuzzy sets were distinguished corresponding to ‘‘Day’’ and adequacy of training and experience is poor and
and ‘‘Night’’. time is during night shift then operators’ response
‘Operators’ response time’: The output parameter time is very critical’’.
had to cover the time interval between 0 and 10 min-
utes. Five fuzzy sets were defined to better depict small
differences in reaction time and the equivalent time iv. Defuzzification
range was expressed in seconds. The fuzzy sets with Since the final output of the fuzzy system modeling
the time intervals each of them covers are presented should be a crisp number for the operators’ response
in table 2. More precisely operators’ response time is time, the fuzzy output needs to be ‘‘defuzzified’’. This
‘‘Very good’’ from 0 to 20 seconds, ‘‘Good’’ from 10 is done through the centroid defuzzification method
to 110 seconds, ‘‘Normal’’ from 60 to 180 seconds, (Pedrycz 1993) as in the previously developed fuzzy
‘‘Critical from 120 to 360 seconds and ‘‘Very critical’’ models.
from 270 to 1170 seconds. A graphical representation The fuzzy logic system has been built in accordance
of the five fuzzy sets is given in figure 2 in order to with the real data coming from the petrochemical unit.
visualize the range of each time set. The testing of the model and its comparison with
the full version will be presented in the section that
iii. Development of the fuzzy rules follows.

285
Operators Response Time The rest of the runs represent situations where all
1
input parameters except of ‘‘Number of simultaneous
goals’’, ‘‘Time of the day’’ and ‘‘Adequacy of training
and experience’’ have medium level values (equal to
Very Good a value of 50). Changes in values were made only for
Good
Normal
Critical
the three input parameters that affect the specific task
0
Very Critical
which was examined for the application of the model.
0 100 200 300 400 500 600
Time Interval For these three parameters the values that were used
with their correspondence with operational situations
Figure 2. Fuzzy sets representation for the ‘‘Operator (expressed in linguistic variables) are the following:
Response Time’’ output variable of the short model. ‘Number of simultaneous goals’: A value of 15
was chosen to represent emergency situations, while
a value of 50 was attributed to maintenance and 90 to
Table 2. Output fuzzy sets for operators response time. normal operation.
‘Time of day’: Value 12 was assigned to day time
Fuzzy set Time interval (in seconds)
while value 0 to night time.
Very good 0 <t< 20 ‘Adequacy of training and experience’: A value of
Good 10 <t< 110 0 depicts the inexperienced operator, a value of 50 is
Normal 60 <t< 180 the operator with adequate training but with limited
Critical 120 <t< 360 experience while a value of 100 corresponds to the
Very Critical 270 <t< 1170 very experienced operator.
What can be seen from the results is that the model
only depicts differences in time response during night
of an inexperienced operator for an emergency situ-
6 RESULTS FROM THE APPLICATION
ation (estimated reaction time 276 seconds). For the
OF THE MODELS
rest of the modeled situations (very and limited expe-
rienced operator, normal operation, maintenance and
6.1 ‘‘ORT’’ fuzzy model general application
emergency situation, day and night time) the model
The results from the general application of the full estimates the same reaction time for all operators (59
version of ‘‘ORT’’ fuzzy model are presented in table 3. second approx. 1 minute).
Details for better understanding of these results are as The reaction time estimated is in accordance with
following. the real time of operators’ response. The inflexibility
First row (first run) represents the worst case sce- of the model to provide different estimates according
nario, which means a context where all input param- to different situations is explained by the fact that for
eters are judged as ‘‘inadequate’’ and are given the the six out of the nine inputs medium values were cho-
minimum possible value they can be assigned to. In sen. This affects the inference of the model by leaving
this scenario—case the ‘‘ORT’’ estimated a value of a smaller number of rules to interfere for the calcu-
6 minutes (480 seconds) for operators’ response time. lation of the final result. From the total number of
The second row (second run) is still a worst case sce- 46656 rules only 18 rules are activated with the use of
nario but with slightly improved parameters value. In medium values.
this case the ‘‘ORT’’ produced a slightly improved Additionally from the three parameters whose val-
response time of 476 seconds. ues are alternated two (number of simultaneous goals
Situation 3 (third row—run) is a best case scenario and time of the day) comprise fuzzy sets that have a
with all parameters assumed to be in the most efficient neutral influence on the final output (Konstandinidou
level. In this case the ‘‘ORT’’ estimated a time of 8 sec- et al. 2006b). That means that even by improving
onds for operators’ response time while in the fourth the value of those inputs the estimated response time
situation which represents the best case scenario with would remain the same since the improvement has no
top values for all parameters the estimated time was 7 effect on the output result.
seconds.
With the above runs the sensitivity of the model was
6.2 ‘‘ORT’’ fuzzy model specific application
tested. Indeed the model depicts differences in its input
and the calculated result is alternated respectively. In order to overpass the observed shortages a new
Fifth row represents a medium—normal case. In group of runs has been selected. This group com-
this situation all parameters have a medium level of prises runs that correspond to the specific context
quality that corresponds to usual working contexts. In under which the selected critical task is performed.
this case the operators’ response time was estimated For the specific application of the model the following
at 59 second (∼1 minute). input values were used.

286
Table 3. Results from the application of the ‘‘ORT’’ fuzzy model.

Number of
Adequacy Working MMI and simulta- Training Crew Operators’
of organiz- condi- Procedures operational neous Available Time and expe- collabora- response
ation tions and plans support goals time of day rience tion time (sec)

0 0 0 0 10 10 0 10 0 480
10 10 10 10 15 20 2 10 10 476
90 90 90 90 90 90 12 90 90 8
100 100 100 100 90 100 13 100 100 7
50 50 50 50 50 50 12 50 50 59
50 50 50 50 15 50 0 0 50 276
50 50 50 50 50 50 0 0 50 59
50 50 50 50 90 50 0 0 50 59
50 50 50 50 15 50 12 0 50 59
50 50 50 50 50 50 12 0 50 59
50 50 50 50 90 50 12 0 50 59
50 50 50 50 15 50 0 100 50 59
50 50 50 50 50 50 0 100 50 59
50 50 50 50 90 50 0 100 50 59
50 50 50 50 15 50 12 100 50 59
50 50 50 50 50 50 12 100 50 59
50 50 50 50 90 50 12 100 50 59
50 50 50 50 15 50 0 50 50 59
50 50 50 50 50 50 0 50 50 59
50 50 50 50 90 50 0 50 50 59
50 50 50 50 15 50 12 50 50 59
50 50 50 50 50 50 12 50 50 59
50 50 50 50 90 50 12 50 50 59
90 20 90 50 15 20 0 0 50 294
90 20 90 50 50 20 0 0 50 294
90 20 90 50 90 20 0 0 50 294
90 20 90 50 15 20 12 0 50 276
90 20 90 50 50 20 12 0 50 59
90 20 90 50 90 20 12 0 50 59
90 20 90 50 15 20 0 100 50 59
90 20 90 50 50 20 0 100 50 59
90 20 90 50 90 20 0 100 50 59
90 20 90 50 15 20 12 100 50 59
90 20 90 50 50 20 12 100 50 59
90 20 90 50 90 20 12 100 50 59
90 20 90 50 15 20 0 50 50 276
90 20 90 50 50 20 0 50 50 59
90 20 90 50 90 20 0 50 50 59
90 20 90 50 15 20 12 50 50 59
90 20 90 50 50 20 12 50 50 59
90 20 90 50 90 20 12 50 50 59

‘Adequacy of organization’: A value of 90 was used sequence of actions for the specific task (as well as for
for this input parameter since the overall organization the rest of the tasks) is provided in well written and
of the specific unit in safety issues is judged to be updated procedures.
excellent. ‘Adequacy of MMI and operational support’: A
‘Working conditions’: A value of 20 was given value of 50 was assigned to this input as the whole task
in this parameter since the task is performed in a is performed on a specific valve so MMI is always the
petrochemical unit with unadvantageous working con- same and does not interfere in operators’ action.
ditions (noise, poor ergonomy, bad lighting, smoke ‘Number of simultaneous goals’: As previously a
and bad odors). value of 15 was chosen to represent emergency situa-
‘Adequacy of procedures and plans’: A value of 90 tions, while a value of 50 was attributed to maintenance
was used for this input parameter because the complete and 90 to normal operation.

287
‘Available time’: A value of 20 was assigned for this with CREAM (Konstandinidou et al. 2006b). Instead
input as the available time for performing the specific the team decided to produce a shorter version of the
task is always very little (1–2 minutes). model that would be more prone to adjustments and
‘Time of day’: A value of 12 was used for day time correspond better to the specific task.
while a value of 0 for night time. The development of the short version was also
‘Adequacy of training and experience’: A value of advantageous from the computational point of view
0 depicts the inexperienced operator, a value of 50 the as explained previously in section 4. The results from
operator with limited experience while a value of 100 the application of the ‘‘ORT – short’’ fuzzy model are
corresponds to the very experienced operator. presented in the following paragraph.
‘Crew collaboration quality’: A medium value of 50
was assigned to this parameter since no crew is needed
6.3 ‘‘ORT-short’’ fuzzy model
to perform this task (task performed by one operator
only). The results from the application of the short version
The results from this application of the model are of ‘‘ORT’’ fuzzy model are presented in table 4. With
presented in second part of table 3. the use of the ‘‘ORT – short’’ model the estimates for
The estimated times with input expressed in linguis- operators’ response time differ when the input param-
tic variables for the three parameters which influence eters change in all possible combinations. In this way
the operators response time in the specific application improvements in operators’ level of experience are
are presented in table 4 in order to facilitate the reader depicted as well as differences in time shifts.
in the comparison of results later on. According to the estimates of the ‘‘ORT – short’’
As expected, estimated times are greater for inex- model a well experienced operator will react in 13 sec-
perienced operators during night (294 sec) and day onds during normal operation in day and night shift, in
(276 sec) and between emergency (276 sec) and nor- 60 seconds during maintenance in day and night shift
mal operations or maintenance (59 sec). However for and in 120 seconds in emergency situations during day
a well trained operator the estimated times are always and night shift. This is in accordance with observation
the same (59 seconds ∼ 1 minute). a) that the time of the day does not affect the response
This is due to the same reason as explained in the time of an operator with good level of training and
previous section: ‘‘number of simultaneous goals and experience. Subsequently an inexperienced operator
time of the day have fuzzy sets with neutral effect on will react in 60 seconds during normal operation in day
the output result. Thus, changes in the specific input and night shift, in 120 seconds during maintenance in
parameters do not affect output. day time and 240 in night time shifts, and in 240 sec-
In order to overcome this problem the fuzzy model onds in emergency situations during day shift and 570
has to be modified. This was not feasible since a in night shift. This is in accordance with observation
change in the effect of the fuzzy sets would disturb the b) that the time of the day does not affect the response
initial structure of the model and make it incompatible time in normal operations.
with the estimations of the action failure probabil- The fuzzy logic system estimations are in accor-
ities which are already validated and in accordance dance with the real data coming from the petrochem-
ical unit. Indeed, observation data showed very slow
and very critical response of inexperienced operators
Table 4. Results from the application of the two versions during night shifts and in emergency situations. In
of ‘‘ORT’’ fuzzy model. fact the registered response time in one such case has
reached the period of 10 minutes (600 seconds).
‘‘ORT’’ ‘‘ORT’’ The model can be further expanded and used in dif-
Number of Time model short’’ ferent tasks and contexts e.g. maintenance tasks, other
simultaneous goals Training of day (sec) (sec) in-field actions or control room operations in the run-
Normal operation Good Day 59 13
ning of a petrochemical unit or more generally of a
Maintenance Good Day 59 60 chemical plant. The only constraints for the applica-
Emergency situation Good Day 59 120 tion of the model are the knowledge of the influencing
Normal operation Good Night 59 13 factors for the specific tasks and the availability of real
Maintenance Good Night 59 60 data.
Emergency situation Good Night 59 120
Normal operation Poor Day 59 60
Maintenance Poor Day 59 120 7 CONCLUSIONS
Emergency situation Poor Day 276 240
Normal operation Poor Night 294 60
Maintenance Poor Night 294 240
The criticality of certain tasks in industrial contexts
Emergency situation Poor Night 294 570 deals not only with the right performance of the task
but also with the correct timing of the performance.

288
A critical task performed too late may be in certain Indeed the ‘‘ORT – short’’ fuzzy model came up with
cases equal to an erroneously performed task. Thus estimates in operators response time that are in accor-
for the purposes of Human Reliability Analysis not dance with the observed data and additionally more
only the Human Error Probability of a task would be sensible to input variations.
necessary but also the response time of the operator Differences between day and night shifts as well as
under specific conditions. task performed during normal operation, maintenance
A model for the estimation of the probability of and emergency situation from experienced and inex-
an erroneous human action in specific industrial and perienced personnel are well depicted with relevant
working contexts has been previously created based on differences in operators’ response times. In fact in the
CREAM methodology and using the fuzzy logic the- extreme situation of an emergency during night shift
ory. The model has been expanded in order to produce where an inexperienced operator is called to act the
estimates for operators’ response time. This version estimated response time from the model is 570 sec-
was called the ‘‘ORT’’ model. onds which is in accordance with the observed data of
In order to test the model real data have been gath- 10 minutes (600 seconds).
ered form a petrochemical unit concerning a specific Following the steps described in the present paper
in-field task: the opening/closure of a manual valve ‘‘ORT – short’’ tailored made models based on fuzzy
in order to maintain a desired pressure drop. The logic architecture can be developed for different tasks
recorded reaction time for the specific task is about and contexts e.g. maintenance tasks, other in-field
1 minute in certain conditions. In extreme cases the actions or control room operations in the running of a
recorded time reached the period of 10 minutes. For chemical plant.
this specific task the influencing factors that have a
direct impact on the operators performance were: the
time of the day when the task is performed; the level
of training and experience of the operator and the REFERENCES
number of simultaneous goals the operator is asked
Bott, T.F. & Kozinsky, E. 1981. Criteria for safety-related
to perform differentiating between tasks during nor- nuclear power plant operator actions. NUREG/CR-1908,
mal operation, during maintenance or in emergency Oak Ridge National Lab, US Nuclear Regulatory Com-
situations (equipment malfunction, trip of the unit). mission.
Three different applications of the model were made Embrey, D.E. 1992. Quantitative and qualitative predic-
for the specific task. In the first application of the tion of human error in safety assessments, Major hazards
model (original version with second output parameter Onshore and Offshore, Rugby IChemE.
response time) with medium values for all parameters Hollnagel, E. 1998. Cognitive reliability and error analysis
except from the three influencing ones the estimated method (CREAM). Elsevier Science Ltd.
response time was within the range of expectations Isaac, A., Shorrock, S.T. & Kirwan, B. 2002. Human error
in European air traffic management: the HERA project.
(1 minute). Although the model was sensible in input Reliability Engineering & System Safety 75 (2): 257–272.
variations, when medium values were used for most of Kim, B. & Bishu, R.R., 1996. On assessing operator response
the input parameters the model was rather inflexible. time in human reliability analysis (HRA) using a possi-
This was due to the fact that medium values deactivated bilistic fuzzy regression model. Reliability Engineering &
most of the fuzzy rules of the original model (99% of System Safety 52: 27–34.
the total fuzzy rules were not used). Kontogiannis, T. 1997. A framework for the analysis of cog-
When the same model was used with values rep- nitive reliability in complex systems: a recovery centred
resenting the real life situation the output resulted in approach. Reliability Engineering & System Safety 58:
more ‘‘sensitive’’ values. Estimates were in the range 233–248.
Konstandinidou, M., Kiranoudis, C., Markatos, N. &
of expectations and variations in input induced vari- Nivolianitou, Z. 2006a. Evaluation of influencing fac-
ations in the output results. However variations in tors’ transitions on human reliability. In Guedes Soares
number of simultaneous goals and time of the day & Zio (eds). Safety and Reliability for Managing Risk.
were still not depicted (i.e. operators response time Estoril Portugal, 18–22 September 2006. London: Taylor
does not alter for day and night changes neither for & Francis.
different operations) since those two input parameters Konstandinidou, M., Nivolianitou, Z., Kyranoudis C. &
have fuzzy sets with neutral influence on the output Markatos, N. 2006b. A fuzzy modelling application of
result in the original structure of the model. CREAM methodology for human reliability analysis.
Therefore and since only three parameters were Reliability Engineering & System Safety 91(6): 706–716.
Pedrycz, W. 1993. Fuzzy control and Fuzzy Systems Second
acting as influencing factors in the specific task, an extended edition, London: Research Studies Press Ltd.
‘‘ORT – short’’ model was developed to include only Swain, A. & Guttmann, H. 1983. Handbook on Human Reli-
these input parameters. In this way the application of ability Analysis with Emphasis on Nuclear Power Plant
the model is more tailored made to the specific task Application NUREG/CR-1278 US Nuclear Regulatory
and a lot of saving in computational cost is achieved. Commission.

289
Weston, L.M. & Whitehead, D.W. 1987. Recovery actions in Vakalis D., Sarimveis H., Kiranoudis C., Alexandridis A.
PRA for the risk methods integration and evaluation pro- & Bafas G. 2004. A GIS based operational system for
gramme (RMIEP) NUREG/CR-4834, Vol. 1. US Nuclear wild land fire crisis management, Applied Mathematical
Regulatory Commission. Modelling 28 (4): 389–410.
Zhang, L., He, X., Dai, L.C. & Huang, X.R. 2007. The Zadeh L.A. 1996. Fuzzy logic and the calculi of fuzzy rules
simulator experimental study on the operator reliability and fuzzy graphs, Multiple-Valued Logic 1: 1–38.
of Qinshan nuclear power plant, Reliability Engineering Zadeh L.A. 2008. Is there a need for fuzzy logic? Information
and System Safety 92: 252–259. Sciences doi: 10.1016/j.ins.2008.02.012.

290
Safety, Reliability and Risk Analysis: Theory, Methods and Applications – Martorell et al. (eds)
© 2009 Taylor & Francis Group, London, ISBN 978-0-415-48513-5

The concept of organizational supportiveness

J. Nicholls, J. Harvey & G. Erdos


Newcastle University, UK

ABSTRACT: This paper describes the process of investigating, defining and developing measures for
organizational supportiveness in employment situations. The methodology centres on a focus group of peo-
ple of diverse age, gender, grade and commercial and industrial disciplines that met many times over a period
of several weeks. The focus group contribution was developed into a large questionnaire that was pilot tested
on a general population sample. The questionnaire was analysed using factor analysis techniques to reduce it
to a final scale of 54 items, which was evaluated by a team of judges, and was then field tested in a nuclear
power station. The analyses revealed a supportiveness construct containing eight factors, being: communication,
helpfulness, empowerment, barriers, teamwork, training, security and health and safety. These factors differ
from other support-related measures, such as commitment, by the inclusion of a ‘barrier’ factor. The findings
are evaluated with an assessment of the host company results and opportunities for further research.

1 INTRODUCTION with our family, friends, neighbours and work col-


leagues (Palmer, 2001). In the work context, new
This paper aims to identify behaviours that contribute technology and increasing competition have forced
to the harmonisation of collective goals between businesses to examine their business strategies and
employers and employees and develop a measure of working practices and evolve new and innovative
these behaviours that can be used to identify strategies means of maintaining their competitive advantage in
to promote the willing adoption of such behaviours the face of stiff competition, particularly from emerg-
within business organizations. To help achieve this, ing economies. This has resulted in the globalisation
a construct of organizational supportiveness is devel- of markets, the movement of manufacturing bases,
oped that can be readily measured in organizations the restructuring of organizations, and changes to the
and will indicate areas where management can alter, employer/employee relationship. Also, the changing
introduce, or delete, practices and policies, which nature of traditional gender roles that has occurred in
will assist in the development of more positive inter- the latter half of the 2th century (Piotrkowski et al.,
nalised attitudes in their staff and consequently bring 1987; Moorhead et al., 1997) and an increase in pop-
added value to their business (Baker, 1996). Employee ularity of flexible working policies (Dalton & Mesch,
expectations of their employers and their working 1990; Friedman & Galinski, 1992) are impacting sig-
environments can be measured through such a scale. nificantly upon the traditional psychological contract.
The objects of this paper are to: Wohl (1997) suggests that downsizing and the asso-
ciated reduction in loyalty and secure employment has
• Define a construct of organizational supportiveness caused people to re-evaluate their priorities in respect
and demonstrate its construct validity of their working and private lives, whilst Moses (1997)
• Develop a measure of the construct and demonstrate states:
its internal consistency
• Test the measure and demonstrate its generali-
sability. ‘‘Providing career support for staff has never been
more challenging for managers, given the diverse
demographic landscape and the complexity of
2 BACKGROUND both the new work configurations and peoples
lives. Managers must help staff navigate in a job-
The last 50 years has witnessed enormous changes less, promotionless, securityless environment—a
in the social, political, cultural and working environ- challenge that becomes even harder when much
ments around the world. These changes have slowly, of the workforce may be temporary, contract or
but progressively, impacted on the lifestyles of the gen- part time. The changes occurring in today’s work-
eral population and the ways in which we all interact place are confusing to employees. Managers can

291
facilitate a sense of reassurance and renewal, The ability to trust enables people to interact,
recognising that as leaders and representatives to build close relationships, is important for social
of their organizations, their participation is exchange and is essential for psychological health
not only desirable but critical’’ (Moses, 1997: and development (Asch, 1952; Erickson, 1959; Blau,
page 44). 1964; Argyle, 1972; Kanter, 1977; Barber, 1983). Not
only that, but positive relationships have been found
So it is clear that supportiveness can be critical to between HR strategies and high organizational perfor-
well-being of the workforce; it can also be stated that mance and commitment (Arthur, 1994; Huselid, 1995;
the implications of getting this wrong can have serious Tsui et.al. 1997). However, Koys, (2001) researched
impact upon the reliability and safety of an organiza- the relationship between HR strategies and business
tional system or its processes. But supportiveness can performance and contends that HR outcomes influ-
mean a lot more than social support, and can be related ence business outcomes rather than the other way
to commitment, trust and the psychological contract, round. HR by its very title is concerned with peo-
amongst its many possible correlates. Consequently, ple more than processes and suggests that it is in fact
the literature on supportiveness is diverse and var- people that make the difference.
ied. It includes the above issues and the work ethic, The prevailing management culture can influence
teamworking, organizational citizenship behaviours, support for alternative working arrangements differ-
expectations and aspirations, and may be seen in entially according to the supervisors perception of the
terms of theories of motivation, attribution, cognitive subordinate’s career success Johnson et al, 2008); this
dissonance etc. may be consistent with a good or poor culture and pre-
In order to develop an understanding of what sup- vailing beliefs about trust. Leung et al, (2008) have
portiveness an organization might provide, we propose shown that informal organizational supports, partic-
that the literature and research into supportiveness ularly those concerning relationship conflict, work
can be summarised into five main thematic areas of underload and lack of autonomy are more effec-
influence: tive than formal supports in reducing stress; again
these may also reflect issues of trust, management
∗ Supportiveness related constructs: psychological culture etc.
contract, organizational citizenship behaviours,
organizational commitment and social support 3 DEFINING A SUPPORTIVENESS
∗ Attitudes: trust, prevailing management culture, CONSTRUCT
beliefs, values, work ethic and motivation
∗ Lifestyle variables: domestic arrangements such Whilst some measures of ‘near’ concepts, such as
as working partner, childcare or care of relatives, social support, trust and commitment exist, none of
environment and non-work interests these are particularly relevant to the notion of the sup-
∗ Aspirational variables: promotion, development, portiveness that might provided by the organization.
transferable skills, job satisfaction, and flexible Thus, it was decided to establish the construct from
hours first principles by asking people what they considered
∗ Business issues: welfare, pensions, and safety at a supportiveness construct would be.
work, unions, legislation, and security. The construct was developed through an iterative
process of:
For each of these areas, a body of literature exists-
too much for one paper to cover, but exemplified by • Extracting variables from established constructs and
the following. concepts within existing literature.
The rapid changes to working life and societal val- • Holding informal interviews and conversations with
ues in the latter half of the 20th century, along with individuals and small groups from diverse employ-
changing economic circumstances and globalisation ment situations to collate their ideas about what
have impacted on the traditional psychological con- they considered that a good supportive organization
tract and new working practices are evolving to meet would be.
the demands of the changing work environment. There • Establishing a focus group of selected people who
is some evidence to suggest that traditional values of would brainstorm the potential variables of their
work ethic and organizational commitment may be perception of a supportive organization without the
waning in favour of more personal fulfilment outside influence of any prior input.
of work. However, it is not clear whether the pace of • Establishing a number of informal ambassador
change of peoples’ attitudes is less than or equal to groups whose collective output was ultimately fed
the pace of change of an emerging new psycholog- into the formal focus group.
ical contract and the effect of a dynamic workplace • Collating all inputs and conducting a ranking and
environment. rating exercise with the focus group.

292
Table 1. The focus group constitution.

Gender Age Industry Job type

1 Female 31–40 Power industry Senior manager


2 Male 51–60 Media Employee
3 Female 51–60 Leisure Employee
4 Female 51–60 Education Middle manager
5 Female <20 Commerce Clerical employee
6 Male 21–30 Student n/a
7 Male 21–30 Construction Employee
8 Male 31–40 Research/development Professional
9 Male 41–50 Engineering Technician
10 Male 41–50 Local Govt Accountant
11 Female 21–30 Retailing Assistant
12 Female 31–40 Health Nursing sister
13 Male >61 Shipping Employee
14 Female 31–40 Power industry Clerical
15 Male 51–60 Off-shore Technician

4 METHODOLOGY demonstrate their support to their workforce. In con-


sidering this task I would like you to include examples
4.1 The focus groups from your own experience of:
Company policies, Temporary initiatives, Manager
The sample of people consulted in the various groups
and co-worker traits, Personal aspirations that you may
was chosen to be representative on the basis of age,
have held but not directly experienced. Include also
gender, job grade, industrial sector and whether
any other factor that you consider could have an impact
employed or unemployed. This was to provide confi-
on this concept. It is equally important that you include
dence that all of the bases, including outliers, had been
both positive and negative examples and ideas of orga-
considered and there was no inherent bias towards peo-
nizational supportiveness in the workplace in order to
ple currently in work. The constitution of the primary
assess not only what a supportiveness concept is, but
focus group is shown in Table 1.
also, what is unsupportive.’’
The ambassador groups consisted of a network of
independent and informal focus groups that were for-
mulated and led by friends, colleagues and associates 4.3 To develop a measure and establish
(ambassadors) who had received the same brief as the its construct validity and reliability
primary focus group.
The focus group process yielded 51 separate elements
that the group rated in terms of their strength on a
subjective scale of high, medium and low strength
4.2 The brief
that reflected the relevance of that element to an
The following brief was given to each member of overall supportiveness concept. The 51 elements cov-
the primary focus group and each ambassador and ered many diverse issues and include elements such
provided the remit for the group’s contribution. as ‘‘allowing personal creativity’’ through ‘‘providing
‘‘The purpose of this research is to identify medical support and pension schemes‘‘ etc. Con-
behaviours, which contribute to the harmonisation of struct validity may be defined as ‘‘the degree to
collective goals between employers and employees and which inferences can legitimately be made from the
develop strategies that promote the willing adoption of operationalisation in your study to the theoretical
such behaviours within business organizations. I hope constructs on which those operationalisations were
to develop a concept of Organizational Supportiveness based’’ (Trochim, 2002).
and ultimately a measure of this concept that can be The theoretical construct in this study may be
operationalised into a questionnaire that can be used defined as the mutual harmonisation of the human
to assess the degree of supportiveness that is given by and business imperatives of organizations. In order
organizations to the employees of that organization. to assess construct validity, a second focus group who
To assist me in defining this concept, I am asking you, were experienced in questionnaire design and followed
both individually and collectively, to consider and list Oppenheim’s (1994) principles, formulated the 51 ele-
all of the means by which organizations do, or could, ments into a 517 item questionnaire of approximately

293
10 Likert-style items per element with a 7-point scale coefficients for job security and teamwork. The results
ranging from ‘‘strongly disagree’’ to ‘‘strongly agree’’. of this analysis is shown in Table 3
An initial pilot on 10 PhD management students was One explanation for this may be that teamwork
followed by a larger pilot in cooperating organisations relies on some type of inter-personal bonding between
in the general industrial and commercial domain, gen- co-workers that takes time to establish and that secu-
erating 103 responses (47 male, 32 female and 24 rity cannot be felt during a period of probationary
gender undisclosed). The questionnaires were factor employment. The standard terms and conditions of
analysed and rationalised into seven factors that were employment at the nuclear power station required all
represented in a 54 item questionnaire; an eight factor, employees to satisfactorily complete a one/two year
health and safety, was omitted from this final ques- probationary period before being confirmed in their
tionnaire as it was felt that this would lengthen the post. This observation may be consistent with the
questionnaire unnecessarily when such issues could be measure being a reliable indicator of perceived orga-
better tested using the organizations’s own health and nizational supportiveness if it is accepted that UK
safety inspection procedures. An established Organi- employment legislation permits unaccountable dis-
zational Commitment measure (Meyer & Allen 1997) missal within the first 12 months of employment
was added for the field test in a single organisation (a without any recourse to appeal, then it is reasonable for
nuclear power station) about which a great deal was new employees not to feel secure in any organisation.
known i.e. the people, processes, operational culture, As a validity and benchmarking exercise to compare
corporate history etc. Consequently, it was possible to our factors with commitment, the commitment mea-
take the mean scores of the returned questionnaires and sure of Meyer and Allen (1997) was correlated with
ask three inter-rater judges to give their opinion of the F1 communications and F4 barriers, with the results in
mean score, between 1 and 7, for each question. This Table 4. The results suggest that affective commitment
process, whilst not fitting the classical application of is strongly associated with both factors, as indeed it
inter-rater reliability, is consistent with the approach often is with other word attitudes such as trust and job
described by Marques & McCall, (2005) and yielded satisfaction. For the normative commitment measure
general consensus between the judges that the measure that relates more to obligation, this is more strongly
derived from the questionnaire was a true reflection related to F4 barriers than to communications, which
of the actual level of organisational supportiveness would be consistent in terms of attitude theory.
present at that time.
5.2 Sample size issues
The two datasets—from the pilot test in the general
5 RESULTS
population and the field test at the nuclear power
station- were gathered approximately 12–14 months
5.1 The 8-factor solution
apart and the closeness with which these two indepen-
The results from the two independent datasets, the ini- dent surveys agree with each other suggests that the
tial pilot test in the general population (N = 103) construct has both internal consistency and construct
and the field test in a host organisation (N = 226), validity. However, neither survey produced the very
were compared and found to be statistically in 91% high ratios of cases to variables (N:P ratio) that fac-
agreement. So the two datasets were combined into a tor analysis ideally requires to have a high confidence
one (N = 329) and subjected to another exploratory in the outcome of the factor analysis. The first sur-
factor analysis where they demonstrated a 96% agree- vey had a 2:1 N:P ratio and the second a 4.5:1 N:P
ment with the field test result. The factors of this ratio. As the questionnaires were identical, their data
second exploratory factor analysis with alpha coeffi- can be combined into a single dataset giving a 6.5:1
cients for the host organisation (N = 226) are shown N:P ratio. This not only increases the cases to vari-
in Table 2. The eighth factor was identified as health ables ratio but it mixes one set of data that could be
& safety and conditions of employment). Since this focused on and therefore skewed to a particular orga-
involves clear lines of statutory enforcement and cor- nization (the field test data) with data that were in
porate responsibility, it is omitted henceforth from the effect randomly gathered from a number of different
analysis since these are largely outwith the control of organizations (the general population data).
the local workforce. There is some debate about what constitutes a suit-
Further analysis of each of the 7 factors in respect able sample size for factor analysis. In general, it
of: age, gender, grade, length of service and shift- is reasonable to say that the larger the sample the
work/daywork effects showed no significant differ- better the results. Small samples are likely to yield
ence from the generalised test results with the excep- spurious results where factors splinter and cannot
tion of length of service effects for employees with be replicated in subsequent analyses or may con-
less than 2 years service who demonstrated lower alpha tain unrepresentative bias (Froman 2001). Minimum

294
Table 2. The 8-factor solution of the field test in the host organization, excluding factor 8 (N = 226).

Factor 1 COMMUNICATION (α = .866) 26 I have choices and decisions to make in my job.


1 My Manager communicates too little with us. 27 I am given freedom to implement my ideas at work.
2 I am told when I do a good job. 28 My Managers listens to my ideas.
3 I can speak to my Manager at anytime. 29 My Manager encourages me to think of new ideas.
4 My Manager operates an open door policy. 30 I am allowed to make improvements to my daily
5 My Manager is always open to my suggestions. routine.
6 My Manager will always explain anything we need
to know. Factor 4 BARRIERS (α = .912)
7 My Manager only tells me what I have done wrong, 31 I trust this organization.
not what should be done to improve things. 32 Managers here do not appreciate the work that I do.
8 I sometimes feel I am wrongly blamed when things 33 Only the ‘in crowd’ get promotion here.
go wrong. 34 Management here do not care about the workers.
9 I can speak to my Manager and tell him/her what I 35 At my place of employment, it is every man for himself.
would like. 36 The culture at work does not foster loyalty.
Factor 2 HELPFULNESS (α = .923) 37 Management here are manipulative.
10 My Manager will assist me with difficult tasks. 38 Vindictiveness among managers is common at work.
11 If I need advice and support, it is available here at 39 Generally, Management here are supportive of the
work. workflow.
12 My Manager is loyal to me. 40 This organization has business integrity.
13 I feel I can rely on my Manager defending my 41 The reasons given for decisions are not usually the real.
group’s interests. 42 This organization always does the morally right things.
14 My Manager always looks after him/herself first. 43 Managers here put their own ‘agendas’ before those of my
15 My Manager always plays down the good things group.
that I do. 44 In this Company everybody is out for themselves.
16 My Manager is only interested in his own career. Factor 5 TEAMWORK (α = .789)
17 My Manager does not care about my domestic 45 Interaction with others in my group is an important part of
problems. what I do.
18 My Manager always sets unrealistic targets. 46 I generally enjoy working as part of a team.
19 My Manager encourages me to continually develop 47 To be successful, my job depends on the work of colleagues.
my skills and knowledge. 48 We depend on each other in my workgroup.
20 Sometimes I am given a task that the Manager
knows will probably fail. Factor 6 TRAINING (α = .708)
21 My Manager’s behaviour is very varied and 49 This organization provides training to maintain existing
inconsistent. skills and to develop new skills.
22 My Manager has their favourite people. 50 There is a good induction training programme here now.
23 How my Manager treats me changes from day to day. 51 We are given the opportunity to get re-training whenever
day. we want it.
Factor 3 EMPOWERMENT (α = .903) Factor 7 SECURITY (α = .721)
24 My workplace is one where I can be creative or 52 I feel relatively secure here.
flexible. 53 My job will still be here in 5 years time.
25 My Manager seeks my views. 54 I feel my job is very safe.

Table 3. Factor means and Cronbach’s alphas by length of service in years.

Factor Mean score Cronbach coefficient

Length of service <1 1–2 2–5 5–10 10–20 >20 <1 1–2 2–5 5–10 10–20 >20

Communication 5.097 5.232 4.421 4.222 5.003 4.833 .790 .861 .907 .967 .844 .855
Helpfulness 4.908 4.948 4.350 3.978 4.711 4.600 .900 .921 .953 .986 .909 .933
Empowerment 4.304 4.335 3.929 4.190 4.777 4.758 .929 .764 .920 .841 .909 .905
Barriers 4.634 4.438 4.104 3.881 4.110 4.097 .803 .887 .946 .849 .893 .926
Teamwork 5.000 5.630 5.792 5.583 5.666 5.599 .880 .317 .837 .793 .722 .852
Training 4.917 5.014 4.625 3.444 4.617 4.920 .786 .840 .831 .627 .600 .681
Security 3.958 3.174 3.625 3.333 4.042 4.102 .730 .341 .795 −.60 .738 .740
N 8 23 24 3 80 88 8 23 24 3 80 88

295
Table 4. Pearson’s correlations between F1, F4 & the Meyer A means of measuring, particularly in terms relative
and Allen (1997) commitment scales. to a previous or subsequent measure, the strength or
level of supportiveness that currently exists within any
Commitment F1 F4 organization has been developed.
scores Communications Barriers
The construct validity and internal consistency of
Normative .282∗∗∗ −.498∗∗∗ the measure has been demonstrated.
Continuance .066 −.041 We have shown that organizational supportiveness
Affective .545∗∗∗ −.592∗∗∗ is an independent construct compared to organiza-
tional commitment and not a direct correlate of OC.
Key: ∗∗∗ = p < .0001. Trust and loyalty appear consistently as impor-
tant throughout the management literature, and their
absence appears regularly to be associated with nega-
sample sizes suggested range from 3 to 10 subjects tive consequences. The ability to trust enables people
per item with a minimum of between 100 and 300 to interact, to build close relationships, is important
subjects regardless of the number of items (Gor- for social exchange and is essential for psychologi-
such 1983; (Cattell 1978; Tinsley & Tinsley 1987; cal health and development (Erickson, 1959; Blau,
Nunnally & Bernstein 1994). Higher estimates range 1964; Argyle, 1972; Kanter, 1977; Barber, 1983).
from ‘a large sample of several hundred’ (Cureton & A supportiveness construct such as this one with a
D’Agostino 1983) to 20 subjects per factor (Arindell focus on communication can build emotional trust
& Van der Ende 1985). The sample sizes used in and an emphasis on developmental training can mit-
this analysis are for the pilot test, 103 subjects which igate the uncertainty of the current business envi-
satisfy Gorsuch’s (1983) criteria; for the field test, ronment. In asking where the supportiveness con-
226 subjects, which satisfy Gorsuch’s (1983), Cattell’s struct fits into the body of theory and constructs that
(1978) and Tinsley’s & Arindell’s (1987) criteria. For exist to describe and explain workforce attitudes to
the combined sample of 329, all of the criteria are their employers and to business organisations, there
satisfied. is evidence that positive support through good HR
policies are associated with high organizational per-
formance as well as employee commitment and trust
5.3 Ordinal scale (Koys, 2001; Arthur, 1994; Huselid, 1995; Tsui
From the results that were obtained, a subjective et.al. 1997). In addition, the evidence that employee
supportiveness scale was developed for the host assistance programmes support systems more than
organisation; this can be used for each subscale and pay for themselves demonstrates the usefulness of
its linguistic anchors for each scale point are: 7 = supportiveness.
exceptionally supportive, 6 = very supportive, 5 =
supportive, 4 = neither and so on to 1 = very poor.
REFERENCES

6 CONCLUSION Arindell, W.A. & Van der Ende, J. (1985). An empirical


test of the utility of the observation-to-variables ratio in
In summary, the following outcomes have been factor and components analysis. Applied Psychological
achieved: Measurement, Vol. 9: 165–178.
The important philosophical and practical elements Argyle, M. (1972). The Social Psychology of Work. Har-
that define an organizational supportiveness construct mondsworth: Penguin.
Arthur, J.B. (1994). Effects of human resource systems on
have been identified. manufacturing performance and turnover. Academy of
These elements have been rationalised into seven Management Journal, Vol. 37: 670–687.
factors that can be measured and operationalised into Baker, N. (1996). Let’s talk about relationships. workplace &
policies and strategies in any organization. family www.findarticles.com/cf_0/m0FQP/n4301_v125/
An eighth important factor was identified (Health 1871 7517/print.jhtml
& Safety and Conditions of Employment) that may not Barber, B. (1983). The Logic and Limits of Trust. New
be within the direct control of most people within an Brunswick, NJ: Rutgers University Press.
organization, but one for which clear lines of statutory Blau, P.M. (1964). Exchange and Power in Social Life. New
enforcement and corporate responsibility exists. This York: Wiley.
Briner, R.B. (1996), Making Occupational Stress Man-
factor has been omitted from the construct on the basis agement Interventions Work: The Role of Assessment.
that health & safety is bounded by statute and terms Paper to the 1996 Annual BPS Occupational Psychology
and conditions of employment are very often outwith Conference.
the control of the local workforce other than during Cattell, R.B. (1978). The scientific use of factor analysis.
periods of formal collective bargaining. New York: Plenum.

296
Cronbach, L.J. & Meehl, P.E. (1955), Construct validity Marques, J.F. & McCall, C. (2005). The Application of
in psychological tests. Psychological Bulletin, Vol. 52: Interrater Reliability as a Solidification Instrument in a
281–302. Phenomenological Study. The Qualitative Report, Vol. 10
Cureton, E.E. & D’Agostino, R.B. (1983). Factor analysis: No. 3: 438–461.
An applied approach. Hillsdale, NJ: Erlbaum. Meyer, J.P. & Allen, N.J. (1997). Commitment in the
Dalton, D.R. & Mesch, D.J. (1990). The impact of flex- Workplace. Sage.
ible scheduling on employee attendance and turnover. Moses, B.(1997). Building a life-friendly culture. Ivey
Administrative Science Quarterly, Vol. 35: 370–387. Business Quarterly, Vol. 62 No. 1: 44–46.
Erikson, E.H. (1959). Identity and the life cycle. Psycholog- Moorhead, A., Steele, M., Alexander, M., Stephen, K. &
ical Issues, Vol. 1: 1–171. Duffin, L. (1997). Changes at work: The 1995 Australian
Friedman, D.E. & Galinski, E. (1992). Work and fam- workplace and industrial relations survey. Melbourne:
ily issues: A legitimate business concern. In S. Zedeck Longman.
(Ed.), Work Families and Organizations. San Francisco: Nunnally, J.C. & Bernstein, I.H. (1994). Psychometric
Jossey-Bass. Theory (3rd. Ed.). New York: McGraw Hill.
Froman, R.D. (2001). Elements to consider in planning Oppenheim, A.N. (1994). Questionnaire Design, Interviews
the use of factor analysis. Southern Online Journal and Attitude Measurement. (2nd Edition) Pitman.
of Nursing Research, Vol. 2 No. 5: 1–22 www. Palmer, M.J. (2001). Why do we feel so bad? (per-
snrs.org/publications/SOJNR_articles/iss05vol02_pdf sonnel management in the health care industry) http://
Gorsuch, R.L. (1983). Factor Analysis (2nd Ed.). Hillsdale, www.findarticles.com/cf_0/m0HSV/6_14/79788282
NJ: Erlbaum. /print.jhtml
Huselid, M.A. (1995). The impact of human resource man- Piotrkowski, C.S., Rapoport, R.N. & Rapoport, R. (1987).
agement practices on turnover, productivity, and corporate Families and work. In M.B. Sussman & S.K. Stein-
financial performance. Academy of Management Journal, metz (Eds.), Handbook of marriage and the family (pp.
Vol. 38: 635–672. 251–283). New York: Plenum Press.
Johnson, E.N., Lowe, D.J. & Reckers, P.M.J. (2008). Alter- Tinsley, H.E. & Tinsley, D.J. (1987). Uses of factor analysis in
native work arrangements and perceived career success: counselling psychology research. Journal of Counselling
Current evidence from the big four firms in the U.S. Psychology, Vol. 34: 414–424.
Accounting Organizations and Society. Vol. 33 No. 1: Trochim, M.K. (2002). Construct Validity. http://www.
48–72. socialresearchmethods.net/kb/constval.htm
Kanter, R.M. (1977). Men and Women of the Corporation. Tsui, A.S., Pearce, J.L., Porter, L.W. & Tripoli, A.M. (1997).
New York: Basic Books. Alternative approaches to the employee-organization rela-
Koys, D.J. (2001). The effects of employee satisfaction, tionship: Does investment in employees pay off? Academy
organization citizenship behavior, and turnover on orga- of Management Journal, Vol. 40: 1089–1121.
nizational effectiveness: A unit level longitudinal study. Wohl, F 1997. A panoramic view of work and family. In S.
Personnel Psychology, Vol. 54 No. 1: 101–114. Parasuraman & J.H. Greenhaus (Eds.), Integrating work
Leung, M.Y., Zhang, H. & Skitmore, M. (2008) Effects of and family: Challenges and choices for a changing world.
organisation supports on the stress of construction esti- Wesport, CT: Greenwood Publishing Group.
mation participants. Journal of Construction Engineering
and Management-ASCE Vol. 134 No. 2:84–93.

297
Safety, Reliability and Risk Analysis: Theory, Methods and Applications – Martorell et al. (eds)
© 2009 Taylor & Francis Group, London, ISBN 978-0-415-48513-5

The influence of personal variables on changes in driver behaviour

S. Heslop, J. Harvey, N. Thorpe & C. Mulley


Newcastle University, Newcastle, UK

ABSTRACT: There are several theories that relate to drivers’ risk-taking behaviour and why they might choose
to increase the level of risk in any particular situation. Focus groups were run to identify transient factors that might
affect driver risk-taking; these were recorded, transcribed and content-analysed to obtain causal attributions. Five
main themes emerged, which could then be sub-divided; these themes are discussed in light of existing theories.
It was found that the attribution to self was the most frequent, but that causal explanations were consistent with
the theories of zero-risk, risk homeostasis and flow.

1 INTRODUCTION AND BACKGROUND highly skilled in an activity and are faced with low
challenges, the individuals experience a state of bore-
1.1 Introduction dom. Conversely, when individuals are not skilled in
an activity and are faced with significant challenges,
Driver behaviour has, for many years, been recognised
they experience a state of anxiety. When individuals’
as the main cause of death and injury on the roads.
skills and the challenges posed by an activity are evenly
Sabey and Taylor (1980) for example, showed that
matched, the individuals experience a pleasurable flow
road users were a causal factor in 94% of collisions.
state. Csikszentmihalyi’s model (2002) suggests that
The road environment and the vehicle were causal fac-
under normal driving conditions drivers remain within
tors in 28% and 8.5% of collisions (total greater than
a flow channel, balancing challenges, ultimately con-
100% because crashes can be attributable to multiple
trolled by speed and attention (Fuller, 2005). Should
causes.) The driver has repeatedly been shown to play
the driving environment become more challenging, a
a dominant role in collision causation (Department for
reduction in speed or an increase in attention, per-
Transport, 2007).
haps achieved by aborting secondary tasks can bring
The research reported here investigates which moti-
the driver back into a flow state. If it became less
vations and other transient factors are most salient with
challenging, an increase in speed or reduction in atten-
respect to affecting driver behaviour.
tion, perhaps achieved by adopting secondary tasks
can similarly ensure that the driver remains in a flow
1.2 Behavioural adaptation
state. Thus by varying their speed or attention, drivers
In a seminal paper Taylor (1964) measured the gal- can remain in flow when challenges and or skill levels
vanic skin response (GSR) of drivers in the following change.
road types or conditions: urban shopping streets; wind- In the context of road safety, an understanding of the
ing country roads; arterial dual carriageways; peak and circumstances under which drivers might be motivated
off-peak; day and night and found that GSR, taken to to accept (knowingly or otherwise) increased risks or
be a measure of subjective risk or anxiety, was evenly challenges is of particular interest. The role of motives
distributed over time over the range of roads and con- in driver behaviour has been recognised in two main
ditions studied. His results suggest that driving is a theories: the zero-risk theory (Näätänen and Summala,
self-paced task governed by the level of emotional ten- 1974) and the theory of task difficulty homeostasis
sion or anxiety that drivers wish to tolerate (Taylor, (Fuller, 2000).
1964). It is now accepted that drivers adapt to condi- The zero-risk theory suggests that driver behaviour
tions and state of mind: Summala (1996, p.103) stated is a function of mobility, ‘extra motives’ and safety
that ‘‘the driver is inclined to react to changes in the (Summala, 1998). On the one hand mobility and
traffic system, whether they be in the vehicle, in the ‘extra motives’ such as self-enhancement, time goals,
road environment, or in his or her own skills or states.’’ thrill-seeking, social pressure, competition, conser-
Csikszentmihalyi’s (2002) flow model represents vation of effort, pleasure of driving and mainte-
the following two important dimensions of experi- nance of speed and progress push drivers towards
ence: challenges and skills. When individuals are higher speeds (Summala, 2007). On the other hand,

299
warning signals control risk taking if safety margins attributional analysis. Attributional statements have
that are learned through experience are compromised been defined as ‘‘any statement in which an out-
(Summala, 1998). come is indicated as having happened, or being
The theory of task difficulty homeostasis argues present, because of some event or condition’’ (Munton,
that drivers monitor task difficulty and aim to drive Silvester, Stratton and Hanks, 1999). Statements meet-
within an acceptable range of difficulty (Fuller, 2000). ing this definition were extracted from the transcript.
Safety margins may be affected by motives, through Where several different causes or effects were given for
their influence on the upper and lower limits of accept- a particular relationship, attributions were extracted
able task difficulty. Fuller (2000) lists motives that and coded separately.
push drivers towards higher speeds and reduced safety
margins in three categories: social pressure; critical
threshold testing; and other motives. Social pressure 3 RESULTS
includes: maintaining the speed of flow in a traffic
stream; pressure from passengers; fear of arriving late 1026 attributional statements were extracted from the
for an appointment; a desire to drive like others; and transcripts and were reduced into major causal areas
a wish to avoid sanctions. Critical threshold testing (attribution frequency in parentheses):
includes a desire to test the limits, perhaps attributable
to sensation seeking needs, and a drift towards the lim- self (1831)
its, due to a lack of negative feedback associated with physical environment (660)
previous risky driving practices. Other motives include other road users (251)
time pressure, time-saving and gaining pleasure from vehicle (105)
speed. road authority (105)
Research has been carried out in order to estab-
lish how drivers are affected by motivation in terms
The self theme can be extended or broken down into
of either increased or reduced safety margins: when
the following:
drivers are motivated by time saving, they drive faster
thus taking increased risks (see for example: Mussel-
white, 2006; McKenna and Horswill, 2006); similarly, extra motives (879)
when angry, they adopt aggressive driving patterns flow state (332)
and drive faster, thus reducing safety margins (see for perception (233)
example: Mesken, Hagenzieker, Rothengatter and de journey type (142)
Waaard, 2007; Stephens and Groeger, 2006). capability (116)
Whilst theories of driver behaviour acknowledge time pressure (103)
that motivations play an important role in terms of experience (26)
driver risk taking, there has been no research to deter-
mine which are the most salient motives and transient Data were then coded within each of these themes.
factors of influence other than in relation solely to A selection of the attributional statements of other road
speed (Silcock et al., 2000). users, vehicle and the roads authority as causal themes
are shown in Table 1.
Self and physical environment themes had to be
2 METHODOLOGY further divided in order to preserve the richness of
the data. The physical environment theme was divided
Eight individuals with full UK driving licences who into the following sub-themes: physical environment,
drove on a frequent basis participated in the study. road condition, and road familiarity. To illustrate the
The study group comprised 6 males and two females, presence of these causal themes, a selection of attri-
with ages spread across the range 17–65. Data were butional statements are shown in Table 2. The self
recorded using a portable cassette recorder with micro- theme was divided and is shown in Figure 2. To
phone. illustrate the presence of ‘self’ in the sub-themes,
a selection of attributional statements are shown in
Table 3.
2.1 Design and procedure
Two unstructured focus group discussions were held.
Participants were encouraged to talk about their own 4 DISCUSSION
driving behaviour, and the circumstances under which
they felt that their driving behaviour changed, giv- We have shown from this research that there are five
ing examples where appropriate. Both focus groups main categories of causal areas, being the driver him
were transcribed for content analysis, including or herself; the physical environment; other road users;

300
Table 1. Statements where the cause was coded under ‘other road users’, ‘vehicle’ and ‘roads authority’.

Attributional statement Identity code Freq.

[a] For ‘other road users’ as a cause


I’d accept a smaller gap, if I was waiting to get out High traffic levels 68
(the old bloody biddies driving like idiots) really, really gets up my nose Incompetence 17
there was no traffic, at all (on M1) so you just put your foot down Low traffic levels 12
then I saw a straight and took a risk that I wouldn’t usually have taken Slow car in front 54
If I’m in a traffic jam I turn the radio up Traffic jam 22
[b] For ‘vehicle’ as a cause
driving the big car is helped because I’ve got a much better view Large vehicle 45
they all assume that I’m a builder or plasterer and therefore behave in a lewd way White van 19
[c] For ‘roads authority’ as a cause
I would definitely stick to the speed limit if it was wet 60 mph limit 22
I’m unwilling to overtake on those roads because partly they remind you how many Road safety
people have been killed there measures 25
Beach Road (I think that’s a forty limit) is one of my favourites for (breaking speed
limits); It’s . . . in effect, a three lane but it’s configured as a two-lane road Limit too slow 20

Table 2. Statements where the cause was coded under ‘physical environment’, ‘road condition’, and ‘road familiarity’.

Attributional statement Code identity Freq.

[a] Physical environment theme


I fell asleep on the motorway Dual/motorway 123
I drive with smaller safety margins in London London 18
You just go round them faster (as you drive along the route) Many roundabouts 15
In Newcastle old ladies are terribly slow and that irritates me Newcastle centre 10
Trying to overtake on those two stroke three lane roads is quite nerve racking Open rural roads 55
I’d accept a smaller gap, if I was in a hurry Priority junction 28
(I can enjoy the driving when on) familiar rural roads (in Northumberland) Rural roads 33
Driving round streets where people live, (I’m much more cautious) Slow urban roads 24
I got to the end of the Coast Road, and thought, I don’t actually remember going
along the road Tyne and Wear 40
Put me on a bendy country road: I’ll . . . rise to the challenge Tight rural roads 98
I was a bit wound up so I was sort of racing, traffic light to traffic light Urban roads 26
you’ve got 3-lane 30 mph roads; I think it is safe to drive there at 60 mph) Wide urban roads 14
[b] Road condition theme
(there was black ice) and I went up the bank and rolled the car Adverse 25
[c] Road familiarity theme
(the old people that come into the city centre here) changes the way I drive familiar 88

the roads authority and the vehicle being driven. Of that people adjust their speed upwards or downwards
these, by far the most frequently occurring was ’self’. according to the weather or their knowledge of the
Each of these is considered below in relation to existing road; thus zero-risk is an inappropriate explanation
theory and evidence. prima facie as the situation does not involve the oppos-
‘Self ’ contained seven sub-themes: perception; ing motives proposed by Summala (1998). However
capability; journey type; time pressure; extra motives; the idea that drivers adapt to the environment in order
flow state; and experience; each of these is considered to remain within a flow state or avoid boredom might
below. be a better explanation, as would the homeostasis
In terms of perception, whether the situation felt theory (Csikszentmihalyi, 2002; Fuller, 2000). Capa-
safe or felt risky were frequently reported. Percep- bility was also a frequently reported cause: it was
tion is a stage inherent within most models of driver often perceived to have increased (such as being in
behaviour (Fuller, 2000; Summala, 1998 and 2007; a larger vehicle) or reduced (through fatigue). If capa-
Näätänen and Summala, 1974). This study has found bility increased, drivers are expected to adapt by either

301
Table 3. Statements where the cause was coded under the umbrella theme, ‘self ’.

Attributional statement Code identity Freq.

[a] Perception theme


In adverse weather I drive more slowly Feels risky 102
because I knew it really well, I would drive faster Feels safe 123
[b] Capability theme
you drive differently in the small car Reduced 59
(I can drive faster when alone in the car because no distractions) increased 57
[c] Journey type theme
I find I can push out into traffic when driving a mini bus Club trip 46
I actually unwound driving home Commute home 15
When I’m on a long journey I drive faster Long journey 20
I can enjoy the driving when it’s pretty rolling countryside Leisure trip 38
[d] Time pressure theme
you have to (drive more aggressively (in London)) Yes 62
(I was driving calmly) partly because there wasn’t a set time I had to be home No 41
[e] Extra motives theme
Everyone expects you to abuse a minibus a bit, like U-turns in the road) Act as 4×4 van driver 32
I drive my fastest when I’m on my own Alone in car 13
(GPS) encourages me to keep the speed up Boredom relief 40
Yes, (I drive at a speed I’m comfortable at regardless of the speed limit) Flow seeking 52
if I’m following someone who is more risky (I will raise my risk threshold) Following someone 17
If somebody’s driving really slowly I’ll judge whether I can get past safely Maintaining progress 222
(Sometimes I find if you turn it up) you’re just flying along Music 21
I was really upset, and I drove in a seriously risky manner Negative emotions 17
I know I can get distracted if I’ve got passengers in the car Passengers 11
I only drive slowly with you (because) I’d want you see me as a good driver Show off safe driving 31
I’ll just tear down there (in order to impress my friend) Show off risky 18
I saw the traffic jams, as a means of winding down and actually unwound relaxation 57
I would hang back (if I saw risky driving in front of me) Risk aversion 110
I drive fast on bendy country roads Risk seeking 103
I just wanted to show him he was an idiot (so chased him) Road rage 10
I was starting to get really sleepy and I blasted the radio up really loud fatigue 11
[f] Flow state theme
In adverse weather I drive more slowly Anxious 108
(If I’m in a traffic jam and it’s boring) I annoy other drivers by . . . (singing) Bored 224
[g] Experience theme
Almost hitting the slow car in front makes me concentrate more for a while Accident or near miss 16

increased speed or reduced concentration; again these as sensation-seeking or to avoid lateness (Summala
ideas are consistent with both flow and homeostasis 2007; Fuller 2000; Näätänen and Summala, 1974). In
explanations (Csikszentmihalyi, 2002; Fuller, 2000). our study, negative emotions such as anger resulted in
Extra motives, illustrated in Table 3, include: drivers effecting more aggressive behaviour and hence
boredom relief, flow-seeking, maintaining progress, reduced safety margins whereas positive emotions had
negative and positive emotions, penalty avoidance, the reverse effect and resulted in increased safety mar-
relaxation, risk-aversion, risk and thrill-seeking and gins; this is consistent with other findings (Underwood
tiredness. The theory of flow (Csikszentmihalyi, 2002) et al., 1999; Stephens and Groeger, 2006). Other
would predict boredom relief motives to result in either motives such as penalty avoidance, relaxation, tired-
faster driving or adoption of secondary tasks, in either ness and risk-aversion all led to reduced speeds but
case reducing safety margins; flow-seeking motives probably increased boredom and took drivers out of
should encourage faster driving when the challenges their flow channel or band of acceptable task dif-
are low but slower driving when the challenges are ficulty; again this is consistent with other findings
high. The findings here would support this explana- (van der Hulst et al., 2001). Risk-seeking motives
tion, but so they also would support the homeostasis are likely to encourage driving at the upper bounds
and zero-risk theories, both of which emphasise the of the flow channelband of acceptable task difficulty
motivation tendencies propelling drivers to drive faster (Csikszentmihalyi, 2002).

302
Journey type was an important causal theme in our a large vehicle appears to result in faster driving,
findings and included leisure trips, long journeys and presumably from increased visibility and reduced
commuting. Motivations related to different journey challenges, whilst the opposite appears to be the
types have associated and different effects on driver case when driving a small vehicle (Csikszentmiha-
behaviour: for example, club trips (found in this study) lyi, 2002). Vehicle image appeared to influence driver
might encourage risky driving amongst peers (Arnett behaviour if the vehicle being used wasn’t the driver’s
et al., 1997) whilst leisure trips may be expected to usual vehicle. In light of work by Newman et al. (2004)
encourage a generally calmer and more relaxed driv- drivers could be expected to drive at reduced speeds
ing manner. In general for any journey type, whether in non-private vehicles, but our findings indicate the
the driver was under time pressure or not (includ- opposite to be true.
ing time-saving) were frequently nominated factors The road authority can self-evidently be an influ-
of influence. Motorists are likely to drive faster and ence, for example in terms of safety measures, traffic
more aggressively when under time pressure, and by calming measures, chevrons and speed limits and these
implication in a more relaxed fashion when not, as in fact were frequently nominated causal factors. The
was found here (Musselwhite, 2006; McKenna and fact that drivers nominated these as causal factors
Horswill, 2006). suggests that they are likely to slow down in these situa-
Flow state was found to be a key theme: whether the tions (McKenna and Horswill, 2006), possibly because
situation resulted in the driver feeling bored or anxious of fear, responsibilities, correction of risk mispercep-
was reported frequently. The flow model (Csikszent- tions and associations between warnings and increased
mihalyi, 2002) suggests that a state of boredom or challenges. Drivers also stated as a causal factor that
anxiety is not pleasant, and as such drivers are expected the speed limit is too low; flow theory would sug-
to look for opportunities to achieve a flow state either gest that they would weigh up the costs and benefits
by adjusting task demands or attention, such as our of speeding versus experiencing a state of boredom
examples illustrate. (Csikszentmihalyi, 2002).
Even car insurance premia reflect experience, so The physical environment theme was divided into
unsurprisingly it was also found to be a causal influ- the physical environment, road condition and road
ence, but only represented in our data by whether the familiarity. Road type was frequently reported as an
driver had experienced an accident or near miss. This influence, particularly motorways or dual carriage-
also fits with the commonly observed phenomenon of ways, followed by tight rural roads and open rural
drivers slowing down after observing the evidence of roads; these different road types have different design
an accident, only to speed up again later. Initial find- speeds to which drivers are expected to adapt- for
ings support the suggestion that drivers adopt more example we found drivers saying they drove faster
risk-averse attitudes after such events and as such are on open roads with high visibility, and slower on
expected to increase safety margins for a period, driv- tight roads where it is reduced (Csikszentmihalyi,
ing within the lower limits of their flow channels; this 2002). Within the road condition sub-theme, the only
explanation fits all three theories. factor cited in this study was adverse conditions,
Other road users are also cited frequently, with high such as water or snow on the road, where drivers
traffic levels the most frequently nominated cause. would be expected to slow down, taking more care
Predictably, high traffic levels were found to affect (Csikszentmihalyi, 2002; Stradling, 2007). In terms
anger, violations and accidents; being held up by of road familiarity, drivers said they would drive faster
a slower car in front and responding aggressively on familiar roads, consistent with all of the three main
were also causes cited in our study, again consis- theories which suggest that challenges are reduced
tent with other research (Underwood et al., 1999; with familiarity (Stradling, 2007); this can be accom-
Bjorklund, in press). Additional causes within this modated by an increase in speed and/or a reduction
theme included other drivers’ incompetent or risky in attention. We also found road familiarity may be
behaviours and traffic jams affecting anger and low associated with increased confidence and a sense of
traffic levels encouraging higher speeds via reduced ownership; an example of this being ‘‘the old people
challenges (Bjorklund, in press; Csikszentmihalyi, that come into the city centre here change the way I
2002). That other road users were found to be a main drive’’.
causal theme in the determination of driver behaviour To summarise, the findings of self as the dominant
supports both the zero-risk and risk homeostasis the- cause is an interesting attribution, not entirely consis-
ories where perception of the environment (including tent with findings elsewhere that most drivers make the
other road users’ behaviour) are key stages within the attribution of blame to others; this is consistent with
models (Fuller, 2000; Summala, 2007; Näätänen and self-serving attributions and the fundamental attri-
Summala, 1974). bution error in attribution theory (Hewstone, 1989).
The vehicle being driven, in terms of size and Most of our findings are consistent with all three
image, were factors of influence in our study. Driving theories of driver behaviour and flow (Fuller, 2000;

303
Summala 2007; Csikszentmihalyi, 2002; Näätänen Fuller, R. (2005). Towards a general theory of driver
and Summala, 1974). behaviour. Accident Analysis & Prevention 37(3):
461–472.
Hewstone, M. 1989. Causal Attribution: from cognitive
processes to collective beliefs. Oxford: Blackwell.
5 CONCLUSIONS McKenna, F.P. and M.S. Horswill (2006). Risk Taking From
the Participant’s Perspective: The Case of Driving and
This research forms the preliminary base for a longer Accident Risk. Health Psychology 25(2): 163–170.
and more detailed study, but this required the identi- Mesken, J., M.P. Hagenzieker, et al. (2007). Frequency,
fication of themes from which to proceed, and these determinants, and consequences of different drivers’ emo-
have been identified here. In all, five major themes tions: An on-the-road study using self-reports, (observed)
behaviour, and physiology. Transportation Research Part
have been identified which sub-divide into 13 smaller F: Traffic Psychology and Behaviour 10(6): 458–475.
themes, which can form the basis for evaluation and Munton, A.G., J. Silvester, et al. (1999). Attributions in
judgement in further interviews and surveys. How- Action: A Practical Approach to Coding Qualitative Data,
ever, even with these themes and our data supporting John Wiley & Sons Ltd.
them as often very frequently occurring, it is not pos- Musselwhite, C. (2006). Attitudes towards vehicle driv-
sible to identify which of the three main explanatory ing behaviour: Categorising and contextualising risk.
frameworks-flow theory, zero-risk theory or home- Accident Analysis & Prevention 38(2): 324–334.
ostasis offers the best explanations for either our Näätänen, R. and H. Summala (1974). A model for the role of
themes or their frequency of occurrence. Somehow, all motivational factors in drivers’ decision-making. Accident
Analysis & Prevention 6(3–4): 243–261.
these theories seem to tell us that drivers may be moti- Newnam, S., B. Watson, et al. (2004). Factors predict-
vated to drive faster, but whether this is to maintain ing intentions to speed in a work and personal vehicle.
flow, or for some form of extra motive, or as critical Transportation Research Part F: Traffic Psychology and
threshold testing, cannot easily be ascertained, even Behaviour 7(4–5): 287–300.
with the causal reasons provided in our attributional Sabey, B.E. and H. Taylor (1980). The known risks we run:
analyses. the highway. Crowthorne, TRL Ltd.
The study so far has some limitations- we need Silcock, D., K. Smith, et al. (2000). What limits speed? Fac-
more focus groups, but directed more towards causal tors that affect how fast we drive, AA Foundation for Road
factors particularly for ‘self’. Once these have been Safety Research.
Stephens, A.N. and J.A. Groeger (2006). Do emotional
conducted and analysed, then questionnaires can be appraisals of traffic situations influence driver behaviour?
built to test some of the more pertinent aspects of Behavioural Research in Road Safety: Sixteenth Seminar,
self and other causes. Additionally, we have focussed Department for Transport.
solely on drivers in one part of the United Kingdom, Stradling, S.G. (2007). Car driver speed choice in Scotland.
so more work will be done in relation to other driving Ergonomics 50(8): 1196–1208.
areas, such as capital cities and rural environments. Summala, H. (1988). Risk control is not risk adjustment: The
zero-risk theory of driver behaviour and its implications,
United Kingdom: Taylor & Francis.
Summala, H. (1996). Accident risk and driver behaviour.
REFERENCES Safety Science 22(1–3): 103–117.
Summala, H. (2007). Towards understanding motivational
Arnett, J.J., D. Offer, et al. (1997). Reckless driving in ado- and emotional factors in driver behaviour: comfort
lescence: ‘State’ and ‘trait’ factors. Accident Analysis & through satisficing. Modelling Driver Behaviour in Auto-
Prevention 29(1): 57–63. motive Environments—Critical Issues in Driver Interac-
Bjorklund, G.M. Driver irritation and aggressive behaviour. tions with Intelligent Transport Systems. P.C. Cacciabue,
Accident Analysis & Prevention In Press, Corrected Proof. Springer Verlag: 189–207.
Csikszentmihalyi, M. (2002). The classic work on how to Underwood, G., P. Chapman, et al. (1999). Anger while driv-
achieve happiness, Rider. ing. Transportation Research Part F: Traffic Psychology
DfT (2007). Road Casualties Great Britain: 2006—Annual and Behaviour 2(1): 55–68.
Report, The Stationary Office. van der Hulst, M., T. Meijman, et al. (2001). Maintaining task
Fuller, R. (2000). The task-capability interface model of the set under fatigue: a study of time-on-task effects in sim-
driving process. Recherche—Transports—Securite 66: ulated driving. Transportation Research Part F: Traffic
47–57. Psychology and Behaviour 4(2): 103–118.

304
Safety, Reliability and Risk Analysis: Theory, Methods and Applications – Martorell et al. (eds)
© 2009 Taylor & Francis Group, London, ISBN 978-0-415-48513-5

The key role of expert judgment in CO2 underground storage projects

C. Vivalda
Schlumberger Carbon Services, Clamart, France

L. Jammes
Schlumberger Carbon Services, Paris La Défense, France

ABSTRACT: Carbon Capture and Storage (CCS) is a promising technology to help mitigate climate change
by way of reducing atmospheric greenhouse gas emissions. CCS involves capturing carbon dioxide (CO2 ) from
large industrial or energy-related sources, transporting, and injecting it into the subsurface for long-term storage.
To complete the limited knowledge about the site and experience about the operational and long-term behavior
of injected CO2 , a massive involvement of expert judgment is necessary when selecting the most appropriate
site, deciding the initial characterization needs and the related measurements, interpreting the results, identifying
the potential risk pathways, building risk scenarios, estimating event occurrence probabilities and severity of
consequences, and assessing and benchmarking simulation tools. The paper sets the basis for the development
of an approach suited for CCS applications and its role in the overall framework of CO2 long-term storage
performance management and risk control. The work is carried out in the frame of an internal research and
development project.

1 INTRODUCTION a massive involvement of expert judgment is neces-


sary when selecting the most appropriate site, deciding
Carbon Capture and Storage (CCS) is a promising the initial characterization needs and related measure-
technology to help mitigate climate change by way of ments, interpreting results, identifying potential risk
reducing atmospheric greenhouse gas emissions. CCS pathways, building risk scenarios, estimating event
involves capturing carbon dioxide (CO2 ) from large occurrence probabilities and severity of consequences,
industrial or energy-related sources, transporting, and and assessing and benchmarking simulation tools.
injecting it into the subsurface for long term storage. The study of CO2 injection and fate currently has
At the current state of the art, the behaviour of very few technical experts, and there is a need to cap-
CO2 and injection-related effects are still not fully italize on the expertise from other domains, such as
understood (IPCC, 2005). Extensive research activ- oil and gas, nuclear waste disposal, chemical, etc. to
ity is underway aiming at bridging these knowledge obtain relevant and valuable judgments.
gaps, for instance characterizing the short and long- Formalized methods able to infer the searched
term chemical interactions between CO2 , formation information and data from individual expertise in a
fluids, and the rock matrix, or the mechanical effects well-defined field, and to analyze and combine the
due to increasing the formation pressure (Berard & al., different experts’ opinions in sound, consistent final
2007). Research and demonstration projects, along judgments seems the best way to achieve this.
with more field data and history matching, will provide Several bibliographic references concerning expert
more accurate characterization of these physical pro- judgment elicitation processes exist in literature
cesses and will permit continuous updates and tuning (Simola & al., 2005), as well as practical applica-
of initial evaluations. tions in some industrial sectors, such as oil and gas,
Natural analogs, know-how from other industrial nuclear, aerospace, etc. The approaches presented,
sectors, and the experience gained in pilot/demonstra- quite generic and in principle almost independent from
tion projects where data gathering and measurement the industrial area, represent a valuable background
campaigns are being carried out, are the references for the development of expert judgment elicitation
today to gain better understanding of the behavior of processes dedicated to CO2 storage studies.
the injected CO2 at a storage site. These general methodologies need to be made more
To complete the current limited knowledge of specific to be applied in the domain of CO2 geolog-
the site and of the behaviour of CO2 underground, ical storage, whenever relevant data are lacking or

305
unobtainable, while issues are very serious and/or very and documented process should add value to the
complex. current practice and provide justified assumptions and
The paper sets the basis for the development of results for external audit.
an approach suited for CCS applications and its role
in the overall framework of CO2 long-term storage
performance management and risk control. The work
is carried out in the frame of an internal research 3 CHALLENGES
and development project and examples of potential
applications are presented. Using expert judgment to help overcome the insuf-
ficient, lack of, or poor, data is challenged by the
following:
2 STATE OF THE ART
– The experts are questioned on a domain that is being
At the time the authors are writing (2008) there are no discovered;
published formalized methods for experts’ judgment – CCS experts are contributing to the discovery of the
elicitation in the domain of CO2 storage. Nevertheless, domain and can be biased by their own understand-
expert judgment is tacitly or explicitly employed in ing;
studies and projects to complement lack of or poor – There are no available statistics (e.g. on unwanted
information and data. events such as leakage from the formation to the
Experts are currently required to assess existing atmosphere; on fault and fracture activation; etc.).
data related to a site, such as geophysics data, geo- However new data from research and demonstration
logical reports, petro-physical properties . . . and projects are recorded: they have to be continually
decide for further data acquisition and site character- mined to update first subjective estimations;
ization when a site is screened for suitability to store – Experimental results cover a limited number of spe-
CO2 . They are also asked: to estimate physical quan- cific phenomena as, for example, wellbore cement
tities, such as rock permeability from measurements degradation (Kutchko & al., 2007; Barlet-Gouédard
and provide degree of confidence or probability den- & al., 2007). Extrapolations to the overall site
sity functions of the estimated quantities; to intervene behavior are not yet viable and their limited number
in the qualitative and semi-quantitative steps of risk as well as confidentiality makes it difficult to build
assessment to identify risk pathways and scenarios, statistics. In spite of that, experts should be able
and when not otherwise possible, to assess their prob- to benefit from this information when providing
ability of occurrence and their potential consequences. judgments;
They are expected to make choices among models to – Numerical modeling of physical phenomena is
represent the actual CO2 behavior, for example during underway and is rapidly progressing, despite the
injection and storage in the underground formation limited feedback from the field necessary to cali-
for a long time period, and to assess and benchmark brate and history match the models. This implies
simulation tools. that current simulation models should consider
CCS projects in Australia (Bowden & Rigg, 2004) the uncertainty of the input parameters and make
are an example. Experts have been involved in the explicit the degree of confidence of the results. The
process of selecting the most suitable storage sites. role of experts to assess the models as well as to
The expert panel was formed by individuals carry- provide estimations of input parameters is crucial;
ing the relevant competencies, such as geomechanics, – Generic risk pathways have been imagined, such
geochemistry, geology, project management, etc. and as CO2 escape into potable aquifers due to well-
led by normative experts. A pool of experts has also bore failure, vertical migration and leakage through
filled and is maintaining a risk register for one selected abandoned or closed wells in the vicinity (Benson
site (Otway). Weyburn project in Canada is another & al., 2002) but not feed-backed and quantified by
example (Whittaker, 2004). In this project experts experience. Starting from these generic pathways,
have been questioned to identify the list of failures, experts may be required to build site specific sce-
events and processes (FEPs) relevant to the project narios and assess their probability of occurrence and
(Savage & al., 2004) and to build risk scenarios. Other the magnitude of their consequences;
examples, such as Sleipner in Norway or Illinois in – Each site is ‘‘unique’’ (not reproducible). Knowl-
United States can also be mentioned. edge about the structure and the properties of the
In all cases, no formal process was followed but the geological system is usually obtained by inter-
quality of the results was assured by the excellence and preting existing relevant geophysical data such as
common sense of the people in the panels. seismic, and measuring underground strata char-
Notwithstanding the importance of these first refer- acteristics—mechanical, petro-physical, fluid flow
ences and applications, it seems that a well established properties, chemical composition. . . Experts will

306
always be required to interpret and assess site spe- The first is represented by natural and industrial
cific data. Extrapolations to or from other sites analogs. They have the advantage of presenting some
are normally risky and shall be appropriately docu- characteristics similar to CO2 storage, such as leak-
mented. age/seepage mechanisms for natural analogs, long
term repository for wastes, gas injection into forma-
These challenges influence the way the formal tions for gas storage; but also important differences,
process for expert judgment elicitation is built. such as CO2 already in place for natural analogs, short
storage period (few months to years) and reversible
process for gas storage sites, and dangerous substances
4 EXPERTS JUDGMENT ELICITATION for waste repositories. The second area is represented
PROCESS by other industrial applications, such nuclear, aero-
nautics, chemical, oil and gas. . . The experts in these
The internal project is currently in the phase of set- two areas need accurate training on the system charac-
ting the requirements for the development of an expert teristics, performance, and challenges to avoid under-
judgment elicitation process able to: estimation of the differences, such as peculiar CO2
– Address the challenges described above; physical and chemical properties, long-term storage
– Capture the most relevant information and data from requirements, etc. and draw biased conclusions.
the experts; The third area is represented by expertise outside
– Be well balanced between time and costs and quality industrial applications, such as policy and regulations,
of captured information; insurance, sociology, environmental protection. . .
– Address specific needs and objectives of expert This contribution is very valuable for example when
judgment use; assessing the consequences of a given scenario and
– Be reproducible. their impact on population, the environment, etc. In
this case the experts need a full understanding of the
The formal process will be based on existing prac- overall system and its challenges, spanning from the
tices and methods (Cooke & al., 1999; Bonano CO2 storage site to the legal framework, the local
& al., 1989), which will then be customized to be population habits, the flora and the fauna, etc.
implemented on CO2 storage projects. On top of these three areas, a valuable set of experts
While building the process, it is important to define is found among the CCS practitioners, i.e. those who
the domains it will be applied to, and ensure it covers have or are participating in CCS research, demonstra-
all of them. tion and future large-scale projects. Their know-how
The following domains are identified: and feedback is very precious.
– Measurements interpretation (e.g. CO2 plume size, Depending on the objectives of the elicitation pro-
mass in place, etc.); cess, a balanced mix of the above expertise needs to
– Physical parameters estimation (e.g. permeability, be reached. The presence of experts outside the CCS
fault/fracture extent, etc.); community already accustomed to giving judgments
– Risk related parameters estimation (e.g. scenario should positively contribute to keep the approach up-
identification, likelihood and severity estimation, to-date and smooth the way towards implementation
etc.); of the elicitation process in this new field.
– Conceptual and mathematical models selection and Following the classical approach for expert judg-
validation (e.g. CO2 transport model, mechanical ment elicitation (Bonano, 1989) three main kinds of
model, etc.). experts will be selected:

All along these domains, the experts are also – Generalists: knowledgeable about various aspects
required to qualify or quantify uncertainty, e.g. in of the storage site and its performance;
qualitative assessment, in numerical values for key – Specialists: at the forefront of specialties relevant to
parameters, in predictions, etc. the storage site;
The process will be deployed into the following four – Normative experts: trained in probability theory,
main steps. psychology and decision analysis. Assist general-
ists and specialists with substantive knowledge in
articulating their professional judgments.
4.1 Experts selection
Typical specialties necessary to create a panel
There is a need to find a balanced way to select the for CO2 storage projects is geology, geomechanics,
experts knowing that the expertise on CO2 geologi- geochemistry, geophysics, petro-physics, monitoring
cal storage is being built in parallel to the increase techniques and instrumentations, materials, etc.
in knowledge. Three main external areas can be The size of the expert panel varies according to the
mentioned where useful know-how exists. domain and objects to be elicited.

307
4.2 Experts elicitation instruments and modes experts can be asked to estimate the physical quantity,
and provide the degree of confidence or the proba-
The mode of carrying out the experts’ elicitation
bility density functions of this parameter in different
depends on the elicitation purposes, the domains it
sections of the well. More refined methods, such as
addresses, the level of general knowledge about the
the Classical Model suggested by Cooke & al. (1999)
objects to be questioned, and the available expertise
are better suited to deal with this case to find a rational
relevant to the object to be questioned.
consensus on the final distributions.
It is envisaged to use at least two approaches to
The third example addresses the estimation of the
elicit knowledge. The first one is by setting up expert
probability of a fault opening under certain pressure
panels, to have all the experts gathered together at least
conditions, when data from site seismic measurements
once and encourage discussion and confrontation. The
are very poor and mechanical properties uncertain. In
second one is by survey, where the experts judgment is
this case, the experts may give very different judg-
elicited independently but not confronted. The results
ments and seeking consensus is less informative than
will then be analyzed by the normative expert(s).
estimating the degree of variability among experts’
best estimates. This allows getting a measure of the
overall uncertainty around that variable and then decid-
4.3 Experts judgments analysis and combination
ing to take actions, such as asking for additional
The judgments provided by the experts will then be measurement campaigns, to reduce the uncertainty
analyzed and combined. bounds.
Methods to combine expert judgments exist in liter-
ature. They vary from very simplistic methods, based 4.3.1 Uncertainty measures
on averaged values of experts’ estimations, giving Uncertainty associated to the judgments has to be mea-
each expert the same weight, to more complicate and sured and considered when the results of the elicitation
informative (Gossens & al., 1998), where the experts process are analyzed. Efficient measures to look at are
are weighted and their judgments probabilistically the following (Hoffmann & al., 2006):
aggregated.
– Extent of agreement between experts and exter-
The choice of the most appropriate method to com-
nal attributions (e.g. analogues): use of relevant
bine the different judgments will vary depending upon
analyses (e.g. regression) to explain differences;
the final use of the estimations. Three examples are
– Degree of variability among experts’ best estimates:
presented below.
statistical measure of consensus of expert opinion;
The first example considers a pool of experts gath-
– Individual uncertainty: measure of the certainty
ered at the initial phase of a project to compile a risk
experts have about their own attribution judgment.
register. Once the list of hazards, their potential causes,
and consequences are identified and recorded, elicited The uncertainty is described by probability distribu-
outcomes will be the likelihood and severity of the tions, credal intervals, when numerically quantifiable,
identified hazard. These two parameters, at this stage or by qualitative statements, such as large, small . . .
of the project, are poorly known and reference statistics when not. The treatment of quantified uncertainties
do not exist, so they will be classified using indexes is done using traditional probabilistic and statistical
(e.g. from 1 to 5) running for example from improb- methods.
able to probable for likelihood and from negligible to
catastrophic for severity. Each expert is asked to index
4.4 Updating initial estimations
likelihood (IL ) and severity (IS ) of each hazard and
provide a credal interval (e.g. [IL − 1; IL + 1]). The An important feature of a geological site is that
judgments are then combined on the indexes and, due its description becomes more and more accurate by
to the broad level of the analysis, a simple combina- acquiring new data, by monitoring, and by running cal-
tion method, like averaging the individual indexes, will ibrated simulation models. This implies that some of
be sufficient to obtain an integer mean index Im . The the objects initially evaluated by experts are now mea-
aggregated credal intervals can also be evaluated. sured or calculated, and also that some judgments have
The second example concerns a project where an to be updated because of more available information.
existing well is being evaluated for potential re-use as A positive outcome is that the uncertainty surrounding
an injection well (Gérard & al., 2006; Van Der Beken, the overall CO2 storage system is narrowed.
A. & al., 2007), and where cement evaluation logs are Updating sessions are therefore necessary and valu-
available. In this case, the permeability of the cement is able. They may require the involvement of only one or
one of the most significant factors to drive the choice a few experts to inform the system of new consolidated
of re-using the well, because the probability of hav- data. Or they may want a pool of experts to re-run an
ing CO2 leaking through the cement depends on it. elicitation process that will consider all the new infor-
In this case, more precise judgments are required and mation at hand. The composition of the pool may be

308
5 EXPERT JUDGMENT AND CO2 STORAGE
PERFORMANCE MANAGEMENT

A methodology for performance management and risk


control of CO2 storage sites and related models and
tools are being developed within Carbon Services, and
it includes a formal process to elicit experts’ judg-
ment and extract knowledge in terms of qualitative or
quantitative appraisals (Vivalda & al., 2008).
The core of the methodology is the identification
of the possible short-, mid-, long-term losses in sys-
tem performance and their description in terms of
reference risk pathways, relevant scenarios and their
probabilistic evaluation. A risk pathway is defined as
the mechanism of exposure of a receptor—potable
aquifers, soils, living areas—to a stressor—CO2
leakage and related events. A scenario is defined
as hypothetical sequences of events constructed for
the purpose of focusing attention on causal process.
Figure 2 shows an example of CO2 storage site and
a leakage pathway, i.e. CO2 vertical migration from
a deep saline formation to a potable water aquifer
through an abandoned well.
Experts are expected to lead the identification of
the risk pathways and to build representative scenar-
ios. In addition, for each representative risk path-
way/scenario, the likelihood/probability of occurrence
and its impact (i.e. the severity of its consequences)
are sometimes estimated using expert judgment. This
occurs quite often when the project is in its early
phases, and available data and statistics are very poor.
Figure 1. Excerpt of decision diagram for structured expert Another step where the experts are required is the
elicitation process selection.
identification and the selection of mitigation measures
(preventive or corrective) when there is the necessity
the same as at the former elicitation or may include to lower or control some risks. Selecting the measures
different experts. It can even be modified in size and is a decision making process where the advantages of
types of expertise when deemed necessary. the introduction of the measure should be balanced
with its effectiveness.
4.5 Structure
The process for expert elicitation is structured starting
from an inventory of available practices and methods,
properly combined in a decision diagram to give indi-
cation on how and when they can be used with respect CO 2
to the site life cycle and the corresponding path to
follow.
The purpose of structuring the process in this way is Monitoring
well
to give flexibility to each specific project to select the Potable Water
Injector
most appropriate path according to the degree of detail, Abandoned well Fault
well
precision expected by the elicitation process and the Fractures

project data available at the moment of the assessment.


Caprock
An excerpt of decision diagram is provided in Figure 1.
The decision diagram is deployed all along the rele- Deep Saline Formation

vant phases of CO2 storage site life cycle, which are the
following: a) site selection, b) site screening, c) char-
acterization, d) design, e) injection, and f ) long-term
storage/surveillance. Figure 2. Example of CO2 storage site and risk pathway.

309
The overall approach for CO2 performance 6.3 Site characterization
management and risk control considers uncertainties,
Site characterization corresponds to the phase where
including those introduced by the judgments.
additional measurements are made following the
directions given during the site screening phase on
5.1 Expert judgment and modeling the base the preliminary performance assessment. In
this phase experts interpret measurements. To carry
Modeling is the core of the CO2 storage site per- out a detailed performance assessment, they identify
formance management and risk control methodology. risk pathways and scenarios. Experts are questioned
Its main purpose is to capture the main features of about the probability of occurrence and magnitude of
the overall site and simulate the most important pro- consequences of risk pathways/scenarios.
cesses induced by CO2 injection, to represent the During this phase a static model can be built and
reality in the most truthful way. Simulation mod- dynamic simulations run to quantitatively assess the
els are used, for example, to quantitatively evaluate risk pathways or the scenarios. The involvement of
the risk pathways/scenarios. To take into account the a group of experts to create the model is crucial to
limited knowledge of today, the models have to be take maximum profit of the various expertises and to
probabilistic and include uncertainty analysis. Experts solve conflicts when modeling choices are made, such
are involved in the process of selecting the most as the size of the 3D area to be described, the inclusion
appropriate and representative models. of surface properties, the inclusion of distant wells, etc.
The quality of the results depends on the quality of
the simulation models and tools, and the role of experts
in benchmarking the tools is significant. The decision 6.4 Long-term storage
diagram for structured expert elicitation process selec-
tion, as it is conceived today, does not address methods Long-term storage corresponds to the phase where
for model assessment because of a different nature and the stored CO2 is expected to stay in place and the
a dedicated approach will be investigated. site is monitored to confirm this expectation. Experts
interpret monitoring data. Most of the interpretations
are derived from indirect measurements and need to
be understood. Experts update predictions as far as
6 EXAMPLES OF POTENTIAL pathway/scenarios are concerned.
APPLICATIONS

A few examples of potential application of the process


during the CO2 storage site life cycle are given below. 7 CONCLUSIONS

The paper underlines the main features of the use


6.1 Site selection of expert judgment within CO2 geological storage
projects. It shows the need of defining a justified and
Site selection corresponds to the phase were the sites
accepted method able to structure experts’ judgment
suitable for CO2 storage are selected among a num-
use and related elicitation processes and to combine the
ber of candidates against a list of comparison criteria.
outcome to reach objective appraisals even if uncer-
During this phase, the experts assess the available data,
tain. Such a method does not exist today in the CCS
and they are explicitly questioned about the weights of
domain.
comparison criteria.
The work carried out within the project at the basis
of this paper aims at bridging this gap by the defini-
6.2 Site screening tion of a structured approach able to capture expert
knowledge, elicit judgments, treat uncertainties and
Site screening corresponds to the phase where the site provide sound estimation when there is a lack of field
is described on the basis of the available information. and operational data.
The experts interpret existing data and make an ini-
tial understanding of the earth model. As the domain
is new, the skill of domain experts has to be devel-
oped and it seems advisable that, instead of a single REFERENCES
expert, a panel of domain experts is formed and their
Barlet-Gouédard, V. Rimmelé, G. Goffé, B. & Porcherie, O.
assessments aggregated. The experts semi quantita- 2007. Well Technologies for CO2 Geological Storage:
tively assess the likelihood and the severity of the CO2 -Resistant Cement. Oil & Gas Science and Technol-
hazards identified through e.g. a tabular process (e.g. ogy—Rev. IFP, Vol. 62, No. 3, pp. 325–334.
Risk register, . . . ) as typical of the initial performance Benson, S.M. Hepple, R. Apps, J. Tsang, C.-F. &
analysis. Lippmann, M. 2002. Lessons Learned from Natural

310
and Industrial Analogues for Storage of Carbon Diox- IPCC, 2005. IPCC Special Report on Carbon Dioxide
ide in Deep Geological Formations. Lawrence Berkeley Capture and Storage. Intergovernmental Panel on Climate
National Laboratory LBNL-51170. Change, Cambridge University Press, Cambridge, UK.
Bérard, T. Jammes, L. Lecampion, B. Vivalda, C. & Kutchko, B.G. Strazisar, B.R. Dzombak, D.A. Lowry, G.V. &
Desroches, J. 2007. CO2 Storage Geomechanics for Per- Thaulow, N. 2007. Degradation of well cement by CO2
formance and Risk Management, SPE paper 108528, under geologic sequestration conditions. Environmental
Offshore Europe 2007, Aberdeen, Scotland. Science & Technology, Vol. 41, No. 13, pp. 4787–4792.
Bonano, E.J. Hora, S.C. Keeny, R.L. & von Winterfeldt, D. Oberkampf, W.L., Helton, J.C., Joslyn, C.A.,
1989. Elicitation and use of expert judgment in perfor- Wojtkiewicz, S.F. & Ferson, S. 2004. Challenge Prob-
mance assessment for high-level radioactive waste repos- lems: Uncertainty in System Response Given Uncertain
itories. NUREG/CR-5411; SAND89-1821. Washington: Parameters. Reliability Engineering and System Safety,
US Nuclear Regulatory Commission. Vol. 85, pp. 11–19.
Bowden A.R. & Rigg, A. 2004. Assessing risk in CO2 Savage, D., Maul, P.R., Benbow, S. & Walke, R.C. 2004.
storage projects. APPEA Journal, pp. 677–702. A generic FEP database for the assessment of long-term
Cooke, R.M. & Gossens, L.J.H. 1999. Procedures guide Performance and Safety of the geological storage of CO2 .
for structured expert judgment. EUR 18820. Brussels, Quintessa, QRS-1060A-1.
Euratom. Simola, K., Mengolini, A. & Bolado-Lavin, R. 2005. For-
Gérard, B. Frenette, R. Auge, L. Barlet-Gouédard, V. mal expert Judgment an overview. EUR 21772, DG JRC,
Desroches J. & Jammes, L. 2006. Well Integrity in CO2 Institute for Energy, The Netherlands.
environments: Performance & Risk, Technologies. CO2 Van Der Beken, A., Le Gouévec, J., Gérard, B. & Youssef, S.
Site Characterization Symposium, Berkeley, California. 2007. Well Integrity Assessment and Modelling for CO2
Gossens, L.H.J. Cooke, R.M. & Kraan, B.C.P. 1998. Evalua- injection. In Proceedings of WEC07, Alger, Algeria.
tion of weighting schemes for expert judgment studies. Vivalda, C. & Jammes, L. 2008. Probabilistic Performance
In Mosleh, A. & Bari, A. (Eds.) Probabilistic safety Assessment Methodology for long term subsurface CO2
assessment and management. Springler, Vol. 3, pp. storage. In Proceedings of PSAM 9 Conference, Hong
1937–1942. Kong.
Helton. J.C. 1994. Treatment of Uncertainty in Performance Whittaker, S.G. 2004. Investigating geological storage of
Assessments for Complex Systems. Risk Analysis, Vol. greenhouse gases in southeastern Saskatchewan: The IEA
14, No. 4, pp. 483–511. Weyburn CO2 Monitoring and Storage Project. In Sum-
Hoffman, S. Fischbeck, P. Krupnik, A. & McWilliams, M. mary of Investigations, Misc. Rep. 2004-4.1, Vol. 1, Paper
2006. Eliciting information on uncertainty from heteroge- A-2, Saskatchewan Geological Survey, Sask. Industry
neous expert panels. Discussion Paper. Research for the Resources, Canada.
future. RFF DP 06-17.

311
Integrated risk management and risk—informed decision-making
Safety, Reliability and Risk Analysis: Theory, Methods and Applications – Martorell et al. (eds)
© 2009 Taylor & Francis Group, London, ISBN 978-0-415-48513-5

All-hazards risk framework—An architecture model

Simona Verga
DRDC CSS (Centre for Security Science), Ottawa, Canada

ABSTRACT: This work advocates an architecture approach for an all-hazards risk model, in support of
harmonized planning across levels of government and different organizations. As a basis for the architec-
ture, a taxonomic scheme has been drafted, which partitions risk into logical categories and captures the
relationships among them. Provided that the classifications are aligned with the areas of expertise of vari-
ous departments/agencies, a framework can be developed and used to assign portions of the risk domain to those
organizations with relevant authority. Such a framework will provide a structured hierarchy where data collection
and analysis can be carried out independently at different levels, allowing each contributing system/organization
meet its internal needs, but also those of the overarching framework into which it is set. In the end, the proposed
taxonomy will provide a ‘‘blueprint’’ for the all-hazards risk domain, to organize and harmonize seemingly
different risks and allow a comparative analysis.

1 INTRODUCTION managing risks. As an example, in Canada, public


safety and security functions are shared horizontally,
Terrorism has become a major concern in recent times, within one jurisdiction, among several departments
but the potential severity of health and natural disasters and agencies, and they cross jurisdictional bound-
has been demonstrated in Canada, as well as world- aries based on severity of consequences of the realized
wide, throughout history in one form or another. While risk event. This adds to the complexity of planning
vaccination campaigns and public education have led and managing even single emergencies that escalate
to the eradication of many of the past ‘‘killer’’ diseases, across jurisdictions and organizational boundaries.
and while science and engineering advances have led The added challenge of an all-hazards planning comes
to more resilient human settlements and infrastruc- from the lack of a coherent picture on the relative sever-
tures, at least in our part of the world, we are far ity of the risks associated with various threats/hazards.
from invulnerable. The emergence of new diseases and Taking such an approach is, nonetheless, important.
the possibility of new pandemics; the exacerbation of
weather events by global warming; unforeseen effects
of emerging technologies; the asymmetric distribution 2 UNDERSTANDING RISK
of wealth and resources leading to increased demo-
graphic pressures; the increased complexity of our Risk is an intellectual construct, contingent on the
infrastructures stressing control mechanisms closer belief that human intervention can influence the out-
and closer to the breaking point; and, not in the least, come of future events, as long as an effort is made to
the greater and greater expectation of protection by anticipate these events. Oftentimes risk has a negative
the population from its government; these are exam- connotation and is associated with the possibility of
ples that underscore the need for a robust, all-hazards future harm or loss (Rowe 1977).
approach to public safety and security.
In Canada, all levels of Government—fed-
2.1 Risk—the ‘‘what’’, ‘‘why’’ and ‘‘how’’
eral, provincial/territorial and municipal—share the
responsibility to protect Canadians and Canadian soci- In simple terms, risks are about events that, when trig-
ety. Within each jurisdiction, the governments’ public gered, cause problems. Because risks refer to potential
safety and security functions are shared among many problems in the future, often there is a great deal of
departments and agencies. Hence, preparedness at the uncertainty with regard to how and to what degree
national level depends on synchronized efforts among such events may be realized. If an organization’s inter-
many partners. For practical reasons, any attempt to ests are potentially affected, processes are set up to
formalize an overarching risk model needs to respect manage the uncertainty and strategies are developed
existing structures for ownership and responsibility in to minimize the effect of such future events on desired

315
Figure 1. Generic risk management process, showing a typical sequence of steps commonly followed, as well as the review
and feedback processes.

outcomes. Risk management is the structured approach start with a ‘‘risk equation’’ along these lines. But
to set up such processes and develop such strategies while this formula is sufficiently general to grant
(Haimes 2004). acceptance with a broad audience of ‘‘risk profes-
Although the details of the risk management sionals’’, it must be refined further in order to be of
approach vary widely across risk domains/organiza- practical value to anyone tasked with a specific assess-
tions, the following steps are commonly followed: (1) ment. The major difficulty facing an ‘‘all-hazards’’
risk identification; (2) risk assessment; (3) identifi- methodology is defining and finding good measures
cation/analysis and implementation of risk treatment for ‘‘Likelihood’’ and ‘‘Consequences’’ consistently
options; and (4) monitoring and evaluation. In most across risks of very different nature.
cases, risk management is a cycle that incorporates
review and feedback loops. Figure 1 shows a generic
risk management process. 2.2 Classification of risks—various perspectives
Invariably, risk management starts with risk identi- Risks must be identified and described in an
fication. An important part of risk identification is understandable way before they can be analyzed and
establishing the context. This implies selecting the managed properly. Risk identification should be an
domain of interest, establishing the identity and objec- organized, thorough approach to seek out probable
tives of risk managers and other interested parties or realistic risks. Among existing methods, the use
(stakeholders), the scope of the process, and the basis of taxonomies meets and promotes the objectives
upon which risks will be evaluated (i.e., assumptions, listed before. Establishing appropriate risk categories
constraints). After establishing the context, the next provides a mechanism for collecting and organizing
step in the process is to identify potential risks. Risk risks as well as ensuring appropriate scrutiny and
identification may be approached from different ends: attention for those risks that can have more serious
it may start with identifying potential sources of the consequences.
risk event, or with the risk event itself. More will be Taxonomies are meant to classify phenomena with
said on this essential step in risk management in subse- the aim of maximizing the differences among groups.
quent sections, as risk identification and classification The term ‘taxonomy’ refers to the theory and practice
constitutes the main topic of this paper. of producing classification schemes. Thus, construct-
Once a risk has been identified, it must be assessed. ing a classification is a process with rules on how to
Risk is the product of Likelihood and Consequences, form and represent groups (Greek word: taxis—order),
where ‘‘likelihood’’ refers to the risk event’s chance of which are then named (Greek word: nomos—law, or
occurrence and ‘‘Consequences’’ measures the sever- science). Taxonomies are useful, if they are able to
ity and extent of the effects associated with the event. reduce the complexity of the domain studied into more
Indeed, any risk assessment methodology is likely to tractable macro-classes (Coccia 2007).

316
The design of taxonomies can be a useful first Another approach is to ‘‘slice’’ risk with respect to
step in theory building. Classification as an output the domain on which the risk event has an impact, as
deals with how groups and classes of entities are illustrated in Figure 2: (1) the natural environment:
arranged, according to the taxonomic approach used. earthquake, flood, storm; (2) the built environment:
The numerous paradigms advanced by specialists in structures, transportation, IT networks; and (3) the
various risk fields may be distilled into a number of social environment: people, communities. We will
general perspectives, or ‘‘facets’’, on how to structure come back to this view in a later section, when we
the risk domain. Each of these perspectives offers cer- discuss building methods for assessing consequences
tain advantages for bringing forward a particular facet of risk events.
of a risk event. The multi-faceted approach is also Last but not least, the classification of risks based on
respectful of the inevitable dose of ambiguity regard- the nature of risk sources provides a basis to systemat-
ing the classification criteria, which may be more ically examine changing situations over time. The next
significant for certain classes of risks than for others. section presents a risk taxonomy based on the source
The blurring of traditional categories related to envi- of the risk event. The classification exercise provides
ronmental and technical risk events is a case in point a powerful way to visualize and understand a complex
(Cohen 1996). domain.
The above considerations illustrate the fact that the
construction of taxonomy inevitably confronts limi-
tations and requires execution of somewhat arbitrary 3 THE ALL-HAZARDS RISK ASSESSMENT
decisions. Nonetheless, even an imperfect structured TAXONOMY
approach is preferred, if it enables a comparative
analysis of broad risk categories, instead of punctual This section proposes a taxonomic approach that
treatment of an arbitrary set of risks in isolation. breaks down all-hazards risk by trying to identify risk
There are several ways of classifying risks. sources, and discusses the advantages of using this
From a time evolution perspective, there are two approach, as well as the shortcomings of the proposed
main sequences of events when talking about risks. scheme.
The first type of event sequence is a sudden occurrence
that brings immediate consequences. Examples are
3.1 Structuring and organizing existing
earthquake, structural collapse, or terrorist attack. The
knowledge
second type of event sequence happens gradually, and
the consequences may become apparent after a long One of the main objectives of the AHRA taxon-
period of time. Examples here are the risk to human omy exercise is directed primarily at retrieving and
health posed by the agricultural use of pesticides, or organizing risk-related knowledge that exists diffusely
that of emerging nanotechnologies. throughout various organizations and levels of the
Canadian government, focusing here on the federal
level. This knowledge constitutes the basis on which a
harmonized methodology for assessing national risks
will be built. Access to the most updated and relevant
information across risk domains under government
responsibility represents an essential enabler for risk
assessment methodology development.
As suggested in (Lambe 2007), an effective taxon-
omy has three key attributes: it provides a classifica-
tion scheme; it is semantic; and it can be used as a map
to navigate the domain. The proposed scheme has been
designed with those key attributes in mind.
First of all, the proposed scheme classifies risks
by grouping related risk events together, in categories
and subcategories structured in a way that reveals the
nature of the underlying relationships. The organi-
zation of classes is respectful of multiple types of
relationships; for example, the two major classes of
risk events are separated based on whether malicious
intent plays a role in the realization of any given risk
within the class.
Within each major category, the subcategories are
Figure 2. Impact taxonomy. decided based on similarity of attributes, although

317
Figure 3. The all-hazards risk taxonomy.

the classification principle is clearer for some than Economic Forum 2008), which is most certainly going
for others (e.g., weather-related events and geological to affect local conditions, so do potential sources of
event are clear examples of Natural Disasters, while risk and it is desirable that the taxonomy goes through
ecological disaster may legitimately fall under both periodic reviews and updates in order to stay adequate.
Natural Disaster and Unintentional events resulting
from human action). 3.2 Taxonomy as basis for data collection
Second, the AHRA taxonomy is semantic; it func-
tions within a controlled vocabulary that describes the Based on the initial categories included in the risk tax-
content of the whole domain and each class and sub- onomy, the team of analysts proceeded to building
class; the agreed definitions for key risk terms are a ‘‘Risk Domain Architecture’’. To that end, a sur-
documented in the AHRA lexicon (Verga 2007). vey was prepared and administered in order to collect
To ensure the success of the taxonomy building relevant information from various risk communities
exercise, the team of analysts at the Centre for Secu- within the Canadian federal government. The survey
rity Science has undertaken a collection of evidence was structured around three main parts:
from stakeholders—Canadian federal organizations
with a public safety and/or national security risk man- 1. A standard first part allowed the respondents to
date—with a focus on establishing common ground, self-identify and communicate their role as risk
mapping activities to risk categories and uncovering practitioners;
critical gaps. 2. A second part consisted of a suite of questions
Figure 3 shows the most current version of the developed to elicit specific information for use
AHRA Risk Event Taxonomy. It must be noted that the in the risk architecture—largely based on the risk
categories are not fully stabilized and they may evolve categories in the taxonomy;
as more parties become engaged and provide input; in 3. The survey ended with a series of open questions
fact, as the global context evolves continuously (World aimed at gaining an overall impression of the stateof

318
Figure 4. OV-05 operational activity model.

risk assessment at the federal level, also designed a brief description of the results is included, together
to elicit additional input to further improve the with magnified insets to showcase the main point.
taxonomy. Figure 4 shows the AHRA taxonomy ‘‘translated’’
into an operational view—OV-05, or Operational
The information collected through the survey was Activity Model. In this model, each risk event in the
used to construct a systems architecture, which helped taxonomy is represented as an ‘‘activity’’ that needs
illustrate how people and organizations were involved to be completed for an all-hazards risk assessment
in risk activities, the relationships among them, and to be thoroughly conducted. The figure shows an
how their mandated responsibilities aligned with cate- enlarged portion, for clarity. It should be noted that
gories proposed in the risk taxonomy. The survey also risk events such as ‘‘Ecological Disasters’’ or ‘‘Emerg-
served to identify those participants willing to share ing Technologies’’, which in the AHRA taxonomy
risk-related tools, methods and assessments. More connect to more then one column, are not properly
details about how the survey was administered, how represented in this Operational Activity Model, since
the data was analyzed and the architecture built can be this type of relationship is not logical in an activity
found in (Keown 2008). The next section summarizes hierarchy.
the results that are most relevant to the purpose of this In the survey, participants indicated their sections,
paper. the larger organizations they belonged to, and each
of the taxonomic events they were actively reviewing
through a risk assessment. This information was visu-
3.3 Results—a federal risk community architecture alized using an Operational Activity Node, or OV-02,
Using the information captured through the survey, diagram. Activity Nodes represent sections within par-
an architecture was constructed. The U.S. Depart- ticipating departments and agencies that conduct risk
ment of Defense Architecture Framework (DoDAF). assessment activities. Figure 5 shows an example—a
DoDAF represents a standard way to organize a sys- node representing the Canadian Ice Service section
tems architecture and is constructed using a number within the Department of Environment Canada.
of complementary and consistent views, enabling both Based on these results, it was possible to return to
an operational and a systems perspective on the built the activity hierarchy and indicate which nodes are
architecture. A full discussion of DoDAF is beyond the responsible for each activity, as shown in Figure 6.
scope of this paper and can be found in (DoDAF 2004). Although not shown, the information collected
The AHRA architecture was built and visualized using through the survey also allowed mapping commu-
Telelogic System Architect® Version 10.6. The dia- nications or other types of relationships between
grams created are too detailed to be viewed clearly, but organizations.

319
Finally, based on the survey responses, it was pos- 3.4 Taxonomy as a basis for methodology
sible to identify risk practitioners who volunteered development
to provide information on tools, methods or specific
Based on the discussion in the last paragraph of
assessments. Based on the material collected from
section 2.1, as well as extensive research on method-
these respondents, and an extensive review of the
ology employed within the global risk management
research literature on the topic, the team of analysts
community, the universal, widely-accepted risk
at CSS hopes to develop a harmonized methodol-
assessment principle is illustrated by the equation:
ogy capable of sustaining an all-hazards scope. The
next section goes back to the AHRA taxonomy and
shows how it can be used to guide methodology Risk Likelihood magnitude of
= ×
development. Magnitude of occurrence Consequence

This is the fundamental equation in risk manage-


ment, and it would be difficult to carry out authentic
risk assessment without basing the overall process on
this relationship. And while the choice of assessment
processes, scaling/calibration schemes, or even the set
of parameters must accommodate specific risks, as it
will be further discussed, this basic equation provides
the common framework required for comparative anal-
ysis of different risks, albeit at a fairly high level.
This being said, the next paragraph will discuss how
the taxonomy can be used to understand the differ-
ent requirements in treating each of the major risk
categories.
The bottom part of Figure 3 showcases the discrim-
inating principle that separates between the two main
classes of risk events: under ‘‘Malicious Threats’’,
risk events originate in malicious intent acts of enemy
actors, who can be individuals, groups, or for-
eign states. The division hints at one major differ-
Figure 5. Sample of OV-02 activity node. ence in methodological approaches employed in the

Figure 6. OV-05 operational activity diagram linked to activity nodes.

320
respective domains: adaptability of the source—threat
actor—in order to avoid risk treatment measures.
For this reason, estimating the likelihood of a mali-
cious risk event requires a different approach and
different metrics than estimating the likelihood of a
non-malicious event.
For the latter type, at least in principle, likelihood is
a more straightforward combination of the frequency
of occurrence for that event type, based by histor- Figure 7. All-hazards risk assessment modules.
ical data, and vulnerability/exposure of the human,
natural and build environment exposed to the given
hazard/threat. Although variations exist within each of
the three major branches: unintentional (man-made);
natural disasters; and health disasters, the calcula-
tion algorithm can be sufficiently consistent for those
subcategories with a clear relationship to the class.
Specialization of Consequence assessment may
also be necessary. However, it is conceivably possible
to establish a common set of consequence parameters,
along with adequate metrics, of which for each risk
category only the relevant parameters will be selected
and included in the assessment. The set of Conse-
quence parameters may be organized around the three
major domains in the Impact Taxonomy illustrated in
Figure 2, which also includes overlap of the domains, Figure 8. Risk assessment tetrahedron—a graphical repre-
allowing for ‘‘hybrid’’ parameters. sentation of the common risk equation.
The rationale for having a common set of Conse-
quence parameters stems from the aim of the AHRA
methodology, which, as stated in earlier sections of man-made disasters. Also, the calculations of Con-
this document, is to enable a comparative analysis sequences could be quite different, although this
across broad categories of different risks, in order to paper strongly advocates a common approach for Con-
inform decision-making at the highest level. The final sequence assessment. A ‘‘modular’’ AHRA would,
output may be assumed to consist of a set of Risk however, need to provide commonality in the way
Magnitude assessments—relating to the top risks of in which the final results are presented to decision
the day—which will need to be viewed both in abso- makers. Figure 7 illustrates a possible breakdown in
lute magnitude terms and in relative magnitude terms. risk assessment modules, while Figure 8 represents
In order that the relative assessment of Risk Magni- a graphical representation of risk equation, showing
tude—leading to priorities for action and/or allocation how the modules can be brought together to provide a
of resources—can work, there will have to be a com- common picture of different assessments.
mon set of parameters for the understanding of the To end this section, a last look at the AHRA taxon-
various consequences. omy is in order. Figure 3 also illustrates a challenge in
For example, it may be necessary to prioritize finding the right ‘‘treatment’’ for risks that do not seem
resources between impending natural disasters on amenable to the same kind of tidy subdivision. The
the one hand and anticipated terrorist activity on sub-categories ‘‘Ecological Disasters’’ and ‘‘Emerg-
the other. Common ways of understanding the total ing Technologies’’ remain somewhat ill-defined and
consequences for Canada of each Risk on the list unwieldy. The difficulty with these two sub-categories
will be required. For decision-making at the highest originates in one shortcoming of the current scheme:
level the risk magnitude assessment will probably the classification principle does not consider the time
need to be supported by high-level information on sequence in the realization of risk events, as discussed
Consequences. in section 2.2. From a time perspective, these two
Thus, a logical way to proceed with the AHRA groups of risk would naturally fall under the ‘‘grad-
methodology is to develop a number of ‘‘modules’’ ual’’ type; many of the other sub-categories belong
that reflect variations in the way in which the Risk to the ‘‘sudden occurrence’’ category, although some
Magnitude calculation needs to be performed for the of the ‘‘boxes’’ in Figure 3 will break under the new
different classes of risk events. The calculation of lens. This last point highlights the challenges in the
Likelihood could be rather different for Malicious ambitious enterprise of tackling ‘‘all-risk’’ in ‘‘one bat-
Threats, for Natural Disasters and for Unintentional tle’’, particularly the difficulty of bringing the time

321
dimension into the equation. Developing the AHRA management decisions. At the same time, the taxon-
methodology is work in progress in the Risk group at omy ‘‘pulls together’’ the different components and
DRDC Centre for Security Science and this document provides the common framework required for harmo-
is meant to illustrate the problem, the approach and nized assessment and comparative analysis of different
initial results as well as an awareness of the challenges risks, albeit at a fairly high level.
associated with this exciting project. A noteworthy quality of the risks identified in the
taxonomy is that they do not exist, and cannot be
identified and assessed, in isolation. Many are inter-
4 CONCLUDING REMARKS connected, not necessarily in a direct, cause-and-effect
relationship, but often indirectly, either through com-
This paper proposes a taxonomic scheme that parti- mon impacts or mitigation trade-offs. The better the
tions the All-Hazards Risk Domain into major event understanding of interconnectedness, the better one
categories based on the nature of the risk sources, and can design an integrated risk assessment approach and
discusses the advantages of using this approach, as recommend management options. But this remains
well as the shortcomings of the proposed scheme. The methodologically and conceptually difficult, due to
taxonomy enables an architecture approach for an all- the inherent complexity of the domain and our limited
hazards risk model, in support of harmonized planning ability to represent it adequately.
across levels of government and different organiza- The above considerations add to the methodological
tions. Provided that the classifications are aligned with hurdles around the representation of interconnected-
the areas of expertise of various departments/agencies, ness inter- and intra- risk domains. In addition, one
a framework can be developed and used to assign por- cannot leave out the global risk context, which is both
tions of the risk domain to those organizations with more complex and more challenging than ever before,
relevant authority. Essential actors, who are active according to the World Economic Forum. (Global
in conducting assessments and/or performing func- Risks 2008).
tions that need to be informed by the assessments,
are often invisible from the point of view of author-
ity structures. Such a framework provides a structured REFERENCES
hierarchy where data collection and analysis can be
carried out independently at different levels, allow- DoD Architecture Framework Working Group. 2004. DoD
ing each contributing system/organization to meet Architecture Framework Version 1.0, Deskbook., USA.
its internal needs, but also those of the overarching Department of Defence.
framework into which it is set. Global Risks 2008. A Global Risk Network Report, World
Economic Forum.
Finally, based on the survey responses, it was pos- Coccia, M. 2007. A new taxonomy of country performance
sible to identify risk practitioners who volunteered and risk based on economic and technological indicators.
to provide information on tools, methods or specific Journal of Applied Economics, 10(1): 29–42.
assessments. Based on the material collected from Cohen, M.J. 1996. Economic dimensions of Environmen-
these respondents, and an extensive review of the tal and Technological Risk Events: Toward a Tenable
research literature on the topic, the team of analysts Taxonomy. Organization & Environment, 9(4): 448–481.
at CSS hopes to develop a harmonized methodology Haimes, Y.Y. 2004. Risk Modeling, Assessment and Manage-
capable of sustaining an all-hazards scope. ment, John Wiley & Sons.
The paper also shows how the AHRA taxonomy Keown, M. 2008. Mapping the Federal Community of Risk
Practitioners, DRDC CSS internal report (draft).
can be used to guide methodology development. The Lambe, P. 2007. Organising Knowledge: Taxonomies,
taxonomy can be used to understand the different Knowledge and Organisational Effectiveness. Oxford:
requirements in treating distinct risk categories, which Chandos Publishing.
in turn guides the choice of assessment processes, Rowe, W.D. 1977. An anatomy of Risk, John Wiley & Sons.
scaling/calibration schemes, or the set of parame- Verga, S. 2007. Intelligence Experts Group All-Hazards
ters in order to accommodate specific risks. As a Risk Assessment Lexicon, DRDC CSS, DRDC-Centre for
consequence, a national all-hazards risk assessment Security Science-N-2007-001.
needs a fold-out, modular structure, that reflects and
is able to support the different levels of potential

322
Safety, Reliability and Risk Analysis: Theory, Methods and Applications – Martorell et al. (eds)
© 2009 Taylor & Francis Group, London, ISBN 978-0-415-48513-5

Comparisons and discussion of different integrated risk approaches

R. Steen & T. Aven


University of Stavanger, Norway

ABSTRACT: There exist many discipline oriented perspectives on risk. Broadly categorised we may distin-
guish between technical, economic, risk perception, social theories and cultural theory. Traditionally, these
perspectives have been viewed to represent different frameworks, and the exchange of ideas and results has been
difficult. In recent years several attempts have been made to integrate these basic perspectives to obtain more
holistic approaches to risk management and risk governance. In this paper we review and discuss some of these
integrated approaches, including the IRGC risk governance framework and the UK Cabinet office approach.
A structure for comparison is suggested, based on the attributes risk concepts and risk handling.

1 INTRODUCTION The aim of this paper is to compare these inte-


grated perspectives, by looking at their differences and
To analyze and manage risk an approach or frame- communalities. The comparison is based on an eval-
work is required, defining what risk is, and guiding uation of how risk is defined and how risk is handled
how risk should be assessed and handled. Many (managed).
such approaches and frameworks exist. To categorize
these, Renn (1992, 2007) introduce a classification
structure based on disciplines and perspectives. He 2 AN OVERVIEW OF SELECTED
distinguishes between: FRAMEWORKS
Statistical analysis (including the actuarial appr-
oach), Toxicology, Epidemiology, Probabilistic risk The main purpose of this section is to summarize the
analysis, Economics of risk, Psychology of risk, Social main ideas of the different integrated risk approaches
theories of risk, Cultural theory of risk. mentioned above.
To solve practical problems, several of these per-
spectives are required. Consider a risk problem with
a potential for extreme consequences and where the 2.1 The HSE framework
uncertainties are large. Then we need to see beyond The main purpose of this approach is to set out an over-
the results of the probabilistic risk analyses and the all framework for decision making by HSE (Health
expected net present value calculations. Risk percep- and Safety Executive) and explain the basis for HSE’s
tion and social concerns could be of great importance decisions regarding the degree and form of regulatory
for the risk management (governance). control of risk from occupational hazards. However
But how should we integrate the various perspec- the framework is general and can also be used for
tives? In the literature, several attempts have been other type of applications. This approach consists of
made to establish suitable frameworks for meeting five main stages; characterising the issue, examining
this challenge, integrating two or more of these per- the options available for managing the risks, adopting
spectives. In this paper we restrict attention to four of a particular course of action, implementing the deci-
these: sions and evaluating the effectiveness of actions. Risk
is reflected as both the likelihood that some form of
harm may occur and a measure of the consequence.
– The HSE framework. Reducing risks, protecting The framework utilises a three-region approach
people (HSE 2001) (known as the tolerability of risk, TOR), based on
– The UK Cabinet office approach (Cabinet office the categories acceptable region, tolerable region and
2002) unacceptable region. Risks in the tolerable region
– The risk governance framework IRGC (Renn 2005) should be reduced to a level that is as low as reasonably
– The consequence-uncertainty framework intro- practicable (ALARP). The ALARP principle implies
duced by Aven (2003, 2007). what could be referred to as the principle of ‘reversed

323
onus of proof‘. This implies that the base case is that associated uncertainties. Focus is on the events and
all identified risk reduction measures should be imple- consequences (referred to as observables quantities),
mented, unless it can be demonstrated that there is such as the number of fatalities and costs, and these are
gross disproportion between costs and benefits. predicted and assessed using risk assessments. Proba-
bilities and expected values are used to express the
uncertainties, but it is acknowledged that they are
2.2 The UK cabinet office approach
not perfect tools for expressing the uncertainties. To
The UK Cabinet office approach (Cabinet office 2002) evaluate the seriousness of risk and conclude on risk
sets out how government should think about risk, and treatment, a broad risk picture needs to be established,
practical steps for managing it better. It proposes prin- reflecting also aspects such as risk perception and soci-
ciples to guide handling and communication of risks etal concern. The analyses need to be put into a wider
to the public. Risk refers to uncertainty of outcome, of decision-making context, which is referred to as a
actions and events, and risk management is about get- management review and judgment process.
ting the right balance between innovation and change
on the one hand, and avoidance of shocks and crises on
the other. The approach is based on the thesis that the 3 COMPARISON OF THE APPROACHES
handling of risk is at heart about judgement. Judge-
ment in the context of government decision making This section compares the four frameworks, with
can, and should, be supported by formal analytical respect to the risk perspectives and risk handling.
tools which themselves need enhancing. But these
cannot substitute for the act of judgement itself. The
approach frames how far formal risk analysis can be 3.1 The risk concept
usefully enhanced and made systematic, so that there is
greater clarity about where analysis ends—and judge- We introduce a context and terminology as follows:
ment begins. It also explores and suggests what else We consider an activity, from which events A may
we need to do to enhance our handling of risk and occur leading to consequences C. The occurrence of A
innovation. and the consequences C are subject to uncertainties U.
The likelihoods and probabilities associated with the
events and possible consequences are denoted L and P,
2.3 The risk governance framework IRGC respectively. Using these symbols, the risk definitions
The risk governance framework (Renn 2005) has been can be summarised in the following way:
developed to provide structure and guidance on how – HSE framework: (L, C)
to assess and handle risk on the societal level. The – Cabinet Office framework: U and (L, C)
framework integrates scientific, economic, social and – IRGC framework: C
cultural aspects and includes the effective engagement – Consequence-uncertainty framework: (A, C, U).
of stakeholders. The framework is inspired by the con-
viction that both the ‘factual’ and the ‘socio-cultural’ Likelihood is normally understood as the same as
dimension of risk need to be considered if risk gov- probability, and in this paper we do not distinguish
ernance is to produce adequate decisions and results. between these two concepts. Note however that some
It comprises five main phases; pre-assessment, risk see likelihood as a more qualitative description than
appraisal, tolerability and acceptability judgment, risk probability which is restricted to a number in the
management and communication. Risk is defined as interval [0, 1].
an uncertain consequence of an event or an activity The Cabinet Office defines risk by uncertainty.
with respect to something that humans value. The Often the uncertainty is seen in relation to the expected
framework gives importance to contextual aspects value, and the variance is used as a measure of risk. As
which, either, are directly integrated in a risk manage- an example, consider the problem of investing money
ment process, or, otherwise, form the basic conditions in a stock market. Suppose the investor considers two
for making any risk-related decision. The framework alternatives, both with expectation 1, and variances
also introduces a categorisation of risk problems which 0.16 and 0.08, respectively. As alternative 1 has the
is based on the different states of knowledge about each lowest risk (uncertainty), expressed by the variance,
particular risk. this alternative would normally be chosen. As another
example, consider the number of fatalities in traffic
next year in a specific country. Then the variance is
2.4 The consequence-uncertainty framework
rather small, as the number of fatalities shows rather
In the consequence-uncertainty framework introduced small variations from year to year. Hence according
by Aven (2003, 2007), risk is defined as the two- to this definition of risk, we must conclude that the
dimensional combination of events/consequences and risk is small, even though the numbers of fatalities

324
Uncertainty account uncertainties/likelihoods, so why not include
Risk this dimension into the risk concept?
Also the (L, C) and (P, C) definitions can be chal-
lenged. A probability is not capturing all aspects of
Eventsand concern. To explain this we need to first introduce
Activity consequences the two common ways of interpreting a probability:
(outcomes) the classical relative frequency interpretation and the
subjective Bayesian interpretation.
Values at stake Values at stake According to the classical relative frequency
paradigm, a probability is interpreted as the relative
Figure 1. Risk defined as a consequence (Aven & Renn fraction of times the events occur if the situation ana-
2008a). lyzed were hypothetically ‘‘repeated’’ an infinite num-
ber of times. The underlying probability is unknown,
and is estimated in the risk analysis. Hence if this inter-
are many thousands each year. Clearly, this defini- pretation is adopted in the above definitions of risk, we
tion of risk fails to capture an essential aspect, the have to take into account that the risk estimates could
consequence dimension. Uncertainty cannot be iso- be more or less accurate relative to the underlying true
lated from the intensity, size, extension etc. of the risk. The uncertainties in the estimates could be very
consequences. Take an extreme case where only two large, and difficult to express.
outcomes are possible, 0 and 1, corresponding to 0 The alternative (the Bayesian perspective) consid-
and 1 fatality, and the decision alternatives are A and ers probability as a measure of uncertainty about
B, having uncertainty (probability) distributions (0.5, events and outcomes (consequences), seen through the
0.5), and (0.0001, 0.9999), respectively. Hence for eyes of the assessor and based on the available back-
alternative A there is a higher degree of uncertainty ground information and knowledge. Probability is a
than for alternative B, meaning that risk according to subjective measure of uncertainty, conditional on the
this definition is higher for alternative A than for B. background information. The reference is a certain
However, considering both dimensions, both uncer- standard such as drawing a ball from an urn. If we
tainty and the consequences, we would of course judge assign a probability of 0.4 for an event A, we com-
alternative B to have the highest risk as the negative pare our uncertainty of A to occur with drawing a red
outcome 1 is nearly certain to occur. ball from an urn having 10 balls where 4 are red. True
The IRGC framework defines risk by C. See probabilities do not exist.
Figure 1. However, a probability is not a ‘‘perfect tool’’ for
According to this definition, risk expresses a state this purpose. The assigned probabilities are conditi-
of the world independent of our knowledge and percep- onal on a specific background information, and they
tions. Referring to risk as an event or a consequence, could produce poor predictions. Surprises relative
we cannot conclude on risk being high or low, or com- to the assigned probabilities may occur, and by just
pare options with respect to risk. Compared to standard addressing probabilities such surprises may be over-
terminology in risk research and risk management, it looked (Aven 2008a, b).
lead to conceptual difficulties that are incompatible The consequence-uncertainty definition (A, C, U)
with the everyday use of risk in most applications, as may be rephrased by saying that risk associated with an
discussed by Aven & Renn (2008a) and summarised activity is to be understood as (Aven and Renn 2008):
in the following. Uncertainty about and severity of the consequences
The consequence of a leakage in a process plant of an activity (I), where severity refers to intensity,
is a risk according to the IRGC definition. This size, extension, and so on, and is with respect to
consequence may for example be expressed by the something that humans value (lives, the environment,
number of fatalities. This consequence is subject to money, etc). Losses and gains, for example expressed
uncertainties, but the risk concept is restricted to the by money or the number of fatalities, are ways of
consequence—the uncertainties and how people judge defining the severity of the consequences.
the uncertainties is a different domain. Hence a risk The main features of the definition are illustrated
assessment according to this definition cannot con- in Figure 2.
clude for example that the risk is high or low, or that The uncertainty relates to both the event and the
option A has a lower or higher risk than option B, as it consequences given that this event occurs.
makes no sense to speak about a high or higher con-
sequence—the consequence is unknown. Instead the
3.2 Risk handling (management)
assessment needs to conclude on the uncertainty or the
probability of the risk being high or higher. We con- The risk handling covers all activities to direct and
clude that any judgement about risk needs to take into control an organisation with regard to risk. It typically

325
emphasis on consequences
eg if serious/irreversible or
Uncertainty need to address societal towards
Risk

Likelihood increasingly uncertain


concerns ignorance

rely on past
experience
Severity of generic hazard

Events and
Activity consequences
(outcomes) consider putative
conventional consequences
risk assessment and scenarios
Values at stake
Values at stake
Consequences increasingly uncertain

Figure 2. Illustration of the risk definition (A, C, U)


Figure 3. Procedures for handling uncertainty when assess-
and (I).
ing risks (HSE 2001).

covers risk assessments, risk avoidance, risk reduction,


risk transfer, risk retention, risk acceptance and risk Identify risks and
communication (ISO 2002). define framework

Embed and
Evaluate risks
HSE framework review
The main steps of the risk management process are pre-
sented in section 2.1. The steps follows to large extent Assess risks
the standard structure for risk management processes, Gain assurance
apetite
about control
see e.g. AS/NZS 4360 (2004). However, the frame-
work has some genuine features on a more detailed Identify suitable
reponse to risks
level, of which the following are considered to be of
particular importance:
– Weight to be given to the precautionary principle in
the face of scientific uncertainty. The precaution- Communicating about risk and uncertainty

ary principle describes the philosophy that should


be adopted for addressing hazards subject to high Figure 4. Framework for handling risk and uncertainty.
scientific uncertainties. According to the frame-
work the precautionary principle should be invoked
where: represents increasing uncertainty in the likelihood that
a. There is good reason to believe that serious harm the harmful consequences of a particular event will be
might occur, even if the likelihood of harm is realised, while the horizontal axis represents increas-
remote. ing uncertainty in the consequences attached to the
b. The uncertainties make it impossible to eval- particular event.
uate the conjectured outcomes with sufficient At the lower left corner, a risk assessment can be
confidence. undertaken with assumptions whose robustness can be
tested by a variety of methods. However, as one moves
– The tolerability framework (the TOR framework) along the axis increasingly assumptions are made that
and the ALARP principle. This framework is based are precautionary in nature and which cannot be tested
on three separate regions (acceptable region, toler- (HSE 2001).
able region and unacceptable region).
– An acknowledgment of the weaknesses and limita-
Cabinet Office framework
tions of risk assessments and cost-benefit analysis.
The main features of the framework are shown in
These analyses could provide useful decision sup-
Figure 4.
port, but need to be seen in a wider context, reflect-
The framework is to large extent based on the HSE
ing that that there are factors and aspects to consider
framework. For example it is referred to the approach
beyond the analysis results. For many type of situa-
described by Figure 3, and the TOR framework as
tions, qualitative assessment could replace detailed
well as the ALARP principle constitutes fundamen-
quantitative analysis. In case of large uncertainties,
tal building blocks of the Cabinet Office framework.
other principles and instruments are required, for
There are also several original aspects, and we would
example the precautionary principle.
like to point at some of these:
The procedures (approaches) for handling uncer- The definition of three different levels of deci-
tainties are illustrated in Figure 3. The vertical axis sion making: Strategic, programme and operational.

326
Table 1. Risk problem category—uncertainty induced example—implications for risk management (Aven and Renn 2008b,
adapted from Renn (2005)).

Risk problem
category Management strategy Appropriate instruments

Uncertainty induced Risk informed and Risk assessments. Broad risk


risk problems Caution/Precaution based characterisations, highlighting uncertainties and features like
(risk agent) persistence, ubiquity etc.
Tools include:
• Containment
• ALARP (as low as reasonably possible)
• BACT (best available control technology), etc.

Risk informed. Risk assessments. Broad risk characterizations.


Robustness and Resilience Improving capability to cope with surprises
focused (risk absorbing system) • Diversity of means to accomplish desired benefits
• Avoiding high vulnerabilities
• Allowing for flexible responses
• Preparedness for adaptation

At the strategic level decisions involve the formula- based on incomplete or invalid data bases, possi-
tion of strategic objectives including major external ble changes of the causal chains and their context
threats, significant cross-cutting risks, and longer term conditions, extrapolation methods when making
threats and opportunities. At the programme level, the inferences from experimental results, modelling
decision-making is about procurement, funding and inaccuracies or variations in expert judgments.
establishing projects. And at the project and oper- Uncertainty may results from an incomplete or inad-
ational level, decisions will be on technical issues, equate reduction of complexity, and it often leads to
managing resources, schedules, providers, partners expert dissent about the risk characterisation.
and infrastructure. The level on uncertainty (and hence Ambiguity relates to i) the relevance, meaning and
risk) will decrease as we move from strategic level to implications of the decision basis; or related to
the programme and then operational level. ii) the values to be protected and the priorities to
In addition, the focus is on risk appetite, i.e. the be made.
quantum of risk that you are willing to accept in pursuit
of value. There is a balance to be made between inno-
vation and change on the one hand and, and avoidance For the different risk problem categories, the IRGC
of shocks and crises on the other. Risk management framework specifies a management strategy, appro-
is often focused on risk reduction, without recognition priate instruments and stakeholder participation, see
of the need for taking risks to add values. Table 1 which indicates the recommendations for the
category uncertainty.
IRGC framework
On a high level the framework is similar to the two The consequence-uncertainty framework
other frameworks presented above. However on a more The framework follows the same overall structure
detailed level, we find several unique features. One as the other frameworks and is characterised by the
is related to the distinction between different type following specific features:
of situations (risk problems) being studied, accord-
ing to the degree of complexity (Simple—Complex), – It is based on a broad semi-quantitative perspec-
Uncertainty and Ambiguity (Aven & Renn 2008b): tive on risk, in line with the perspective described
in section 3.1, with focus on predictions and high-
Simplicity is characterised by situations and problems lighting uncertainties beyond expected values and
with low complexity, uncertainties and ambiguities. probabilities, allowing a more flexible approach
Complexity refers to the difficulty of identifying than traditional statistical analysis. It acknowledges
and quantifying causal links between a multitude that expected values and probabilities could produce
of potential causal agents and specific observed poor predictions—surprises may occur.
effects. – Risk analyses, cost-benefit analyses and other types
Uncertainty refers to the difficulty of predicting the of analyses are placed in a larger context (referred
occurrence of events and/or their consequences to as a managerial review and judgment), where the

327
limitations and constraints of the analyses are taken framework stresses the importance of reflecting con-
into account. sequence, likelihoods and uncertainties. The adjusted
– The cautionary and precautionary principles as con- definition ‘‘uncertainty about and severity of the con-
sidered integrated features of risk management. The sequences of an activity with respect to something that
cautionary principle is a basic principle in risk man- humans value’’ (Aven and Renn 2008a), can be seen
agement, expressing that in the face of uncertainty, as reformulation of the original one to better reflect
caution should be a ruling principle, for example the intention.
by not starting an activity, or by implementing mea- As another example, the Cabinet office (2002)
sures to reduce risks and uncertainties (Aven & refers to risk as uncertainty, which means that risk
Vinnem 2007, HSE 2001). The precautionary prin- is considered low if one expects millions of fatalities
ciple is considered a special case of the cautionary as long as the uncertainties are low. Risk manage-
principle; its definition is discussed in Section 4. ment certainly needs to have a broader perspective on
risk, and this of course also recognised by the cabinet
A risk classification structure is suggested by the office framework. The terminology may however be
combination of expected consequences EC and an challenged.
assessment of uncertainties in underlying phenomena When referring to the likelihood of an event we
and processes that can give large deviations compared mean the same as the probability of the event. How-
to the expected values. ever, the term probability can be interpreted in differ-
Starting from the classification based on the tra- ent ways as discussed in Section 3.1 and this would also
ditional risk description using the expected conse- give different meanings of likelihood. With exception
quences, we may modify the classification based on of the consequence-uncertainty framework none of
the uncertainty assessments: For example, if a system the frameworks have specified the probabilistic basis.
is classified to have a medium risk according to the In the consequence-uncertainty framework probabil-
expected consequences criterion, we may reclassify it ity means subjective probabilities. Hence there is
as having high risk, if the uncertainties in underlying no meaning in discussing uncertainties in the prob-
phenomena and processes are very large. The uncer- abilities and likelihoods. If such a perspective is
tainties may be related to for example new technology, adopted, how can we then understand for example
future use and demand for the system, and political Figure 2, which distinguishes between uncertain-
events. ties about likelihoods (probabilities) and uncertainties
about consequences?
The former types of uncertainties are referred to
4 DISCUSSION as epistemic uncertainties and are also called second-
order probabilities. It is based on the idea that there
The four frameworks have similarities as shown in exists some ‘‘true’’ probabilities out there, based on
Section 3. The overall principles are to large extent the traditional relative frequency approach, that risk
overlapping: analysis should try to accurately estimate. However
– Risk perspectives highlighting events, conseque- this view can be challenged. Consider for example the
nces, probability (likelihood) and uncertainties. probability of a terrorist attack, i.e. P(attack occurs).
– Risk management processes along the lines shown How can this probability be understood as a true prob-
in Figure 4, and the use of risk reduction processes ability, by reference to a thought-constructed repeated
such as ALARP. experiment? It does not work at all. It makes no sense
– A risk-informed use of risk analysis. It is acknowl- to define a large set of ‘‘identical’’, independent attack
edged that risk analyses may provide useful decision situations, where some aspects (for example related
support, but they need to be placed in a wider con- to the potential attackers and the political context)
text, where their scope and limitations are taken into are fixed and others (for example the attackers’ moti-
account. Quantitative risk analyses cannot replace vation) are subject to variation. Say that the attack
sound judgments. probability is 10%. Then in 1000 situations, with
the attackers and the political context specified, the
Different terminology is used, and different aspects attackers will attack in 100 cases. In 100 situations
are highlighted in the four frameworks. The ter- the attackers are motivated, but not in the remaining
minology is to varying degree consistent with the 900. Motivation for an attack in one situation does not
intentions and ambitions of the frameworks, as shown affect the motivation in another. For independent ran-
in Section 3.1. For example, in the IRGC framework dom situations such ‘‘experiments’’ are meaningful,
risk is defined as an uncertain consequence of an but not for more complex situations as for example
event or an activity with respect to something that this attack case.
humans value. However, the framework in general Alternatively, we may interpret the likelihood
does not restrict risk to a consequence. Rather the uncertainties in Figure 2 by reference to the level of

328
consensus about the probabilities, or by reference to proper requirement without knowing what it implies
the amount and quality of the background data and and what it means when it comes to cost, effect on
knowledge for the probabilities. safety etc. If such criteria are defined, they give a focus
Or we may use other classification structures, for on obtaining a minimum safety standard—and there is
example the one which is based on expected conse- no drive for improvement and risk reduction. If a high
quences and uncertainties in underlying phenomena level of safety is to be obtained, other mechanisms
and processes (refer to the consequence—uncertainty than risk acceptance criteria need to be implemented
framework). and highlighted, for example ALARP processes. Fur-
The use of the precautionary principle needs to thermore, no method has a precision that justifies a
be based on an understanding of the risk and uncer- mechanical decision based on whether the result is
tainty concepts. The precautionary principle applies over or below a numerical criterion.
when there are scientific uncertainties about the con- HSE (2001) sets the value of a life (the cost one
sequences, but are also uncertainties of the likelihoods should be willing to pay for reducing the expected
and probabilities included? This is discussed by Aven number of lives by 1) equal to £1 million and pro-
(2006). poses that the risk of an accident causing the death of
The level of uncertainties would affect our man- 50 people or more in a single event should be regarded
agement policies and strategies. We will always give as intolerable if the frequency is estimated to be more
weight to the cautionary and precautionary princi- than one in five thousand per annum. HSE believes
ples in case of large uncertainties. All frameworks that an individual risk of death of one in a million per
acknowledge this, although the terminology varies. annum for both workers and the public corresponds
The HSE framework and the consequence-uncertainty to a very low level of risk and should be used as a guide-
framework highlight both the cautionary principle, line for the boundary between the broadly acceptable
and not only the precautionary principle. The caution- and tolerable regions.
ary principle means that caution, for example by not For the offshore industry a value of a life of
starting an activity, or by implementing measures to £6 million is considered to be the minimum level,
reduce risks and uncertainties, shall be the overriding i.e. a proportion factor of 6 (HSE 2006). This value
principle when there is uncertainty about the future is used in an ALARP context, and defines what is
occurrence of events and their consequences. The pre- judged as ‘‘grossly disproportionate’’. Use of the pro-
cautionary principle is a special case of the cautionary portion factor 6 is said to account for the potential
principle used when there are scientific uncertainties for multiple fatalities and uncertainties. Hence the
about the consequences. In practice many refer to pre- base case is that a risk reducing measures should be
cautionary principle in the meaning of the cautionary implemented, and strong evidence (costs) is required
principle. However, this could be considered unfortu- to justify no-implementation.
nate as it would mean a reference to this principle too To verify these criteria expected value based
often. Aven (2006) among others prefer to restrict the approaches such as cost-benefit analyses and cost
precautionary principle to situations where there is a effectiveness analyses are used, calculating ICAF val-
lack of understanding of how the consequences (out- ues (implied cost of averting one fatality, i.e. the
comes) of the activity are influenced by the underlying expected cost per expected reduced number of fatal-
factors. ities). This approach is indeed questionable; as the
To manage risk it is common to use a hierarchy of expected values do not take into account the risks and
goals, criteria and requirements, such as risk accep- uncertainties. A main objective of a safety measure
tance criteria (defined as upper limits of acceptable is to reduce risk and uncertainties, but then we can-
risk) or tolerability limits, for example ‘‘the individ- not use a principle based on expected values which to
ual probability of being killed in an accident shall not large extent ignores the risk and uncertainties (Aven &
exceed 0.1%’’. The use of such criteria constitute an Abrahamsen 2007).
integrate part of the various frameworks. However, the All risk perspectives and frameworks considers in
weight given to the criteria varies. In the consequence- this paper acknowledges the need for taking into the
uncertainty framework these criteria do not play an account the risk and uncertainties beyond the expected
important role. According to this framework such cri- values, but the practice of using expected value based
teria should be used with care, and avoided if possible approaches is in direct conflict with this recognition.
in particular on a high system level, for example a
plant or a an industry (Aven & Vinnem 2007, Aven
et al. 2006). It is argued that principally speaking, a 5 FINAL REMARKS
requirement (criterion) related to risk and safety can-
not be isolated from what the solution and measure Our analysis demonstrates that the four frameworks
mean in relation to other attributes, and in particu- are based on the same type of fundamental ideas
lar costs. It is impossible to know what should be the and principles. However the terminology and practical

329
approaches and methods adopted differ substantially. Aven, T. and Renn, O. 2008a. On risk defined as an event
This can be partly explained by different scientific where the outcome is uncertain. Submitted.
traditions, as the frameworks have been developed in Aven, T. and Renn, O. 2008b. Determining the right level
different scientific environments, and partly explained of investments in societal safety and security—the role
by different needs and objectives. The foundations of of quantitative risk assessments. Submitted for possible
publication.
all the frameworks have not been clarified. Processes Aven, T. and Vinnem, J.E. 2005. On the use of risk acceptance
need to be initiated to strengthen the theoretical basis criteria in the offshore oil and gas industry. Reliability
of the frameworks. An example in this direction is Engineering and System Safety, 90, 15–24.
Aven and Renn (2008b). Aven, T. and Vinnem, J.E. 2007. Risk Management, with
Applications from the Offshore Oil and Gas Industry.
Springer Verlag, NY.
REFERENCES Aven, T., Vinnem, J.E. and Røed, W. 2006. On the use
of goals, quantitative criteria and requirements in safety
AS/NZS 4360 2004. Australian/New Zealand Standard: Risk management. Risk Management: an International Journal.
management. 8, 118–132.
Aven, T. 2003. Foundations of Risk Analysis—A Knowledge Cabinet Office 2002. Risk: improving government’s capabil-
and Decision Oriented Perspective, Wiley, NY. ity to handle risk and uncertainty. Strategy unit report. UK.
Aven, T. 2006. On the precautionary principle, in the context HES 2001. Reducing risk, protecting people. HES Books,
of different perspectives on risk. Risk Management: an ISBN 0 71762151 0.
International Journal, 8, 192–205. HSE 2006. Offshore installations (safety case) regulations
Aven T. 2007. A unified framework for risk and vulnera- 2005 regulation 12 demonstrating compliance with the
bility analysis and management covering both safety and relevant statutory provisions.
security. Reliability Engineering and System Safety, 92, ISO 2002. Risk management vocabulary. ISO/IEC Guide 73.
745–754. Renn, O. 2005. Risk Governance: Towards an Integra-
Aven, T. 2008a. A semi-quantitative approach to risk analy- tive Approach. White Paper No. 1, written by Ortwin
sis, as an alternative to QRAs. Reliability Engineering & Renn with an Annex by Peter Graham (International Risk
Systems Safety, 93, 768–775. Governance Council: Geneva 2005).
Aven, T. 2008b. Risk Analysis, Wiley, NJ.
Aven, T. and Abrahamsen, E.B. 2007. On the use of cost-
benefit analysis in ALARP processes. I. J. of Performa-
bility. 3, 345–353.

330
Safety, Reliability and Risk Analysis: Theory, Methods and Applications – Martorell et al. (eds)
© 2009 Taylor & Francis Group, London, ISBN 978-0-415-48513-5

Management of risk caused by domino effect resulting


from design system dysfunctions

S. Sperandio, V. Robin & Ph. Girard


IMS, Department LAPS, University Bordeaux 1, France

ABSTRACT: The aim of this paper is to ensure the development of design projects in an environment with
limited resources (material, human, know-how, etc.) and therefore satisfy the strategic performance objectives
of an organization (cost, quality, flexibility, lead time, deadlines, etc.). The matter of this paper is also to set
down the problems of a real integration between risk and design system managements. With this intention, a
new paradigm for Risk Management Process (RPM) is proposed then illustrated via an industrial case. Such
a RMP includes the tasks of establishing the context, identifying, analysing, evaluating, treating, monitoring
and communicating risks resulting from design system dysfunctions. It also takes into account risks caused
by domino effect of design system dysfunctions and completes risk management methodologies provided by
companies that don’t consider this aspect.

1 INTRODUCTION With this intention, a new paradigm for Risk Manage-


ment Process (RPM) is proposed then illustrated via
In the design, the organization and the planning phases an industrial case. Such a RMP includes the tasks of
of services or manufacturing system, the capacity establishing the context, identifying, analysing, eval-
of a system to maintain its performances in spite uating, treating, monitoring and communicating risks
of changes related to its environment or an uncer- resulting from design system dysfunctions.
tainty of the data used, constitutes a new issue which This paper is organized as follows. In section 2, gen-
has arisen to complicate an already complex situa- eral points of risk management are introduced. Then,
tion. Therefore, in response to the evolutions of its the RMP relative to design projects is constructed.
environment and its possible internal dysfunctions, a Sections 3, 4 and 5 detail the different phases of the
system evolves thanks to actions on its constitution RMP. Risks are examined according to three points of
and/or on the products manufactured and/or the net- view (functional, organic, and operational) making it
work to which it belongs (suppliers, subcontractors, possible to determine their impacts on the objectives
customers, etc.). The design system can be defined of the organization. These impacts are evaluated and
as the environment where design projects (product, quantified thanks to the FMECA methodology, which
service, system or network design) take place. Its allows to identify potential failure modes for a prod-
management requires to control each phase of the uct or process before the problems occur, to assess the
product/service/system/network lifecycle, the design risk associated with those failure modes and to iden-
methodologies, the process and associated technolo- tify and carry out measures to address the most serious
gies. The aim of this paper is to ensure the development concerns. An industrial case enables to illustrate the
of design projects in an environment with limited methodology in section 6. Some conclusion remarks
resources (material, human, know-how, etc.) and and discussions are provided in the last section.
therefore satisfy the strategic performance objectives
of an organization (cost, quality, flexibility, lead time,
deadlines, etc.). The propagation of accidental sce-
narios and the comtempory increase in the severity 2 RISK MANAGEMENT: GENERAL POINTS
of event are usually named ‘‘domino’’. However, risk
management methodologies provided by companies Project success or failure depends on how and by
don’t take into account risks caused by domino effect whom it is determined (Wallace et al., 2004). Many
of design system dysfunctions. The matter of this design projects deviate from their initial target due
paper is also to set down the problems of a real inte- to modifications or changes to customer needs as
gration between risk and design system managements. well as internal or external constraints. Other projects

331
Definition of the design project
Monitoring
and
Review
Modeling of the design system
Context Analysis

Risk identification
Monitoring
and
Risk analysis
Review

Risk evaluation

Risk assessment

Monitoring
Corrective actions
and
No corrective action Review

Risk treatment

Choice of design scenarios


Communication and consulting

Figure 1. Risk Management Process relative to design projects.

may be simply stopped due to uncontrolled parame- risks relative to design projects (Fig. 1). The different
ters and unidentified constraints arising from various phase of the process are detailed hereafter.
processes within the project such as the product design
process or the supply chain design (identification of
suppliers and subcontractors for example). The con-
3 RMP: CONTEXT ANALYSIS
sequence: no risk, no design. Risk processes do not
require a strategy of risk avoidance but an early diag-
3.1 Definition of the design project
nosis and management (Keizer et al., 2002). Neverthe-
less, most project managers perceive risk management The definition phase aims to determine the contents
processes as extra work and expenses. Thus, risk man- of the project: product, service, system and/or net-
agement processes are often expunged if a project work design. Design process is the set of activities
schedule slips (Kwak et al., 2004). In a general way, involved to satisfy design objectives in a specific
main phases of risk management are (Aloini et al., context. The design objectives concern the prod-
2007): context analysis (1), risk identification (2), risk uct/service/system/network definition. They are con-
analysis (3), risk evaluation (4), risk treatment (5), strained by the enterprise organization and the design
monitoring and review (6) and communication and steps, and are influenced by technologies or human and
consulting (7). In agreement with such a methodology, physical resources (Wang, 2002). Design is mainly
we propose to use the following process to manage the a human activity and it is also very complex to

332
Actor

Actor Axis
Link 2 Link 3

Enterprise
Process Organisation

Link 5

Link 4 Design
system Link 6
is En
Ax vir
ic on
me
log nt
hno Ax
c is
Te
Scientific and External and
Technological Internal
Knowledge Product
Environments
Link 1

Figure 2. Design system modeling, interactions between factors influencing the design system (Robin et al., 2007).

understand the tasks carry out by designers (Gero, risk. Different types of innovation exist (organizational
1998). When the resolution steps are known (routine change, process or technological innovations, product
design process), the project is structured accord- development, etc.), which do not rest on the implemen-
ing to different activities which transform the prod- tation of similar knowledge and means and thus do not
uct/service/system/network knowledge. Then, the generate the same risks. For example, a survey done
project is defined like the intention to satisfy a design by the Product Development and Management Asso-
objective (a technical function, a part, an assembly or ciation (PDMA) reveals that more than 50% of the
a complex mechanism). The project manager decom- sales in successful companies were coming from new
poses the project according to the identified activities products and that the percentage was even over 60% in
and the actors’ tasks are very prescriptive. In this the most successful overall company (Balbontin et al.,
case the respect of the delay is the main performance 2000).
objective. So, the project manager decides on the
synchronisation of the human and material resources
3.2 Modeling of the design system
availability with the activities needs.
In the other cases, design can’t be considered as Modeling contributes to ideas development and struc-
a solving problems process, as a creative or inno- turing, and can be used as a support of reasoning
vative design process, and activities don’t structure and simulation. Design management requires under-
the project. Design must be identify as a process standing of design process context in order to adapt
which support emergence of solutions. In this case, actors’ work if it turns out that it is necessary. Hence,
the project is organised to favour the collaboration in their generic model of design activity perfor-
between actors of the process and the project manager mance, O’Donnell and Duffy insist on the necessity
searches to create design situations which facilitate to identify components of the design activity and their
the emergence of solutions. He decides on the adapted relationships (O’Donnell et al., 1999).
organization to favour collaborative work. Innovation The design system can be defined as the environ-
corresponds to the application of new and creative ment where design projects (product or system or
ideas. Therefore, implement an innovation project network design) take place. We have identified three
leads to investing in a project by giving up the idea of factors influencing the design system and which have
an immediate profitability and by accepting a certain to be considered to follow and manage suitably the

333
design system evolution and the design process (Robin up actor and external/internal environments (link 3,
et al., 2007): Fig. 2). Organization has to favour allocation of
adapted human resources to a specific situation in
– The context in which the design process takes
a particular context. These models are local perfor-
place. It includes natural, socio-cultural and econo-
mance inductors for design system and interactions
organizational environments (external and internal
between them provide a dynamic vision of the design
environments). External environment is the global
system evolution (links 4 to 6, Fig. 2). In this model,
context in which enterprise is placed (its market,
the description of factors influencing the design sys-
its rivals, its subcontractors . . .). Internal environ-
tem, at each decision-making level provides a global
ment describes the enterprise itself: its structure, its
vision of the design context. Hence, thanks to such
functions and its organization. Internal environment
a representation of the design context, the decision-
is an exhaustive description of the system in order
maker can analyse the design situation and identify
to take into account all the elements which could
particularities of each project. He is able to observe
have an influence on the decision-making at each
evolution of each component (environment, techno-
decision-making level.
logical and actor one), interactions between them and
– The technological factor that concerns the techno-
consequently to adapt his project management method
physical environment (scientific and technological
by taking the right decision of management to satisfy
knowledge). Scientific knowledge regroups the nat-
objectives.
ural science and the engineering sciences. Tech-
nological knowledge concerns the manufacturing
practices and the technology. Interest is to have a
global vision of the knowledge possessed and usable 4 RMP: RISK ASSESSMENT
by the enterprise and to identify a potential lack of
knowledge in some design tasks. 4.1 Risk identification
– Human and his different activities during design Risk can be planned or unexpected, from external
process (actor). Actor aspects have to consider or internal origin, linked to market trends, eruption
multi-facets of the designers. Human resources will of new technologies, strategic or capitalistic deci-
be described with classical indicators (availability of sions, etc. Risk Identification is discovering, defining,
a resource, hierarchical position, role in a project, describing, documenting and communicating risks
training plan . . .). But factors very close to the before they become problems and adversely affect
actor’s personality have to be taken into account too. a project. There are various techniques that can be
These factors influence the design process and more used for risk identification. Useful techniques include
generally the design system. brainstorming methods as well as systematic inspec-
These three global factors are global performance tions and technological surveys.
inductors for design system. These factors and their
interactions are integrated in a model composed with
4.2 Risk analysis
a technological axis, an environment axis and an actor
one (Fig. 2). Then specific objectives, action levers According to Lemoigne (Lemoigne, 1974), complex
and performance indicators, dedicated to the design system analysis refers to three points of view: a func-
system, have to be identified according to elements of tional point of view, i.e. the description of system
this model. functionality and behaviour, an ontological or organic
To identify and manage relationships between point of view, i.e. the description of resources used
global factors influencing performance of the design (human or technical), materials and information, and
process, we propose to use product, process and orga- related control structures, and a genetic point of view,
nizational models (Fig. 2). Product model acts as a which renders system evolutions and development.
link between knowledge and external/internal environ- Consequently we have developed a modeling approach
ments (link 1, Fig. 2). Product is the expression of the to provide analysts with the appropriate view of a
scientific and technological knowledge of an enter- system. Approach regroups a functional model to sit-
prise and permits to evaluate its position on a market. uates the system within its environment, an organic
It’s a technical indicator which allows to make evolve view to depict the physical organization and resources
a firm or to identify a possible lack of competency to which achieve functions previously identified and
be competitive on a market. As process corresponds an operational view which stipulates the ways the
to the place where the knowledge is created and used organic system is exploited. The application of such
by the actors to develop the product, it connects actor a decomposition to the global and local factors of the
and knowledge (link 2, Fig. 2). Finally, influences of design system is proposed hereafter (Fig. 3). Exam-
environments on actors are taking into account by the ples of potential evolutions of these factors are also
mean of an organizational model. This model joins presented.

334
Functional Evolution of the company role
Internal Organic Evolution of the company structure
environment
Operational Evolution of the company operating modes
Functional Evolution of the company place in the network
External Organic Evolution of the partners in the network (subcontractors, collaborators,…)
Global factors

environment
Operational Evolution of the partnerships (new contracts, new collaborations,…)
Functional Company knowledge and know-how evolution (fundamental sciences)
Scientific and Organic Evolution of the methods and techniques associated to this knowledge
technological
knowledge Operational Evolution of the operating modes
Functional Evolution of the actor’s role
Actor Organic Evolution of the actor’s positioning in the structure
Operational Evolution of the actor’s knowledge and competencies (training period,…)

Functional Evolution of the product service function


Product Organic Evolution of the product structure
Operational Behavioral evolution of the product
Local factors

Functional Evolution of the design activities


Process Organic Evolution of the design activities sequencing (anteriority, parallelism,…)
Operational Evolution of the resources that realize the activities
Functional Evolution of the design system role
Organization Organic Evolution of the design system structure (design centres)
Operational Evolution of the design centres processes

Figure 3. Potential modifications of local and global factors of the design system.

4.3 Risk evaluation


0 Initialization
When a risk is identified, its impact or effect on the
design system (and therefore on the design project Case 4:
Fatal Functional
since the design system can be seen as the environ- 1
definition
ment where design projects take place) is evaluated
thanks to the FMECA methodology. Such a method- Case 3: Strongly
Risk disturbing
ology allows to highlight potential failure modes for a
product or a process before the problems occur, to Risk Organic
2 2
assess the risk associated with those failure modes management
Case 2: Fairly
definition

and to identify and carry out measures to address the Risk


disturbing
most serious concerns (Kececioglu, 1991). In accor-
Case 1: Slightly
dance with the FMECA methodology, the criticality of disturbing Operational
events is valued by the Risk Priority Number (RPN) 3
definition
(see Equation 1 below), which is linked to three cri-
teria: the occurrence frequency of the event O, the
gravity degree of this event G and the risk of its Project Reengineering Project Engineering
non-detection D.
Figure 4. Criticality classes and domino effect on design
system.
RPN = O × G × D (1)

5 RMP: RISK TREATMENT


The RPN allows to organize eventual corrective
actions. Criteria weighting results from the expertise A good risk treatment requires a decisional framework
of leaders and/or analysts and/or operators, on the basis in order to visualize the different risk criticalities. We
of experiment. consider that a risk is (Sperandio et al., 2007):

335
Criticality classes Level of risk Decision
No action or modification at the operational level.
C1
Acceptable in the present state Follow-up, monitoring and review.
(case 1, Fig. 4)
Risk assessment.
Modification at the organic level.
C2
Tolerable under regular control Follow-up, monitoring and review.
(case 2, Fig. 4)
Risks assessment.
Modification at the functional level.
C3
Difficult to tolerate Follow-up, monitoring and review.
(case 3, Fig. 4)
Risk assessment.
C4 Change of strategy.
Unacceptable
(case 4, Fig.4) Total reorganization of the project.

Figure 5. Criticality scale and decisional framework.

Definition of the design project


Monitoring
and
Review
Modeling of the design system

Risk identification
Monitoring
Risk analysis and
Review
Risk evaluation

Monitoring
Corrective actions
and
No corrective action Review

Choice of design scenarios


Communication and consulting

Figure 6. Criticality classes and domino effect on the Risk Management Process.

– Slightly disturbing if it has no impact on the and/or competences of human and technical
structure of the design system (case 1, Fig. 4): resources, legislatives constraints, etc. Such a risk
changes of resources capacities, modifications of belongs to the criticality classe number 2 (C2),
operational constraints, for example. There are sev- which includes a risk level tolerable under regular
eral types of operational constraints: constraints of control.
precedence, temporal constraints, cumulative con- – Strongly disturbing if it requires strategic adjust-
straints, disjunctive constraints, etc. A slightly risk ments of the design system, impacting its functional
belongs to the criticality classe number 1 (C1), characteristics (case 3, Fig. 4): change of industrial
which includes a risk level acceptable in the present activity, integration of new technologies, modifica-
state. tion of the supply chain, etc. Such a risk belongs to
– Fairly disturbing if it acts upon the organic defini- the criticality classe number 3 (C3), which includes
tion of the design system (case 2, Fig. 4): capacities a risk level tolerable with difficulty.

336
– Fatal if it makes the design system obsolete (case 4, In this example, we decide to analyze two possible
Fig. 4): bad investments, for example. Such a risk evolutions (evolutions of the company or of its envi-
belongs to the criticality classe number 4 (C4), ronment) which can constitute a risk for the design
which includes an unacceptable risk level. system. Our analyse is limited to the investigation of
the global inductors impacts on the local inductors,
Determination of these criticality classes depends but it is important to also consider that the modifi-
on specificities of each system. When all the cation of these global inductors can influence other
risks impacting design system are identified and global inductors. Let us consider two events E1 and E2.
all the corresponding RPN are calculated, a crit- E1 corresponds to a breakdown in the process and E2
icality scale could be created and valuation of is a lack of natural aroma of chocolate.
classes could be done.
Reengineering the project during its life-cycle is
a procedure triggered each time significant events 6.2 Risk management
which impact the project occur. The functional,
Criticality analysis of events susceptible to have an
organic and operational definitions of the project
impact on the design system refers to a grid (Figure 7),
should then be tuned accordingly to support re-
in which the quotation scales of the occurrence fre-
engineering reasoning.
quency of the event (O), the gravity degree of the
A decisional framework enables to visualize risk
same event (G) and the risk of its non-detection (D)
criticalities. According to their value, the generic crit-
illustrate the expectations of the various actors in the
icality of risks consequences entitles to define actions
company:
to be carried out (see Figure 5). Obviously, such
actions are dependant on the company’s intention of – The occurrence scale (O) is defined by the appear-
doing risk management. Therefore, an effort com- ance frequency of a dysfunction. An exceptional
bined with the actions of risk management can be event will be credited with 0.25 point; a rare event
defined according to a qualitative scale: for example, with 0.50 point, a recurrent event (weekly, daily)
no action, vigilance or selective action, vigilance or with 0.75 point and a permanent event with 1 point.
periodic action, vigilance or continuous action, etc. – The gravity scale (G) considers the sum of the quo-
Finally, the propagation of accidental scenarios and its tations obtained for the criteria previously defined:
impact on the Risk Management Process are presented organization, process and product. Criteria are
hereafter (Figure 6). Such an approach enables to take marked out of 25. Therefore, a criterion of minor
into account risks caused by domino effect of design gravity (operational consequence on the exploita-
system dysfunctions. tion on the system) is credited with 5 points, while
the criterion considered as the most significant cri-
terion, i.e. the criterion which has big consequences
6 INDUSTRIAL CASE on the exploitation of the system (functional defi-
nition), is credited with 25 points.
6.1 Introduction – The non-detection scale (D) measures the ‘‘proba-
bility’’ of not detecting a potential failure when the
The company produces ice creams for the great distri- cause exists. The term ‘‘probability’’ used here is
bution. Characteristic of this branch of industry lies in not standard, since the values corresponding to these
the fact that it is a seasonal production. The off sea- probabilities are not contained between 0 and 1. Vol-
son corresponds to a relatively reduced production. At untarily, the graduation is bounded by 10 and 20,
the height of the season, the company is required to in order to clearly indicate the importance of the
have a perfect control of its production equipments various causes at the origin of the dysfunctions.
(very high rate of production), because any matter
loss or overconsumption could have a strong impact on It is then necessary to identify the origin of the risk
the productivity. Therefore, the appearance of events and to quantify the impact of this risk on the elements
susceptible to modify the functionality, the structure of the design system (product, process, organization).
and/or the operational scenarios of the production sys- Such a work is carried out by the FMECA group: one
tem is extremely prejudicial for the company, and it is or more experts who have knowledge on the tool and
imperative to analyze the criticality of these events conduct the discussions, actors of the design who pro-
in order to launch rapidly adequate and corrective vide essential knowledge relating to the design system,
actions. and persons in charge of other services (production,
Innovation has a key role to play in the performance marketing, etc.) who evaluate the impact of the risk
of such a firm. Considered as a solution for growth and in their own fields. In accordance with the FMECA
competitiveness, it is used by managers to create new methodology, the criticality of events is valued by the
sources of value. Risk Priority Number (Figure 7).

337
Occurrence :
Causes: Consequences :
Appearance frequency
Global inductors evolution Local inductors evolution
of a dysfunction

Internal / external
environments

Scientific and

Organization
technological
knowledge

Risk Priority Number


Product
Process
Actors

Exceptional

Permanent
Recurrent
Rare
Operational

Operational
Operational

Operational

Operational

Operational

Functional

Functional

Functional
Functional

Functional

Functional

Organic

Organic

Organic
Organic

Organic

Organic

10 15 20 10 15 20 10 15 20 5 15 25 5 15 25 5 15 25 0.25 0.50 0.75 1


R1 1 1 1 1 1 412.5
R2 1 1 1 1 1 325

D G O RPN

Figure 7. Impact of enterprise evolution on the design system.

The Risk Priority Number of E1 is given by: occur. The functional, organic and operational mod-
els (or definitions) of the design system should then
RPN 1 = 15 × (15 + 25 + 15) × 0.5 = 412.5 (2) be tuned accordingly to support reengineering rea-
soning. The methodological guidelines are based on
The Risk Priority Number of E2 is given by: event criticality analysis. A classification of events
was made to guide the analysts towards appropri-
RPN 2 = 20 × (25 + 25 + 15) × 0.25 = 325 (3) ate model tuning, such that the representation of the
system be permanently in conformity with the sys-
6.3 Conclusion tem despite the continuous modifications encountered
by the system during its life-cycle. A risk manage-
The Risk Priority Number of E1 is higher than the Risk ment methodology is also provided in order to take
Priority of E2. Therefore, in order to launch the correc- into account risks caused by domino effect of design
tive actions efficiently, it will be necessary to initially system dysfunctions. The Risk Management Process
treat dysfunctions due to E1. Moreover, impacts of the includes the tasks of establishing the context, iden-
risks on the design system have been quantified, which tifying, analysing, evaluating, treating, monitoring
will enable to adjust the design strategy. Events leading and communicating risks resulting from design system
to operational modifications of the company are com- dysfunctions.
mon, and are an integral part of everyday life. With
this intention, the company launched out in a step of
continuous improvement (preventive and autonomous REFERENCES
maintenance) in order to, on the one hand, prevent-
ing slightly events and, on the other hand, solving all Aloini, D., Dulmin, R., Mininno, V. (2007). Risk
dysfunctions which can exist in production workshops management in ERP project introduction: Review
(dysfunctions related to worker’s environment. Such a of the literature, in: Information & Management,
step aims at extending the life span of equipments and doi:10.1016/j.im.2007.05.004.
decreasing times of corrective maintenance. Balbontin, A., Yazdani, B.B., Cooper, R., Souder, W.E.
(2000). New product development practices in American
and British firms, in: Technovation 20, pp. 257–274.
Gero, J.S. (1998). An approach to the analysis of design
7 CONCLUSION protocols, in: Design studies 19 (1), pp. 21–61.
Kececioglu, D. (1991). Reliability Engineering Handbook,
Reengineering the design project during its life-cycle Volume 2. Prentice-Hall Inc., Englewood Cliffs, New
is a procedure triggered each time significant events Jersey, pp. 473–506.

338
Keizer, J., Halman, J.I.M, Song, X. (2002). From experience: Sperandio, S., Robin, V., Girard, Ph. (2007). PLM in the
applying the risk diagnosing methodology, in: Journal strategic business management: a product and system
Product Innovation Management 19 (3), pp. 213–232. co-evolution approach, in the proceedings of the Inter-
Kwak, Y.H., Stoddard, J. (2004). Project risk management: nal Conference of Product Lifecycle Management, July
lessons learned from software development environment, 11–13 2007, Milan, Italy.
in: Technovation 24, pp. 915–920. Wallace, L., Keil, M., Rai, A. (2004). Understanding soft-
Lemoigne, J.L. (1974). The manager-terminal-model sys- ware project risk: a cluster analysis, in: Information &
tem is also a model (toward a theory of managerial Management, 42, pp. 115–125.
meta-models). Wang, F., Mills, J.J., Devarajan, V. (2002). A concep-
O’Donnell, F.J.O., Duffy, A.H.B. (1999). Modelling product tual approach managing design resource, Computers in
development performance, in: International Conference Industry 47, pp. 169–183.
on Engineering Design, ICED 99, Munich.
Robin, V., Rose, B., Girard, Ph. (2007). Modelling collab-
orative knowledge to support engineering design project
manager, in Computers in Industry 58, pp. 188–198.

339
Safety, Reliability and Risk Analysis: Theory, Methods and Applications – Martorell et al. (eds)
© 2009 Taylor & Francis Group, London, ISBN 978-0-415-48513-5

On some aspects related to the use of integrated risk analyses


for the decision making process, including its use
in the non-nuclear applications

D. Serbanescu, A.L. Vetere Arellano & A. Colli


EC DG Joint Research Centre, Institute for Energy, Petten, The Netherlands

ABSTRACT: Currently, several decision-support methods are being used to assess the multiple risks faced
by a complex industrial-based society. Amongst these, risk analysis is a well-defined method used in the
nuclear, aeronautics and chemical industries (USNRC, 1998; Haimes, 2004). The feasibility of applying
the Probabilistic Risk Assessment approach (USNRC, 1983) in the nuclear field (PRA-Nuc) for some new
applications has been already demonstrated by using an integrated risk model of internal and external events
for a Generation IV nuclear power plant (Serbanescu, 2005a) and an integrated risk model of random tech-
nical and intentional man-made events for a nuclear power plant (Serbanescu, 2007). This paper aims to
show how such experiences and results can be extended and adapted to the non-nuclear sectors. These exten-
sions have been shown to trigger two main methodological novelties: (i) more extensive use of subjective
probabilities evaluations, in the case of non-nuclear applications and (ii) inclusion of hierarchical systems
theory in the PRA modelling. The main aspects of the results and conclusions of the above-mentioned
cases, along with insights gained during this analysis are presented and discussed in this paper. In particu-
lar, this paper is a synthesis of insights gained from modelling experiences in extending PRA-Nuc to new
applications.

1 INTRODUCTION 2 CONSIDERATIONS ON PRA


EXTENTED USE
Risk analysis methods are used to support decision-
making for complex systems whenever a risk assess- The next paragraphs describe the seven ‘‘problem-
ment is needed (USNRC, 1983). These methods are solution’’ paradigms common for all four applications.
well-defined and routinely employed in the nuclear,
aeronautics and chemical industries (USNRC, 1998;
2.1 Suitable systems for PRA modeling
Haimes, 2004; Jaynes, 2003). This paper will describe
experiences gained and lessons learnt from four appli- The systems that could be modelled using PRA-Nuc
cations: (i) use of PRA-like methods combined with should be a special type of complex systems, i.e.
decision theory and energy technology insights in complex, self-regulating, self-generating and hierar-
order to deal with the complex issue of SES (Security chical [Complex Apoietic Systems-CAS] (Serbanescu
of Energy Supply); (ii) use of PRA for new applica- 2007a; Serbanescu et al, 2007a, 2007b).
tions in the nuclear field (PRA-Nuc); (iii) use of PRA The model to be built and the method to be used
for modelling risks in non-nuclear energy systems, for the evaluation of the risk induced by CAS have to
with special reference to hydrogen energy chains and comply with seven requirements: (i) To be systemic
(iv) use of PRA for modelling risks in photovoltaic (i.e. use of systems theory in modelling), systematic
energy production. An important experience gained (i.e. use the same approach throughout the entire PRA
in this research is the discovery that the problems process) and structured (i.e. consider the model hier-
encountered in the use of this approach in nuclear and archical and evaluate each of the levels one by one—
non-nuclear applications have common problems and either by using top-down or bottom-up approach;
solutions that can be grouped into a set of paradigms, (ii) To be able to define structures of dynamic cyber-
which are presented below. netic interrelations between components with both

341
random and intentional types of challenges to CAS,
and to solve nonlinear dynamic models by defining
what linearity means for a CAS; (iii) To define the
system as a whole, as being a result of the synergetic
interfaces of its components and to define also the
CAS interface with the environment; (iv) To have a
solution for the system control (like for instance dis-
tributed control, hierarchical control and/or external
to the CAS unitary control); (v) To have a system
management based on predefined objectives, such as
energy/substance balance or risk impact, including
validation and verification processes. (vi) To solve the
specifics of the cause-effect issue for a CAS, which is
connected in its turn to other issues like definition of
linearity, uncertainty and system structure modeling;
Figure 2. Representation of a CAS model to be imple-
(vii) To be dynamic and highly flexible in defining
mented in PRA-Nuc codes (e.g. Risk Spectrum).
initial and boundary conditions.
Feasibility of extension of PRA-Nuc application
was previously demonstrated for other CAS-like Gen-
eration IV nuclear power plants (Serbanescu, 2005a,
2005b) and an integrated risk model of random tech-
nical and intentional man-made events for a nuclear
power plant was described in Serbanescu 2007a,
2007b.

2.2 System description


Figure 1 shows the general process of PRA analysis
for CAS systems: for all the Initiating Events (IE),
the reaction of CAS is modeled considering also the
Figure 3. Sample representation of the main systems of a
expected End States (ES) of each scenario, while the hydrogen distribution installation.
PRA model for a CAS is represented in Figure 2.
Three levels of modeling can be identified: level 1 is
related to the CAS reaction after all the functional bar-
riers have reacted; level 2 represents the CAS model from environment/population has been considered
after the last physical barrier separating the installation and level 3 represents the final-total risk impact of
the challenges to the installation.
As modeling requires also a representation of the
interconnections between the systems, a set of ‘‘Con-
necting End States’’ (CES) is defined. CES are used
to assure both ‘‘horizontal’’ (at the same level of the
CAS model) and ‘‘vertical’’ (between levels of CAS)
connections between scenarios.
As an example, Figure 3 summarizes the CAS
model being developed for modeling of the hydro-
gen distribution station. Similar models were also built
for security of energy supply evaluation (Serbanescu
et al, 2008) and for photovoltaic manufacturing pro-
cess (Colli et al, 2008). Abbreviations codes for each
of the elements are shown in the Figure 3; abbrevia-
tions IV1, IV2 and IV3 are used for isolating valve
systems.

2.3 Scenarios and faults (barriers) description


Figure 1. General representation of the PRA process for a Figures 4 and 5 illustrate the general process of gen-
Complex Apoietic System (CAS). erating ES by combining the scenarios defined in the

342
connected with the SC and EN type nodes in Figure 4.
The calculation of scenarios leading to an end state
in Figure 4 will include combination of failures at the
level of components of the CAS. Please note that the
notations shown in these figures follow the PRA-Nuc
principles and are used during the implementation of
the model into the computer codes.
It is possible to show that a CAS model of challenges
is generating a σ-algebra over the set of all possible ES
(Serbanescu 2005a, 2007a). In such a structure ‘‘risk’’
is defined as a norm in the measurable vector space of
the sets of scenarios. The risk is defined as the distance
between normal state and an altered end state. This
distance is usually calculated, as a product between
the probability of a given sequence of scenarios and
the damages associated with the endstates after the
scenario will take place.
The scenarios resulting from Boolean combinations
of failures differ in many aspects, as for instance the
Figure 4. Sample representation of a CAS fault tree model. combination of failed barriers or the ES (final condi-
tions) after the scenario came to an end for the given
level of PRA model.
Figure 6 shows an example of a CAS scenario def-
inition. End states for levels 1 and 2 are also known as
Release Categories (RC) while ES of level 3 are known
as Risk Categories (RK). More details of these and rep-
resentative examples of these entities are presented in
(Colli et al, 2008).
In any case, the ES identification is subject to con-
tinuous iterations and sensitivity analyses following
the process previously represented in Figure 1. Table 1
is a sample of endstates for the hydrogen installa-
tion shown in Figure 3. The ‘‘Code’’ column is only
indicative to show that a code is needed for the imple-
mentation of the model into PRA-Nuc computer codes.
A sample of a typical IE list for the hydrogen instal-
lation presented in Figure 3 is: (i) Breaks (e.g. break
Figure 5. Sample representation of CAS event tree model.

Event Trees with the Fault Trees of the systems, which


are assumed to operate as barriers against various
challenges. Such a process is known in PRA-Nuc as
‘‘integration of Fault Trees into Event Trees’’.
In these figures, the representation technique shown
is similar to the graph theory approach and is in accor-
dance with the existing standards (USNRC, 1983).
Nodes and basic events (i.e. final failures of a given
chain of failures) represent failures in both figures.
However, in Figure 4, nodes are showing possibility
of failures for entire systems of the CAS, which were
designed to act as barriers, while in Figure 5, nodes are
showing the possibility of failure of given components
in a given system. In Figure 4, ESα(1-n) represent end
states of a given scenario, which indicate certain levels
of damage after the scenario took place, while ESκ(1-n) Figure 6. Sample representation of CAS scenario defini-
are used as CES. The TOP nodes from Figure 5 are tion.

343
Table 1. Example of End States for Hydrogen installation. between the acceptable threshold for risk and the
frequency of a given event (IE) to happen.
End State Parameter(s) Code From a topological point of view, this threshold in
a 3-dimensional subset (defined by the risk, probabil-
Effect levels
ity of an IE and the dominant parameter of the CAS)
Overpressure 10 mbar EFOV4
500 mbar EFOV4 splits the space of possible solutions into two areas sep-
arating the acceptable solutions from the unacceptable
Thermal 3 kW/m2 FTH1 ones. The threshold line is linear if the axes of risks and
35 kW/m2 EFTH3 event probability/frequency are represented in the log-
Gas Cloud Not dangerous EFGC1 arithmic scale. This line is another representation (in
Dangerous in size EFGC2 a 3D form) of Pareto sets of solutions as presented in
(Serbanescu 2005a, 2007a), and for this reason, PRA
Harm effects
Overpressure Window break HOVDG1 can be also used to define the Pareto set of acceptable
Building collapse HOVDG2__ solutions for a CAS.
Eardrum rupture HOVINJ__ More in detail, (Serbanescu 2005a) one can con-
Fatalities 1% HOVFAT1__ sider a system as a cybernetic hierarchical one (e.g. of
Fatalities 100% HOVFAT3__ a CAS type as explained before).
Thermal Burns degree 1 HTHBU1 As it was shown in Serbanescu 2005a for such
Burns degree 2 HTHBU2__ systems the distance from the best solution is quan-
Glass/window fail HTHFIRE__ titatively described by a specific type of risk measure
Fatalities 1% HTHFAT1__ calculated as a loss of information entropy (Jaynes,
Fatalities 100% HTHFAT3__ 2003, Smithson, 2000). Consequently, the optimum
Fire Fatalities 1% HFIFAT1 for a certain problem is achieved when the Shannon
Fatalities 100% HFIFAT3 information entropy1 reaches a minimum, given the
variables constraints. In mathematical terms, this leads
Explosions Fatalities 1% HEXFAT1 to the task of finding the optimum for (1) with the limit
Fatalities100% HEXFAT3
conditions (2).

Sinf = XI ∗ ln XI , (1)

of vessels/cylinders in storage cabinet unit); (ii) Leaks 


(e.g. leak from the hydrogen inlet ESD); (iii) Over- xi ≤ α, (2)
pressurization (e.g. overpressurization of 10 mbar in
PD Unit); (iv) Thermal challenges to installation (e.g. In formulas (1) and (2), X stands for any risk met-
thermal impact of 3 kW/m2 in PD Unit); (v) Fire, ric of a given PRA level (e.g. probability of failure
explosions, missiles (e.g. fire generated by installa- or risk) and α is used for the limitations imposed to
tion failures in PD unit); (vi) Support systems failure those parameters (e.g. optimization target for accept-
(e.g. loss of central control); (vii) External event (e.g. able probability of a given ES or risk values). The
external Flood in UND); (viii) Security threats (e.g. solution of equations (1) and (2) represented in the
security challenge in area close to units). form of a Lagrangean Function (LF).
A specific feature of CAS model level 3 for security In order to analyse possible changes to the risk met-
of energy supply (SES), is that ES are SES risk related rics, a tool is needed to evaluate the resulting impacts
criteria. so that decisions can be taken on what course of action
In this case, scenarios leading to various SES cate- to follow. The proposed tool to carry this out is based
gories are ranked on the basis of their risk and impor- on Perturbation Theory (Kato, 1995). This approach is
tance, in accordance with PRA-Nuc methodology. It used for matrices built with event trees and fault trees
is also important to note that for the SES case, a set in PRA specialized computer codes. Therefore, in
of economical (e.g. energy price variations) and socio- order to evaluate the impact of challenges on the CAS
political initiators (e.g. result of referendum for a given models and/or modifications and/or sensitivity anal-
energy source) were also defined (Serbanescu 2008). yses related changes, linearity is assumed within the
‘‘perturbated’’ limits of the CAS model. This linearity
is related to the logarithmic risk metrics set of results.
2.4 Objective function and risk metrics
For any type of CAS (nuclear, chemical etc), con-
straints exist in terms of risk. These constraints are 1 Shannon information entropy is defined for the set of
usually represented in the form of a linear dependency CAS risk metrics solution.

344
The verification of the impact of this assumption is
one of the main targets of the sensitivity cases.
As a consequence of such an assumption the build-
ing of the model and the calculation of solutions is
guided by formulas of type (3), in which ‘‘R’’ repre-
sent the risk metrics function, ‘‘NL’’ index is related
to ‘‘non-perturbated’’ initial model/solution, ‘‘HC’’ is
related to the induced perturbation in a format of a
Hazard Curve, ‘‘R 0 ’’ is the non perturbated solution
for risk metrics (reference solution), R the modifi-
cation induced in the results by a given assumption and
εerror is the tolerance.
Figure 8. 3-dimensional representation of the risk metrics
RNL = FNL (PM , HC) = F0 ⊗ HC results for CAS level 3.

= R 0 + R + εerror (3)
paraboloid inside the cone) and the limitations (the
As it is shown in Figure 7 and in formula (3), F is cone). In the LF approach (B2), the limitations are
the function that builds the solutions of the PRA-Nuc already embedded in the space of acceptable
model and it is defined based on the sets of end states solutions defined by the internal paraboloid (i.e. the
PM and the hazard curve of a given new phenomenon cone is already embedded within the paraboloid). The
to be modeled (HC), and with values in the sets of val- external paraboloid in B2 indicates the external upper
ues obtained for risk calculations of all ES as defined bound of the solutions and not the limits defined by
by the set RNL . (Serbanescu 2007a). the cone in A2.
An example of risk metrics results for a Genera- In both cases the acceptable solutions from a risk
tion IV nuclear power plant has been explained in perspective indicate scenarios for acceptable design
more detail in Serbanescu 2005a, where different solutions with various degree of feasibility, provid-
approaches to evaluate risk metrics were discussed ing also important information for the CAS risk
regarding their use in decision-making related to opti- optimization process.
mized solutions from a risk perspective. Figure 8 As shown in Figure 8 there are 10 types of solu-
shows two approaches to presenting risk metric results. tions identified in the solutions space. For A2 these
A2 is the classical approach, as defined by USNRC are: 1-low risk impact—not difficult design solutions;
(1983). B2 is the approach based on the Lagrangean 2-low risk impact—solutions defined mainly by active
Function (LF) approach mentioned in Section 4 of this systems features; 3-medium risk impact—achievable
paper. If we would like to use these approaches in design solutions; 4-medium to high risk impact—
decision-making, it is important to note the subtle dif- solutions supported by passive features; 5-high but
ference between A2 and B2. In the classical approach still acceptable risk impact—difficult design solu-
(A2), the user has to bear in mind both the space tions. For B2 these are 6–10, where 6 corresponds
of acceptable solutions from a risk perspective (i.e. to 5, i.e. high but still acceptable risk impact—
difficult design solutions, and 10 corresponds to 1
(i.e. low risk impact—not difficult design solutions).
It is also worth noticing that results shown in Figure 8
are typical for all the CAS mentioned in this paper
(details for other cases are in (Colli et al, 2008) and
(Serbanescu et al, 2008)).

2.5 Risk calculation and decision making


processes
One of the main uses of the risk analyses is to sup-
port the decision making process in various areas. The
graphical representation of the commonly used deci-
sion analysis problems methods was introduced by
Howard (1984).
Figure 7. Sample representation of the definition of the PRA—Nuc has been used to support the deci-
solutions for PRA-Nuc in new applications based on the sion making process both as a guiding informa-
old ones. tion (Risk Informed Decision Making—RIDM) or

345
to assess risk based criteria (Risk Based Decision ◦ feedback review statements (for the terms noted
Making—RBDM). The experience gained in its appli- with F—statements based on feedback review
cation to the nuclear system led to identify a set of and UF —uncertainty of the statements from
areas of applicability for both the deterministic and feedback review).
probabilistic tools, as shown in Figure 9.
In this classification the area of applicability is It is also important to notice that risk analyses
defined by two criteria: (i) credibility in the assumed results are fundamentally divided in ‘‘determinis-
degree of conservatism of the model and (ii) credibility tic’’ oriented statements and ‘‘probabilistic’’ oriented
in the uncertainty level of the built model and applied statements.
method to evaluate risk. For deterministic judgments the result is composed
Formula (4) can be applied in order to find out how of the criteria value D and the level of uncertainty
certain could one be on a risk analysis based on prob- in this values (UD ); for the probabilistic results the
abilistic and deterministic set of results and/or how to components of the results are P and UP . There is
‘‘combine’’ them (if possible). also a component of results given by feedback from
real object while compared to the model (F set of
(P ⊗ U(P) ) ⊗ (D ⊗ U(D) ) ⊗ (F ⊗ U(F) ) statement).
O= ↑ ↑ ↑ ↑ ↑ (4) Operator ⊗ will have various impact on the final
RP RG1 RD RG2 RF function (with low-L, medium-M or high-H) impact)
as shown in Table 2 depending on the type of judg-
The function O (Objective of the decision process) ment cases, in which the decision maker positions
is a result of a combination using a series of logic himself (which could be optimistic, pessimistic etc.).
operators (⊗): The result of how the final credibility should be con-
sidered given a set of deterministic results is illustrated
• RP,D for reasoning on credibility of probabilistic in the Table 2, which shows that the role of the deci-
and respectively deterministic results, and RF for sion maker can be also modeled and considered a priori
the reasoning on the credibility of reasoning based so that variations in the conclusions of the same risk
on feedback from experiments/real cases; results used by various interest groups could be pre-
• RG1 and RG2 for connecting results on reasoning dicted and understood. Understanding risk results is
based on: one of the main conditions of assuring a good risk gov-
◦ probabilistic evaluations (for the terms noted with ernance process and maximizing the use and impact
P—probabilistic statements and UP —probabili- of the risk evaluations.
stic statements uncertainties),
◦ deterministic evaluations (for the terms noted
with D—deterministic statements and 2.6 Understanding errors and judgment biases
UD —uncertainties of deterministic statements), In the search for solutions on the task of self-assessing
the credibility of risk analyses results, two groups of
systematic biases are identified. The first is related
to the generic scientific method and its drawback, for
which a series of systematic errors (called ‘‘scientific
myths’’) exist as defined in (Mc Comas, W. 1996):
(i) Hypotheses become theories which become laws;
(ii) Hypothesis is an educated guess; (iii) A general
and universal scientific method exists; (iv) Evidence
accumulated carefully will result in sure knowledge;
(v) Science and its methods provide absolute proof;
(vi) Science is procedural more than creative; (vii)
Science and methods can answer all questions; (viii)
Scientists are particularly objective; (ix) Experiments
are the principle route to scientific knowledge.
The second group of biases is related to the ‘‘myths’’
generated by risk analyses for CAS, for which a list
as defined by (Hanson S.O., 2000): (i) ‘‘Risk’’ must
have a single, well-defined meaning; (ii) The sever-
ity of risks should be judged according to probability
Figure 9. Sample representation of areas of applicabil- weighted averages of the severity of their outcomes;
ity for decision making of deterministic and probabilistic (iii) Decisions on risk should be made by weighing
approaches (Serbanescu 2007b). total risks against total benefits; (iv) Decisions on risk

346
Table 2. Sample representation of the reasoning operators from formula (4) used in decision making statements.

Case 1 Case 2 Case 3 Case 4 Case 5


Optimistic Pessimistic Neutral trust Overoptimistic Overpessimistic trust
Impact Function/ trust in risk trust in risk attitude on risk trust of risk attitude in risk
Cases results results results results results

P L L M L M
RP L L M L H
U(P) H H H H H
TOTAL P L L M L H
D H H M H L
RP H L L H L
U(D) L L L L L
TOTAL D M M L M L
F H H M H L
RF H L L H L
U(F) L L L L L
TOTAL F H M L M L
R G1 M M L L L
R G2 H M M H M
O
Total objective
function M M H L L

should be taken by experts rather than by laymen; (v)


Risk-reducing measures in all different sectors of soci-
ety should be decided according to the same standards;
(vi) Risk assessments should be based only on well-
established scientific facts; (vii) If there is a serious
risk, then scientists will find it if they look for it.

2.7 Solutions to paradoxes


Identification of existence of possible systematic
errors induced by various pre-concepts usually gen-
erated in general science and in risk analyses of CAS,
in particular, is only the first step in an attempt to find a Figure 10. Sample representation of the reasoning triads.
solution for continuous improvement of the credibility
of results. The model is therefore defined by a set of three
In order to search possible solutions for the sys- relations between the set of real objects (R), for which
tematic errors in the models and in the methods some the set of CAS (S) models are built, while generating
criteria were identified. One of them is related to the the set of systematic biases (M).
knowledge acquisition process. If the theory of tri- As it is shown in Figure 11; the triad of models
adic approach as defined in (Peirce, C.S., 2008) is is represented by KNW-SOC-ENV (being equivalent
applied as one possible interpretation of the process, to the triad in Figure 10) and the triad of methods is
then the process of developing CAS models could be represented by RSK-DEC-CAS, which is built based
represented as in Figure 10. on PRA-Nuc with the specifics presented previously
The process of finding solutions for the CAS model in this paper (cf. Sections 1,3 and 4).
using an adequate method can then be represented The intersection of the two triads (models and meth-
as in Figure 11, where the areas of acceptable solu- ods) generates four areas of solutions represented by
tions for the CAS model is defined by the intersection the concentric circles (0, I, II and III) shown in Fig. 10.
between the areas of the set of model representa- These circles are indicating in increasing order the
tion and the set of solutions obtainable with the CAS level of uncertainty and knowledge gained by a certain
method described before. risk evaluation process.

347
Table 3. Representation of the reasoning process as per
(Descartes, 1637).

GOALS AND CRITERIA


R1 Goal of knowledge is to formulate judgments (true
and fundamental) on Objects
R2 Objects are targets, for which knowledge is
indisputable and real
R3 The main criterion for sound knowledge is to get
yourself intuition of the objects
THE TOOL IS DEFINED BY THE RULES
R4 The Tool is the Method
Figure 11. Sample representation of the process of building R5 The tool features are to aim at Organizing and
methods in order to find solutions to the CAS type models. Developing Gradually Increased Complexity
Hierarchy of the Objects—GICHO
R6 In order to fulfill rule R5 a Complexity Level
Criteria (CLC) derivation process is used so that
Another important aspect is related to the search the following are performed:
of systematic errors in CAS risk analysis. This search ➣ To define what is simple and hence the refer-
is done using the Cartesian approach in each triad of ence CL0
the model or method. This search, which is shown
➣ To define the departure from CL0 of other
objects C = CL − CL0
in Table 3, is used for all steps in CAS modelling
and solution-finding processes, independent of the EVALUATE COMPLETENESS OF KNOWLEDGE
risk evaluation approach type, i.e. ‘‘deterministic’’ or R7 Evaluate of the objects in the interaction (synergetic
‘‘probabilistic’’. mode).
In (Serbanescu 2005a) a suggested series of steps The condition for a successful method is
was identified in all the CAS modelling performed so to consider knowledge (KNW) as a process
far for new PRA applications. (Process)—KNWP
In order to define the specific actions needed to R8 The application of the method has a necessary and
avoid loss of control over the process of credibility self- sufficient condition, i.e. the existence of and
assessment of results, three preliminary steps were intuitive understanding of each object at all levels
R9 The KNWP has to start from CL0 and to get
performed (Serbanescu 2007b): confirmation at each step that the subject can
understand the object
– First step: identify the detailed governing prin- R10 Subject improves himself by practicing on CL0
ciple of each of the phases described below:
P1 Unique source modeled with integrated RESULTS OF KNWP
method leads to the need to diversify definitions of R11 Subject has to use all mind tools to achieve KNW:
objective functions. i) Reasoning, ii) Imagination, iii) Senses and
P2 Deterministic versus probabilistic approaches iv) Memory
as complementary approaches seem to cope with the In order to:
dual features of CAS risk models, but their use creates ➣ Perceive CL0 by intuition
paradoxes if the applicability for each of them is not ➣ Build KNW starting from CL0
defined.
P3 Combine deterministic and probabilistic by
using weights in judgments on random variables gener- the causal approach of identification of beliefs behind
ating numerical results for which credibility depends the paradoxes for each phase results in challenges to
on a solution not at the level of a simple combina- the cause-effect approach.
tion between the two types of data (deterministic and P6 The CAS model is built for a given set of
probabilistic). objective functions and the metrics of this model is
P4 Test the model in order to check its stability expected to include risk, which usually is not one of
for paradoxes identified up to this phase, which results the intended goals for the user.
in a closed logic loop that defines the reference case P7 Merging deterministic and probabilistic
while remaining in the limits of the model itself. models forces the modeller and user to have common
P5 A CAS model could be in one of nine stages. statements about combinations of probabilistic (which
They follow each other, and, their duration, intensity are by default uncertain) and deterministic events.
and degree of challenge of paradoxes at each phase However, this can sometimes lead to conflicting
differ from one CAS type to another. Furthermore, conclusions with problems of logic inconsistence.

348
P8 Management of risk model leads to manage- P2 Applicability areas of deterministic and
rial/procedural control in order to limit the uncertainty probabilistic parts in the CAS model and in the
in the real process of CAS evaluation. However this decision module to be clearly defined and used.
action is in itself creating new systematic assumptions P3 Use numerical and logical functions as
and errors and is shadowing the ones accumulated up switches and connectors between the deterministic and
to this phase. probabilistic part of CAS models and their metrics.
P9 The completion of a nine-cycle phase CAS P4 Use special sensitivity analyses phase to
and its implementation reveal the need to restart the define the sensitivity to unseen/not clearly formulated
process for a better theory. assumptions embedded in the model based on their
reflection in the paradoxes. In the screening process
– Second step: defines believes identified to be of those issues, use diverse methods not included so
the main features of each of the steps far in CAS model.
P1 It is assumed that there is a unique definition P5 Perform a full inventory of identified para-
for risk and the risk science has a unique and unitary doxes in CAS model and the possible beliefs generat-
approach to give all the answers. ing them in order to have a better understanding of the
P2 It is assumed that the well-established scien- CAS model bias and define the CAS reference model
tific facts showing both random and deterministic data for further analysis.
used in a scientific manner could provide support for P6 Identify the rules for the post processing of
certitude by using risk assessments. risk analyses results in order to be used in the decision
P3 It is assumed that in the case of risk analyses making process, by defining the place and desirability
a scientific method of universal use exists to evaluate for the user of Risk Informed Decision Making—a
severity of risks by judging them according to their module to be added to the actual results from risk
probability and the outcomes/damages produced. analyses for CAS models.
P4 It is assumed that by using carefully chosen P7 The merging action of deterministic and prob-
experience and model results one can derive objective abilistic approaches in CAS risk models is accompa-
results proving the validity of results for the given CAS nied by an improved set of logical construction of the
model. formulations of results and modules merging the two
P5 It is assumed that by using educated guesses approaches.
and experiments scientists can find and evaluate any P8 Continuously check the objectivity of the CAS
significant risk due to the objectivity and other specific model through risk managerial actions, including their
features of science. potential distortion into the model.
P6 It is assumed that based on the objectivity of P9 Restart the cycle of modeling even if there is
science and the approach in risk analyses to evaluate no user request for it since there is always a need to have
risks against benefits; the results could be used as such full coherent answers to all paradoxes encountered,
in decision-making process. even if they seem of no interest to the user and/or sci-
P7 It is assumed that by using a scientific method, entific community. The expected effect of the actions
which is honest and objective and by using risk reduc- implemented in the CAS risk analysis process was rep-
ing measures in all sectors of society (any type of resented and discussed in (Serbanescu 2005a). The
CAS model) the combined use of well proven tools main conclusion of this representation showed (with
in all science of analysis/deterministic and synthe- examples from Generation IV modeling) the impor-
sis/probabilistic approaches assures success in CAS tance of the feedback modeling in CAS risk analyses
modeling. in order to assure a convergent set of solutions by iden-
P8 It is assumed that science is more procedural tifying and managing mainly the systematic possible
than creative (at least for this type of activity) and the errors.
decisions themselves have to be made by trained staff
and scientists.
P9 It is assumed that science is evolving based on 3 CONCLUSIONS
laws which appear by the transformation of hypotheses
into theories, which become laws and for any CAS, in The paper presents the main issues identified during
this case, if there will be a real risk then the scientists the extended use of PRA-Nuc for new non-nuclear
will find it. applications. The results concluded so far identify
some common generic aspects, which need to be con-
– Last step: a set of actions to prevent generation sidered in order to implement properly and/or extend
of paradoxes is defined: existing PRA-Nuc models and tools to new applica-
P1 Model diversity of objective functions for tions. More pilot cases and applications are also under
CAS metrics and use hierarchy for its structure looking way in order to confirm further the results obtained so
for optimum at each Hierarchical level. far and/or to identify new issues to be considered.

349
REFERENCES Serbanescu D., 2005a. Some insights on issues related to
specifics of the use of probability, risk, uncertainty and
Colli A & Serbanescu D., 2008. PRA-Type Study Adapted logic in PRA studies, Int. J. Critical Infrastructures, Vol. 1,
to the Multi-crystalline Silicon Photovoltaic Cells Manu- Nos. 2/3, 2005.
facture Process, ESREL 2008 under issue. Serbanescu D, 2005b. Integrated Risk Assessment, ICRESH
Descartes R., 1637, Discours de la méthode, Paris, 2005, Bombay, India.
Garnier—Flammarion, 1966, edition. Discourse on Serbanescu, D. & Kirchsteiger C., 2007a. Some methodolog-
Method (1637), Haimes, Yacov Y., 2004, Risk Model- ical aspects on a risk informed support for decisions on
ing, Assessment and Management, 2nd Edition, Wiley & specific complex systems objectives, ICAP 2007.
Sons, New Jersey. Serbanescu D., 2007b. Risk Informed Decision Making,
Hansson S.O., 2000, Myths on Risk Talk at the conference Lecture presented at VALDOC Summer School on Risk
Stockholm thirty years on. Progress achieved and chal- Issues, Smoegen (Sweden), Vol. Karita Research Sweden
lenges ahead in international environmental co-operation. (Organiser), JRC PB/2007/IE/5019.
Swedish Ministry of the Environment, June 17–18, 2000 Serbanescu D. &Vetere Arellano A.L. et al., SES RISK a new
Royal Institute of Technology, Stockholm. method to support decisions on energy supply, ESREL
Howard et al, 1984, Howard R.A. and Matheson J.E., (edi- 2008 under issue.
tors), Readings on the Principles and Applications of USNRC 1983, PRA Procedures Guide. A guide for
Decision Analysis, 2 volumes (1984), Menlo Park CA: the performance of Probabilistic Risk Assessment for
Strategic Decisions Group. Nuclear power Plants (1983) USNRC, NUREG/CR-
Jaynes E.T., 2003, Probability Thoery—The Logic of Sci- 2300, February 1.
ence, Cambridge University Press, Cambridge, UK. USNRC 1998, Regulatory Guide 1.174, An approach for
Kato, Tosio, 1995. Perturbation Theory for Linear Opera- using PRA in Risk Informed Decisions on plant specific
tors’’, Springer Verlag, Germany, ISBN 3-540-58661. changes to the licensing basis, July 1998.
Mc Comas, W. 1996, Ten Myths of science: Reexamining Smithson, Michel J., 2000, Human judgment and impre-
what we know, vol 96, School Science & Mathematics, cise probabilities, web site of the imprecise probabilities
01-01-1996, p. 10. project http://ippserv.rug.ac.be 1997–2000 by Michel J.
Peirce C.S., 1931, Collected Papers of Charles Sanders Smithson and the Imprecise Probabilities Project.
Peirce, 8 vols. Edited by Charles Hartshorne, Paul Weiss,
Arthur Burks (Harvard University Press, Cambridge,
Massachusetts, 1931–1958, http://www.hup.harvard.edu/
catalog/PEICOA.html

350
Safety, Reliability and Risk Analysis: Theory, Methods and Applications – Martorell et al. (eds)
© 2009 Taylor & Francis Group, London, ISBN 978-0-415-48513-5

On the usage of weather derivatives in Austria—An empirical study

M. Bank & R. Wiesner


alpS—Centre for Natural Hazard Management, Innsbruck, Austria, University of Innsbruck,
Department of Banking and Finance, Innsbruck, Austria

ABSTRACT: This paper introduces to the fundamental characteristics of weather derivatives and points out the
relevant differences in comparison to classical insurance contracts. Above all, this paper presents the results of a
survey conducted among Austrian companies which aims at investigating the objectives of weather derivatives
usage and at analysing concerns regarding their application. The survey was conducted via face-to-face interviews
among 118 firms from different sectors facing significant weather exposure such as energy and construction
companies, beverage producers and baths. As no other survey has put a focus on weather derivatives so far,
this paper aims at filling a lack of relevant information regarding weather derivative practices. The results will
grant a deeper insight in the risk management practices and the needs of potential costumers. This facilitates
the development of target group specific weather risk management solutions which may enhance the usage of
weather derivates in various industries.

1 INTRODUCTION 2 THEORETICAL BACKGROUND

Many industries and accordingly their sales and 2.1 Weather risk and weather exposure
income are at risk due to weather fluctuations. It is
estimated that nearly 20–30 percent of the U.S. econ- Weather risks refer to weather events which affect the
omy is directly affected by the weather (CME 2006). revenues and earnings of a company in a negative
Other estimations assume that 70 percent of all busi- way. Type and quantity of relevant weather param-
nesses face weather risk in some way (Jain & Foster eters depend on the business area and can include
2000). Weather fluctuations still cannot be controlled, events such as temperature, precipitation or humidity
but with weather derivates a relatively new financial (Schirm 2001). These kinds of weather characteristics
tool has been created, which offers protection against are called ‘‘noncatastrophic events’’ which typically
weather-related risks. have a high-frequency but low severity. They are con-
Weather derivatives are believed to have substantial trasted with catastrophe-related low frequency—high
market potential in weather affected industries but so severity risks, e.g. hurricanes, tornados, etc. Weather
far research has primarily been focused on pricing and derivates have been developed to facilitate protec-
valuation issues. Hence, it is the aim of this paper tion against profit impacts given adverse weather
to investigate reasons and concerns regarding the use conditions but not for property or catastrophic risk
of weather derivatives as this topic was not covered protection (Clemmons 2002).
before. However, this is of fundamental interest for Weather risks belong to the category of operational
the application of these instruments. risks. In particular they refer to the danger of losses
The remainder of this paper is structured as fol- due to external events, as weather conditions cannot
lows. Section two provides the theoretical background be influenced by an enterprise. Furthermore, weather
of weather derivatives and points out the main differ- risks are considered to be volumetric risks which can
ences in comparison to classical insurance contracts. affect both supply as well as demand.
The third section focuses on the descriptive analy- The sensitivity to weather conditions can be defined
sis of the survey results regarding risk management as weather exposure. It quantifies the dependency of
practices and weather derivatives usage among Aus- operating figures on adverse weather conditions and
trian companies. The conclusions are presented in is the first condition for a comprehensive weather
section four. risk management. The determination of the weather

351
exposure usually requires a detailed analysis of the 2.2 Characteristics of weather derivatives
company’s data and potential influencing factors on a
A weather derivative is a financial contract between
company’s business success. Weather derivatives then
two parties with a payoff which is derived from the
serve as an efficient instrument to reduce weather risks
development of an underlying index. They differ from
and respectively weather exposure, and hence mini-
common financial derivatives as their underlying is
mize their potential negative influence on company’s
not a tradable asset but meteorological variables.
success. A systematic weather risk management pro-
As we already noted weather affects different com-
cess therefore assures competitiveness and presumably
panies in different ways. In order to hedge these
causes positive effects (Jewson & Brix 2005):
different types of risk, weather derivatives can be based
– Reduces the year-to-year volatility of profits on a variety of weather variables. These include tem-
– Profit smoothing leads to constant year-to-year tax perature, wind, rain, snow or sunshine hours. The
burdens only condition is that the weather variables can be
– Low volatility in profits often reduces refinancing transformed into an index and that they are measured
costs objectively without the influence of any counterparties
– In a listed company low volatility in profits usually (Cao et al. 2003, Dischel 2002).
translates into low share price volatility, and less The pay-off of these financial instruments is derived
volatile shares are valued more highly from a weather index, not from the actual amount of
– Low volatility in profits reduces the risk of money lost due to adverse weather conditions. There-
bankruptcy and financial distress fore, it is unlikely that the pay-off will compensate
the exact amount of money lost. The potential differ-
A sample of weather risk and corresponding finan- ence between an actual loss and the received pay-off
cial risks faced by various industries is given in is known as basis risk. Generally, basis risk is reduced
Table 1. when the company’s financial loss is highly correlated
with the weather and when contracts of optimum size,
structure and location are used for hedging (Jewson &
Table 1. Illustrative links between industries, weather type Brix 2005).
and financial risks. Typically, a standard weather derivative contract is
defined by following attributes:
Risk holder Weather type Risk
– Contract period: defines a start date and an end date,
Energy industry Temperature Lower sales during warm usually a month or a season
winters or cool – Measurement station: preferably close to the com-
summers
pany’s location to reduce geographical basis risk
Energy Temperature Higher heating/cooling
consumers costs during cold – Weather variable: corresponding to weather expo-
winters and hot sure and hedging needs
summers – Underlying: index which aggregates the weather
Beverage Temperature Lower sales during variable over the contract period
producers cool summers – Pay-off function: determines the cash-flows of the
Construction Temperature/ Delays in meeting derivative
companies Snowfall schedules during – Strike Level: value of the index at which the pay-off
periods of poor changes from zero to a non-zero value
weather
– Tick and Tick Value: defines how much the pay-off
Ski resorts Snowfall Lower revenue during
winters with below- changes per unit of the index
average snowfall The pay-off functions of weather derivates rely on
Agricultural Temperature/ Significant crop losses
industry Rainfall due to extreme
the pay-off functions of traditional derivative instru-
temperatures or rainfall ments, e.g. options, collars, straddles or swaps (Hull
Municipal Snowfall Higher snow removal 2006). Because weather is not a tradable asset, the
governments costs during winters exercise of weather derivatives always results in cash
with above-average settlement. In addition, contracts may involve finan-
snowfall cial limits in the maximum pay-off.
Road Salt Snowfall Lower revenues during
companies low snowfall winters
Hydro-electric Precipitation Lower revenue during 2.3 Weather derivatives in comparison
power periods of drought to weather insurance contracts
generation
Traditional insurance companies already offer pro-
Source: Climetrix 2006. tection against weather risks. So, is there really a

352
need for weather derivatives or can weather risks be weather risks as all losses have to be proven. This is
hedged by means of traditional insurance contracts? time-consuming and costly. The straightforward com-
To answer this question the key characteristics of pensation process of weather derivates justifies the
weather derivatives and weather contracts have to be acceptance of basis risk. Furthermore, the existence
contrasted. of basis risk has the beneficial effect that weather
The main difference between the two instruments derivates are lower-priced than insurance contracts.
is that the holder of an insurance contract has to Other aspects in which derivatives and insurance
prove that he actually suffered a financial loss due differ include legal, tax and regulatory issues. A
to adverse weather conditions in order to be compen- comprehensive discussion of these topics is given in
sated. Hence, weather insurance contracts consist of Raspé (2002), Edwards (2002), Ali (2004) and Kramer
two triggers: a weather event plus a verifiable financial (2006).
loss. If the insured is not able to prove this, he will not To summarize, weather derivates and insurance
receive payments from the contract (Becker & Bracht contracts show significant differences but are not to be
1999). In contrast, the payments of weather derivatives considered as exclusively competing concepts. Rather
solely rely on the weather index value, an actual loss they should be used as complementary instruments
does not have to be demonstrated. Weather derivates in weather risk management as both offer application
offer the option holder the advantage of receiving specific advantages.
prompt compensation without the risk of potentially
needing long lasting proof of the actual loss (Raspé 3 THE SURVEY
2002).
In conjunction with this feature another distinctive 3.1 Research methodology and respondents
criterion has to be considered. The buyer of a weather profile
derivate does not need to have any weather sensitivity
or intention to protect himself against adverse weather The literature lacks relevant information regarding
conditions, i.e. he does not need to show an insur- actual weather derivatives usage so a main objective of
able interest. On the other hand, insurance contracts the survey was to fill this research gap. It was also con-
are based on the concept of insurable interest, which sidered useful to investigate the broader issues related
means a direct relationship between the insured risk to that topic. Therefore, questions regarding general
and the insurance purchaser has to exist, i.e. a natural risk management practices were included in the ques-
exposure is required (Culp 2002). Weather derivatives tionnaire. Following research ideas served as starting
can also be bought for mere speculation (Alaton et al. point:
2002). – How do companies perceive climatological risks?
An additional important feature of weather – What reasons for the use and concerns regarding
derivates compared to insurance is that two counter- weather derivatives are named by active users?
parties with opposed risk exposures can enter into a – Which are the main factors that hinder companies
contract to hedge each other’s risk, e.g. via a swap to use weather derivatives?
structure. This is usually not possible in the insurance
market (Alaton et al. 2002). Furthermore, swap con- The survey was conducted via face-to-face inter-
tracts allow protection at no upfront costs, whereas a views among 118 Austrian firms from different sectors
premium must always be paid for insurance contracts facing significant weather exposure such as energy
(Becker & Bracht 1999). and construction companies, beverage producers and
The typical usage of weather derivates and insur- baths. The breakdown of respondents per sector is as
ance contracts can be seen as another difference follows: 37.3% energy, 25.4% baths, 20.3% beverages
between the two. As stated before, weather derivates and 16.9% construction sector. The sample consists to
are mainly constructed for protection against high- 3.4% of micro-, 26.4% of small-, 23.1% of medium-
frequency/low-severity risks. Insurance contracts usu- and 13.2% of large sized enterprises. In comparison to
ally refer to risks of extreme or catastrophic nature but the average percentage of enterprise sizes in Austria
with low occurrence probability (Alaton et al. 2002). a relatively large part of the sample is represented by
A possible explanation for these typical application medium- and large-sized companies. This is mainly
fields can be found in the instrument specific basis attributed to the energy and construction sector. The
risk. The pay-off of weather derivates shows no depen- main findings of the survey will be highlighted in this
dency on the actual loss which results in a natural paper.
basis risk and implies the risk of an insufficient com-
pensation in extreme events. Insurance solutions are
3.2 Weather risk and weather exposure
preferred in this situation as they are indemnity based
and do not contain basis risk. In contrast, insurance First, respondents were asked to specify on a five point
contracts do not function well with normal, frequent scale (1 for no weather risk and 5 for very high weather

353
no respondents face a high exposure as up to 50% of their
revenues are affected. Finally, a large group of respon-
low dents (21%) face a very high exposure, as weather
12%
risks threaten more than 50% of their revenues. The
5%
high proportion of firms with medium, high and very
high exposure indicates that there is definitely a need
49% very high for a sophisticated weather risk management (Fig. 2).
21%
moderate
3.3 Risk perception and risk management
13%
Given the relatively high exposures across the sample,
it seems interesting to analyse whether an institution-
alised risk management process exists. This topic was
high covered in the questionnaire by asking which depart-
ment or person is conducting the risk analysis and
Figure 1. Weather risk faced by respondents. within which frequency risk analysis is carried out.
The results show, that in the majority of the cases
(62%) the management board is performing these anal-
> 50 Percent 21%
yses and that special departments are assigned only
41 - 50 Percent 11,5%
to some extent (15%). On the one hand, it can be
positive that the executives are in charge of the risk
31 - 40 Percent 10% management process as it signals its importance. On
the other hand, specialised departments could focus
21 - 30 Percent 11%
and explore in more detail the potential risks as general
11 - 20 Percent 14% managers probably do. Therefore, designated depart-
ments or employees seem to be beneficial from the
1 - 10 Percent 26% company’s perspective.
Another question of the survey asked respondents to
unknown 7%
indicate the frequency at which they analyse their risks.
0% 5% 10% 15% 20% 25% This could serve as indicator how important risks are
perceived and whether a functioning risk management
Figure 2. Weather exposure faced by respondents. exists. 58% of respondents to this questions stated that
they never assess their risks. Nearly 12% indicate to
analyse risks at least sometimes. 30.5% of respondents
risk) to what extent their company faces weather risk. admit regular risk analysis.
Almost the half of firms (49.2%) stated that they face These findings are worrying if we consider the sig-
very high weather risks and another 13% indicated nificant weather risks and exposures stated by the
high weather risks. The results show that the major- companies. Therefore, it was also analyzed how the
ity of firms in the analyzed sectors are significantly specified exposure corresponds with the regularity of
exposed to weather risks. Only 12% of firms said that risk analysis. It was expected that the frequency of risk
they face no weather risks at all. Figure 1 presents a analysis declines with diminishing weather exposure,
distribution of the answer by weather risk category. i.e. that firms with a high exposure are performing
After the majority of firms face considerable a regular risk assessment in opposite to low expo-
weather risks we were interested to analyze whether sure firms. The relationships between the level of
they can quantify their exposure. Only a more detailed weather exposure and the frequency of risk analysis
knowledge of the weather exposure facilitates an effec- are displayed in Figure 3.
tive risk management process. Therefore, the compa- As expected, a large number of companies with
nies were asked to indicate the percentage of operating low exposure are never performing risk analysis at
revenues at risk due to adverse weather conditions. all (52%). 33% of respondents indicate regularly risk
Only a low number of 7% of the respondents stated assessments and other 15% state to analyse risks at
that the exposure is ‘‘unknown’’, which highlights the least sometimes. These findings are in line with the rel-
importance of weather risk in these business sectors. atively low exposure. In contrast to this, the frequency
26% of the firms have a moderate weather exposure of risk analysis does not improve with increasing expo-
influencing up to 10% of their revenues. Further, 14% sure, instead the proportion of firms with regular risk
state an exposure between 11–20% of their revenues. assessments even declines. Figure 3 displays, that a
Aggregated 21% face a medium weather exposure large amount of firms facing medium or very high
affecting up to 40% of their revenues. 11.5% of exposures are not controlling their risks on a regular

354
> 50 Percent 68% 32%
Three respondents actually use weather derivatives,
two belonging to the energy and one to the construc-
41 - 50 Percent 25% 17% 58% tion sector. The small number of users was expected
due to the fact that weather derivatives are a rela-
31 - 40 Percent 60% 10% 30%
Exposure

tively new financial hedging tool. Besides that, today’s


21 - 30 Percent 55% 27% 18% weather derivative contracts are usually directed at
large firms as contract volumes need to be relatively
11 - 20 Percent 60% 20% 20%
high, given the fixed costs incurred in transactions, to
1 - 10 Percent 52% 15% 33% allow reasonable premiums.
Respondents’ primary reason to use weather deriva-
0% 20% 40% 60% 80% 100% tives is earnings stabilization. Two respondents rank
Never Sometimes Regularly
this factor as very important, respectively one respon-
dent as important. Another important factor in the
Figure 3. Cross table—frequency of risk analysis and
decision of using weather derivatives is the absence
weather exposure.
of an actual loss proof to receive a payout. Accord-
ingly, the transparent and unproblematic clearing of
basis. 55% to 68% of respondents never perform risk weather derivatives is named as major factor as com-
analysis. Also the percentage of firms with risk assess- pensation only relies on the development of objective
ments on a regular basis remains stable in-between weather parameters. Moreover, respondents indicate
18% to 32% across the different exposure classes. An that traditional insurance products do not offer com-
exception seem to be respondents belonging to the parable protection regarding weather related risks or
exposure class of 41% to 50%. The majority of this are not available.
group states to asses risk regularly (58%) or at least The results show that classical risk management
sometimes (17%). Only 25% of these firms do not aspects, such as the reduction of earnings volatility,
asses their risks profile. are the primary reasons for the application of weather
A reduction of exposure classes into three cat- derivatives. Besides that, instrument specific advan-
egories corresponding to low (1–20%), medium tages over traditional insurance are also perceived
(21–40%) and high (>40%) exposure yields some as important. Respondents also indicate that weather
improvement. 41% of respondents belonging to the derivatives are not used for mere speculation. On
high exposure category indicate that they regularly the contrary, users of weather derivatives seem to be
analyse risks. But, still 53% of these firms do not aware of their exposure and are searching for specific
evaluate risks. The proportion of risk analysis in low hedging tools to protect their profits.
and medium exposure classes remains the same. The major concerns regarding the use of weather
Overall, these findings suggest that the majority of derivatives were also ascertained to understand possi-
firms do not have an institutionalised risk manage- ble reasons for the limited use of weather derivatives.
ment as a significant percentage of respondents do A list of eleven possible concerns was presented in the
not analyse their risks at all despite a considerable questionnaire to be ranked by respondents again on a
weather exposure. Unexpectedly, the results remain five-point scale (1 for no concern and 5 for very high
relatively stable across exposure classes meaning that concern).
also the majority of firms with high weather exposure Tax, legal and accounting issues appear to be
do not evaluate the riskiness of their business regularly. the companies‘ major concerns as all users indicate
Accordingly, the results imply that firms underesti- very important, important or at least moderate con-
mate or misinterpret their actual weather exposure and cern. This can be attributed to the fact that weather
are presumably not able to control their weather related derivatives are relatively new financial instruments
risk efficiently. which still results in some uncertainty regarding their
accounting and legal treatment (Mojuyé 2007). Diffi-
culty of quantifying the firm’s exposure and a general
lack of knowledge in derivatives are also major con-
3.4 Reasons and concerns regarding the use
cerns to the respondents. Further, difficulties in eval-
of weather derivatives
uating the hedge results are indicated. These results
One objective of this study was to gain a deeper insight suggest clearly that, where companies have concerns
into the use of weather derivatives. The survey asked regarding weather derivatives, this arises to a great
firms to indicate the importance of different factors extent from a lack of their expertise.
in the decision of using weather derivatives. Eleven Also, the potentially negative image of derivatives
reasons were listed, and users were asked to rank them is a serious concern for users. Weather derivatives
on a five-point scale (1 for not important and 5 for are often presented in the media as bets or gambling
very important). against the weather instead of focusing on their role in

355
risk management. This kind of press coverage could 5
also contribute to the fact that potential end-users are Mean
Median
reluctant in their decision to apply weather derivatives. 4
Mode
Interestingly, basis risk is not concerned as overly
important by respondents. One companies ranks this
3
fact as important, the others state moderate and no
importance. These results are to some extent in con-
tradiction to the expectations as literature often names 2
basis risk as a major problem in the application
of weather derivates. But, as these respondents are 1
actually using weather derivatives, basis risk seems

3,92
3,62
3,56
3,52
3,21
2,99
2,78
2,52

2,42
2,26
2,23
2,15
2,12
1,95
2,5
acceptable or they simply have not had any problems 0
regarding this issue.

Difficulty of pricing and valuing derivatives

Costs of hedging exceed the expected benefits


Exposure effectively managed by other means
Difficulty of quantifying the firm´s exposure

Uncertainty about accounting treatment

Concerns about perception of derivative use


Never considered using weather derivatives

Uncertainty about tax and legal treatment


Difficulty of evaluating hedge results
Instrument does not fit firm´s needs

Company policy not to use derivatives


Lack of expertise in derivatives
We ather derivatives unknown

Insufficient weather exposure


No benefits expected
Overall, the results imply that lack of knowledge
seems to be the firms’ major concern by the use of
weather derivatives.

3.5 Reasons for not using weather derivatives


In the following, the analysis will focus on non-users
and their reasons why they choose not to apply them.
Therefore, non-users were asked to rank the impor-
tance of different factors regarding the decision against
the usage of weather derivatives. A total of 16 rea-
sons were listed and ranked by the respondents on a
five-point scale (1 for not important and 5 for very
important). The responses to this question are shown
in Figure 4. Figure 4. Reasons for not using weather derivatives.
The most important reason for the decision against
weather derivatives is that the respondents simply
never considered using them (54% of answers in the reason ‘‘image of derivatives’’, which was impor-
category very important/ mean 3.92/ mode 5). This tant for weather derivative users, has a subordinate
reason is closely followed by the reason that firms do position. A potentially negative image of weather
not know weather derivatives (45 % very important/ derivatives or their promotion as bets on the weather
mean 3.56/ mode 5). Summarized 70–80% of respon- seem to have no major influence on the decision not
dents rank this reason as moderate, important or very to use them.
important. These findings suggest that the majority of In summary, results indicate that the majority of
potential user simply does not know that an instrument companies are not aware of weather derivatives as
for the hedging of weather related risks exists. risk management tool, as they either do not know the
Respondents also indicate to a large proportion instrument or never considered using them. Another
that they do not expect benefits from weather deriva- important factor seems to be a general lack of knowl-
tives (44% very important/ mean 3.62/ mode 5). The edge regarding derivatives and their application. This
perception of an insufficient benefit may also be includes the instrument itself but also areas such as
attributed to the limited knowledge of weather deriva- the quantification of weather exposure. Because of a
tives. This seems to be confirmed as lack of expertise lack of knowledge, companies seem to be also very
in derivatives is ranked in fourth position (39% very sceptical in appreciating potential benefits of weather
important/ mean 3.52/ mode 5). Another concern derivates.
receiving some attention was that weather derivatives The findings imply that there are some predominant
may not fit the firm’s needs (Mean 3.21/ mode 5). reasons which are mainly responsible in the decision
Difficulties of quantifying the weather exposure of not using weather derivatives. Therefore, a factor
and valuing weather derivatives were also ranked analysis was conducted to identify the significant fac-
as relatively important. They are related to limited tors. Factors which strongly correlate with each other
expertise. Some respondents also indicate that they will be reduced to one factor. A Kaiser-Meyer-Olkin
potentially face a too small weather exposure. test yields 0.84. It is above the eligible level of 0.8 and
All other reasons are perceived as relatively unim- therefore qualifies the sample as suitable for factor
portant by respondents given means below 3. Also the analysis.

356
Table 2. Rotated component matrix. identified factors seem plausible given the findings
before. As weather derivatives are relatively new
Component instruments and many firms never had experience
with derivatives before, they lack expertise in eval-
1 2 3 uating and handling these instruments. Additionally,
l Difficulty of evaluating
as nearly 60% of companies do not analyze their
hedge results ,902 risks on a regular basis it seems reasonable that
m Uncertainty about they are not able to quantify their exposure correctly.
accounting treatment ,872 Further, they probably cannot appraise the potential
n Uncertainty about tax and benefits of weather derivatives as they hardly know
legal treatment ,869 the instruments. This leads to the third factor that
k Difficulty of pricing and weather derivatives are widely unknown. It is entirely
valuing derivatives ,833 possible that factor 1 and 2 also arise from a gen-
o Concerns about perception eral unawareness of hedging measures with weather
of derivative use ,678
j Difficulty of quantifying
derivatives. Therefore, lack of knowledge regarding
the firm’s exposure ,551 ,455 weather derivatives and their application seems to be
d No benefits expected ,742 the most important reason for not using them so far.
e Instrument does not fit Taken this into consideration, weather derivates
firm’s needs ,716 should be actively promoted in the public and to
a Insufficient weather potential end users to increase the awareness level.
exposure ,662 Furthermore, the functioning and characteristics of
p Company policy not to weather derivatives as risk management tool should
use derivatives ,641 be highlighted to demonstrate their benefits and to
f Exposure effectively
managed by other means ,639
build up some basic expertise within the firms. The
h Costs of hedging exceed promotion of risk analysis may also contribute to a
the expected benefits ,516 ,616 success of weather derivatives as the firms get aware
i Lack of expertise in of so far neglected weather risks. Finally, if potential
derivatives ,820 users know the instrument and its role in weather risk
c Never considered using management they will more likely apply them.
weather derivatives ,752
b Weather derivatives
unknown ,743 3.6 Potential usage of weather risk management
instruments
Expecting the low number of active weather deriva-
tive users we also asked the respondents to indicate
A factor analysis yields three significant factors whether they can imagine using weather derivatives
which explain up to 66% of the total variance. Using or comparable instruments in the future. The results
the factor loadings given in the rotated component show that roughly one-fourth (26.7%) of firms are
matrix the questions can be assigned to the three generally willing to apply weather derivatives. Given
extracted factors. The factor loadings shown in Table 2 the actual state this proportion indicates a significant
are already sorted for easier interpretation as well as market potential for weather derivatives.
factor loadings below 0.4 are not displayed. The question demonstrates that many respondents
Factor 1 appears to be related to ‘‘lack of know- are open-minded about weather risk management with
ledge’’ (l, m, n, k, j and less so for o) as it mainly weather derivatives. Of course, it has to be investigated
encloses reasons in the context of pricing, valuation, which firms finally can use these instruments as basis
hedging and taxation uncertainties. Factor 2 could be risk and large contract sizes may have unfavourable
named ‘‘no benefits expected’’ (d, e, f, h, a and less impacts. High cost for small weather derivative trans-
so for p) as the containing reasons focus on the fact actions could be reduced via bundling schemes in
that there might be better alternatives or that weather which different companies located in one region are
derivatives do not fit the firms’ needs. Finally, fac- sharing a derivative contract to reduce the costs and to
tor 3 includes questions regarding unawareness of achieve reasonable contract sizes.
weather derivatives in general (c, b, i) which leads Another possibility is the integration of weather
to its assignment of ‘‘instrument unknown’’. derivatives in loans. This structure enables a company
The factor analysis indicates that the reason not to to enter into a loan agreement with a higher interest
use weather derivatives can be mainly attributed to rate that already includes the weather derivatives pre-
three factors: ‘‘lack of knowledge in derivatives’’, ‘‘no mium which the bank pays to the counterparty. In case
benefits expected’’ and ‘‘instrument unknown’’. The of an adverse weather event, the company only pays a

357
fraction or nothing of the usual loan due, thus receiv- from a ‘‘lack of knowledge’’ regarding derivatives.
ing a financial alleviation in an economical critical Furthermore, the factors ‘‘no benefits expected’’ and
situation. Moreover, weather-indexed loans would be ‘‘instrument unknown’’ seem to restrict the usage of
less likely to default which is also favourable for the weather derivatives.
bank itself as it strengthens the bank’s portfolio and Considering these findings, the main challenge
risk profile (Hess & Syroka 2005). Weather-indexed will be to increase the acceptance, awareness and
loans seem especially applicable for SMEs as they are knowledge towards weather risk management and cor-
financed to a large extent by loans. This facilitates the responding tools. The results show, that a significant
access for banks and offers considerable cross selling proportion of companies not using weather derivatives
potential. The potential use of a weather-linked credit so far may begin to use them as knowledge of these
by companies was tested in question 24. instrument increases.
The results show that 17% of respondents can imag- Finally, the survey confirmed that many firms
ine to use a weather-indexed loan. In comparison to are interested in weather risk management, but the
question 23, it does not seem as attractive as weather process is still in its infancy. Firms should put
derivatives for potential users. On the one hand, this more emphasis on investigating their weather expo-
could be attributed to the more complex product struc- sure and potential hedging measures to reduce the
ture. On the other hand, potential users could simply impacts of adverse weather conditions on operating
prefer a stand-alone product instead of buying some figures. Especially weather derivatives offer the pos-
sort of bundling schemes. Further, the comments of sibility to manage weather risks actively and flexible.
respondents on questions 23 and 24 indicate that there On the other hand, today’s large contract sizes hin-
is a general interest in weather derivatives but addi- der mainly SMEs from the application of weather
tional information is requested. This highlights again derivatives. Hence, future research will focus on con-
that instrument awareness has to be improved as well tract structuring issues, such as bundling schemes or
as lack of knowledge has to be reduced. weather-indexed loans, to facilitate the use of weather
derivatives.

4 CONCLUSION
REFERENCES
The main objectives of this work were to give an
introduction to weather derivatives and to investi- Ali, P.U. 2004. The Legal Characterization of Weather
gate weather derivative practices. A survey, conducted Derivatives. Journal of Alternative Investments 7(2):
among 118 Austrian companies, confirmed that the 75–79.
respondents face a significant weather exposure and Becker, H.A. & Bracht, A. 1999. Katastrophen- und
that they are also to a large extent aware of it. On Wetterderivate: Finanzinnovationen auf der Basis von
Naturkatastrophen und Wettererscheinungen. Wien.
the other hand, the survey revealed that the majority Cao, M. et. al. 2003. Weather derivatives: A new class
of firms are lacking a sophisticated risk management of financial instruments. www.rotman.utoronto.ca/∼wei/
concept as they hardly control their risks on a reg- research/JAI.pdf, 16.06.2007.
ular basis. This seems worrying given the general Clemmons, L. 2002. Introduction to Weather Risk Man-
knowledge of their exposure. agement. In Banks, E. (ed), Weather risk management:
Respondents already using weather derivatives markets, products, and applications: 3–13. Basingstoke.
indicated that they generally consider them as a Climetrix 2006. Climetrix—Weather Derivatives Software.
useful tool in weather risk management. Primarily, http://www.climetrix.com, 15.08.2006.
weather derivatives are applied for earnings stabiliza- CME 2006. CME Weather Products, http://www.cme.com/
trading/prd/weather/index14270.html, 09.08.2006.
tion. Instrument specific advantages, such as the lack Culp, C.L. 2002. The ART of risk management. New York.
of damage proofs and prompt compensation payments, Dischel, R.S. 2002. Climate Risk and the Weather Market.
are emphasized. Typical concerns regarding the usage London.
of weather derivatives are pricing, valuation, tax and Edwards, S. 2002. Accounting and Tax Treatment. In Banks,
legal issues. Further, the image of weather derivatives E. (ed), Weather risk management: markets, products, and
is perceived relatively critically. The study implies that applications: 246–261. Basingstoke.
also companies already using weather derivatives may Hess, U. & Syroka. J. 2005. Weather-based insurance in
need some additional assistance in the weather risk Southern Africa. The International Bank on Reconstruc-
management process. tion and Development, Washington.
Hull, J. 2006. Options, futures, & other derivatives, Upper
The relevance of different reasons in the deci- Saddle River, NJ.
sion of companies not to use weather derivatives was Jain, G. & Foster, D. 2000. Weather Risk—Beyond
also investigated. A factor analysis on the survey Energy. Weather Risk supplement to Risk magazine August
results produced evidence that the reluctance of firms 2000, http://www.financewise.com/public/edit/energy/
to use weather derivatives arises to a great extent weather00/wthr00-beyondenergy.htm, 10.08.2006.

358
Jewson, S. & Brix, A. 2005. Weather derivative valuation: Mojuyé, B. 2007. Does the hot dog vendor lose out?
the meteorological, statistical, financial and mathemati- International Financial Law Review 26(11): 27–29.
cal foundations. Cambridge. Raspé, A. 2002. Legal and Regulatory Issues. In Banks, E.
Kramer, A.S. 2006. Critical Distinctions between Weather (ed), Weather risk management: markets, products, and
Derivatives and Insurance. In Culp, C.L. (ed). Structured applications: 224–245. Basingstoke.
finance and insurance: the ART of managing capital and Schirm, A. 2001. Wetterderivate: Einsatzmöglichkeiten und
risk: 639–652. Hoboken. Bewertung. Working Paper, University of Mannheim.

359
Safety, Reliability and Risk Analysis: Theory, Methods and Applications – Martorell et al. (eds)
© 2009 Taylor & Francis Group, London, ISBN 978-0-415-48513-5

Precaution in practice? The case of nanomaterial industry

H. Kastenholz & A. Helland


Technology and Society Lab, Empa, St. Gallen, Switzerland

M. Siegrist
Institute of Environmental Decisions, Consumer Behavior, ETH, Zurich, Switzerland

ABSTRACT: Nanoparticulate materials (NPM) pose many new questions on risk assessment that are not
completely answered and concerns have been raised of their potential toxicity and life cycle impacts. Voluntary
industrial initiatives have been often proposed as one of the most promising ways to reduce potential negative
impacts on human health and the environment from nanomaterials. We present a study which had the purpose to
investigate how NPM industry in general perceives precaution, responsibility and regulations, how they approach
risk assessment in terms of internal procedures, and how they assess their own performance. The survey shows
that industry does not convey a clear opinion on responsibility and regulatory action, and that the majority of
companies do not have standardized procedures for changes in production technology, input substitution, process
redesign, and final product reformulation as a result of a risk assessment. A clear majority of the companies
found their existing routines regarding these procedures to be sufficient.

1 INTRODUCTION found to be inadequate in dealing with the novel prop-


erties of NPM (Davis 2007). Therefore there is an
Nanotechnology—the art and science of manipulating ongoing discussion regarding assessing and managing
matter at the nanoscale to create new and unique mate- the risks derived from NPM properties, the method-
rials and products—is a recently emerging and rapidly ological challenges involved, and the data needed
growing field whose dynamics and prospects pose for conducting such risk assessments (EPA 2007;
many challenges to scientists and engineers as well Morgan 2005). However, given that NPM may cause
to society at large. This technology has a major poten- harm and that there are currently no regulations that
tial to generate new products with numerous benefits. take the specific properties on NPM into account,
There are already many products containing nanopar- the responsibility for safe production and products is
ticulate materials on the market today, and there are mostly left with industry. Risk assessment procedures
expectations for more applications ranging from light- and precautionary measures initiated by industry are
weight materials, drug-delivery systems, and catalytic therefore vital to managing the environmental health
converters to food, cosmetics and leisure products. and safety of nanomaterials (Som et al. 2004). These
However, the unique properties of nanomaterials aspects have been reflected in previous cases involving
and the rapid development of nanomaterial based prod- other emerging technologies as well.
ucts have also brought up many concerns over its The objectives of this paper were to investigate
consequences for the human and environmental health how NPM industry in general perceives precaution,
(Nowack & Bucheli 2007; Nel et al. 2006; Oberdörster responsibility and regulation, how they approach risk
et al. 2006). Especially the impact of intentionally pro- assessment in terms of internal procedures, and how
duced nanomaterials with targeted properties fulfilling they assess their own performance. To this end, we
specific functions is recently in discussion. It has first introduce the survey methodology. Secondly the
been shown that some nanomaterials may have damage results of the survey are presented and finally we will
potential if they are exposed to humans or the environ- discuss the implications of the results.
ment. Against this backdrop, different stakeholders
have called for action to ensure the workplace, con-
sumer and environmental safety of NPM production 2 METHODS
and products (Helland et al. 2006).
NPM may fall under different regulations depend- In order to investigate how companies deal with the
ing on the application, but the regulations are currently risks of nanomaterials, we conducted a written survey.

361
The data was collected in Germany and Switzerland be responsible and 4 thought it should be shared with
between December 2005 and February 2006. The other actors. In the end-of-life, the industrial opinion
sample consisted of a total of 135 companies, 48 of is divided: 17 companies argued that the responsibil-
them from Switzerland and 87 from Germany. The ity should be taken by the industry alone, 10 thought
companies were identified through websites, litera- the government or the consumer should be responsi-
ture reviews and personal contacts. A prerequisite for ble, whereas 9 thought it should be shared with other
company selection was that companies should have actors. The responsibility can be seen as externalized
NPM-based products available on the market. A total throughout the life cycle stages.
of 40 companies filled out the questionnaire, which In research and development 19 companies con-
represents a response rate of 29.6%. Before sending sidered that no regulations are needed, whereas 17
out the questionnaire, we contacted each company found that industrial standards should be established
by phone and requested the person in charge of risk and 2 companies preferred governmental standards. In
assessment procedures to fill out the questions. the production stage a clear majority of the companies
The two largest industrial sectors were ‘‘chemi- found that NPM should be regulated by industrial stan-
cals and materials’’ and ‘‘consumer goods’’, and the dards, whereas in the usage stage and the end-of-life
most common application fields for NPM within these stages the company opinions were divided as to how
industrial sectors were coating and thin films for to best regulate this area. In general, the industry can
different materials (e.g. glass, wood and textile), med- therefore be seen as divided on whether industrial stan-
ical applications and electronic products. Twenty-five dards or governmental regulations comprise the most
companies had less than 100 employees, 8 companies appropriate form of regulation.
had between 100 and 1000 employees, 6 companies On the question: ‘‘If risk assessment reveals a lack
had more than 1000 employees and 1 company did of knowledge and there is a possibility of harmful
not answer this question. Fourteen companies reported effects, does your company have standardized cri-
that they were ‘‘primary producers’’ of NPM, 21 com- teria or procedures for a) raw material substitution
panies were ‘‘downstream users’’ working with NPM b) final product reformation c) process change?’’
purchased for their applications, 2 companies pro- The majority, 21 companies, did not indicate any
duced and purchased NPM for their applications, and standardized procedures following a risk assessment.
3 companies did not answer this question. In order Eleven companies were found to be promoting risk
to address the objectives we developed a number of research, whereas 21 companies, a majority, did
questions regarding (1) industrial interpretations of the not promote any such research. Furthermore, a total
precautionary principle, life cycle responsibility and of 15 companies had procedures involving different
regulation, (2) industrial procedures, and (3) indus- stakeholder concerns in product development (e.g.
try’s own assessment of these respective procedures public consultations, hearings, scenario analysis).
and areas. In general, companies were quite satisfied with
their current performance. All companies found that
best available information acquisition was important,
3 RESULTS although the routines used for it could be improved.
However, it was particularly in terms of promoting
Regarding the industrial interpretations of the precau- risk research, sharing knowledge with other organiza-
tionary principle, a clear majority of the responders tions such as safety knowledge, and risk assessment
found that all emissions should be kept As Low As methodology, that the companies saw the greatest
Reasonably Achievable (ALARA) and that measures potential for improvement.
should be taken if specific criteria of potential irre-
versibility are fulfilled. However no majority opinion
was found regarding whether the burden of proof 4 DISCUSSION
should be on the proposing actor. A principle compo-
nent analysis identified only one factor that explained Industries perceive themselves as clearly responsible
a total variance of approximately 87.55% and one can for potential impacts to human health and environ-
therefore conclude that the respondents answered all ment in the research, development and production
the questions in a consistent manner. stages, but this responsibility is gradually being exter-
In the production phase of the life cycle, most of the nalized to other stakeholders throughout the life cycle.
companies felt responsible for potential environmental This clear acknowledgement of industrial responsibil-
health impacts that may occur, whereas 2 thought this ity is in sharp contrast to a less uniform perception
responsibility should be shared with the government. of regulation, where 38 companies wanted indus-
In the use phase, 24 companies opined that the respon- trial standards or no regulations in the production
sibility should be borne mainly by industry, whereas stage, 25 companies wanted the same in the usage
only 8 thought the government or the consumer should phase and 18 companies in the end-of-life. The

362
combination of increasingly industrial externalization expected by the public, as they perceive fewer risks
of responsibility and regulations throughout the life associated with nanotechnology than lay people do
cycle may be problematic as current regulations gen- (Siegrist et al. 2007a; Siegrist et al. 2007b).
erally do not cover NPM. Thus we may have a situation
where there is a vacuum concerning the question who
should monitor that NPM are being developed and 5 CONCLUSIONS
used in a safe manner.
The prevention of the possibility of harm arising Do the companies have to be pushed or pulled to
from production or products by eliminating prob- improve their precautionary measures or is the current
lems at the source may involve changes in production situation satisfactory? In the case of NPM, industries
technology, input substitution, process redesign and find that regulation should be evidence-based, but the
re-engineering, and final product design and reformu- fate of NPM throughout the life cycle receives little
lation. About two thirds of the companies did not have industrial attention (Helland et al. 2008). Thus, we
such procedures in place, but at the same time most have a situation of interpreting whether the state of sci-
companies assessed their existing routines regarding entific evidence warrants regulatory action. We have
these respective procedures and found them to be shown that the industrial opinion is not clear on how to
sufficient with only a few companies thinking these interpret the state of evidence, but that two things do
procedures could be improved. This may imply that stand out: emissions should be kept as low as possible
there are no high incentives for considering alterna- and specific characteristics of irreversibility are reason
tive technology options and thus a business-as-usual enough for taking precautionary measures. However,
scenario seems the most likely path that industry would who should monitor and demonstrate such specific
chose in this respect. One exception was the procedure characteristics as industry does not necessarily see
of risk assessment. It may be argued that compa- that as their responsibility? How could a common
nies need assistance in this respect as a majority of approach be defined in order to create standards of
companies do not have risk assessment procedures risk assessment and other quality standards?
in place, but do express a wish to improve their Building on the consensus opinion of the industry,
performance. It is obvious that the current lack of one may draw two specific priorities for governmental
established risk assessment and management frame- bodies to investigate in collaboration with industry:
works for NPM is problematic for the companies
(Reijnders 2006). – What are the possible sources of NPM emissions
The awareness among key actors in a company throughout the product life cycle, how high are such
(managers, engineers, researchers, safety experts, emissions and what is the environmental fate of such
product designers etc.) is heavily influenced by social emissions?
factors such as communication and cooperation with – What NPM characteristics may serve as signs of
other stakeholders in addition to the existing corporate potential irreversibility for ecosystemic interactions
culture, the core values of the company (Ashford & (e.g. persistency, bioaccumulation, etc.) and which
Zwetsloot 2000). Increasing awareness among com- NPM specifically inhibit such characteristics?
panies’ key actors through training and outreach
activities can therefore comprise effective measures
to improve the safety culture of the companies. The
surrounding network of agents may therefore pro- REFERENCES
vide important contributions to industry for governing
risks, but nonetheless greater industry contributions to Ashford, N.A. & Zwetslot, G. 2000. Encouraging inherently
safer production in European firms: a report from the
the public knowledge base on NPM have been called field. Journal of Hazardous Materials, 78: 123–144.
for (Helland & Kastenholz 2007). We may conclude Nowack, B. & Bucheli, T.D. 2007. Occurrence, behavior and
from our results that most companies have an estab- effects of nanoparticles in the environment. Environmen-
lished network in which risk information is exchanged. tal Pollution., 250: 5–22.
This information exchange usually takes place among Davis, J.M. 2007. How to assess the risks of nanotechnology:
companies and between companies and universities, learning from past experience. Journal of Nanoscience
although very few actively involve themselves in and Nanotechnology, 7: 402–409.
funding university research. Helland, A., Kastenholz, H., Thidell, A., Arnfalk, P. &
The majority of companies also found that informa- Deppert, K. 2006. Nanoparticulate materials and regu-
latory policy in Europe: An analysis of stakeholder per-
tion exchange was an area in which there was room for spectives. Journal of Nanoparticle Research, 8: 709–719.
improvement. A further aspect resulting from insuffi- Helland, A., Scheringer, M., Siegrist, M., Kastenholz, H.,
cient communication with external stakeholders may Wiek, A. & Scholz, R.W. 2008. Risk assessment of engi-
be the consumer response. Nanotechnology experts neered nanomaterials—Survey of industrial approaches.
may not be inclined to initiate the risk assessment Environmental Science & Technology, 42(2): 640–646.

363
Helland, A. & Kastenholz. H. 2007. Development of Nan- Siegrist, M., Wiek, A., Helland, A. & Kastenholz, H. 2007a.
otechnology in Light of Sustainability. J. Clean. Prod, Risks and nanotechnology: the public is more concerned
online: doi:10.1016/j.jclepro.2007.04.006. than experts and industry. Nature Nanotechnology, 2: 67.
Morgan, K. 2005. Development of a preliminary framework Siegrist, M., Keller, C., Kastenholz, H., Frey, S. &
for informing the risk analysis and risk management of Wiek, A. 2007b. Lay people’s and experts’ perception of
nanoparticles. Risk Anal., 25(6): 1621–1635. nanotechnology hazards. Risk Analysis, 27(1): 59.
Nel, A., Xia, T., Mädler, L. & Li, N. 2006. Toxic potential of Som, C., Hilty, L.M. & Ruddy, T.F. 2004. The precau-
materials at the nanolevel. Science, 311: 622–627. tionary principle in the information society. Human and
Oberdörster, G., Oberdörster, E. & Oberdörster, J. 2005. Ecological Risk Assessment, 10(5): 787–799.
Nanotoxicology: An emerging discipline evolving from U.S. EPA 2007. Nanotechnology White Paper; U.S. Environ-
studies of ultrafine particles. Environmental Health Per- mental Protection Agency: Washington, DC.
spectives, 113(7): 823–839.
Reijnders, L. 2006. Cleaner nanotechnology and hazard
reduction of manufactured nanoparticles. Journal of
Cleaner Production, 14: 124–133.

364
Safety, Reliability and Risk Analysis: Theory, Methods and Applications – Martorell et al. (eds)
© 2009 Taylor & Francis Group, London, ISBN 978-0-415-48513-5

Risk based maintenance prioritisation

Gunhild Birkeland & Siegfried Eisinger


DNV, Høvik, Norway

Terje Aven
University of Stavanger, Stavanger, Norway

ABSTRACT: Maintenance prioritisation is about management of all pending maintenance activities of a


system. Due to lack of spare parts, personnel or access to the component to be maintained, the maintenance
activities are often delayed or deferred. This creates a maintenance backlog. To handle this backlog, prioritisation
of the maintenance work is required, and a common approach for doing this is to introduce a criticality measure
based on the consequences of failures. However, the adequacy of this approach can be questioned as it does not
reflect risk. This paper proposes a new prioritisation method meeting this critique. The method is based on the
use of SIL (Safety Integrity Levels) requirements. The method also produces risk monitoring indices useful for
management and authorities.

1 INTRODUCTION To test the developed methodology a pilot case


study was run on a part of a plant. The main results
On any industrial plant, the preventive and corrective from this case study are reported in the paper.
maintenance activities must be managed and adapted
to the available resources. It is necessary to keep a
to-do list of maintenance activities and prioritise at
any time which tasks are to be performed first. The 2 THE METHOD
prioritisation is affected by external constraints like the
availability of spare parts and access to the component In the Oil & Gas Industry there is established a
to be maintained. risk evaluation and acceptance scheme on component
It is observed that maintenance prioritisation on level. This scheme is related to IEC 61508 and IEC
many industrial plants is currently performed through 61511 (see IEC (2000) and IEC (2003)). Since 2001
a predetermined equipment criticality list which, in all new electronic safety-related systems on Norwe-
combination with failure impact, sets the priority on gian platforms must comply to IEC 61508 (see also
a given maintenance activity. Based on this priority OLF (2000)). IEC 61508 only relates to electronic
the maximum time until the maintenance activity shall safety-related systems, but the concept of the Safety
be performed is set. Criticality takes only account of Integrity Level (SIL) within IEC 61508 can be used
the consequences of a failure (see NORSOK (2001)). in an analogous way also for non-electronic systems.
Since the frequency part is not taken into account The SIL defines a maximum allowable unreliability of
the approach is not risk based. It is believed that a safety function, see Table 1.
a fully risk based approach will lead to better pri-
oritisation decisions and better monitoring of risk
Table 1. Relation between SIL and Probability of Failure on
associated with delayed maintenance. Risk indices can
Demand (PFD) (low demand mode) as given in IEC (2000).
be defined for monitoring the risk level associated with
the backlog. SIL PFD for low demand mode of operation
In this paper a risk based prioritisation method
is described. The method is based on SIL (Safety 4 10−5 −10−4
Integrity Levels) which refers to the reliability of safety 3 10−4 −10−3
functions. Simplicity is emphasised, in order to ensure 2 10−3 −10−2
that the method can and will actually be used by 1 10−2 −10−1
operators on industrial plants.

365
If the safety function reliability is below the SIL β-factor of 10% was chosen and the degree of redun-
value, the total system risk might become unaccept- dancy is taken into account as β0−r where β0 = 10%
able. The SIL is thus a risk based measure and is as and r is the degree of redundancy. For example, r = 1
such a good candidate for the evaluation of risk related means that the safety function tolerates one failure,
to deferred maintenance. Note however that if a safety and β0−r = 0.1. It follows that the expected additional
function is less reliable than required according to the delay time due to the deferred maintenance equals
SIL, the exact increase of total system risk cannot be d · β0−r , and hence Equation (1) can be written
directly inferred.
The SIL value represents an established risk based d
D= β −r (3)
measure for the maximum allowable unreliability of a p·T 0
safety function and is used as a basis for the following
risk based prioritisation method. A system might for example include a safety func-
Now suppose a safety function has failed. The prob- tion rated at SIL 2 with redundancy of 1. The function
lem is to determine the maximum time dm before the contains two sensors which can both detect a high
maintenance activity for this unit should start (referred pressure. Over one year the delayed maintenance of
to as the maximum delay time). Suppose the unit the sensors has amounted to 8.7 days. Using again
belongs to SIL category 2, i.e. the probability of fail- p = 0.005 the delay factor becomes D = 0.48. The
ure on demand (PFD), p, should be between 10−3 and maintenance delay should be acceptable.
10−2 . To simplify we require p ≤ 0.005. For a specific As will be demonstrated in Section 3, observed
period of time (T), say one year, this corresponds to delay time factors Di∗ can be readily calculated from
an expected downtime of 1.8 days. The interval lim- maintenance data for all safety functions i. Based on
its 10−3 and 10−2 correspond to 0.365 days/year and these we can produce overall indices and diagrams
3.65 days/year, respectively. To determine the maxi- providing the management and the authorities with
mum acceptable delay time, we need to define what is a picture of the maintenance backlog status. These
‘‘small’’ compared to 1.8 days. The actual delay time indices and diagrams can be used as a basis for
should not significantly increase this downtime. Obvi- concluding on whether the maintenance of safety func-
ously if the maximum delay time is a factor 10 higher tions is going reasonably well or whether safety is
than 1.8 days, i.e. 18 days, it is too large. A factor 5 ‘eroding away’. An example of an approach for visu-
is smaller but still it gives a major change in the total alising the status is the following: Calculate all delay
downtime. We see that we have to reduce the factor to time factors Di∗ for all safety functions i which have
about 1 to not significantly change the downtime, and failed in the time interval T and present a graph show-
the proposed method uses this number as a criterion ing the associated cumulative distribution. At D = 1
for determining the maximum delay time. a dividing line is drawn which marks the maximum
We formalise this by introducing the delay time fac- delay time. The number of safety functions to the right
tor D, expressing the ratio between the additional delay of this line should be small, i.e. centre of the histogram
time d due to deferred maintenance and the expected should be well to the left of this line.
downtime due to the SIL requirement, i.e. If several systems are to be compared it would
be interesting to calculate a single number which
d summarises the information on the maintenance per-
D= (1)
p·T formance with respect to deferred maintenance of
safety systems. One could readily calculate the mean
The criterion used to determine the maximum delay value D∗ of all Di* , with respect to the safety functions
time can then be written which have failed in the time interval T . The mean
value calculates the ‘centre of gravity’ of the distribu-
D ≤ 1, (2) tion of the Di∗ ’s, and as will be shown in Section 3,
the mean value provides a good overview over the
i.e. dm ≤ p · T . maintenance backlog status. Moreover, the mean value
Next we address the case of redundant safety sys- D∗ can easily be plotted as a function of time, which
tems, i.e. safety functions consisting of two or more provides an overview over the evolution of the main-
components in parallel such that one failure does tenance backlog in time. The mean value, D∗ , could
not immediately cause the safety function to fail. It also be plotted as an accumulated function of time,
is observed that the dominant reason for failures of which provides an overview of total delay time factor
redundant safety functions is Common Cause Failures over a period of time. Care has however to be shown
(CCF). To incorporate these failures into the method, when using D∗ as a basis for judgments about the
we use a modifying factor β, the so-called β-factor acceptability of the maintenance backlog. For exam-
(see IEC (2000) and IEC (2003)). Typical values for ple, requiring D∗ ≤ 1, would allow many large delay
β range from 1% to 10%. For the present study, a times (far beyond the limit 1), as long as the average

366
is within the criterion 1. Hence if a criterion should be 1

based on D∗ it should be considerable lower than 1. 0,9

Cumulative distribution of D*i


Note that potential effects from compensating mea- 0,8

sures for failed safety functions is only included as a 0,7

‘black and white’ model: if a compensating measure 0,6

exists, is perfectly effective and is in fact used, the 0,5

respective failures of the safety function can be taken 0,4

out of the list. If not all conditions are fulfilled, the fail- 0,3

0,2
ures are left in the list conservatively assuming that the
0,1
compensating measure is totally ineffective. 0
0 0,5 1 1,5 2 2,5 3 3,5 4 4,5 5

Delay time Factor

3 PILOT CASE STUDY


Figure 1. Cumulative distribution of delay time factor.
A pilot case study has been performed for a
Norwegian offshore installation based on informa-
tion from the computerised maintenance manage- 4.5
ment system (CMMS). Notifications (failure reports)

Accumulated annual delay time factor D*


All SILs
4
for this installation were extracted for the period SIL1
3.5
01.01.2007–31.12.2007. The total numbers of ana- SIL2
3
lyzed notifications are 1402. The pilot case study
2.5
covers only corrective maintenance of safety func-
tions. 2

The maintenance delay times could be extracted 1.5

from the CMMS with reasonable accuracy. Unfortu- 1

nately, the safety rating of the safety functions were 0.5

not given as SIL values, but as a safety criticality with 0


jan feb mar apr mai jun jul aug sep okt nov des
3 categories. For the purpose of the case study, a simple Time
mapping between these categories and the SIL rating
was performed. The mapping is considered reason-
able for some functions where experience exists, but Figure 2. Accumulated annual delay time factor D∗ as a
the mapping must be validated for real-life usage of function of time.
the method. The redundancy level was also readily
available in the CMMS.
through the year to arrive at the above reported D∗ =
4.3 at the end of the year.
3.1 Distribution of delay time factor
The different safety levels (SILs) are kept separate
As pointed out in Section 2, the distribution of the and are summed in the curve ‘All SILs’. The mean
delay time factor over all safety functions which delay time factors D∗ are computed on the basis of the
have failed in the time interval T will give valuable actual delay times d at the specific point in time; for
information on the maintenance backlog status. example on 1. April a safety function that failed four
In Figure 1 the cumulative distribution is shown for days earlier is assigned a delay time d equal to 4 days
the pilot case. For about 70% of all failed safety func- (refer Equation (3)).
tions the maintenance delay time factor was below 1, Figure 2 illustrates clearly that there are periods
but there is also a considerable number of maintenance with considerable backlog, i.e. the periods with high
delays with a delay time factor well above 1—the max- slope. During the end of 2007 the maintenance delays
imum delay time factor in the period is in fact 162. The seem to be better under control than in the months
average delay time factor is D∗ = 4.3. As shown in February and March.
Figure 1, the contribution from delay time factors >1 Figure 2 also illustrates how this method can be used
is rather high. for active maintenance prioritisation. At any point in
time a number of notifications are active. In order to
minimise the maintenance delay factor (and thus the
3.2 Maintenance performance as function of time
risk), focus should be put on the safety functions con-
Next we study how the backlog evolves over time. tributing to the highest slopes, i.e. the safety functions
Figure 2 presents the accumulated annual delay time for which the factor β0−r /p is highest among the active
factor D∗ as a function of time. Starting from a zero notifications. For example on 26. January a notifi-
delay time factor, the figure shows how D∗ evolves cation with SIL = 2 and r = 0 (no redundancy) is

367
created and it takes 51 days of delayed maintenance, performance through time, find important contribu-
contributing to a clearly visible slope in Figure 2. Other tors to delayed maintenance, prioritise maintenance
active notifications during this time contribute to con- from day to day basis or to compare the maintenance
siderably less delay time. In general attention should performance of several systems.
be placed on non-redundant SIL2 functions. As part of future work it is planned to include pre-
From Figure 2 one may conclude that the accumu- ventive maintenance delays in the methodology, i.e.
lated risk from delayed maintenance is too high, as an analogous delay time factor originating from pre-
there are many safety functions for which the PFD is ventive maintenance being performed later than the
more than doubled by delayed maintenance. planned time.

4 CONCLUSION AND FUTURE WORK REFERENCES

A new method for prioritisation of maintenance back- IEC. 2000. IEC 61508—Functional safety of elec-
log with respect to maintenance of failed safety trical/electronic/programmable electronic safety-related
functions has been proposed. The proposed method systems. First edition. IEC.
is linked to the established Safety Integrity Level IEC. 2003. IEC 61511—Functional safety—Safety instru-
mented systems for the process industry sector. First
regime from IEC 61508 which specifies the reliabil- edition. IEC.
ity requirements of safety functions. The method is NORSOK. 2001. NORSOK standard Z-008—Criticality
thus inherently risk based. It furnishes a variety of risk analysis for maintenance purpose. Revision 2. Norwegian
indices to gain an overview over the status of the main- Technology Centre.
tenance backlog of safety functions. The indices can be OLF. 2001. OLF 070—Recommended Guidelines for the
used to determine acceptance levels which tell the user application of IEC 61508 and IEC 61511 in the Petroleum
whether the maintenance of safety functions is going Activities on the Norwegian Continental Shelf. Revision
reasonably well or whether safety is ‘eroding away’. 0.1 The Norwegian Oil Industry Association.
The indices can be used to monitor the maintenance

368
Safety, Reliability and Risk Analysis: Theory, Methods and Applications – Martorell et al. (eds)
© 2009 Taylor & Francis Group, London, ISBN 978-0-415-48513-5

Shifts in environmental health risk governance: An analytical


framework

H.A.C. Runhaar, J.P. van der Sluijs & P.P.J. Driessen


Copernicus Institute for Sustainable Development and Innovation, Utrecht University,
Utrecht, The Netherlands

ABSTRACT: Environmental Health Risks (EHRs) traditionally have been dealt with in a hierarchical and
technocratic manner. Preferably based on scientific expertise, standards set are uniform and based on legal
regulations. However, this approach has encountered implementation problems and deadlocks, particularly in
cases where scientific knowledge is at best incomplete and interests of powerful stakeholders conflict. Many
new approaches to manage EHRs have been implemented, which share two characteristics: an increased
integration of (a) cost-benefit and other considerations; (b) the public and other stakeholders; and (c) of EHR
objectives in other sectoral policies, and an increased differentiation of EHR standards (partly as a consequence
of the former characteristic). Still little systematic empirical research has been conducted on the experiences
with these shifts in EHR governance, in particular in the light of the shortcomings of the ‘traditional’ approach
to EHR governance. This paper proposes an analytical framework for analyzing, explaining and evaluating
different categories of, and shifts in, EHR governance regimes. We illustrate our paper with the trends in EHR
governance described above.

1 INTRODUCTION this framework can be used for analyzing some trends


in EHR governance that we observed in a recently
The physical environment poses many different risks conducted empirical quick-scan survey (Soer et al.,
to human health. In many countries worldwide gov- 2008). We conclude the paper with some suggestions
ernments attempt to regulate environmental health for further research in this area.
risks (EHRs) by avoiding or reducing their occur-
rence, reducing or avoiding exposure to these risks,
or by mitigating the effects of exposure to these risks.
2 EHR GOVERNANCE: KEY CONCEPTS
A wide variety of approaches to dealing with EHRs
can be observed. Not only are similar risks regu-
2.1 What are EHRs?
lated differently in an international perspective as
regards standards, principles for standard-setting or In the risk research community, a substantial body
instruments implemented for reaching these standards, of work has been carried out on how the concept
national governments appear not to have consistent ‘risk’ should be defined. The International Risk Gov-
regimes for dealing with different risks, either (Hood ernance Council (IRGC) defines risk as ‘‘an uncertain
et al., 2004; O’Riordan, 1985; Zoetbrood, 2005). (generally adverse) consequence of an event or an
To date, only limited research has been conducted activity with respect to something that humans value’’
on the reasons for the observed heterogeneity in EHR (Bunting, 2008). Analytically, risks have two main
governance and the relative contribution of the various components: the probability of an event and its con-
EHR governance regimes to the reduction of EHRs sequences for health, environment and goods (cf.
(Hood et al., 2004). The few publications in this area Neumann and Politser, 1992).
are primarily concerned with classifying EHR gover- EHRs are risks stemming from the physical envi-
nance regimes and elaborating on the appropriateness ronment in which humans live. Distinction can be
of these regimes in distinct contexts (see Section 2). made between health risks stemming from natural
Yet, these papers are mainly conceptual of nature; events (e.g. floods, earthquakes) and man-induced
empirical evaluations in this area are scarce. EHRs (e.g. by the emission of toxic substances or
In this paper we develop an analytical framework the production of noise). In this paper we focus on
for addressing the above knowledge gap and show how the latter type of EHRs, as we expect that for these

369
type of risks there are more opportunities to influence • The regulatory and decision-making style (in part
them (in the case of natural events EHR management based on O’Riordan, 1985);
will mainly be focused on adaptation and mitigation • The requirements with respect to organizational
of adverse health effects and less with fundamentally and institutional capabilities for assessing, mon-
addressing their causes). itoring and managing risks (including emergency
management).

2.2 EHR governance regimes Other relevant elements of risk governance regimes
found in the literature include culture, principles
EHR governance regimes can be defined as ‘‘the com- and uncertainty and ambiguity. Rayner and Cantor
plex of institutional geography, rules, practice, and (1987) discuss four different ‘institutional cultures’
animating ideas that are associated with the regulation that may underlie approaches to risk management.
of a particular risk or hazard’’ (Hood et al., 2004: 9). These cultures reflect distinct worldviews, based
Only limited research has been conducted on the clas- among other things on particular principles of social
sification of risk regimes (Hood et al., 2004). Below justice and perceived economic interests. The impor-
we discuss a few of such attempts. tance of culture and cultural values in the percep-
In comparing the ways in which risks were regulated tion and selection of risks is also recognized by,
in different countries, O’Riordan (1985) identified among others, Douglas and Wildavsky (1982). It
four (partly overlapping) ‘styles of risk regulation’ that is not clear however whether culture is part of
can be considered as four risk governance regimes. governance regimes or an important context vari-
These regimes primarily differ in the way in which able of influence on the way in which governance
decision-making is organized (top-down, interactive regimes are shaped and function. Related to cul-
etc.) and the extent to which consensus among stake- tural aspects are the principles behind standard set-
holders is possible. ting; both elements are based on some core values.
Hood et al. (2004) consider risk governance regimes As regards principles for EHR standard-setting, De
as systems consisting of interacting or at least related Hollander and Hanemaaijer (2003) for instance dis-
parts and identify nine distinct configurations or risk tinguish between four approaches, reflecting different
governance regimes, which they identify by means of principles:
two dimensions:
• ‘Right-based’ approach: an equal protection for all
• ‘Basic control system components’: ways of gath- above a certain risk level;
ering information; ways of setting standards; and • ‘Utility-based’ approach: maximization of benefits
ways of changing behavior in order to meet the for society as a whole (i.e. achieving the greatest
standards (i.e. policy instruments); increase in overall public health at the lowest cost);
• The ‘instrumental’ and ‘institutional’ elements of • ‘Technology-based’ approach: using the best avail-
regulatory regimes: the ‘regulatory regime context’ able techniques;
(different types of risks at hand, the nature of public • The precautionary principle: a cautious approach
preferences and attitudes over risk and the ways in in the light of complexity, uncertainty and irre-
which stakeholders are organized) and the ‘regime versibility.
content’ (the policy setting, the configuration of
state and other organizations directly engaged in risk The EEA (2001) emphasizes the importance of
regulation and their attitudes, belief and ‘operating uncertainty in risk governance and suggests a frame-
conventions’). work for classifying risks regarding different forms
of uncertainty (risk, uncertainty and ignorance) and
The IRGC takes a more normative approach. It actions (and guiding principles) for handling these
advocates a framework for risk governance that aims situations. Klinke and Renn (2002) also discuss ambi-
at, among other things, enhanced cost-effectiveness, guity: the presence of ‘‘contested views about the
equal distribution of risks and benefits, and consis- desirability or severity of a given hazard’’ (Klinke
tency in risk assessment and management of similar and Renn, 2002: 1092). UNESCO COMEST (2005)
risks (Bunting, 2008; Renn, 2006). The IRGC frame- addresses the issue of risk governance under uncer-
work takes into account the following elements: tainty and explores the implications of the precau-
tionary principle for risk governance. Van der Sluijs
• The structure and function of various actor groups in (2007) reviews the challenges to risk governance of
initiating, influencing, criticizing or implementing uncertainty and precaution, focusing on the interface
risk decisions and policies; between science and policy, and concludes that a pre-
• Risk perceptions of individuals and groups; cautionary—post normal—style of risk governance
• Individual, social and cultural concerns associated has slowly but steadily started to invade traditional
with the consequences of risk; approaches. Methodological challenges remain as

370
to how to further integrate different approaches to 2.4 Explaining EHR governance regime selection
deal with uncertainty, in particular the more techni- and performance
cal approaches and the approaches from the policy
What explains the presence of particular EHR gov-
perspective.
ernance regimes and how can we understand the
The above discussion suggests that there is no such
contribution of these regimes to the reduction of
thing as a generally accepted framework for identi-
EHRs? Hisschemöller and Hoppe (2001) and Hoppe
fying and classifying risk governance regimes, but
(2002) adopted a useful theoretical model that links
provides us with what seem to be relevant elements
governance regimes to two specific characteristics of
in such a framework:
the policy problem at issue: certainty of the knowl-
edge basis and the extent to which norms and values
• EHR type in terms of (perceived) seriousness, cer- converge (see Figure 1). In contrast to the authors dis-
tainty of the knowledge basis and the prevalence of cussed in the preceding Section, Hisschemöller and
conflicting interests among stakeholders as regards Hoppe (2001) and Hoppe (2002) explicitly consider
EHR handling; these characteristics as independent variables relevant
• Governance style, referring to the plurality and to the appropriateness and performance of governance
diversity of actors allowed to be involved in regimes. Interactions and negotiations with, and input
decision-making, the moment when they are from, stakeholders are assumed to be necessary when
involved in decision-making processes, and their stakes of the various actors involved are high, norms
role in the implementation of EHR strategies; and values diverge, and when there is high uncer-
• The role of science and other knowledge sources in tainty about causes of the policy problem or impacts of
decision-making on EHR objectives; alternative policy programs – i.e. when ‘unstructured’
• The principles behind EHR objectives; policy problems are at issue. This unstructured prob-
• Dominant instruments in the implementation of lem category is similar to the post normal science type
EHR strategies (e.g. hierarchical or more voluntary of risk assessment proposed by Funtowicz and Ravetz
forms of regulation); (1993). In these situations stakeholder involvement is
• Culture and values; required in all stages of policy-making, including anal-
• The institutional context; ysis, both in order to get access to relevant information
• The interlinkages between the above elements. and to create support for policy-making. Examples
of this mode of problem solving are discussed in
Kloprogge and Van der Sluijs (2006) for the risks of
2.3 Evaluating EHR governance regime climate change. ‘Structured’ policy problems, on con-
performance trary, can be solved in a more hierarchical way. Here,
policy can be left to public policy-makers; involvement
Although many criteria can be employed for assessing
of stakeholders is not needed for analysis or success-
the performance of EHR governance regimes (see for
ful problem solving. In this case policy-making has
instance those employed by the IRGC in Section 2.2),
a technical character and is often heavily based on
one of the basic criteria would seem to be the extent
scientific knowledge. In the case of moderately struc-
to which they succeed in reducing EHRs to levels that
tured, ‘means’ problems stakeholder involvement is
are acceptable to decision-makers, the public, scien-
not required for recognition of the problem at issue,
tists and other stakeholders. At least two interrelated
dimensions are relevant here. The first is the speed
with which EHRs are identified, decision-making on
their management is completed, and implementation is
organized. The second is the outcome in terms of (per- Structured problems Moderately structured
/ means problems
ceived) risk reduction and its evaluation by relevant
actors. e.g. road maintenance
Yet, it is also interesting to move beyond the e.g. traffic safety
level of individual EHRs, e.g. by looking at how
EHR regimes impact upon other EHRs. Relevant in
this light is the concept of ‘risk migration’: the sit- Moderately structured Unstructured
uation where one risk is reduced but another one / goals problems problems
created or enhanced. An example is polybrominated
diphenyl ethers (PBDEs) based flame-retardant com-
e.g. abortion e.g. car mobility
pounds that accumulated in the environment and in
humans and in particular caused societal anxiety when
they were found in breast milk (Alcock and Busby, Figure 1. Types of policy problems (with examples).
2006). Source: Hoppe, 2002 (adapted).

371
but mainly for the selection of the means by which this
goal is to be reached. Since there is high uncertainty
about the effectiveness of and stakeholder prefer-
ences for various solutions, policy-makers together
with stakeholders search for adequate problem-solving
activities. Finally, in the case of moderately structured,
‘goals’ problems there is substantial agreement on cer-
tain knowledge but sometimes intense disagreement
about norms and values at stake and about the goals
that should be set. Interaction with stakeholders is
required in order to identify and streamline stakeholder
preferences.
Other bodies of literature focus on specific aspects
of governance regimes, in particular the role of sci-
ence in policy (e.g. Funtowicz and Strand, 2008;
Gibbons et al., 1994). Many publications in that area Figure 2. Analytical framework for characterizing,
start emphasize that science often cannot be the only explaining and evaluating EHR governance regimes.
knowledge producer, in particular in the case of Source: Authors.
uncertainties or moral issues where perceptions of
stakeholders diverge and where consensus is required
for successful policy (Funtowicz and Strand, 2008). 3 UNDERSTANDING EHR GOVERNANCE
This is in line with Hisschemöller and Hoppe (2001) IN PRACTICE: ILLUSTRATION
and Hoppe (2002).
The above literature is useful as a starting point Rather than predefining distinct configurations of
for searching for explanations for EHR governance EHR governance regimes and analyze and evaluate
regimes found in practice and their performance, but these, we suggest taking the elements from the ana-
there are two aspects that deserve attention. One, the lytical framework presented above to identify distinct
literature discussed is in part normative; it prescribes EHR governance regimes and developments in these
‘ideal’ governance responses in particular situations. regimes from observations in practice. In this Section
However, in practice problems that can be charac- we will illustrate the value of the analytical framework
terized as ‘unstructured’ are not always dealt with by analyzing and evaluating EHR governance regimes
according to what the literature suggests. For finding at the macro level encompassing shifts in EHR regimes
explanations for the presence of certain governance in general. Our framework however can also be used
regimes therefore also other aspects may be rele- for micro level analyses focusing on distinct regimes
vant (e.g. the existence of diverging interests and for a particular EHR in a particular country and time
power relations; see also Flyvbjerg, 1998). Two, period.
the literature starts from problems that have been In a quick-scan survey conducted in 2007 Soer et al.
accepted on societal and political agendas and does (2008) compared trends in EHR governance in the
not explicitly address the issue of when problems Netherlands with those in nine European countries,
become accepted. In the case of EHRs (and risks in the U.S. and Australia. The authors observe shifts that
general) calamities often are followed by intense pub- they summarize as a turn from a ‘traditional’, special-
lic and political reactions irrespective of the chance ized approach to a more integrated and differentiated
of such events. Risk awareness can also be cre- approach. How can this trend be analyzed, under-
ated by ‘policy entrepreneurs’ with an interest to stood and evaluated from our analytical framework?
bring together problem-owners, decision-makers and Below we will first describe and evaluate the tradi-
solutions (Kingdon, 1995). tional approach to EHR governance and then try to
explain shifts and make an ex ante evaluation of them.
2.5 Towards an analytical framework
for evaluating EHR governance regimes
3.1 ‘Specialized’ EHR governance regimes
Figure 2 brings together the elements of our analytical
framework that we discussed above. (Block) arrows 3.1.1 Characterizing EHR governance regimes
indicate the types of relationships between the ele- The traditional way of dealing with EHRs as could be
ments that we expect to find in empirical analysis. observed in many western countries as from the Indus-
Feedback loops from EHR governance regimes and trial Revolution can be summarized as follows (e.g.
their outcomes to their context are possible, but for Fiorino, 1990; Heriard-Dubreuil, 2001; O’Riordan,
the sake of this paper ignored in the figure. 1985; Sunstein, 2002):

372
• Regarding the role of stakeholders: usually govern- Netherlands air and water quality improvements were
ment agencies set standards in hierarchical ways. realized and smog levels were reduced dramatically,
(Formal) public participation is limited or absent. whereas the noise problem was stabilized (Keijzers,
Informal influences by stakeholder groups however 2000). A similar picture emerges if we take a Euro-
do exist (see below); pean perspective (e.g. EEA, 2003, 2007 regarding air
• Regarding the role of science: scientists are the and water quality). Yet, the traditional approach to
logical providers of knowledge on the nature and EHRs also encountered various (often interrelated)
severity of EHRs and levels at which health risks problems, which at least in part can be explained
are acceptable; by characteristics of the EHR governance regime
• Regarding principles underlying EHR standards: configuration.
standards are based primarily on estimated health First, various examples are known of EHRs with
impacts, preferably based on scientific risk assess- severe health consequences that have not been reg-
ments. Cost-benefit and other considerations usu- ulated at all. An important explanation is found in
ally do not play a major role in an early stage of limitations in scientific knowledge, hindering the set-
EHR governance, i.e. in the definition of standards; ting of clear EHR standards (Health Council of the
• Regarding standards: Environmental health risks Netherlands, 1995; Open University, 1998, U.S. EPA,
are dealt with on an individual basis. Where possible 2004; VROM, 2004). However another explanation is
quantified standards are formulated that generic, i.e. found in powerful lobbies of interest groups that suc-
apply to all of the situations specified and do not dis- ceed in lowering EHR standards (see for instance Hood
criminate between groups of people. This results in et al., 2004 for the U.K. situation). Here strong eco-
detailed standards, which is one of the reasons why nomic interests and weak(ly presented) public (health)
this approach is often considered ‘technocratic’. interests result in deadlock situations; EHRs not being
Compensation of health risks geographically or actively dealt with. A third explanation lies in the
between groups of people (either concerning one principles behind EHR standard setting, where cost-
specific risk type or between different risk types) benefit considerations usually are not of primary
usually is not allowed; importance. This has resulted in, what is perceived
• Regarding policy instruments: a linear approach to as, excessive costs of meeting EHR standards (e.g.
EHR governance is taken: implementation and the Health Council of the Netherlands, 1995). An exam-
selection of instruments (of which legislation and ple is soil quality; in many western countries including
licenses are typical) follows after risk assessment for instance Denmark and the Netherlands there are
and standard-setting and does not play a large role vast amounts of land polluted by industrial activi-
in earlier stages of EHR policy. ties that need to be cleaned up but for which a lack
of funding is available (Soer et al., 2008). Another
Today this approach, which can be called ‘special- example is radon gas in buildings, causing lung can-
ized’ due to its focus on individual EHRs, still can cer. At least in the Netherlands, the costs of reducing
be observed worldwide. In Europe, for instance, this this risk are perceived as excessive (VROM, 2004).
approach is visible in the EU Directives regulating A fourth explanation of EHRs not actively being taken
concentrations of particulate matter and other EHRs. up is that it is not always clear which societal actor
should take responsibility of dealing with EHRs, in
3.1.2 Explaining ‘specialized’ EHR regimes particular when those causing risks or in control of
The technocratic and hierarchical character of EHR instruments for reducing EHRs can be distinguished
regimes reflect wider ideas on the role of govern- clearly or when governments are expected to deregu-
ments in society, which were dominant in Western, late (Renn, 2006; U.K. H.M. Treasury, 2005; VROM,
liberal societies until some two decades ago. (Cen- 2004). It should be noted that also examples are found
tral) governments were considered to have a strong and of EHRs policies that haven been criticized due to
leading role in addressing social problems and there a limited scientific basis or low cost-effectiveness.
was much faith in the contribution that science could In the Netherlands for instance millions of euros
have in enhancing the effectiveness and efficiency of were made available for research on health effects
government policy (Fischer, 1997; Van de Riet, 2003). of radiation from mobile phone transmitting stations
This has become institutionalized in very specialized in response to public concerns, while there was lit-
(and growing) bureaucracies and an important role of tle scientific evidence of such effects. This illustrates
(institutionalized) science-based knowledge providers an important other shortcoming of the specialized
(‘policy analysis’). approach: difficulties in reconciling ‘objective’ sci-
entific EHR assessments and ‘subjective’ EHR per-
3.1.3 EHR governance outcomes and explanations ceptions by the public. Perceptions of stakeholders are
The ‘specialized’ approach has resulted in the reduc- considered to have become more important due to sci-
tion of various health risks. For instance, in the entific limitations, democracy concerns and a growing

373
dependence of governments on stakeholders. Even to the following changes in the elements of EHR
though knowledge has been built on variables that governance regimes outlined in our analytical frame-
affect the perceived seriousness of risks (e.g. voluntari- work (see Figure 2):
ness and personal control, the catastrophic potential,
emotional associations with the risk, trust in regula- • Regarding the role of stakeholders: the public and
tory agencies etc.) and the availability of guidelines other stakeholders are increasingly involved in the
for taking public perceptions of EHRs explicitly into formulation and implementation of EHR gover-
account (e.g. Klinke and Renn, 2002), this issue is nance regimes. Not only is a more participatory
thus far only has been addressed marginally in EHR approach to EHRs increasingly considered neces-
governance regimes (e.g. in the Netherlands and else- sary in order to create support for EHR policies,
where for external safety a distinction is made between it also stems from a desire to make other stake-
‘individual risk’ and ‘group risk’ based on the societal holder co-responsible for preventing and reducing
impact of a group of people killed or injured in compar- EHRs (Renn, 2006; U.K. H.M. Treasury, 2005;
ison to individual casualties. Yet an overall framework Soer et al., 2008; VROM, 2004). Participation
for incorporating public concerns is lacking. sometimes serves to create legitimacy and trust,
A second and related problem is that realizing EHR for instance in the U.K. where public protest arose
standards or the setting of stricter EHR standards is after the public was exposed to ‘unacceptable’ risks
often hindered by conflicts between EHR objectives relating to among other things BSE (Fisher, 2000);
and those of other policy domains, in particular spa- • Regarding the role of science: a few efforts have
tial planning. In particular in intensively used urban been made to reconcile scientific and stakeholder
areas, further spatial developments are often hindered risk perceptions. The U.K. H.M. Treasury (2005)
by strict EHR norms (e.g. Glasbergen, 2005; Wheeler for instance has proposed to make ‘concern assess-
and Beatley, 2004). Yet, the institutional context that ment’ and integral part of risk management. Since
lies responsibilities for EHR governance mainly with 2006 there is a legislative requirement to attempt
state agencies that, in addition, often prescribe detailed incorporating public opinion in the decision-making
(minimum) standards for EHRs, combined do not process by means of such an assessment, although
provide an incentive for actors involved (manufactur- it is recognized that ‘‘public attitudes to risk are dif-
ers, planners, etc.) to reduce EHRs beyond the legal ficult to measure’’; and that ‘‘public concerns may
minimum standards or to find innovative ways for be founded on ignorance, false information, unre-
reducing EHRs (Health Council of the Netherlands, liable media reporting, etc.’’ (U.K. House of Lords,
1995). 2006: 6). Eventually it is up to decision-makers to
A third problem with the specialized approach rela- determine the relative weight of such factors;
tes to difficulties in dealing with ‘cumulative’ effects • Regarding principles underlying EHR standards:
of EHRs (e.g. Health Council of the Netherlands, cost-benefit, economic and other concerns are
1995; U.S. EPA, 2004). On the one hand, such effects more explicitly (and perhaps also more intensively)
are overlooked in the case of source-based regulations included in EHR assessments (e.g. U.K. H.M. Trea-
(in contrast to effect-based regulations, such as the EU sury, 2005, VROM, 2004). In this way also a better
Directives for air pollution that prescribe maximum integration of EHR standard setting and EHR imple-
concentrations of for instance particulate matter). On mentation is strived after. There are however also
the other hand, problems exist because of a lack of countries that explicitly do not consider costs or fea-
knowledge on the combined effect of different EHRs sibility in the risk assessment and standard setting
(e.g. Robinson and MacDonell, 2006, VROM, 2004). stage (e.g. Australia and the U.S.; Soer et al., 2008);
Apart from that, the specialized approach sometimes • Regarding standards: an increased differentiation
results in ‘risk migration’(see Section 2.3). can be discerned. In the Netherlands for instance
some experiments have been conducted with pro-
viding urban planners more policy freedom in the
formulation of area-specific environmental ambi-
3.2 ‘Integrated’ and ‘differentiated’ EHR
tions (Runhaar et al., 2009). In this way a better
governance regimes
coordination and local optimization of environ-
3.2.1 Characterizing EHR governance regimes mental (health) planning and spatial planning is
A recent quick-scan survey of trends in EHR gov- facilitated. These experiments allow for a limited
ernance in ten European countries, the U.S. and form of differentiation of environmental objectives
Australia (Soer et al., 2008) revealed various changes (in a spatial sense). Compensation—a lower envi-
in EHR governance regimes in Australia, the U.S. and ronmental quality in one domain being offset by an
in ten European countries. It should be noted that improvement in another—was also envisaged, but
these trends are not found in all of these countries appeared impossible (also in light of EU Directives).
in the same intensity. Nevertheless the study points Also in Scotland it is legally possible to generate

374
area-specific environmental standards (Soer et al., societal context. The trend towards more public and
2008). Obviously differentiation of EHR standards stakeholder participation and integration of policy
is not a new phenomenon—however in the past sectors form part of a broader shift towards more
differentiation usually related to predefined situa- integrated approaches to dealing with societal issues
tions. Differentiation is also envisaged in another in Western, liberal societies, which are commonly
form—an explicit and deliberate (rather than an referred to a change in planning from ‘government’
ad hoc) differentiation of EHR governance regimes to ‘governance’. There is no agreed-upon definition
for distinct types of EHRs. Up to now several of what governance entails (Jordan et al., 2005). Yet,
countries including the Netherlands, Malta and the often-mentioned characteristics are: a decline of cen-
United Kingdom have announced to adopt such an tral government’s ability to regulate society in a top-
approach (Soer et al., 2008). Yet implementation is down fashion, resulting in more horizontal forms of
still in an early stage. It is therefore still uncertain steering in cooperation with other government ties and
whether or not this type of differentiation is feasible, stakeholders outside the government; a more direct
realistic, or even desirable; and early inclusion of stakeholders in the formulation
• Regarding policy instruments: risk assessment and and implementation of policies; shared responsibili-
management are more integrated. Also, EHR objec- ties between state, civil society and market actors in
tives are increasingly integrated in other sectors, in the formulation and implementation of policies; (as
particular in spatial planning. Examples of coun- a consequence) a blurring of the traditional bound-
tries are Australia, Germany and the Netherlands. aries between state, civil society and market (Van
Regularly used tools for better incorporating EHRs Kersbergen and Van Waarden, 2001). Two important
in spatial planning are Health Effect Screenings, causes of this trend towards governance are a (per-
Health Impact Assessments and Strategic Environ- ceived) ineffectiveness of central government policy
mental Assessments. In Malta integration between due to, among other things, limits to the extent in
policy sectors is strived after through the formation which society can be known and steered, and the plea
of an inter-departmental committee on environment for more direct democracy.
and health (Soer et al., 2008).
On a more abstract level, these trends reflect a 3.2.3 ‘Specialized’ EHR regimes: Outcomes,
change to more integration (of cost-benefit and other explanations and challenges ahead
considerations in EHR standard-setting; of stakehold- The shift towards more integration and differentia-
ers in EHR in the formulation and implementation tion is observed in several countries but is by far
of EHR governance regimes; and of EHR objectives crystallized. In many of the countries examined,
in other policy sectors) and more differentiation of plans are in immature stages or even only under
EHR standards (partly as a consequence of the former consideration. Not much experience has been built
shifts). with increased integration and differentiation in EHR
policies. Yet a few new dilemmas were identified,
3.2.2 Explaining ‘specialized’ EHR regimes including the weighing of stakeholder opinions and
The shifts in EHR governance regimes discussed scientific inputs and the weighing of health and other
above in part stems from the problems with the concerns (see also Glasbergen, 2005 and Runhaar
‘specialized’ approach. In the Netherlands for exam- et al., 2009). In addition EHR planners state that
ple EHR planners have differentiated norms for soil they lack a clear framework for systematically deal-
quality according to land use functions. Soil norms ing with EHRs. ‘Old’ problems include a lack of
for houses are stricter than for those for industrial (scientific) data (e.g. on cumulative and risk migra-
areas. A same approach has been followed as regards tion effects), insufficient funding and problems in
noise nuisance risks (Glasbergen, 2005). This dif- communication between EHR planners, other sec-
ferentiated approach is based on a desire to realize toral planners and stakeholders. Finally a potential
EHR reductions in a more cost-effective manner and risk of more integration is compromising of EHR
to better reconcile EHR policies and spatial poli- standards in favor or other ambitions (Soer et al.,
cies. Setting EHR standards based on ‘best available 2008).
technology’ (BAT) or on what is ‘reasonably prac- It is far from certain that more integrated and differ-
ticable’ (ALARP) is another well-known approach entiated governance regimes will completely replace
to reconciling EHR reduction and economic con- more traditional EHR regimes. In Europe, EHRs are
cerns. In this context these shifts can be considered as increasingly targeted at a supranational level (EU)
ways to overcome the outcomes of more ‘specialized’ by means of Directives that prescribe strict and uni-
regimes. form standards for acceptable EHRs. It is interesting
However, the above shifts can also be explained to examine how these various forms of EHR regimes
by more general developments in the broader interact.

375
4 CONCLUSIONS EEA. 2003. Europe’s water: An indicator-based assessment.
Copenhagen: European Environmental Agency.
Given the limited attention being paid to EHR gover- EEA. 2007. Air pollution in Europe 1990–2004. Copen-
nance regimes in risk research, our aim was to develop hagen: European Environmental Agency.
an analytical framework for characterizing, explaining Fiorino, D.J. 1990. Citizen participation and environmen-
tal risk: a survey of institutional mechanisms. Science,
and evaluating such regimes. Based on a review of rel- Technology & Human Values 15(2): 226–243.
evant literature we developed a framework, which we Fischer, F. 1997. Evaluating public policy. Chicago: Nelson-
illustrated by means of some macro trends we observed Hall Publishers.
in some Western countries. Fisher, E. 2000. Drowning by numbers: standard setting
The framework seems to be useful for guiding in risk regulation and the pursuit of accountable public
research into the above area as it allows for a sys- administration. Oxford Journal of Legal Studies 20(1):
tematic examination of relevant elements and possible 109–130.
relationships. In the analysis of recent shifts in EHR Flyvbjerg, B. 1998. Rationality and power. Democracy in
governance we discussed as an illustration of our practice (translated by S. Sampson). Chicago/London:
University of Chicago Press.
framework not all of elements were elaborated in much Funtowicz, S.O. & Ravetz, J.R. 1993. Science for the post-
detail. Cultural influences on governance regimes normal age. Futures 25(7): 739–755.
for instance may be identified more explicitly in an Funtowicz, S. & Strand, R. 2008. Models of science and pol-
international comparative study. icy. In T. Traavik & L.C. Lim (eds), Biosafety first: holistic
We suggest that further research is conducted in approaches to risk and uncertainty in genetic engineering
order to classify EHR governance regimes with the aid and genetically modified organisms. Trondheim: Tapir
of our framework. What distinct configurations can be Academic Press (forthcoming).
found in practice, next to the (perhaps artificially con- Gibbons, M., Limoges, C., Nowotney, H., Schwarzman, S.,
structed) ‘specialized’ approach we discussed? And Scott, P. & Trow, M. 1994. The new production of
knowledge: the dynamics of science and research in
what relations between the various elements exist? The contemporary societies. London: Sage.
shifts towards integration and differentiation identi- Glasbergen, P. 2005. Decentralized reflexive environmental
fied in the quick-scan survey conducted by Soer et al. regulation: opportunities and risks based on an evalua-
(2008) may act as a starting point. Three remarks tion of Dutch experiments. Environmental Sciences 2(4):
should be made here. One, in this study only a 427–442.
(probably non-representative) sample of countries was Health Council of the Netherlands. 1995. Niet alle risico’s
included. Two, the shifts identified were not observed zijn gelijk: kanttekeningen bij de grondslag van de risi-
in each of the sample countries in the same intensity. cobenadering in het milieubeleid (Not all risks are equal,
Three, the survey focused on the main characteris- a critical approach to environmental health risk policy; in
Dutch). The Hague: Gezondheidsraad.
tics of national EHR governance regimes; no in-depth Heriard-Dubreuil, G.F. 2001. Present challenges to risk
research was conducted into specific EHRs. However, governance. Journal of Hazardous Materials 86(1–3):
many of the trends that we observed are similar to 245–248.
those reported by for instance Amendola, 2001, De Hisschemöller, M. & Hoppe, R. 2001. Coping with
Marchi, 2003, Heriard-Dubreuil, 2001 and Rothstein intractable controversies: the case for problem structur-
et al., 2006. ing in policy design and analysis. In: M. Hisschemöller,
R. Hoppe, W.N. Dunn & J.R. Ravetz (eds), Knowl-
edge, power, and participation in environmental policy
REFERENCES analysis: 47–72. New Brunswick/London: Transaction
Publishers.
Alcock, R.E. & Busby, J. 2006. Risk migration and scientific Hollander, A. de & Hanemaaijer., A. 2003. Nuchter
advance: the case of flame-retardant compounds. Risk omgaan met risico’s (Dealing sensibly with risks; in
Analysis 26(2): 369–381. Dutch). Bilthoven: Netherlands Environmental Assess-
Amendola, A. 2001. Recent paradigms for risk informed ment Agency.
decision making. Safety Science 40(1–4): 17–30. Hood, C., Rothstein, H. & Baldwin, R. 2004. The government
Bunting, C. 2008. An introduction to the IRGC’s risk of risk. Oxford: Oxford University Press.
governance framework. Paper presented at the Sec- Hoppe, R. 2002. Cultures of public policy problems. Journal
ond RISKBASE General Assembly and Second Thematic of Comparative Policy Analysis: Research and Practice
Workshop WP-1b. May 15–17. Budapest: Hungary. 4(3): 305–326.
De Marchi, B. 2003. Public participation and risk gover- Jordan, A., Wurzel, R.K.W. & Zito, A. 2005. The rise of
nance. Science and Public Policy 30(3): 171–176. ‘new’ policy instruments in comparative perspective: has
Douglas, M. & Wildavsky, A. 1982. How can we know the governance eclipsed government? Political Studies 53(3):
risks we face? Why risk selection is a social process. Risk 477–496.
Analysis 2(2): 49–51. Keijzers, G. 2000. The evolution of Dutch environ-
EEA. 2001. Late lessons from early warnings: the pre- mental policy: the changing ecological arena from
cautionary principle 1896–2000. Copenhagen: European 1970–2000 and beyond. Journal of Cleaner Production
Environmental Agency. 8(3): 179–200.

376
Kersbergen, K. van & Waarden, F. van. 2001. Shifts in gov- Runhaar, H.A.C., Driessen, P.P.J. & Soer, L. 2009. Sus-
ernance: problems of legitimacy and accountability. The tainable urban development and the challenge of pol-
Hague: Social Science Research Council. icy integration. An assessment of planning tools for
Kingdon, J.W. 1995. Agendas, alternatives, and public integrating spatial and environmental planning in the
policies. New York: Harper Collins College. Netherlands. Environment and Planning B, 36(2) (forth-
Klinke, A. & Renn, O. 2002. A new approach to risk eval- coming).
uation and management: risk-based, precaution-based, Sluijs, J.P. van der. 2007. Uncertainty and precaution in envi-
and discourse-based strategies. Risk Analysis 22(6): ronmental management: insights from the UPEM con-
1071–1094. ference. Environmental Modelling and Software 22(5):
Kloprogge, P. & Sluijs, J.P. van der. 2006. The inclusion 590–598.
of stakeholder knowledge and perspectives in integrated Soer, L., Bree, L. van, Driessen, P.P.J. & Runhaar, H.
assessment of climate change. Climatic Change 75(3): 2008. Towards integration and differentiation in
359–389. environmental health-risk policy approaches: An inter-
Neumann, P. & Politser, R. 1992. Risk and optimality. In national quick-scan of various national approaches to
F. Yates (ed). Risk-taking Behaviour: 27–47. Chicester: environmental health risk. Utrecht/Bilthoven: Coperni-
Wiley. cus Institute for Sustainable Development and Inno-
Open University. 1998. Risico’s: besluitvorming over vei- vation/Netherlands Environmental Assessment Agency
ligheid en milieu (risks: decision-making on safety and (forthcoming).
the environment, in Dutch; course book). Heerlen: Open Stoker, G. 1998. Governance as theory: five propositions.
University. International Social Science Journal. 50(155): 17–28.
O’Riordan, T. 1985. Approaches to regulation. In: Sunstein, C.R. 2002. Risk and reason. Safety, law, and the
H. Otway & M. Peltu. Regulating industrial risks. Sci- environment. New York: Cambridge University Press.
ence, hazards and public protection: 20–39. London: U.K. H.M. Treasury. 2005. Managing risk to the public:
Butterworths. appraisal guidance. London: H.M. Treasury.
Rayner, S. & Cantor, R. 1987. How fair is safe enough? U.K. House of Lords. 2006. Economic Affairs—Fifth Report.
The cultural approach to societal technology choice. Risk London: House of Lords.
analysis, 7(1): 3–9. UNESCO COMEST. 2005. The precautionary principle.
Renn, O. 2006. Risk governance. Towards an integra- Paris: UNESCO.
tive approach. Geneva: International Risk Governance U.S. EPA. 2004. Risk assessment principles and practices.
Council. Washington: United States Environmental Protection
Riet, O. van de. 2003. Policy analysis in multi-actor pol- Agency.
icy settings. Navigating between negotiated nonsense VROM. 2004. Nuchter omgaan met risico’s. Beslissen met
and superfluous knowledge. Ph.D. thesis. Delft: Eburon gevoel voor onzekerheden (Dealing sensibly with risks;
Publishers. in Dutch). The Hague: Department of Housing, Spatial
Robinson, P. & MacDonell, M. 2006. Priorities for mix- Planning, and the Environment.
tures health effects research. Environmental Toxicology Wheeler S.M. & Beatley T. (eds.) (2004). The sustain-
and Pharmacology 18(3): 201–213. able urban development reader. London/New York:
Rothstein, H., Irving, Ph., Walden, T. & Yearsley, R. 2006. Routledge.
The risks of risk-based regulation: insights from the
environmental policy domain. Environment International
32(8): 1056–1065.

377
Safety, Reliability and Risk Analysis: Theory, Methods and Applications – Martorell et al. (eds)
© 2009 Taylor & Francis Group, London, ISBN 978-0-415-48513-5

What does ‘‘safety margin’’ really mean?

J. Hortal, R. Mendizábal & F. Pelayo


Consejo de Seguridad Nuclear, Madrid, Spain

ABSTRACT: The term ‘‘safety margin’’ has become a keyword when discussing about the safety of the
nuclear plants, but there is still much confusion about the use of the term. In this paper the traditional concept
of safety margin in nuclear engineering is described, and the need is expressed of extending the concept to
out-of-design scenarios. A probabilistic definition of safety margin (PSM) is adopted for scalar safety output at
a scenario-specific level. The PSM is easy to generalize (to initiating events, multiple safety outputs, analytical
margins) and, combined with frequencies of initiators and accidents, makes up the plant risk. Both deterministic
and probabilistic approaches to safety assessment found easy explanation in terms of PSM. The role of the
probabilistic margins in the safety assessment of plant modifications is discussed.

1 INTRODUCTION 2 THE TRADITIONAL CONCEPT


OF SAFETY MARGIN
In recent years, the international nuclear community is
becoming more and more concerned by the possibility The introduction of safety margins in traditional engi-
that significant changes in plant design or in operation neering is a protection design technique aimed at pro-
strategies result in adverse side effects usually referred viding some additional protection capability beyond
to as ‘‘erosion of safety margins’’. A number of initia- the one that is considered strictly necessary. The ben-
tives have been launched to address this problem due efit of using safety margins is two-fold. On one hand,
to the increasing number of plants applying for power they allow to accommodate tolerances for little known
uprates, life extensions, increased fuel burn-up, etc., phenomena, uncertainties in model data, variabilities
where some voices claim that, even complying with in initial or boundary conditions, and so on. On the
applicable regulations, there could be an unacceptable other, they result in a significant simplification of
loss of safety margins. Moreover, the development the design methods as they allow to split the design
of new designs for nuclear power plants where the in several decoupled stages where the applicable cri-
existing technical regulations for LWRs are not neces- teria are not too closely linked to the details of the
sarily applicable rises the need to establish criteria for phenomenology considered in the design analyses.
determining what is an acceptable level of safety. This approach, essentially deterministic, is con-
A second reason for the discussion about safety sidered conservative in most cases and, therefore,
margins is the increasing trend to apply the so-called provides confidence that the protection is able to cope
Risk-informed Regulation (RIR). From the pioneer- with or to mitigate challenging situations, including
ing Regulatory Guide 1.174 of the USNRC, most some that were not considered in the design analyses.
safety standards and guides on this matter ask for In addition, it is convenient for developing safety regu-
‘‘maintaining enough safety margins’’ as a condition lations. This idea was also applied from the beginning
for acceptability of any change being licensed in the in the nuclear industry and, in particular, in the analysis
framework of RIR. of Design Basis Transients and Accidents (DBT&A,
For these and other reasons, the term ‘‘safety mar- from now on referred to as DBT) where the capabilities
gin’’ has become a keyword when discussing about the of the protection are assessed. A set of well defined,
overall safety of the plants but there is still much con- enveloping scenarios, classified into a few frequency
fusion about the use of this term. This paper is intended classes, are taken as design basis for the protection
to propose a consistent concept of safety margin that and a set of safety variables are used as damage indi-
could be helpful in the discussions about overall plant cators or as indicators of challenges to the protection
safety without losing its traditional meaning. barriers. For this limited set of design basis scenarios

379
Figure 1. Safety margins in the analysis of Design Basis Transients and accidents.

it is possible to define class-specific acceptance cri- failure mode of a particular barrier can result from a
teria in terms of extreme allowed values of the safety variety of transients, as indicated by the converging
variables, also called safety limits. arrows linking the two first columns of figure 1.
In this context, the concept of safety margin is As in the previous case, there are several possible
applied on a scenario-specific basis and its meaning modes of failure of each barrier, as indicated by the
can be agreed without much difficulty. However, even dashed ellipse ‘‘Other barrier failures’’. Each combi-
at single scenario level, a great variety of margins nation of barrier failures and type of accident gives rise
appear and all of them can be properly called safety to a particular release of radioactive products (source
margins. Figure 1 tries to represent these margins and term), as indicated by the converging arrows link-
how they relate to each other. ing the second and third columns. Again, a limited
In this figure, the two left-most columns represent set of enveloping DBT is selected in order to per-
the barrier analysis and the two on the right side repre- form the source term analysis. These DBT will be,
sent the analysis of radiological consequences. In the in general, fewer and different from those used in
first column, a particular safety variable in a particular the barrier analysis. The selection of DBT for source
DBT is represented. Since the DBT is an enveloping term analysis and the analysis of these DBT to con-
scenario, the extreme value of the safety variable in the firm that they remain below the Source Term Reference
enveloped transients will stay below the value of the Limit introduce two new margins, identified as Source
same variable in the DBT which should, indeed, stay Term Analytical Margin and Source Term Margin in
below the acceptance criterion or safety limit. There figure 1.
will be as many ‘‘left-most columns’’ as the number of Finally, the radiological effects of the source term
safety variables times the number of DBT. This is indi- are calculated in terms of doses. The use of the Source
cated in Figure 1 by the dashed ellipse entitled ‘‘Other Term Margin allows to decouple the dose calculations
S.V. and Acc. Crit.’’ In every one of these columns from the source term analysis. If the doses resulting
there will be an Analytical Margin and a Licensing from a release equal to the Source Term Reference
Margin. Limit are lower that the Authorised Dose Limit, any
Each safety variable and its corresponding safety change in the source term analysis does not force a
limit are selected to prevent a particular failure mode of recalculation of doses, provided that the new source
a protection barrier. However, the safety limit is not a term remains below the reference limit. The drawback
sharp boundary between safety and failure. Overpass- of this approach is that it could result in a more difficult
ing the safety limit means that there are non-negligible application of the ALARA principles. In any case, the
chances for a given failure mode but, in most cases, difference between the calculated dose and the Autho-
there is a margin (the Barrier Margin in figure 1) rised Dose Limit is an additional margin, identified as
between the safety limit and the actual failure. A given Dose Margin in figure 1.

380
A Global Plant Margin is indicated in figure 1. The analysis framework proposed by the SMAP
However, this is only a qualitative concept. Note that group consists of an adequate combination of tradi-
each column of figure 1 corresponds to different phys- tional deterministic (analysis of DBT) and probabilis-
ical magnitudes and, therefore, they cannot simply be tic (PSA) techniques. The use of event and fault trees,
summed-up. In addition, we have concurrent margins similar to those of PSA, applied to the assessment
that cannot be easily combined into a single margin of any safety objective, is combined with simula-
measurement. It is clear that an adequate Global Plant tion techniques, typical of DBT analyses, in order
Margin can only exist if all the partial margins exist. to quantify the exceedance frequencies of the limits
Moreover, the larger the partial margins, the larger the that define the safety objectives being assessed. These
plant margin. However, a quantification of the Global exceedance frequencies are then used as risk indicators
Plant Margin is not possible. that characterize the overall plant safety.
The concept of safety margin proposed in this paper
is an extension of the traditional concept of safety mar-
gins (as depicted in figure 1) and is especially suited
3 THE NEED TO EXTEND THE CONCEPT for methodologies aimed, like SMAP, to answer the
OF SAFETY MARGIN question on the sufficiency of safety margins.

The difficulty to quantify the Global Plant Margin


resulting from the analysis of DBT is not the only 4 DEFINITION OF SAFETY MARGIN
limitation of this concept of plant margin. Worldwide
experience on the operation of nuclear power plants According to our previous discussion, the safety mar-
showed soon that the exclusive use of the analysis gin should be defined as a measure of the distance
of DBT to assess the plant safety could be insuf- from a calculated safety output to some limit imposed
ficient. Some events, especially the TMI accident, on it. In general, both elements are calculated vari-
showed that more complicated scenarios, resulting ables, which are uncertain magnitudes because they
from out-of-design sequences of events needed to be inherit the uncertainty of the inputs to the calculations
addressed. The question of how to deal with so many and moreover they incorporate the uncertainty from
possibilities made it inevitable to better evaluate their the predictive models being used. We will adopt in
frequencies in order to weight their relative impor- this paper the classical representation of uncertain
tance. This gave rise to the incorporation of system magnitudes as random variables.
reliability engineering techniques, as it had been advo- Let us consider a transient or accident A in an indus-
cated by some precursor studies, like WASH-1400 in trial or technological facility, and a scalar safety output
USA or the Deutsche Risikostudie Kernkraftwerke in V calculated for A. V will be, in general, an extreme
Germany. Among other important lessons learned value (maximum or minimum) of a safety variable in
from this experience was that operators and their a certain spatial domain during the transient. We will
actions were needed but not necessarily beneficial, so symbolize by V the random variable representing such
their impact should be taken into account. safety output as calculated with a realistic model M.
Probabilistic Safety Assessment (PSA) techniques Let us also suppose an upper safety limit L for V, and
implement these new aspects of the safety analysis assume that L is time-independent. In this case, V will
but they have been applied only to the assessment of be the maximum value of a safety variable. This sim-
severe accidents and their consequences. Other types ple setting will be considered through the paper, the
of accidents, more likely to occur but resulting in lower treatment for other settings (e.g. when L is a lower
consequences, were left out of scope of PSA. As a limit and V is a minimum during the transient) being
consequence, the rules on the use of PSA for licensing completely analogous.
purposes, when existing, often include a requirement In general L can be considered a random variable
to demonstrate that ‘‘enough safety margin is main- as well, because it can have some uncertainty. For
tained’’. What is, then, enough safety margin? This instance, L can be a damage threshold from some
question cannot be answered neither with the analysis safety barrier, obtained directly from experimental
of DBT only nor with PSA only. The need to answer measures or from calculations with a model M’. The
this question has given rise to some international initia- variable
tives. In particular, the NEA Committee on the Safety
of Nuclear Installations (CSNI) promoted in 2003 an D ≡L−V (1)
Action Plan on Safety Margins (SMAP 2007). A work-
ing group was established that developed a framework
for integrated assessments of the changes to the overall is sometimes called ‘‘safety margin’’ in the Reliability
safety of the plant as a result of simultaneous changes Engineering realm. But our intention is to reserve such
in plant operation or design. denomination for a conservative function of D.

381
When V < L, the acceptance criterion for V is for every sequence Ai , i = 1, . . ., S and the problem
fulfilled, and the result is successful or acceptable. is posed of how to combine them in order to obtain
On the contrary, when V > L, the limit is violated the safety margin for the initiator IE. With a defini-
and the result is failed or unacceptable. When L has tion such as (2) the combination of safety margins in
no uncertainty, the region V < L is the acceptance not straightforward. Therefore, a better definition of
region for V and the complementary set V ≥ L is margin should be found.
the rejection or forbidden region for V. Whenever L A probabilistic definition of safety margin has been
is uncertain such regions can still be defined, being proposed (Martorell et al. 2005, Mendizábal et al.
aware that they have random or uncertain boundaries. 2007). For our setting, we define:
With this proviso, the safety margin can be defined
as a measure of the distance from the safety output to PSM (V ; A) ≡ PR {V < L/A} (4)
the forbidden region V ≥ L. Uncertainties are to be
considered in the safety margin definition. or, from (1):
The view of V and L as random variables is anal-
ogous to the load-capacity formulation in Classical PSM (V ; A) ≡ PR {D > 0/A} (5)
Reliability (figure 2). V can be viewed as the load
on a static component, whose capacity or resistance namely, it is the probability of V being under its limit,
is L. The component fails whenever its capacity is conditioned to the occurrence of the accident A. The
overcome by the load. probabilistic safety margin (PSM) is the probability
A classical election for the safety margin of V that the calculated V during A is in the safe region.
in A is: Such probability is implicitly conditioned to the use
of the predictive model M.
SM (V ; A) ≡ Max{k γ D , 0} (2) A general definition of the PSM is:

γ D being the γ -quantile of the random variable V PSM (V ; A) ≡ PR {V ∈ RV /A} (6)


and γ a small number in the range [0,1]. This means to
choose as safety margin a low quantile of the difference where RV is the ‘‘acceptance region’’ of V, i.e. the
D, provided it is nonnegative. k is a constant chosen region where V must stay as a safety requirement.
so that the margin is nondimensional. The complementary of RV is the forbidden region. As
The problem of a definition like (2) is that it does not already stated, the boundaries of these regions (e.g.
provide an easy way of combining safety margins. Let safety limits) can be uncertain.
us as consider an initiating event IE and let A1 , . . ., AS The expression (4) can be written in terms of the
be the accident sequences deriving from it (i.e. they probability distributions of both V and L. In the sequel
are the combination of IE with one or more concur- we will suppose that both V and L are continuous ran-
rent failures, including single and coincident failures). dom variables with probability density functions (pdf),
The safety output V for IE is defined as the following and that they are independent, as it most commonly
random variable: occurs in safety analysis. Then
V = Vi with probability  +∞  s 
PR {V < L/A} = fL (s) fV (z; A) dz ds
pi ≡ PR(Ai /IE) (3) −∞ −∞
(7)
where Vi is the output as calculated for the i-th
sequence. Definitions (1) and (2) can then be applied to or, in a more compact form
IE. But let us suppose that we know the safety margins
 +∞
PR {V < L/A} = fL (s)FV (s; A) ds (8)
−∞

where fL is the pdf of L and fV , FV are respectively


the pdf and cumulative distribution function (cdf)
of V conditioned to the occurrence of A. This is a
convolution integral, extended to the range of L.
The advantages of defining the safety margin as a
probability are:
– PSM’s are nondimensional and ranging in the
Figure 2. Load-capacity formulation of safety margins. interval [0,1]

382
– Its use can be extended to other situations, for
F
instance to analytical safety margins PSM (B; A) = PR{V k < Lk /A}
– PSM’s can be generalized to multidimensional k=1
safety outputs and to combinations of accident

F
sequences (e.g. initiating events) = PSM (V k ; A) (13)
– PSM’s combine according to the laws of probability
k=1

The probability in (4) and (5) corresponds to the This is another example of how probabilistic mar-
calculation uncertainty of V and L, sometimes called gins can merge into another margin.
epistemic uncertainty, as opposed to the aleatory Now let us focus again on the scalar safety output V
uncertainty arising from the unpredictability of the and consider all the initiating events IEi , i = 1, . . . M
accidents (i.e. related to the occurrence of initiators, that can start accidents challenging V. The frequency
additional failures, some phenomena, etc). In some of V exceeding the limit L is:
sense, thus, the PSM for an accident sequence reflects
the lack of knowledge about the behaviour of safety

S
variables. ν (V > L) = νi (1 − PSM (V ; IEi )) (14)
The probability of V exceeding the limit is one i=1
minus the PSM:
where νi is the frequency of IEi . In (14) the fre-
1 − PSM (V ; A) = PR{V ≥ L/A} (9) quencies of initiators combine with the exceedance
probabilities:

Let us now recall the initiator IE with the derived 


S
sequences A1 , . . ., AS . The definition (4) can be 1 − PSM (V ; IEi ) = pij (1 − PSM (V ; Aij )) (15)
extended to IE: j=1

Aij being the j-th sequence evolving from the i-th


PSM (V ; IE) ≡ PR {V < L/IE} (10)
initiator.
The epistemic uncertainty about V, represented by
which, according to the law of total probability, can be the PSM, merges in (14) with the aleatory uncertainty
expressed as: represented by the frequencies νi .
We conclude that probabilistic safety margins can
combine with initiator frequencies and produce

S
exceedance frequencies, which constitute the plant risk.
PSM (V ; IE) = pj PSM (V ; Aj ) (11)
j=1

5 CALCULATION OF PROBABILISTIC
That is, the margin for the initiator is a weighted SAFETY MARGINS
average of the margins for sequences, the weight being
the conditional probability. This is an example of how If the probability distributions of V and L are known,
probabilistic margin combine. The same expression the PSM is calculated as a convolution integral. But
holds for the exceedance probabilities. such situation is rare. Very often, the calculation of
Now, let us suppose a safety barrier B having several V is difficult and time-consuming, so that the large
failure modes, the i-th failure mode being typified by random samples needed to confidently calculate the
a safety output Vi with an upper safety limit Li , i = probability distribution are almost unthinkable.
1, . . ., F. A safety margin can be assigned to the barrier, The PSM can be calculated by means of Monte
conditioned to the accident A: Carlo methods. Random values of V are obtained by
randomly sampling the inputs to the model M and per-
  forming calculations for the accident A. The same

F
procedure may yield random values for the limit L.
PSM (B; A) ≡ PR (V < L )/A
k k
(12)
Then, PSM can be estimated by means of statisti-
k=1
cal methods. For instance, the same methods used in
Reliability for the estimation of component failure-
which is the probability of no failure conditioned to A. on-demand probabilities can be applied to the PSM;
It is a generalization of the PSM for a multidimensional a survey of such methods can be found in (Atwood
safety output. Whenever the random variables Vi are 2003). For an exposition about the estimation of PSM
independent, the probability in (12) factorizes and we refer to (Mendizábal et al. 2007, 2008). Strategies

383
of variance reduction (e.g. importance sampling) can FV is an increasing function, and hence
be implemented in the Monte Carlo method.
 +∞  +∞
fL (s) FV (s; IE) ds > FV (Vb ; IE) fL (s)ds
Vb Vb
6 PROBABILISTIC SAFETY MARGINS AND
DETERMINISTIC SAFETY ASSESSMENT = FV (Vb ; IE) [1 − FL (Vb )] (17)

The deterministic safety assessment (DSA) evaluates


Introducing (17) in (16) and expressing the cdfs as
the response of a facility against initiating events using
probabilities, the inequality
the so called design basis transients (DBTs) (Pelayo &
Mendizábal 2005). For each initiating event IE one
or more DBTs are outlined, defined as enveloping or PR {V < L/IE} > PR {V ≤ Vb /IE} · PR {Vb < L/IE}
bounding events. This means that the consequences (18)
of a DBT are worse than those of the great major-
ity of transients deriving from the initiating event. An
obvious way of building up a DBT is by adding con- is derived, that can be expressed in terms of proba-
servative assumptions to the initiator. Typically, some bilistic margins:
additional system failures are assumed (e.g. the single
failure criterion).
PSM (V ; IE) > PR {V ≤ Vb /IE} · PSM (V ; DBT )
Once established the DBTs, their consequences are
(19)
calculated and compared to the safety limits in order
to prove that the probability of limit violation is low
enough. (19) says that the probabilistic safety margin for V
The deterministic methodologies can be classified is higher than the PSM for the DBT multiplied by the
as: probability of V being lower than Vb , that can be called
the analytical PSM for V. This is the product of the two
– Conservative, wherein the DBT consequences are shaded areas in Figure 3.
calculated with conservative models and hypotheses As we pointed out, the DBT is chosen so that the
– Realistic or best-estimate, wherein the DBT con- analytical PSM is close to 1. Then, (19) states that
sequences are calculated with realistic models and a sufficient condition for the PSM of V to be high
hypotheses, and its uncertainty is evaluated and enough is that the PSM for the DBT is close to 1.
included in the comparison with the safety lim- The inequality (19) contains the essence of the
its. They are also called BEPU (best-estimate plus DSA. If a clearly conservative transient is outlined
uncertainty) methodologies. and it is shown to be clearly under the safety limit
(high PSM), then it can be assured that the PSM for
the initiating event is high as well.
6.1 Conservative methodologies Let us now recall the expression (14) for the ex-
In the frame described in section 4, let us consider ceedance frequency of the safety limit L. If an envelop-
a DBT calculated with a conservative methodology, ing transient or DBT is set up for each initiating
producing a conservative value Vb of the safety output
V. This means that the probability of V being less than
Vb conditioned to IE is very close to 1. Notice that
such probability is formally a PSM of V with respect
to Vb , and therefore has the sense of an analytical
safety margin.
There is a quite interesting property of the proba-
bilistic safety margin concerning the DBTs. The con-
volution integral (8) can be split around the constant
value Vb :

 Vb
PR {V < L/IE} = fL (s) FV (s; IE) ds
−∞
 +∞
+ fL (s) FV (s; IE) ds (16)
Vb Figure 3. Lower limit to the probabilistic safety margin.

384
event, the inequality (18) can be introduced in (14) to give guidance to loss-of-coolant accident (LOCA)
to yield: analyses performed with best-estimate models. This
guide was the realistic counterpart to the overconser-

M vative guidance provided in 10 CFR 50.46 and the
ν (V > L) < νk [1 − PR(V ≤ Vbk )PR(Vbk < L)] Appendix K to 10 CFR 50, where the acceptance cri-
k=1 teria for a LOCA analysis were spelled out in terms
(20) of several safety outputs (peak cladding temperature,
local maximum oxidation of the cladding, core-wide
where Vbk is the value of V calculated for the k-th oxidation) and the limit that they could not surpass. In
design basis transient. (20) trivially transforms into RG 1.157, the requirement was that during a LOCA
the limits are not to be violated with a high probability.
The ordinary deterministic criteria were transformed

M
ν (V > L) < νk PR(V ≤ Vbk )PR(Vbk ≥ L) into probabilistic criteria, the probability being related
k=1
to the uncertainty of the calculated safety outputs.
The BEPU methodologies, supplemented by statis-

M tical procedures as those hinted in section 5, can be
+ νk PR(V > Vbk ) (21) used to estimate the PSM for DBTs. It is important to
k=1 point out that the estimated PSM has statistical uncer-
tainty, stemming from the finite size of the random
(21) provides an upper bound to the exceedance samples. Therefore, an acceptance criterion for the
frequency of the limit L. The first addend in the right PSM of the enveloping transient should read:
hand side represents the contribution of the transients
enveloped by the DBTs (those within the design basis) PR{PSM (V ; DBT ) > M0 } ≥ 1 − α (22)
and the second one is the residual contribution stem-
ming from the not enveloped fraction (beyond design that is, the margin must be higher than M0 , a value
basis sequences). The right hand side of (21) gives close to 1, with a high statistical confidence (α is a
the exceedance frequency of L supposing that the low value, say 0.05). When the statistical sample is
not enveloped fraction gives rise to a limit violation large enough (a possibility if the calculations with M
(i.e. progress beyond design basis assumptions) and are not time-consuming) (22) simplifies to:
assigning the probabilistic margin of the DBTs to the
enveloped fraction of transients. If the DBT is ade- PSM (V ; DBT ) > M0 (23)
quately chosen, the residual term can be neglected
against the main one, and the maintenance of the safety
margin is assured through the enveloping character and 7 THE ROLE OF PSM: PLANT
the safety margin of the DBTs. This is the basis of the MODIFICATIONS AND PSM AFFECTATION
deterministic design approach.
So far we have defined probabilistic safety margins as
building blocks of the NPP risk. It is on the regulatory
6.2 Realistic methodologies
side to decide the magnitude of the tolerable damage
In the realistic methodologies of DSA, the value Vb exceedance frequency as a function of the damage,
of the output V in the enveloping transient is no making up the risk curve against which the design
longer a constant value, but an uncertain variable, its should be checked and a safety margin preserved.
uncertainty coming from the calculation process. But It should be noted, see (14), that the exceedance
the cornerstone of the method is still valid: the high frequency of a safety limit is not conditional to any par-
safety margin is assured through the definition of an ticular event or sequence but cumulative to all of them.
enveloping transient with a safety margin high enough. The design involves the characterisation of the ini-
The DBT in realistic methodologies produces less tiating events covering all plant conditions and modes
conservative values of the safety outputs than those of operation (ANS 1983, IAEA 2001) and the delin-
found in the conservative methodologies, because real- eation of the bounding cases, not necessarily only one
istic models and hypotheses are used throughout and case per IE. It is the core of the designers’ job the build-
few pessimistic assumptions (e.g. single failure crite- ing of the bounding case and the methodology used to
rion) are maintained. The uncertainties of the outputs obtain the exceedance probability of the safety limit,
are estimated instead. and a description exceeds the target of this paper.
The probabilistic definition of safety margins is The classification of the postulated events (within
implicit in the nuclear regulation which refers to real- the design basis) is made according to the expected
istic methodologies applied to DSA. In 1989, the frequency of the initiating event or of the accident
USNRC issued the Regulatory Guide 1.157 in order sequence. This classification approach should be

385
made consistent with the above mentioned assertion explored. Following the standard PSA formalism, the
regarding the cumulative nature of the exceedance fre- possibility to make use of a limited number of bound-
quency criterion. No such classification is made within ing cases like those presented before, breaks down due
the realm of PSA, where design basis postulates break to the fact that sequence delineation becomes more
down. complex as the number of headers of the associated
In this section we will tackle the role of safety mar- event tree increases. On the contrary, thermalhydraulic
gins when modifications in a plant are being assessed. calculations within PSA are not making use of the
It is common practice to evaluate the impact in safety BEPU methods and uncertainty only stems from con-
of plant modifications with a double perspective: i) siderations linked to the failure rates of event tree head-
deterministic (preservation of the design bases) and ers (uncertainty on the frequency of the sequence).
ii) probabilistic, estimating the impact of the change Within Consejo de Seguridad Nuclear two conver-
on exceedance frequencies of some safety limits. gent approaches are being studied. One of them rests
Concerning i), the impact is customarily assessed on the concept of dynamic event trees where sequence
against DBTs and verified that specific limits are delineation is stochastic but conditioned by the past
not violated. Current regulation requires that for a history being generated (Izquierdo & Cañamón 2008).
plant modification being acceptable it should be ver- An alternate short term approach makes use of the
ified that the frequency of initiating events is not BEPU methodologies approach and of current PSA
affected and that enough safety margin is preserved results (Martorell et al. 2005). Such method focuses on
for the DBTs. Both elements appear in the right hand the analyses of standard ‘‘success sequences’’ where
side of (21). A thorough evaluation of the new plant an aggregated PSM and frequency of exceedance are
conditions is required to confirm that the new plant generated. Event tree failed sequences are assumed to
operating conditions do not increase the expected fre- have PSM = 0 and a contribution to the frequency
quency of initiating events or create new ones, that of exceedance comes from the ‘‘success sequences’’
equipment needed to cope with events will do accord- once BEPU methods are applied to the most relevant
ing to design and that the magnitude of the damage sequences. In both cases the incorporation of uncer-
is kept low. Due to the limited number of events and tainty stemming from the simulation (mainly thermo-
the bounding approach used in the sequences delin- hydraulic) has profound implications. For instance, the
eation, PSM estimation is straightforward, although concept of cut set should be revised in the presence of
not an easy task, by means of the BEPU method- such TH uncertainty, and interpreted as a collection of
ologies (Boyack 1989). The frequencies of accident basic events whose simultaneous occurrence implies
sequences should be calculated by means of proba- the violation of the safety limit with a high probability.
bilistic techniques, including operational experience
and precursory analyses. It is important to note that, in
standard conservative methodologies, where the safety 8 CONCLUSIONS
limits are given without uncertainty, the main term in
(21) has only contributions from the DBTs producing The role of safety margins in Safety Analysis of
limit violations. nuclear plants is examined through the present paper.
As an example, following a power plant increase, The need of a general definition of safety mar-
it has been observed that the operating conditions of gin is pointed out, and a probabilistic definition of
plant equipment will become more demanding of cur- safety margin is proposed. Probabilistic margins can
rent intensity on the main plant transformers implying combine with other probabilities and frequencies to
that the frequency of load rejection events may notably make up the exceedance frequencies of safety limits.
increase. Similarly the new enveloping case assuming Both the deterministic and probabilistic approaches
a new power rate will result in a decrease of the PSM. to safety analysis found easy explanation in terms
Both effects tend to augment the upper bound in (21). of probabilistic safety margins, as well as the safety
As a result a plant modification will require design assessment of plant modifications.
changes in a way as to preserve the frequency of initi-
ating events (modifications on the main transformer)
and the preservation of the safety margin (limiting
fuel peaking factors, new fuel designs, . . .). These REFERENCES
results can also be achieved by reducing the epistemic
uncertainty and the conservative bias of the simulation ANS (American Nuclear Society) 1983. Nuclear Safety Cri-
teria for the Design of Stationary Pressurized Water
models used in the safety outputs calculation (hidden Reactor Plants. ANSI/ANS-51.1–1983.
safety margins). Atwood, C.L. et al. 2003. Handbook of Parameter
The evaluation of PSM affectation from the prob- Estimation for Probabilistic Risk Assessment. Sandia
abilistic perspective is nowadays subject to intense National Laboratories—U.S. Nuclear regulatory Com-
research (SMAP 2007). Several approaches are being mission. NUREG/CR-6823.

386
Boyack, B. et al. 1989. Quantifying Reactor Safety Mar- Safety Assessment Methods for Nuclear Reactors. Korea
gins. Prepared for U.S. Nuclear Regulatory Commission. Institute of Nuclear Safety, Daejon, October 30th to
NUREG/CR-5249. November 2nd 2007.
IAEA. 2001. Safety Assessment and Verification for Nuclear Mendizábal, R. et al. 2008. Probabilistic safety margins:
Power Plants. Safety Guide. Safety Standard Series No. definition and calculation. Presented to ESREL 2008.
NS-G-1.2 , 2001. Pelayo, F. & Mendizábal, R. 2005. El carácter determin-
Izquierdo, J.M. & Cañamón, I. 2008. TSD, a SCAIS suitable ista del análisis de accidentes en centrales nucleares.
variant of the SDTDP. Presented to ESREL 2008. Seguridad Nuclear 35: 20–29.
Martorell, S. et al. 2005. Estimating safety margins consid- SMAP (Task Group on Safety Margins Action Plan). 2007.
ering probabilistic and thermal-hydraulic uncertainties. Safety Margins Action Plan—Final Report. Nuclear
IAEA Technical Meeting on the Use of Best Estimate Energy Agency. Committee on the Safety of Nuclear
Approach in Licensing with Evaluation of Uncertainties. Installations, NEA/CSNI/R(2007)9.
Pisa (Italy), September 12–16, 2005.
Mendizábal, R. et al. 2007. Calculating safety margins
for PSA sequences. IAEA Topical Meeting on Advanced

387
Legislative dimensions of risk management
Safety, Reliability and Risk Analysis: Theory, Methods and Applications – Martorell et al. (eds)
© 2009 Taylor & Francis Group, London, ISBN 978-0-415-48513-5

Accidents, risk analysis and safety management—Different perspective


at a Swedish safety authority

O. Harrami
Swedish Rescue Services Agency, Karlstad, Sweden
Department of Fire Safety Engineering and Systems Safety, Lund University, Lund, Sweden

M. Strömgren
Swedish Rescue Services Agency, Karlstad, Sweden
Division of Public Health Sciences, Karlstad University, Karlstad, Sweden

U. Postgård & R. All


Swedish Rescue Services Agency, Karlstad, Sweden

ABSTRACT: This study gives an overview of approaches for risk management used within the Swedish
Rescue Services Agency (SRSA). The authority’s commission covers a broad spectrum of safety issues. Group
interviews were performed within different sectors of the organisation. The results show that several perspectives
on accidents and different understandings of safety terms exist within the SRSA. How the organisation uses risk
analyses and carries out risk evaluations differs among the sectors. The safety work includes various types
of accidents, injuries and incidents. The SRSA also use a variety of strategies for safety based on tradition,
legislation and political direction. In such an extensive safety authority, it is not unproblematic to coordinate,
govern and regulate safety issues. Different safety paradigms and risk framings have created problems. But these
differences can also give opportunities to form new progressive strategies and methods for safety management.

1 INTRODUCTION variety of accidents and safety related subjects within


different jurisdictions. The SRSA is active in many
The Swedish Rescue Services Agency (SRSA) is a areas of expertise, for example, fire safety, chem-
governmental authority that works with public safety, ical safety, natural hazards and safety for children
accident prevention and risk management. The author- and elderly. It covers four main legislations: the Civil
ity is a result of the merging of a handful former Protection Act, the Transport of Dangerous Goods
authorities and fields of activities. The SRSA has a act, the Law on Measures to Prevent and Limit the
central role in the management of public safety, which Consequences of Serious Chemical Accidents (Seveso
makes it interesting to study how the authority handles Directive) and the Flammable and Explosive Goods
different kind of safety issues and which principles that Act. The authority also has commissions in educa-
govern the operation. Few studies have been performed tion and international humanitarian relief and disaster
on how Swedish national authorities take decisions operations.
and operate. Lundgren & Sundqvist (1996) studied In 2003 the deputy director-general of the SRSA
how the Swedish Environmental Protection Agency commissioned the authors of this paper, who also
take decisions when handling permission and autho- were employed at the agency, to obtain an overview
risation cases. Abrahamsson & Magnusson (2004) of the use of risk analysis within the SRSA, and
studied nine national authorities’ work with risk and to investigate the need for coordination and educa-
vulnerability analysis (SRSA was one of the studied tion. It was stated that knowledge of risk analysis
authorities). methods should be regarded as a strategic area for the
This aim of the study is to give an overview of authority. At the same time, expertise on risk anal-
perspectives on, and approaches to, accident preven- ysis was split into several parts of the authority and
tion and risk management used within the SRSA. the coordination among the experts was poor. Some
The authority has around 800 employees and is a other circumstances that have contributed to this study
governmental safety organisation that handles a large being initiated are: a generally increase in the use

391
of risk analysis, a development of competences and within that sector, and be aware of the traditions and
methodologies regarding risk management, an exten- views prevalent within the sector.
sion of the authority’s commission to new risk areas The interviews were performed from October 2004
and a recent reorganisation. through March 2005. Each group interview took
The commission was the start of work that in time approximately three hours to perform. The same struc-
expanded to a more comprehensive study, which also tured question guide was used for all group interviews.
included subjects such as: the use of safety manage- The question guide consisted of open questions and
ment terms, perspectives on accidents, strategies and was structured around different themes such as: the
processes for risk management, actors, tools for analy- phenomenon of accident and injury, actors, safety
sis of accident risks (e.g. risk analysis), risk evaluation management, common terms, safety legislation and
and safety regulations. The study is described in-depth the use of risk analysis.
and published in a Swedish report (All et al. 2006). After the interviews the preliminary results were
The objective of this paper is to describe and presented to the studied sectors both as a report and
discuss some themes and underlying principles for at a workshop. The feedback received was included in
risk management in a multifaceted national safety the final results.
authority. In the analyses comparisons were made of terms,
strategies, and how risk analysis and risk evaluation
were performed.
2 METHOD

Aspects that this paper studies are how the Swedish 3 RESULTS AND ANALYSIS
Rescue Service Agency works with safety, risk anal-
ysis, risk assessment and risk evaluation. Structured 3.1 Safety related terms
group interviews were performed with 12 different
sectors within the organization. The reason why inter- How central safety terms are used and understood
views were preferred to questionnaires was that many within the different sectors has been studied.
of the questions were difficult to describe with clear Many sectors explain accidents with words like
and simple questions. More correct interpretations ‘‘suddenly’’, ‘‘undesired’’, ‘‘unintentional’’, ‘‘negative
were also facilitated since different parts of the orga- event’’, ‘‘event leading to injuries and damages’’ etc.
nization use terms and definitions in different ways. Several sectors express that an accident is an event
In order to structure the application sectors, sev- that will initiate a rescue operation. Some sectors only
eral logics and methods to divide the organisation include unintentional events in the term accident while
were considered. The final division was based on a others also regard intentional events, like arsons, as
combination of legislations used, traditions, branches accidents. For some sectors injuries are emphasized
and contexts. The studied sectors are those working while others focus on damage on property or the envi-
with: ronment. There are also examples of sectors where
definitions of the term accident are influenced by, or
1. Fire protection in buildings stated, in different legislations.
2. Transportation of dangerous goods Risk and safety are very central terms for the SRSA
3. Flammables but still they seem to be hard to explain and define.
4. Explosives These terms are often used as technical terms but also
5. Land use planning in a more general sense like ‘‘we have to reduce the
6. Natural disasters risks’’ and ‘‘for a safer society’’. The sector Environ-
7. Environmental and risk appraisal for industry, mental and risk appraisal for industry refers to official
including Seveso-establishments documents e.g. the Seveso II directive and in national
8. Emergency response support guidelines when explaining the term risk. These are
9. Safety management in municipalities examples of technical definitions of risk. Almost all
10. National injury prevention program sectors the studied described that risk, in one way or
11. Emergency preparedness for chemical, biological, another consists of two parameters: probability and
radiological and nuclear (CBRN) incidents consequences. There are, however, variations in how
12. Supervision, in four different jurisdictions the sectors emphasise these two parameters. Three
divergent groups of explanation of the term risk were
From each sector two or three key persons were found:
selected to the group interview. The selections of key-
persons were made in cooperation between the analysis 1. Probability (P)
group and the department director of each sector. The 2. Consequence (C)
key person should have had long experience in working 3. A combination of probability and consequence

392
In addition there was a difference in what way the sectors indicate in their answers that they saw the two
two parameters should be combined in group 3. Three concepts as dissimilar.
types of mathematical functions were proposed: a sum
of P and C, a product of P and C and finally other
3.2 Structure and coordination of the work
relations between P and C. Another interesting finding
with public safety
is that many of the interviewed sectors did not feel
comfortable when using the term risk. This section starts out with an analysis of accidents that
If risk is a difficult word to explain, the term safety the sectors work with. It continues with a presentation
seems even more challenging to grasp. Many of the of analyses made from four theories and models, in
answers were quite vague and general. Only one sector, order to illustrate different views on and structures
the National injury prevention program, gave a more of operations at the SRSA. The analyses deal with
academic and detailed explanation of the term safety questions about the constitution and definition of an
by defining it as: ‘‘ . . . a circumstance characterised accident, strategies to control safety and strategies for
by an adequate control of physical, material and moral countermeasures.
threats, which contribute to the perception of being An analysis was made of some accidents and types
protected from hazards. Safety thus cannot be defined of events that the sectors work with (Tab. 1). The anal-
in an absolute meaning. Safety is a dynamic status. ysis illustrates the complexity of operations within
Safety is not only lack of injuries and hazards. Safety the SRSA. Two sectors, Emergency response support
cannot be restricted to injury prevention.’’ Many of and Safety management in municipalities, work with
the other sectors gave explanations like ‘‘safety is the almost all of the listed events. On the other hand there
opposite of risks, accidents or injuries’’. From the are some sectors that only handle a few types of events.
broad spectrum of received explanations on the term Some types of events are handled only by a few of
safety, four types of meanings were identified: the sectors while other events, e.g. fire and contact
with hazardous substances, are handled by almost all
1. Control of risk studied sectors. When several sectors work with the
a. Systems and organization to prevent accidents same type of event, they deal with different aspects
b. Measures that are taken to prevent accidents of the event. These aspects could for example depend
on: where the event take place, if certain substances
2. Absence or low levels of are involved in the event, which legislations govern
the event, the event’s origin etc. To exemplify this the
a. Accidents event type fire, which is handled by several sectors
b. Injuries and damage from different aspects, is analysed. The sector Fire
3. Risk-free or without any risk protection in buildings is concerned with this event
4. Feeling of safety and security if it takes place in buildings. The sectors Dangerous
goods, Flammables, Explosives, Environmental and
The meanings 1 and 2 are in some respect related to risk appraisal for industry and Emergency prepared-
objective measures, quantifications and observations. ness for CBRN incidents are concerned with the event
The meanings 3 and 4 are more abstract and subjective if the fire includes or affects hazardous materials. The
than the first two. sector Dangerous goods is also concerned with fires
The concepts of risk management and safety man- from a transportation point of view. The sector Land
agement (safety work) are both quite common in use planning is concerned with this event both from a
Sweden. The concepts have similar meanings both ‘‘hazardous materials’’ and ‘‘fires in buildings’’ point
in a practical and a theoretical sense. How the sec- of view. The sector Natural disasters is interested in
tors apprehend these two concepts have been studied. fire if it takes place in forests or other vegetation. The
The answers show that most of the sectors either use sector Emergency response support is concerned with
the term risk management or the term safety man- the event from a fire extinguishing point of view. The
agement in everyday work. Although all sectors work sector Safety management in municipalities has a more
thoroughly with risk and safety issues, many of them comprehensive approach to the event type fire.
had difficulties to give a comprehensive explanation of An analysis was made on what constitutes and
these two concepts. Common explanations of the two defines accidents. There are various literatures on the
concepts are: a process, steps, systematic activities and nature of accidents (Andersson 1991, Campbell 1997,
measures taken in order to improve safety. Risk man- Hollnagel 2004, Kjellén 2000). Despite all complex
agement is sometimes perceived to be more systematic accident models the results in this study have been
and more scientific than safety management. A more analysed against a simplistic accident model, which
in-depth analysis of the answers and actual practices states that an accident is constituted by a combination
reveals that only a minority of the sectors sees any of an event and an injury/damage. It was found that
real differences between the two concepts. A couple the interviewed sectors focused on the two parts in a

393
Table 1. The table shows which types of event the different sectors work with. Note that the sector 12 (Supervision) has been
excluded in this table.

Types of events

Mechanical & construction collapse

Contact with hazardous substances


Accidental release of substances
Arcing and electric shock
Exposure to cold
Exposure to heat
Collision, crash

Suffocation
Trapped in

Explosion
Drowning

Lightning

Landslide
Flooding

Storm
Cuts

Fire
Fall

Sector

1. Fire protection in buildings ◦ ◦ ◦ • • ◦ • ◦ •


2. Dangerous goods ◦ ◦ • • • • • •
3. Flammables ◦ • • • • ◦ •
4. Explosives ◦ • ◦ • ◦ • •
5. Land use planning ◦ • ◦ ◦ • • • • • • •
6. Natural disasters ◦ • • • ◦
7. Environ. and risk appraisal for industry ◦ • • • • • ◦ ◦
8. Emergency response support ◦ ◦ • • • ◦ • • • • • • • • • • • ◦
9. Safety management in municipalities • • • • ◦ ◦ • ◦ • ◦ ◦ • • • • • • •
10. National injury prevention program • • ◦ • • • • • •
11. Emergency prep. for CBRN incidents ◦ ◦ ◦ ◦ • • •

Legend: • = primarily focus, ◦ = semi-focus.

varying extent, and hence a supplement to the model Explosives and Emergency response support. The
was made. The supplement displays the sectors’ views sectors Natural disasters and National injury preven-
on accidents based on the main focus, which could be tion program have a damage-based view on accidents.
either: The sectors Environmental and risk appraisal for
industry and Safety management in municipalities
– Event-based
have an event- and damage based view. The remain-
– Damage-based
ing sectors used more than one type of the views on
– Event- and damage based
accidents.
Accidents that are event-based will be regarded as Next the operation of the selected sectors’ were
accidents, independent of the magnitude and extent analysed and categorised in relation to Rasmussen’s
of the damages. A typical event-based accident is an three types of control strategies: empirical safety con-
unintentional explosion. Damage-based accidents on trol, evolutionary safety control and analytical safety
the other hand, have the magnitude and extent of the control (Rasmussen 1994). These strategies describe
damage as a starting-point for defining an accident. the control of safety from the perspective of avail-
A typical damage-based accident is a natural disaster. able information and suitable analysis method. The
Regarding natural disasters, Carr argued that a (nat- findings show that the sectors Emergency response
ural) disaster is defined by human beings and not by support, National injury prevention program and
nature i.e. it should be defined and understood by its Safety management in municipalities, have their focal
damage. The event itself is not an accident or a disaster point in the empirical control strategy. Dangerous
‘‘not every windstorm, earth-tremor, or rush of water is goods, Explosives and Flammables have their focal
a catastrophe’’ (Carr 1932 in Furedi 2007). For the last point in evolutional control strategy. The sectors Land
group (event- and damage base), no clear emphasis is use planning, Natural disasters and Emergency pre-
seen on the event or the damage. Both are required paredness for CBRN incidents primarily utilize the
for defining it as an accident. Most of the sectors analytical control strategy. Two of the sectors uti-
have an event-based view on accidents i.e. Fire pro- lize two strategies. Fire protection in buildings uses
tection in buildings, Dangerous goods, Flammables, both empirical and evolutional safety control, and

394
Environmental and risk appraisal for industry uses analysis that are concerned with risk and safety issues.
both evolutional and analytical safety control. This could be e.g. regulation impact assessments,
The sectors’ operation were also analysed using a cost-effect analyses, risk inventories, risk evaluations,
model that is commonly used at the SRSA describ- risk assessments, incidents and accident outcomes,
ing the safety management process in five stages: 1) environmental impact assessments, quantitative risk
Prevent accidents, 2) Take mitigation actions before analysis (QRA), flood inundation mapping and fire
accidents occur, 3) Prepare rescue operations, 4) Carry spreading analysis.
out rescue operations, and 5) Take actions after res- A description has been made of how risk analyses
cue operations. These five strategies to control risk, are used in different sectors in the authority. The results
described in the model, are divided up both from show that risk analyses focuses on very different parts
the perspective of time and type of countermeasure. of the socio-technical system (Rasmussen 1997). The
The first three stages are measures that are taken main part of the work with risk analyses that SRSA
before an accident, while stage 4 and 5 are measures does, that is reviews of conducted risk analyses and
taken during an accident and after an accident, respec- issuing directions on how to perform risk analyses,
tively. Note that the model has some resemblance is primarily focused on a narrow part of the socio-
with the generic model Federal Emergency Manage- technical system. These analyses are delimited to an
ment Agency (FEMA) uses for disaster and emergency operation (site specific), organisation or geographic
management, which has four stages: Mitigation, Pre- sector (often local). The SRSA (i.e. the interviewed
paredness, Response, and Recovery (FEMA 1996). sectors) also conduct their own safety investigations
The five-stage model above has been used to deter- and risk analyses. Many of these analyses focus on a
mine where in the process the different sectors focus comprehensive part of the socio-technical system e.g.
their work. Most of the interviewed sectors work with from a national or regional point-of-view. Some anal-
all of these five stages. The work of nine of the twelve yses were also made from a narrow socio-technical
selected sectors have their main focus in stages 1 and 2. system point-of-view e.g. in connection with authori-
The sectors Emergency response support and Emer- sations for an operation. An example of risk analysis
gency preparedness for CBRN incidents, have their that has been made from a comprehensive socio-
primarily focus on stages 3, 4, and 5. Only the sec- technical point-of-view is risk analysis of the ‘‘sys-
tor Safety management in municipalities has the same tem’’ transportation of dangerous goods in Sweden or
focus on all stages. Europe. Other similar examples are the analysis of
Finally, an analysis was done of the different sec- expected effects on fire safety and societal costs, con-
tors’ utilization of Haddon’s ten strategies for counter- ducted in connection with introducing guidelines on
measures. A revised version of the ten strategies based compulsory smoke detectors in residences (Sweden),
on Folkhälsoinstitutet (1996), Gunnarsson (1978) and and the ongoing investigations about possible require-
Haddon (1970) has been used in this analysis: ments on self-extinguishing cigarettes (Sweden and
Europe).
1. Eliminate the risk
The purpose and aim of the risk analysis was dis-
2. Separate from the risk
cussed with the sectors. The results show that they
3. Insulate the risk
differ between sectors and are unclear in some cases.
4. Modify the risk
The results also show that there is on the one hand an
5. Equip to handle the risk
official purpose with a risk analysis and on the other
6. Train and instruct to handle the risk
hand a more pragmatic (non-stated) purpose. At the
7. Warn about the risk
same time as the purpose and aim of risk analysis is
8. Supervise the risk
only briefly and often vaguely described in Swedish
9. Rescue if accident happens
regulation, the regulations and guidelines often state
10. Mitigate and restore
that the purpose and aim of a risk analysis has to be
The twelve sectors utilized in some way most of the made clear early in the process. This transfer of the
strategies for countermeasures presented above. The issue from central to a local and operation-specific
dominant strategies for countermeasures are number context, makes it possible to direct the shaping of the
3, 4, 5 and 6. Strategies 3 and 4 are focused on actions risk analysis in the assignment or the actual risk analy-
that reform and reduce the hazard (risk source), while sis. This is unfortunately not done to any great extent.
strategies 5 and 6 focuses on preparing actions that can The vague specification of the purpose in the regu-
avoid and manage incidents. lation also creates insecurity among the stakeholders
and supervising authorities about how the risk analyses
should be conducted and reviewed. The most com-
3.3 Risk analysis and other similar tools
mon purposes or aims found in the actual risk analysis
In this section the term risk analysis is used as a generic usually quote what is stated in the regulation, or estab-
term for all types of investigations, examinations and lish that the purpose is to fulfil the requirement in

395
certain regulation. Fulfilling requirements from insur- thought that adequate safety was not primarily a ques-
ance companies or from the organisation itself have tion of achieving a certain safety level or safety goal;
also been stated. Another common category of pur- instead the focus should be on the characteristics of
pose is that the risk analyses shall constitute some the safety work.
kind of basis for decisions for different stakeholders The principles and directions used for evaluation of
and public authorities. Yet another common category risk in the authority’s operation were surveyed. Fifteen
is the purpose to verify or show something. Examples types of principles and direction were referred to by the
of this are to verify that the regulation is fulfilled, the sectors as starting-points when doing risk evaluation.
safety level is adequate or that the risk level is accept- Some examples are: risk comparison, economical con-
able. Other examples are to show that it is safe, more siderations, zero risk target (vision zero), experience
actions do not have to be taken, the risk is assessed in and tradition (practice), lowest acceptable or tolera-
a proper manner or that a certain capacity or skill is ble risk level, national goals, the principle of avoiding
obtained. A less common category of purpose is to dis- catastrophes, a third party should not be affected by
play a risk overview or a comprehensive risk picture. the accidents of others, and balanced consideration
An outline for a generic structuring of the different pur- and compromises between competing interests. Some
poses of a risk analysis based on the received answers of the sectors did not have a clear picture of how the
could hence be: principles were utilized in real cases or situations. Also
in a couple of cases the sectors did not know which
1. Formal requirements principles that governed the evaluation of risk within
a. Legal based their domain.
b. Non-legal based Based on some of the results presented above, an
outline for classification of risk criteria was made. The
2. Basis for decision-making intention with the outline is to display some differ-
a. Public decisions ent aspects that the criteria focus on and hence also
b. Non-public decision the evaluations. The outline has five main classes
of criteria: 1) Safety actions and system design, 2)
3. Verification Rights-based criteria, 3) Utility-based criteria, 4)
a. Design of detail, part of a system or a system Comparisons, 5) Comprehensive assessments. More
b. Risk level about the outline for classification of risk, and other
results on how SRSA conceived and do evaluation is
4. Risk overview/comprehensive risk picture found in Harrami et al. (in press).

3.4 Evaluation of risk and safety issues


4 DISCUSSION
An analysis was made on how the evaluation of risk
and safety are conceived within different sectors of The extent of this study forces the presentation and
the SRSA. The results show that even though the analysis of the result in this paper to be quite gen-
sectors daily take decisions and standpoints that incor- eral. Four experts employed by the authority have
porate evaluation of risk, some of them do not consider carried out the study. This may have influenced the
themselves doing evaluations of risk. results, even if the authors have tried to be balanced
Many of the sectors had difficulties to describe and objective.
how they did the assessment of adequate or satisfac- The use and understanding of several central safety
tory safety. Ten clusters of answers were identified terms seems sometimes considerably divergent within
in the answers given by the sectors on the question the authority. Some potential problems that have been
‘‘When does the SRSA know that ‘‘adequate’’ safety is identified are that the ability to collaborate, both with
achieved?’’ A couple of these clustered answers were: other organisations and within the authority, may be
when no accidents, injuries or deaths occur; it can be negatively affected. Poor internal co-ordination could
shown that a product or an operation does not have a lead to unnecessary fractionising and bureaucratisa-
higher risk level than other similar products or opera- tion of the work with safety. Possible ways to tackle
tions; it has been shown that the quantitative risk levels these problems could be to discuss terminology issues
is lower than established criteria; it has been shown that continuously and to use terms in a more precise and
certain required equipment or functions exist; a deci- distinct manner i.e. define and explain the used terms
sion becomes a legal case, goes to trial and become a better.
precedent; all the parties are pleased and nobody com- The authority uses a wide range of different theo-
plains about the safety level or actions taken. A couple ries, models and strategies for safety work. However,
of the sectors answered that it is not possible to deter- a development towards fewer standardised strategies
mine if adequate safety is achieved. Also one sector is not advocated. Instead it is important to increase

396
the understanding about different theories, models and of causes, effects, relations, correlations and solutions.
strategies, and the implications that there might have Hence, which theory and model that is used probably
on the authority’s work. If used correctly, the diversity affects both the organisation (the structuring of the
of different theories, models and strategies for safety sectors) and the operation at the authority. The coor-
work could strengthen the authority and contribute to dination of the work, e.g. the delimitation of which
develop new ways to work with risk management. aspects of an accident/event different sectors shall han-
The uses of and knowledge about risk analyses have dle also depend on which fundamental model and
always been considered to be crucial for the SRSA. theory that is applied. Therefore the choice and use
Even so, most sectors had difficulties in describing of approaches, theories, strategies and practices must
and specifying in what way risk analyses affect safety be conscious and deliberate. According to Hovden
and the significance of risk analysis in safety work. (2004) for over 100 years there have been ‘‘muddling
The authors had expected more comprehensive and through’’ processes to form the safety and rescue insti-
profound discussions on the relevance, use, utilization, tutions and regulatory regimes, and this has resulted in
content of the risk analyses and how the reviewing an over-complex ‘‘jungle’’ of safety institutions. This
and evaluation of the results was done. If risk analyses is in some respects the case for the SRSA. The ‘‘mud-
shall continue to be regarded as important and strategic dling through’’ process within the SRSA has been
subject for the organisation in future, there is a need ongoing for over two decades and has resulted in an
to specify what function and importance risk analyses authority with a very scattered commission and opera-
should have in risk management regulated by SRSA. tion. On the other hand there are also advantages with
There is also a need for the sectors to increase their an integrated management of issues concerning safety,
competence as well as the focus on these issues. risk, accidents, disasters, injuries, damage and res-
The different sectors had difficulties in expressing cue operations. An authority that handles accidents in
and describing how they assess adequate, sufficient, a more comprehensive way can co-ordinate the work
acceptable or tolerable safety. They were also uncer- better.
tain about which values, directions and principles Finally the authors also believe that many of the
that governed their operation and how these should issues discussed above will become even more sig-
be used. These insufficiencies regarding knowledge nificant in the future. Especially since a new national
that has a general and non-technical nature may be authority for safety, security and crisis will be estab-
an obstacle for good transparency, the understanding lished in January 2009. This authority will take
of decisions, communication of decisions and stand- over the tasks of the SRSA, the Swedish Emergency
points and support to decision-makers. These findings Management Agency and The National Board of
are most probably the result of deficient reflections Psychological Defence.
and discussions about risk evaluation issues within the
authority. In order to develop a more robust assessment
there is a need for SRSA to develop a more substan- 5 CONCLUSIONS
tial foundation for the evaluation of risk and safety
issues. The divergence in theories, practice and use of terms
Harms-Ringdahl & Ohlsson (1995) carried out a within the SRSA, together with a lack of understand-
study among eleven Swedish safety authorities. They ing, complicates the co-operation and coordination
found that there were major differences between the within the SRSA and with other actors within the field
authorities according to their areas of responsibilities, of safety. These differences can also give opportunities
traditions and operating conditions, manifested by dif- for advancements in safety management. To achieve
ferences in terminology, views on accidents and how this it is necessary for more profound discussions and
they should be prevented. The results in this paper reflections on how to perform safety management in
show that major differences also exist within a single such a diversified safety agency. Subjects that ought
authority. to be discussed are for example:
The SRSA is the result of the merging of several
– The basis on which events, accidents, injuries, crisis
operations of a handful of authorities and fields of
and catastrophes should be included in the work at
activities. This has resulted in a governmental author-
the authority
ity with a widespread operation that holds very differ-
– Which risk management strategies should be used
ent approaches, theories, strategies and practices. The
– The importance of risk analysis and how it should
width and diversity of the operation can be explained
be utilised in the safety work
to a large extent by different safety paradigms based
– Standpoints on fundamental ethics, values and
on tradition, legislation, sector boundaries and polit-
principles for evaluating risk.
ical directions. The diverse approaches, theories and
models focus on different aspects of safety and risk, Several of the issues that the authority handle are
and will result in different analyses and explanations complex and do not have any simple and obvious

397
answers. Therefore these issues need to be handled Carr, L.J. 1932. Disaster and the Sequence-Pattern Concept
with humility and caution. It is also important that of Social Change. American Journal of Sociology 38(2):
the employers have understanding and respect for how 207–218.
other sectors work in the authority, and that they ques- FEMA 1996. Guide for All-Hazard Emergency Operations
tion and critically examine their own work with safety. Planning. State and Local Guide (SLG) 101. Washington:
Federal Emergency Management Agency.
The co-operation within the organisation and with Folkhälsoinstitutet 1996. På väg mot ett skadefritt Sverige
other authorities would be facilitated if the authority (in Swedish). Report 1996:117. Stockholm: Swedish
develops a strategy or a policy on how to handle central National Institute of Public Health.
terms. Furedi, F. 2007. The changing meaning of disaster. Area 39
(4): 482–489.
Gunnarsson, S.O. 1978. Strategies for accident preven-
6 FUTURE WORK tion. In R. Berfenstam, L.H. Gustavsson & O. Petersson
(eds), Prevention of accidents in childhood; A symposium
The SRSA has initiated several activities based on in the series of congresses and conferences celebrating
the 500th anniversary of Uppsala University, held at
the study, e.g. a project that aims to develop meth- the Department of Social Medicine, University hospital,
ods and strategies for municipal risk assessment, a October 5–7, 1977, Uppsala. Uppsala: Uppsala University
development study regarding risk evaluation. It has Hospital.
also initiated a project group that co-ordinates risk Haddon, W.J. 1970. On the escape of tigers: an ecologic
communication and central terms and definitions. note. American Journal of Pubilc Health (December):
2229–2234.
Harms-Ringdahl, L. & Ohlsson, K. 1995. Approaches
ACKNOWLEDGEMENTS to accident prevention: A comparative study of eleven
Swedish authorities. Safety Science 21(1): 51–63.
Harrami, O., Postgård, U. & Strömgren, M. (in press).
We would like to thank the officials participating in the Evaluation of risk and safety issues at the Swedish Res-
interviews for the openness and the Swedish Rescue cue Services Agency. In Proceedings of ESREL 2008
Service Agency for making it possible to conduct this AND 17th SRA EUROPE CONFERENCE—Annual Risk,
study. We also would like to thank Prof. Lars Harms- Safety and Reliability Conference, Valencia Spain, 22–25
Ringdahl and an anonymous referee for constructive September 2008. Rotterdam: Balkema.
comments on the conference paper. Hollnagel, E. 2004. Barriers and accident prevention—or
how to improve safety by understanding the nature of
accidents rather than finding their causes. Burlington:
REFERENCES Ashgate.
Hovden, J. 2004. Public policy and administration in vulner-
able society: regulatory reforms initiated by a Norwegian
Abrahamsson, M. & Magnusson, S.-E. 2004. Använd- commission. Journal of Risk Research 7(6): 629–641.
ning av risk- och sårbarhetsanalyser i samhällets kris- Kjellén, U. 2000. Prevention of Accidents Through Experi-
hantering—delar av en bakgrundsstudie (in Swedish). ence Feedback. London: Taylor & Francis.
LUCRAM report 1007. Lund: Lund University Centre Lundgren, L.J. & Sundqvist, G. 1996. Varifrån får
for Risk Analysis and Management. miljövårdsbyråkraterna sin kunskap? In Lars J. Lundgren
All, R., Harrami, O., Postgård, U. & Strömgren, M. 2006. (ed.), Att veta och att göra—Om kunskap och han-
Olyckor, riskanalyser och säkerhetsarbete—några olika dling inom miljövården (in Swedish): 129–171. Lund:
perspektiv inom Räddningsverket (in Swedish). Report Naturvårdsverket Förlag.
P21-480/07. Karlstad: The Swedish Rescue Service Rasmussen, J. 1994. Risk management, adaptation, and
Agency. design for safety. In: B. Brehmer & N.-E. Sahlin (eds),
Andersson, R. 1991. The Role of Accidentology in Occupa- Future risks and risk management. Dordrecht: Kluwer
tional Injury Research. Ph.D. thesis. Arbete och hälsa, Academic Publishers.
vetenskaplig skriftserie 1991:17. Stockholm: Karolinska
Rasmussen, J. 1997. Risk management in a dynamic society:
Institutet. A modelling problem. Safety Science 27(2–3): 183–213.
Campbell, R. 1997. Philosophy and the accident. In R. Cooter
& B. Luckin (eds), Accidents in History: Injuries, Fatali-
ties and Social Relations. Amsterdam: Rodopi.

398
Safety, Reliability and Risk Analysis: Theory, Methods and Applications – Martorell et al. (eds)
© 2009 Taylor & Francis Group, London, ISBN 978-0-415-48513-5

Evaluation of risk and safety issues at the Swedish Rescue Services Agency

O. Harrami
Swedish Rescue Services Agency, Karlstad, Sweden
Department of Fire Safety Engineering and Systems Safety, Lund University, Lund, Sweden

U. Postgård
Swedish Rescue Services Agency, Karlstad, Sweden

M. Strömgren
Swedish Rescue Services Agency, Karlstad, Sweden
Division of Public Health Sciences, Karlstad University, Karlstad, Sweden

ABSTRACT: This study investigates how evaluation of risk and safety are conceived and managed in
different parts of the Swedish Rescue Services Agency (SRSA). Group interviews were performed within
twelve different sectors of the organisation. The results show that some of the representatives do not consider
themselves doing evaluations of risk, even though they take daily decisions and standpoints that incorporate
evaluation of risk. In most sectors profound reflection and discussion about these issues had only been carried
out to a very limited extent. The different sectors had great difficulties to express or describe how to assess
adequate, sufficient, acceptable or tolerable safety. There is a need for SRSA to develop a more substantiated
foundation for evaluation of risk and safety issues to receive better internal and external understanding of the
decisions, a more transparent process, easier and clearer communication of decisions and standpoints, and better
support to decision-makers.

1 INTRODUCTION i.e. the Swedish Road Administration, the Swedish


Rail Administration, the National Board of Housing,
Most national authorities have an important role in Building and Planning and the Swedish Rescue Ser-
interpreting legislation in order to: support political vices Agency (SRSA). One of the conclusions of the
and legislative decisions, issue regulation and guide- investigation was that while the competent authorities
lines and balance conflicting interest in decisions. At agreed on the ‘‘risk picture’’, they did not agree on
the same time Reid (1999) have suggested ‘‘Conflict how the risk evaluation should be made and hence
arising from conflicting evaluations of given infor- how risk treatment (safety actions and safety design)
mation generally cannot be resolved’’ and ‘‘This sort should be assessed for new road and rail tunnels
of conflict commonly arises in relation to the socio- in Sweden (Boverket 2005). This is an example of
political context of risk, and it involves competing what Reid (1999) describes as ‘‘conflicting evalua-
interests and incompatible value systems’’. So how tions of given information (e.g. conflicting views of
does a national safety authority handle these difficult the acceptability of estimated risks)’’. The two other
issues? types of conflict situations, described by Reid, ‘‘dif-
Lundgren & Sundqvist (1996) is one of the few ferent information’’ and ‘‘conflicting information’’ are
studies that have investigated how Swedish national more connected to the limitation and estimation of the
authorities operate. The study describes how the risk. The authors’ earlier experience from risk evalua-
Swedish Environmental Protection Agency take deci- tion in various fields (such as land use planning, large
sions when handling permission and authorisation scale hazardous materials, dangerous goods, natural
cases. The Swedish government appointed in 2002 a hazards, infrastructure risks and fire protection) is that
commission on public safety in traffic tunnels, which, the stakeholders could have diverging views on early
amongst other things, investigated the view on risk steps of the risk management process e.g. risk identi-
evaluation among the concerned national authorities fication or risk analysis. However situations where the

399
stakeholders agree on the ‘‘risk picture’’, but disagree 11. Emergency preparedness for chemical, biological,
on how to evaluate and assess the risk are even more radiological and nuclear (CBRN) incidents
common. 12. Supervision in four different jurisdictions
The following study on how evaluations of risk and
safety issues are done within the authority, was made as From each sector two or three key persons were
a part of a larger investigation of how the SRSA works selected for the group interview. The selections of key-
with accidents, risk analysis and safety assessment. persons were made in cooperation between the analysis
More findings from the study and a more compre- group and the department director of each sector. The
hensive description of how the study was initiated are key person should have had long experience in working
found in All et al. (2006) and Harrami et al. (in press). within that sector, and be aware of the traditions and
The SRSA is a governmental organisation working views prevalent within the sector.
with miscellaneous types of accidents, safety work and The interviews were performed from October 2004
risk management in many arenas. The authority has through March 2005. Each group interview took
around 800 employees and is active in many areas of approximately three hours to perform. One structured
expertise and the legal competence covers four main question guide was used for all group interviews.
legislations: the Civil Protection Act; the Transport of The question guide consisted of open questions and
Dangerous Goods Act, the Law on Measures to Pre- was structured around different themes such as: the
vent and Limit the Consequences of Serious Chemical phenomenon of accident and injury, actors, safety
Accidents (Seveso Directive) and the Flammable and management, common terms, safety legislation and
Explosive Goods Act. The authority also has com- the use of risk analysis.
missions in education and international humanitarian This paper focused on risk evaluation and the
relief and disaster operations. questions used were:
The objective of this paper is to describe and discuss
how evaluation of risk and safety issues is carried out – When does the SRSA know that ‘‘adequate’’ safety
at the SRSA, and what the basis is for the evaluation. is achieved in the work with decisions, regula-
tions, directions, handbooks, information, guid-
ance, education etc.? How have you decided on
these actions?
2 METHOD – Which principles and directions govern the deci-
sions in your sector? Describe the principles.
This paper studies how the Swedish Rescue Service – Are any legal or recommended risk or safety levels
Agency works with risk evaluation in its operation. used? How are these levels described? Is it possible
Structured group interviews were performed with 12 to assess if the levels are achieved?
different sectors within the organization. The rea-
son why interviews were preferred to questionnaires After the interviews the preliminary results were
were that many of the questions were difficult to presented to the studied sectors both as a report and
describe with clear and simple questions. More cor- at a workshop. Received feedback was included in the
rect interpretations were also facilitated since different final results.
parts of the organization use terms and definitions in
different ways.
In order to structure the application sectors, sev-
eral logics and methods to divide the organisation 3 THE DIFFERENT SECTORS VIEW ON
were considered. The final division was based on a EVALUATION OF RISK AND SAFETY
combination of legislations used, traditions, branches ISSUES
and contexts. The studied sectors are those working
with: Below is a summary of the answers, for each one of
the twelve sectors, on the questions described in the
1. Fire protection in buildings method section above. The summaries are presented
2. Transportation of dangerous goods in the same order and have the same numbering as
3. Flammables the presentation of the sectors in the previous method
4. Explosives (including LPG) section.
5. Land use planning The extent and depth of the answers varied a lot
6. Natural disasters which is apparent from the summaries. One has to
7. Environmental and risk appraisal for industry, bear in mind that the scope and focus of the commis-
including Seveso-establishments sion vary between the studied sectors. There are also
8. Emergency response support variations in other contextual conditions such as the
9. Safety management in municipalities characteristics of the risk, who the stakeholders are
10. National injury prevention program and how adjacent legislation is formulated.

400
3.1 Fire protection in buildings details, subsystems and systems. The assessment is
performed based on experience from normal opera-
The work of this sector is made within the framework
tions and incidents, as well as continuous dialog and
of the Civil Protection Act that is applicable when
discussions with companies and the industry. Ful-
assessing fire protection in existing buildings. The
filment of requirements is primarily done by living
Building Code regulates the construction period up to
up to accepted practice, and only to a minor extent
the approval of the building. The legal requirements on
by satisfying a certain safety criterion. If the prac-
safety actions, shall express the lowest level accepted
tise is not satisfied, the risk owner has to be credible
by society. The requirements shall also be economi-
when asserting that the activity still fulfils the require-
cally reasonable i.e. the action must give a protection
ments. There are three main principles that are used in
that is proportional to the cost. It is possible to require
the assessment: First, as few as possible should be
new actions to a certain extent, but considerations
exposed to risk. Second, the control-room shall be
have to be taken to the regulation that existed when
safe in order to handle control emergencies. Third,
the building was built. These circumstances make
people outside the plant shall not be affected. The
each case unique, which requires profound knowledge
assessment of damages and injuries is based using
about different parts of the system as well as flexibil-
a design scenario that is ‘‘more reasonable’’ than a
ity when assessing the buildings’ safety and risk level.
worst-case scenario. The design scenario is based on
The principles used in the assessment are focused on
experience within the industry, such as information
saving lives and preventing injuries. A key principle
about incidents and similar systems. The rupture of
is that people have to be able to evacuate the build-
a tank/cistern is for example not considered to be a
ing before critical conditions arise. Another important
reasonably empirical damage scenario while a bro-
principle is to protect a third party from injuries and
ken coupling is. Effects of organisational changes are
damages. The interpretation of the regulation and the
very difficult to assess compared to changes in the
assessment of safety levels, are to a certain degree
technological system. The cost for an action shall be
also dependent on the political climate and on to what
(economically) reasonable. There are no legal or rec-
degree recent fires have become a media event.
ommended levels for safety but the regulation states
that as few as possible should be injured. The indus-
3.2 Dangerous goods try calculates probabilistic risk, e.g. individual risk
(IR) and societal risk (SR), in their risk analyses
The objective of the regulatory work for the trans- and wants the SRSA to use probabilistic risk crite-
port of dangerous goods is to ensure safe transport. ria. The risk analyses that calculate IR and SR are
The regulatory work is done within the framework of difficult to assess, primary since the analyses are very
the UN Economic & Social Council (ECOSOC) and uncertain but also because no probabilistic risk cri-
is based on negotiations between the member states. teria have been set. The sector thinks that the focus
39 countries are involved in the work with the rules should be on actions and measures for improving
for road transport, and 42 countries work with the safety instead of focusing on the figures and results
rules for rail transport. There is no set safety target in risk analysis.
level but a general principle that is used in the assess-
ment is that the more dangerous the substance is the
higher safety the level is required. The rules are a 3.4 Explosives
balance between enabling transportation and ensuring A standard model e.g. safety distances is used when
safety. This means that the diverse conditions in the applicable, otherwise a risk analysis should show that
member-states (weather, density of population, road the (process) method doesn’t cause larger risk than
quality, economy etc) may play an important role in similar alternative standard methods. The criteria for
the negotiations. Since the regulatory work is based liquefied petroleum gas (LPG) pipes are based on
on negotiation the political aspects are dominant and long traditions of manufacturing steel and on expe-
this may also be reflected in the outcome. Cost-effect rience from incidents. The principles used for LPG
analyses are utilized to some extent. The rules mainly constructions are: Firstly, the gas has to remain in
address the design of the technical system and to some the containment. Secondly, in case of leakage the gas
extent organisational issues. It is difficult to assess if has to be ventilated. Thirdly, ignition has to be pre-
‘‘adequate safety’’ is achieved. vented. Finally a single mistake or error should not
in it self result in a dangerous situation. The safety
criteria for explosion protection is based on interna-
3.3 Flammables
tional figures and information (e.g. from NATO) and
‘‘Adequate safety’’ is assessed in official inspections adapted for domestic conditions. Three principals are
at the site before permissions are issued. Aspects used for assessing the safety of explosives: An initi-
that are assessed are the design and performance of ation of explosion should be avoided. The explosives

401
should be protected from the surroundings. The sur- Slope Stability (1995) and the Swedish Committee
roundings should be protected from the explosives. It for Design Flood Determination (Flödeskommittén
is very difficult for the sector to assess if the desired 1990). The latter have recently been updated and
safety level is achieved. The absence of accidents is replaced by new guidelines (Svensk Energi, Svenska
interpreted as an indication that the safety level proba- Kraftnät & SveMin 2007).
bly is good. The applied criteria are absolute but some
consideration of the cost of actions may be taken.
3.7 Environmental and risk appraisal for industry
Most authorizations are new and unique cases. The
3.5 Land use planning
work in this sector is therefore similar to the one
Most cases are unique. It is therefore important to described for land use planning (section 3.5). The
collect all possible documents, information and analy- authorization decision is based on risk analysis and
sis. Usually a comprehensive assessment made within on the operators’ answers to complementary ques-
the frame of the Planning and Building Act, balances tions by SRSA. It is very difficult to know if the
different values and makes assessment of the reason- safety level is adequate. Lately the regulation has been
ableness. To a certain extent this assessment also has evolving towards more functional regulation. A cou-
to take the Environmental Code into account, which ple of precedents exist that contains interpretations
assumes that certain codes of consideration have been of how safety and risk evaluations within the juris-
followed. E.g. that sufficient knowledge and best- diction of the Civil Protection Act and the Law on
available-technology have been used, chemicals have Measures to Prevent and Limit the Consequences of
been replaced with less hazardous ones and that the Serious Chemical Accidents (Seveso Directive) should
location of the activity in question is the most appro- be assessed. These precedents constitute directions
priate. In local regulations safety distances are utilized for future assessments. The sector also refers to the
as well as prohibition to establish new hazardous oper- same principles used in the application of risk criteria
ations. It is not possible to assess if adequate safety is presented earlier in section 3.5 (Land use planning).
achieved. One way to approach this issue is to put
safety issues on the agenda. The more safety issues
3.8 Emergency response support
that are discussed and managed in the planning pro-
cess the better, and in some sense the closer you get This sector is both comprehensive and situation-
to ‘‘adequate safety’’. Some principles that are used in dependent, and focuses on two aspects of risk eval-
the application of risk criteria (Davidsson 1997) are: uation: general planning of the fire and rescue work in
(1) the principle of reasonableness i.e. the activity in the municipalities (task, size, location, equipment etc.)
question should not imply risks that might reasonably and safety for the personnel during a rescue operation.
be avoided or reduced, (2) the principle of proportion- The evaluations done in connection with the general
ality i.e. the risks that an activity gives rise to may not planning vary between different municipalities and are
be disproportionate in relation to the benefits, (3) the based on different information and statistics. Statistics
principle of distribution i.e. no individuals or groups on incidents, injuries and accidents are common infor-
should be put at a risk that far exceeds the benefits mation sources. During the last years SRSA has been
they derive from the activity and (4) the principle of developing and promoting other methods which some
avoiding catastrophes i.e. manageable accidents with municipalities have adopted e.g. cost-effect methods
limited consequences are more preferred than ones as well as different measures, indicators and key-
with consequences of catastrophic magnitude. ratios. Evaluations done during a rescue operation are
to a certain degree directed by regulation e.g. prerequi-
sites for conducting a rescue operation and directions
3.6 Natural disasters
for certain activities (e.g. fire fighting with breath-
When assessing adequate safety with respect to ing apparatus and diving). The rescue commander has
flooding, the effects of calculated or estimated to make fast and sometimes difficult assessments and
flows is compared to buildings, public functions and evaluations based on limited information.
other properties and values that have to be protected.
There are no general recommendations for how
3.9 Safety management in municipalities
the municipalities should assess flooding risks. The
municipalities decide permissible water levels for The work of this sector is largely done in the munic-
new settlements. The evaluation of landslides is ipalities, and is based on political decisions that give
mainly assesses through safety factor calculations and directions for the safety work. The priority given to
national recommended safety factors are utilised. safety issues differs between municipalities. A com-
Much of the safety work performed in this sec- mon guiding principle is that saving life is prioritised
tor is based on guidelines from the Commission on compared to saving property, and saving property is

402
prioritised compared to saving the environment. There 4 ANALYSIS AND DISCUSSION
are also continuous discussions about how resources
should be allocated between small and large accidents. 4.1 The view of sectors on risk evaluation issues
Usually the assessment is done by comparing safety
At first during the interviews some of the sectors
levels and accident statistics with similar municipal-
did not consider themselves doing evaluations of risk,
ities as well as with the nation as a whole. Target
even though they take daily decisions and standpoints
levels for the citizens’ protection are in some cases
that in one way or another incorporate the evalua-
expressed. Cost-effect methods are used to a certain
tion of risk. These tasks include issuing permits and
extent.
authorisations, issuing regulation and other guidelines,
developing and providing methods and tools for safety
3.10 National injury prevention program work, promoting safety in certain lines of business or
arenas, giving comments on submitted proposals etc.
The Safe Community concept promotes (Welander
Evaluation seems to be a very integrated, natural and
et al. 2004) systematic safety work at the munici-
unconscious part of the everyday work done by the
pal level. The vision is zero injuries but in reality
different sectors.
technological and other changes cause new types of
The reason for the sectors’ views, not considering
accidents and injuries, and this impedes the achieve-
themselves doing evaluation, have not been analysed
ment of safety goals set within the frame of a vision.
thoroughly. However two possible explanations have
The Swedish Public Health Policy gives general direc-
been seen, and they are both connected to the back-
tions that guide the assessments (Folkhälsoinstitutet
ground and the role of the officials. Most officials
2003). Each municipality sets its own levels based
have a technical or natural science background even
on the guidelines for public health. A common way
though this is gradually changing. This means that
to direct safety work is to identify the most cost-
they are used to working with proposals and solutions
effective actions. Priority is given to saving lives.
that are expressed in figures and charts, and based on
Children’s safety is often prioritised when establishing
calculations and statistical methods. Inspired by the
new housing estates.
work of Lupton (1999), Summerton & Berner (2003)
and Renn (1992), Hultman (2004) describes seven per-
3.11 Emergency preparedness for CBRN incidents spectives on risk within socio-technical systems. Most
SRSA-officials have, and work within, what Hultman
Adequate safety is generally very difficult to assess
describes as, an ‘‘engineering perspective’’ on risk.
for CBRN incidents since the uncertainties are large.
One explanation could be that the term ‘‘evaluation’’
The response time for the national oil protection sup-
has been interpreted as using a certain systematic, sci-
plies is considered to be sufficient. The emergency
entific or quantitative method. Their background may
protection and preparedness for nuclear incidents are
have lead their thoughts to associations about scien-
also considered to be sufficient, even though the pre-
tific evaluation, which they do not consider themselves
requisites for these incidents are continually updated.
doing. Another explanation could be the indistinct role
The assessments of the safety levels for these inci-
of the officials. Most of them are scientific experts by
dents are based on relevant scenarios. The prepared-
profession. They also form a link between the legal
ness has to be flexible in order to adapt to different
world and other experts. If ‘‘evaluation’’ was inter-
situations.
preted as a legal issue they probably did not consider
themselves doing that either.
3.12 Supervision
Supervision is made within the framework of four leg- 4.2 Assessment of adequate or satisfactory safety
islations, and is based on general information about
Most of the respondents thought that the question of
the operation, risk analysis and other documents. It
how they determine if adequate safety is achieved
is not possible to process and consider everything in
was relevant. At the same time many of the sectors
an operation. Strategic choices are made, on what
had difficulties describing how they did the assess-
to study. With improved experience the supervisors
ment. Below is a summary of different views on how
learn how to receive a good picture of the status of
adequate or satisfactory safety is assessed. Adequate,
different operations and how to assess the ‘‘safety
satisfactory, sufficient, acceptable or tolerable safety
level’’. Supervision is made up of two main parts: the
is attained when:
administrative organisation and the hands-on opera-
tion. The main starting-point for the assessments is – no accidents, injuries or deaths occur
the intentions of the legislation. Hence the interpreta- – tests and examinations show that characteristics
tions of the legislation become crucial for the safety for a product or an activity meet predetermined
assessment. thresholds

403
– it can be shown that a product or an activ- 1 2
ity/operation does not have a higher risk level than
other similar products or activities/operations 3
– all experiences acquired by the involved parties are Outcomes
Actions
&
used as far as possible &
Functions
Expectation
– it has been shown that the quantitative safety/risk values
&
Way of work
level is lower than established criteria
– it has been shown that certain required equipment
or functions exist
– a process involving certain parties and including
certain stages, has preceded a safety decision
– a decision becomes a legal case, goes to trial and “The reality” Reasonableness
become a precedent
– a decision on safety is determined in relation to other
factors (adequate safety is relative)
– all the parties are content and nobody complains
about the safety level or actions taken. Figure 1. Three types of discourses on adequate safety were
identified during the interviews. Most common was the type 1
Some of the sectors answered that it is not possible and 2 discourses.
to know or determine if adequate safety is achieved.
One sector thought that adequate safety was not pri- – The precautionary principle
marily a question of achieving a certain safety level or – Zero risk target (Vision zero)
safety goal. Instead the focus should be on the char- – Political decisions and directives
acteristics of the safety work. An operation or activity – Experience and tradition (established practice)
that works systematically and strives for continuous – Lowest acceptable or tolerable risk level
safety improvements should be considered to have – National goals (e.g. Public Health Policy)
adequate safety. – The principle of avoiding of catastrophes
The results show that the sectors have interpreted – Limitation and mitigation of consequences
the question in several different ways. The sectors – The evaluation is transferred to others
have either answered how requirements in regulation – A third party should not be affected by the accidents’
are satisfied (legal view) or how adequate safety is of others
assessed from a scientific or from a philosophical – Balanced consideration and compromises between
point of view. Also the answers given on this question competing interests
could be summarised in three types of main discus- – Saving human lives are prioritised, in relation to
sions (discourses), where discussion type 1 and 2 were saving property and the environment
the most common (Fig. 1).
Type 1: Quantitative risk measures (outcomes and As seen from the results, many sectors refer to
expectations) were discussed and contrasted to each more than one principle. There is sometimes no clear
other. In most case the discussion also included some boundary between two principles. Also some princi-
questions or criticisms of these risk measures (to some ples may include (parts of) other principles. It was
referred to as ‘‘the reality’’). Type 2: Different safety often unclear to what extent the principles were used
actions, necessary functions and ways of managing and what importance the different principles had. For
safety were discussed and contrasted. In most cases some of the sectors it was not clear how the princi-
the discussion also included issues on how to assess ples were utilized in real cases or situations. In a few
the reasonableness of required safety actions. Type 3: cases the sectors did not have a clear picture of which
In a couple of cases the two earlier types of discussions principles governed the evaluation of risk within their
(type 1 and type 2) were contrasted to each other. domain. The reason for the uncertainty about the prin-
ciples, their importance and how they are utilized may
be due to the fact that almost no sector had been carry-
4.3 Principles used for evaluation of risk ing out any profound reflections and discussions about
and safety issues these issues.
All sectors gave example of directions and principles
that were used as starting-points in the evaluation of 4.4 An outline for classification of risk criteria
risk and safety issues. Below is a summary:
An outline for classification of risk criteria has been
– Risk comparison made based on how the sectors assess adequate safety
– Ethics and values and what principles they use in the evaluation process.
– Economic considerations The outline is based on the classifications presented by

404
Mattsson (2000) and Morgan & Henrion (1990). These 4.5 Assessment of reasonableness and the use
classifications have been expanded and modified to of quantified risk criteria
better fit the results found in this study, which have
The issue of reasonableness was common in the
resulted in five groups of criteria:
sectors’ discussion on risk evaluation. The sectors dis-
1. Safety actions and system design criteria cussed reasonableness both regarding the evaluation
2. Rights-based criteria process and the requirements on safety actions. Most
3. Utility-based criteria probably this is due to that the applicable legislations
4. Comparison criteria state that requirements on safety actions shall be ‘‘rea-
5. Comprehensive assessments sonable’’. These legislations may have been influenced
by utility theory, but the assessments of the reasonable-
Group 1 consists of four types of criteria: (a) ness made by the courts have an even more compre-
Detail regulation, commonly regulates actions that hensive scope. The few existing precedents (stated by
target technology, operation and to some extent the courts) have interpreted the meaning of ‘‘reasonable’’
organisation; (b) Functional regulation, expressed as as: the costs for an action shall be proportional to the
a certain function that has to be fulfilled in a given safety turnout of the required action. Also, practical
situation; (c) Measures, indicators and key-ratios, feasibility as well as the question of existing versus
which could be a safety distance, an index or a ratio; new objects/activities is taken into consideration when
(d) Technology-based criteria, that states the best assessing if an action is reasonable.
available technology should be used. In some of the interviewed sectors it is common that
Group 2 include criteria that are rights-based, quan- the industry use quantitative risk analyses (QRA) that
titative and express the highest permitted risk level: present the result as individual and societal risk (i.e.
(a) Zero-risk and similar goals for outcomes, rights-based criteria). In some European countries,
expressed as the accepted level of deaths or injuries; such as the Netherlands and Great Britain, quantita-
(b) Individual risk; (c) Societal risk criteria. The cri- tive risk criteria are used to assess such risk analyses
teria (b) and (c) express the highest accepted level for (Ale 2005). There are no set quantitative risk criteria
expected values of deaths. in Sweden that can be applied when assessing quanti-
Group 3 consists of three types of criteria that all tative risk analysis. Therefore the industry (or a single
come from utility theory and are variants of utility- risk owner), either set their own criteria or refers to
based criteria (Mattson 2000): (a) Cost-benefit: assess levels used in other countries.
if the sum of all benefits of an action shall exceed This situation is perceived as problematic by some
the costs; (b) Cost-effect: assess which action meets sectors, since the Swedish legislation generally pro-
the safety goal to the lowest cost; (c) Multi-attributive motes risk decisions based on overall assessments of
utility criteria: assess different alternatives based on different aspects, standpoints and interests. Also most
a combination of several types of preferences with of the sectors made a distinction between calculated
different weights. risk (expectation values) and actions to manage risk,
Group 4 consists of criteria with three types of com- which is also exemplified by the two most common
parisons: (a) Between the risk/activity in question and types of discussions displayed in Figure 1. Some sec-
similar risks/activities; (b) Between the risk in ques- tors also interpreted that the legislation is more focused
tion and dissimilar types of risks e.g. smoking, driving on risk reducing actions, than on risk as a level or a
and climbing; (c) Between different design alter- condition. The divergent view between the industry
natives, which is common in environmental impact and some of the sectors, on how to use and assess
assessment. quantitative risk analysis arises several questions:
Group 5: The assessment of safety is done together
with many other factors and the assessment does not – Why do the industry calculate individual and soci-
have to be quantitative. The assessment processes do etal risk, even though no quantitative criteria exist?
not have to be as systematic as multi-attributive anal- – Do the industry focus more on risk levels than on
ysis. The outcome of such an assessment may be safety actions?
difficult to predict since these assessments are unique – Is it possible for authorities to evaluate the results
for each case. of QRA, even though no quantitative criteria exist?
In some cases ‘‘hybrid criteria’’ are used (Mattsson
2000). These combine different criteria e.g. first soci-
etal or individual risk analysis is assessed followed by
a cost-benefit analysis. Also many of the presented 5 CONCLUSIONS
criteria may be expressed as guideline values/levels
and not absolute values/levels. The difference is that In this study we have striven to give a picture of how
the guideline values are not unconditional and can be the assessment of risk evaluation has been carried out
seen as something to strive for. within the SRSA and to a certain extent also why.

405
The findings in this study show that the evaluations All, R., Harrami, O., Postgård, U. & Strömgren, M. 2006.
of risk and safety issues are done in many different Olyckor, riskanalyser och säkerhetsarbete—några olika
ways, due to several context-dependent factors that perspektiv inom Räddningsverket (in Swedish). Report
influence the assessment. A study that investigates if it P21-480/07. Karlstad: The Swedish Rescue Service
is possible to find more universal foundations for eval- Agency.
Boverket 2005. Riskvärdering—delprojekt 2.1, bilaga till
uations of and decisions on risk within the authority is regeringsuppdrag Personsäkerhet i tunnlar (in Swedish).
desired. Such foundations could perhaps supplement Karlskrona: The National Board of Housing, Building and
or replace some of the existing principles. Planning.
The evaluation component in the authority’s pro- Commission on Slope Stability 1995. Anvisningar för
cesses that include decisions about risk has to be Släntstabilitetsutredningar (in Swedish). Linköping:
elucidated. The benefits of such elucidation could be Royal Swedish Academy of Engineering Sciences.
among other things better internal and external under- Davidsson, G., Lindgren, M. & Mett, L. 1997. Värdering
standing of the decisions, a more transparent process, av risk (in Swedish). Report P21-182/97. Karlstad: The
easier and clearer communication of decisions and Swedish Rescue Service Agency.
Flödeskommittén 1990. Riktlinjer för bestämning av dimen-
standpoints, better support to decision-makers such as sionerande flöden för dammanläggningar. Slutrapport
other officials and politicians, and a reduction of the från Flödeskommittén (in Swedish). Statens Vatten-
influence of individual preferences linked to certain fallsverk, Svenska Kraftverksföreningen and Sveriges
officials. Meteorologiska och Hydrologiska Institut.
A more systematic exploration and discussion of Folkhälsoinstitutet 2003. Sweden’s new public health pol-
these questions is needed within the SRSA if we are icy. National public health objectives for Sweden. Report:
to develop these issues and get a better general view 2003:58. Stockholm: Swedish National Institute of Public
and co-ordination of risk evaluation. Health.
Harrami, O., Strömgren, M., Postgård, U. & All, R.
(in press). Accidents, risk analysis and safety
management—different perspective at a Swedish safety
6 FUTURE STUDIES authority. In Proceedings of ESREL 2008 AND 17th
SRA EUROPE CONFERENCE—Annual Risk, Safety and
Based on the study, the SRSA has concluded that Reliability Conference, Valencia Spain, 22–25 September
improvements are needed and initiated a develop- 2008. Rotterdam: Balkema.
ment study regarding risk evaluation at the authority. Hultman, M. 2005. Att förstå risker—en kunskapsöversikt av
The first ongoing step in the three-year project is to olika kunskapsperspektiv. KBM:s Forskningsserie Report
carry out a review of research done on the subject. nr. 8. Stockholm: Swedish Emergency Management
Agency.
The review will combine four knowledge compila- Lundgren, L.J. & Sundqvist, G. 1996. Varifrån får
tions done by researchers from the fields of sociology, miljövårds-byråkraterna sin kunskap? In Lars J Lundgren
economics, philosophy and technology. (ed.), Att veta och att göra—Om kunskap och han-
The results from the ongoing review could together dling inom miljövården (in Swedish): 129–171. Lund:
with the findings from the study presented in this paper Naturvårdsverket Förlag.
constitute a starting-point for a more substantial foun- Lupton, D. 1999. Risk. Milton Park: Routledge.
dation for the evaluation of risk and safety issues at Mattsson, B. 2000. Riskhantering vid skydd mot oly-
the authority. ckor—problemlösning och beslutsfattande (in Swedish).
Karlstad: Swedish Rescue Services Agency.
Morgan, M.G. & Henrion, M. 1990. Uncertainty—A guide
to dealing with uncertainty in quantitative risk and policy
ACKNOWLEDGMENTS analysis. Cambridge: Cambridge University Press.
Reid, S.G. 1999. Perception and communication of risk, and
We would like to thank the officials participating in the the importance of dependability. Structural Safety 21(4):
interviews for the openness and the Swedish Rescue 373–384.
Service Agency for making it possible to conduct this Renn, O. 1992. Social theories of risk. In S. Krimsky & G.
study. We also would like to thank Prof. Lars Harms- Dominic (eds) 1992. Social theories of risk. Westport:
Ringdahl and an anonymous referee for constructive Praeger Publications.
Summerton, J. & Berner, B. (eds) 2003. Constructing risk
comments on the conference paper.
and safety in technological practise. London: Routledge.
Svensk Energi, Svenska Kraftnät & SveMin 2007. Rik-
tlinjer för bestämning av dimensionerande flöden för
REFERENCES dammanläggningar (in Swedish). Stockholm: Sweden-
ergy, Svenska Kraftnät and SveMin.
Ale, B.J.M. 2005. Tolerable or acceptable: A comparison Welander, G., Svanström, L. & Ekman, R. 2004. Safety
of risk regulation in the United Kingdom and in the Promotion: An Introduction. Stockholm: Karolinska
Netherlands. Risk Analysis 25 (2): 231–241. Institutet.

406
Safety, Reliability and Risk Analysis: Theory, Methods and Applications – Martorell et al. (eds)
© 2009 Taylor & Francis Group, London, ISBN 978-0-415-48513-5

Regulation of information security and the impact on top management


commitment—A comparative study of the electric power supply sector
and the finance sector

J.M. Hagen
Gjøvik University College

E. Albrechtsen
SINTEF Technology and Society/Norwegian University of Science and Technology, Trondheim, Norway

ABSTRACT: The paper compares how information security in critical infrastructures is regulated in two differ-
ent sectors and how the regulations can influence organizational awareness. It also compares how organizational
information security measures are applied in the sectors, and discusses how the sectors can learn from each
other. The findings document considerable differences in legal framework and supervision practices, in use
of organizational information security measures and the top management engagement. Enterprises belonging
to the finance sector have more widespread use of organizational security measures, and the respondents are
also more satisfied with the management engagement and the organization’s performance according to the legal
requirements. The paper argues that information security audit by authorities can be one important contribution
to information security awareness, and top management commitment to security, and that the sectors can learn
from each other by sharing information on how they deal with information security.

1 INTRODUCTION IT systems to the Internet has increased the risk for


system breakdowns and serious failures (Nystuen &
Sundt (2006) and Lobree (2002) claim that laws and Hagen 2002, Hagen 2003). Norwegian enterprises
regulations affect the implementation of information are also exposed to computer crime, and there exist
security in organizations. There is, however, limited numerous weaknesses in the applied security systems
information publicly available about how compliance (Kredittilsynet 2004, Hagen 2007, Hole et al. 2006).
to information security laws is currently dealt with. Norwegian authorities, with support from the legal
This paper addresses how the regulatory and super- framework, supervise and measure the compliance
visory activities are carried out and impact information to information security regulations within the critical
security practices and organizational awareness in the infrastructures. By law, authorities can force the orga-
finance and the electric power supply sector in Nor- nizations to adopt information security measures. Yet,
way. It also discusses how each sector could learn from certain questions arise:
each other.
– Do the laws and accompanying supervision process
Finance and electric power supply are both criti-
by the authorities have any effects on the enter-
cal infrastructures; additionally, electric power supply
prises’ organizational security awareness and top
is the basis infrastructure for all kind of service
management commitment?
production dependent on computers and electronic
– Are there any differences in security practices
communication services.
between the sectors?
Norwegian enterprises have gradually increased
– If there are differences, how could the enterprises
their dependence on IT systems (Hagen 2007). In
within the different sectors learn from each other to
addition, studies on critical infrastructures and vul-
strengthen information security?
nerability of the society have over the last years
documented that also previously redundant physical The paper discusses how audit by authorities can con-
systems, like electricity production and distribution, tribute to organizational security awareness and top
now are critically dependent on electronic commu- management commitment to security. Section 1 intro-
nication and the Internet, and that connecting the duces the problems to be analyzed. Section 2 gives

407
state of the art regarding the management’s role in finance sectors. This is a moderate sample, and
information security. Section 3 presents the applied limits our ability to conduct statistical analysis to
research method. Section 4 provides an overview of reveal associations.
the Electric Power Supply and the Finance Industry Therefore, we have chosen to present the results as
in Norway and the applied supervision methodolo- descriptive statistics. A few hypotheses have also been
gies. In section 5 we compare the two sectors studying formulated and tested by independent samples t-tests.
the security attitudes and applied organizational secu- The independent-samples t-test procedure compares
rity measures. Discussion of the findings is given in means for two groups of cases. Ideally, for this test, the
Section 6. Section 7 provide the answers of the research subjects should be randomly assigned to two groups,
questions, and section 8, show the way ahead, and con- but this is not the case when we study enterprises
cludes that it will be useful to study security cultures within the energy and the finance sector. Therefore, we
within organizations. discuss how differences in other factors, ex. respon-
dents, could be masking or enhancing a significant
difference in means. In addition, the paper has been
2 THE MANAGEMENT’S ROLE through several rounds of validation by the informants.
Information security has developed from a strict tech-
nological discipline to become a multidisciplinary 4 THE ELECTRIC POWER SUPPLY
responsibility for top management (Lobree 2002, AND FINANCE INDUSTRY IN NORWAY
Sundt 2006). Williams (2007) claims that even the
board needs to be assured of effective risk manage- 4.1 The legal framework
ment and the sharing of responsibility for informa-
tion security by a number of individuals within the In Norway, the Ministry of Oil and Energy holds
organization. the administrative responsibility for the energy sec-
Information security law places responsibility for tor. The Norwegian Water Resources and Energy
information security on the management and the Directorate (NVE) oversees the administration of the
boards. Dealing with authorities is a typical manage- energy and water resources. The Norwegian electricity
ment task. Thus, supervision from authorities may be industry finds itself in a unique position interna-
one way to raise top management’s awareness. We tionally, as almost 100% of the country’s electricity
expect that an engaged top management is important production comes from hydro electric power. Nor-
for how the organization adopts information secu- wegian enterprises need to be licensed in order to
rity measures and complies with the law. Studies of supply and dispose electrical energy. Enterprises act
safety have documented that management involve- in accordance with many laws and regulations; how-
ment is important for the safety work within companies ever, the most important, dealing with contingency
(Simonds 1973, Simonds & Shafai-Sharai 1977). It and information security, are the Energy Act, the
is reasonable to believe that this experience could be Energy Direction and the Contingency Direction § 4
transferred to the information security field. and 6. The regulations deal with the planning, build-
ing and conduct of different electrical enterprises,
including emergency preparedness. They place certain
3 METHOD requirements on electrical enterprises’ protection of
critical information systems and sensitive and classi-
We have carried out two exploratory case studies; fied information, and demand that enterprises provide
one on financial law and supervision, and one on the Assurance Reports.
energy law and supervision within the hydroelectric The Ministry of Finance holds the superior adminis-
power supply sector. Data on supervision practices was trative responsibility for the country’s financial sector.
collected during the summer 2006 from textual mate- The Financial Supervisory Authority of Norway over-
rials and personal interviews with representative from sees Norways’s financial markets and regulates the
the Norwegian supervisory authorities and four com- financial sector. Like the hydroelectric power enter-
panies. Authorities in Sweden, Finland, Denmark and prises, financial enterprises act in accordance with
UK were also contacted by mail (Hagen et al. 2006). many laws and regulations. The most important legal
In the spring 2007, a web based survey was car- acts are the Financial Supervisory Act, and the related
ried out among IT-officers in different sectors (Hagen Information and Communication Technology (ICT)
et al. unpubl.). This survey addressed organizational Direction. The ICT Direction deals with informa-
security measures like user education, procedures and tion security. Thus, one major difference between the
controls and tools, compliance to law and the effec- sectors is that the financial sector has within the hydro-
tiveness of supervision. Out of a total of 87 answers, electric power supply industry is regulated according
only 34 were related to the electric power supply and to § 4 and 6 in the Contingency Direction.

408
4.2 The supervision methodology 5 COMPARATIVE ANALYSES
The two Norwegian regulatory authorities chosen for
5.1 The answers
this study conduct their supervisory controls in dif-
ferent manners, still there are some similarities. In Of the 87 received answers, 21 were from hydro elec-
both sectors, supervision is conducted in the follow- trical enterprises and 13 from saving banks. Table 1
ing way: The enterprises go through a pre-evaluation, identifies the distribution of the answers and the
which aims to evaluate the risks, importance and sta- respondents of the questionnaire.
tus of the enterprises. Then, the authority notifies the The data shows that within the finance sector more
enterprises which are selected to undergo supervision. managers answered the questionnaire compared with
The Authority requests information, arranges meet- the hydro electric power industry in which responses
ings, and collects more information, analyses it and from IT personnel dominated.
documents deviations. In this process the dialog is About half of the respondents report that they out-
important. Then, the enterprises are given a deadline source some or all IT operation. This corresponds with
to close the deviations. the findings in other surveys (Hagen 2007).
Both the Norwegian Water Resources and Energy
Directorate and the Financial Supervisory Authority 5.2 Top management commitment and quality
have developed guidelines to assist the enterprises of security work
in their effort to comply with information security
law. In addition, they both offer training and give The questionnaire included questions about the
advices. respondents’ judgements of top management engage-
But, there are also certain differences. The supervi- ment and the compliance to information security
sion methodology of the Norwegian Water Resources law.
and Energy Directorate is based on NS-EN ISO 19011, Table 2 summarises the results of paired sample
which describes an audit methodology and specifies t-tests. A Likert scale was used, ranging from 1 =
requirements for auditors. It has developed a hand-
ful of ‘‘yes/no’’ questions based on the text of the
Table 1. Respondnets (N = 34).
law. The Financial Supervisory Authority has devel-
oped a comprehensive questionnaire based on the Electric power Finance
ICT Directive and COBIT, and uses this to audit
the IT processes. This is a far more comprehen- Manager 1 4
sive tool than the handful of questions posed by the IT 17 1
Norwegian Water Resources and Energy Directorate. Economy 0 2
Within the hydroelectric power industry, information Security 2 3
Advisor 1 2
security supervision is just a part of a more com-
prehensive security supervision, while the finance
Total counts 21 12
sector supervision is characterized by a sole focus on
information security, and is deeply rooted in the ICT
Direction. In addition, more human resources are ded-
icated to the information security supervisory process Table 2. Securty attitudes towards information security
within the financial sector than in the electric power (N = 34). Mean ranging from 1 = strongly disagree to
industry. 5 = strongly agree.
When requirements to close deviations are not
met, both authorities can withdraw the licences. Yet, Mean
punishment practices may vary: The Financial Super-
visory Authority of Norway publishes any deviations Electric
power Finance Sig.
that are not closed within the deadline. This may
impact the market’s trust to the current financial Engaged top
enterprise and management. The Norwegian Water management 3.33 4.62 0.001
Resources and Energy Directorate require the devi- Info.sec. is frequently
ation closed, and any enterprise not closing the devia- on the agenda 2.29 3.08 0.057
tion risks fines. Also, the addressee of the supervision Legal requirements
report varies: Within the financial sector the report is are satisfied 4.14 4.69 0.035
Supervision increases
addressed to the board, placing responsibility on the
the top manage-
board; within the hydro electrical power supply indus- ments’
try, the supervisory report is addressed to the company engagement 3.95 4.25 0.399
(Hagen, Nordøen and Halvorsen, 2007).

409
strongly disagree to 5 = strongly agree. The results Looking at the share of enterprises experiencing secu-
show mostly high mean values, which means that the rity incidents (Table 5), we see that a larger number
respondents agree to most statements. Tests of differ- of electric power supply enterprises report incidents
ences in attitudes reveal that the respondents within typically caused by insiders, compared with finan-
the finance sector view top management engagement cial enterprises. A subsequent hypothesis may be that
to be higher compared with the respondents from the there exists a relationship between high organizational
electric power supply sector. Within the finance sector, security awareness and low probability for security
information security is also more often on the man- breaches by own employees. The data set is unfor-
agement’s agenda, and there is also a stronger focus tunately too small to conduct a Chi-square test of the
on meeting legal requirements. There are, however, no hypothesis.
significant differences regarding the respondents view
on the effect of supervision, but a high mean value
indicates that the respondents agree that supervision 5.3 How could the sectors learn from each other?
has some effect. Within the Norwegian electricity industry, focus
Table 3 shows that most organizational security placed upon information security has increased dur-
measures are widely adopted in the finance sector. ing the last years, both through national information
Security classification of systems and personnel are, security strategy initiatives and research on protection
however, mostly applied by electric power supply of critical infrastructure (Nystuen & Hagen, 2003).
organizations. One possible explanation might be the The traditional security focus in the sector has been
security regime that developed after the Second World on national security, physical security and emergency
War with focus on protection of power plants and in case of natural disasters and war. Information secu-
critical infrastructures in case of war. rity has become important in particular during the last
The survey also inquired about the frequency to 10–15 years, and is now a critical input in process
which risk analysis and internal and external audits operation and trade.
were conducted. Independent sample t-tests show sig- In the finance sector, however, information secu-
nificant differences between the sectors; the financial rity has been close to core business ever since money
sector performs more frequently risk analysis and became electronic signals. If the IT systems are
audits compared with the electric power industry. down or the IT services not secure, this would not
Summing up the findings so far, it seems reason-
able that the information security awareness is higher
in the finance sector compared with the electric power
industry. This is documented by the respondents’ Table 4. Risk analysis and audits (N = 34). Scale used:
judgement of top management engagement and a more 1 = never, 2 < every third year, 3 < yearly, 4 = yearly.
widespread use of organizational information security
Mean
measures.
The data also shows that the overall majority of Electric Sig.
both the financial and the electrical power supply power Finance level
organisations have routines for the reporting of inci-
dents, and that the employees in both sectors often act Risk analysis 3.05 3.77 0.003
according to the routines and report security incidents. Internal audit 3.10 3.85 0.002
External audit 2.62 3.46 0.016

Table 3. Percentage who have applied different organi-


zational security measures in electric power sector (N = Table 5. Number of enterprises with incidents (N = 34).
21) and financial sector (N = 13).
N

Electric Electric Finance


power Finance power (N = 21) (N = 13)

Written security policy 76% 100% Abuse of IT systems 6 1


Guidelines 80% 92% Illegal distribution of non-
Non-disclosure agreement 62% 100% disclosure material 2 1
Employee training 52% 92% Unauthorized deletion/
Awareness raising activities 48% 54% alteration of data 4 1
Formal disciplinary process 38% 61% Unintentional use violating
Classification of systems/personnel 62% 54% security 10 3

410
only prohibit money transfers, but also disrupt the We know from this study that both electric power
reputation of the financial enterprise. industry and finance sector authorities have developed
Compared with the electrical power supply sector, guidelines and in addition provide their supervisory
the respondents within the finance sector report more objects with advice and assistance. The differences in
far-reaching security measures, higher management legal framework, amount of human resources spent
engagement in information security and more satis- in supervision of information security and applied
faction with compliance to legal requirements. The supervision methodologies mirror the organizational
financial sector is also supervised according to a dedi- security awareness in the two critical infrastructures.
cated information security direction and by a far more High organizational security awareness corresponds
comprehensive method. with less exposure to insider threats. The finance sec-
The electrical power supply sector could learn from tor has adopted the most comprehensive approach,
the financial sector how to improve organizational which is mirrored in wider use of security measures,
awareness and adopt organizational information secu- a more engaged top management and less reporting
rity measures. The financial sector could also learn of insider incidents. Our findings confirm the theo-
from the electric power supply sector how to clas- ries that a legal framework affects how information
sify information, personnel and systems into different security measures are adopted by organizations (Sundt
security classes, and how emergency exercises can 2006, Lobre 2002).
prepare the organizations for different threat scenarios. It is reasonable to believe that the differences in
To learn and share knowledge about information laws and regulations and how compliance to laws
security practices and its effects, mutual trust and a is supervised can have an impact on organizational
secure meeting places are needed. The survey data awareness, including top management engagement in
shows that the organizations are participating in a information security. When supervising, the author-
wide range of different security forums, but we do ities place responsibility on the management. The
not see many of them participate in the same forum. quality of the supervision methods and the kind of
The authorities could facilitate meetings to estab- applied sanctions might impact the engagement of the
lish connection among enterprises in the different top management. If the management is engaged, it
sectors. Another possibility is that the National Secu- will be aware of the need for information security
rity Agency facilitates such meetings through the measures to comply with the laws, and assure that
NorCerts’ forums. This could be a good starting point. security measures are implemented. In this survey we
We expect that security systems within both sectors found that high management engagement corresponds
would benefit from information sharing and mutual with a high degree of adopted security measures and a
learning. The electric power supply is the most critical lower degree of insider incidents. The findings corre-
infrastructure in modern society. In case of power out- spond with the experiences from safety management
age, most services stop, including financial services. (Simonds 1973, Simonds & Shafai-Sharai 1977) and
On the other hand, the electric power supply sector research on information security law (Sundt 2006,
depends on the finance sector to transfer money when Lobre 2002).
electricity is traded, and the whole society depends We must, however, be aware of alternative expla-
on both electric power supply and services and infras- nations of the findings. Organizational information
tructures for money transfers. If any of these critical security awareness could be related to maturity in
service stops, then production and delivery of all kind information security work within the sectors, or the
of services and goods stop, too. Moreover, in the mod- closeness of information security to core business.
ern society, there is a new evolving interdependency; Unfortunately, the data set is neither designed for it,
the ICT systems that affect the sectors/domains beyond nor large enough, to conduct statistical analysis to test
their boundaries. Both sectors depend critically on the the hypothesis.
ability to communicate, and on ICT providers and
infrastructures that might be located in other countries
6.2 Information security supervision
(Nystuen and Fridheim, 2007).
in other countries and the possibility
to generalize the findings
6 DISCUSSIONS Supervision from a public authority requires a legal
framework. Without laws and directions, there will
6.1 The differences in legal framework
be no public supervisory activities. The supervisory
and supervision practice
practices and related legal framework in Norway, Swe-
The question we raised in this paper asks whether the den, Denmark, Finland, and the UK is documented in
laws and accompanying supervision process of the (Hagen et al, 2007). Norway has close relations to all
authorities have any effects on the organizational secu- these countries, both within the electric power supply
rity awareness of the enterprises under supervision. sector and within the finance sector. The Norwegian

411
authorities use information security supervision to a management should be aware of the use of organiza-
larger extent than the other countries. The Norwegian tional information security measures examined in the
financial sector is in an exceptional position because survey, as the measures would influence in their work
of the ICT Direction. The usual practice is to include in someway. It is our view that the study produces
information security requirements as a minor part of valid and reliable results, despite the few answers and
other laws and directions. In Denmark, however, there the variations in identity of the respondents to the
are regulations within the financial industry similar questionnaire.
to the ICT Direction in Norway, and the supervi-
sory process is almost equal to the Norwegian one.
Supervision of information security within the hydro- 7 CONCLUSIONS
electric power supply is, however, in a premature phase
compared to Norway, because information security is In this paper we raised three questions:
not to the same extent regarded as critical. Sweden
does not have information security laws for finance – Do the laws and accompanying supervision process
or hydroelectric power supply, but its Security Act of the authorities have any effects on the organi-
can be applied on critical infrastructures. Finland has zational security awareness and top management
the Data Security Act, which is relevant for all sec- engagement of the enterprises under supervision?
tors, yet there are differences regarding the practical – Are there any differences in security practices
organisation of emergency preparedness. The most between the sectors?
significant differences are, however, related to UK – If there are differences, how could they learn from
practices. The UK does not have information secu- each other to strengthen the information security?
rity regulation similar to the Nordic countries, but
the Centre for Protection of National Infrastructure Our findings support the theory that laws have effect
(previous National Infrastructure Security Coordina- on how organizations adopt information security mea-
tion Centre, NISCC) contributes with knowledge and sures. The findings indicate that both laws and the
advice regarding cyber attacks against critical infras- quality of the supervisory processes could have an
tructures. None of the countries including Norway use effect on how the organizations adopt organizational
indicators or metrics to measure information security information security measures.
and compliance to law. There are differences between the sectors in legal
The differences in the national regimes make it dif- framework, supervision methods and type of sanctions
ficult to generalize the findings in this study and to if the requirements are not met. These differences mir-
conduct meaningful comparisons. It could, however, ror the security practises, attitudes of top management
be interesting to study the information security prac- engagement in information security and the exposure
tices within other countries in more detail to learn more to insider incidents.
about the effects of different supervision approaches Cross-sector learning and knowledge sharing will
and also the effect of laws related to more market strengthen the overall security culture and can be moti-
driven regimes. vated by the mutual dependency between the sectors.
Existing security forums can be used, or the authorities
could facilitate meetings. In this way, security sys-
6.3 Validity and reliability of the data tems within both sectors can benefit from knowledge
The paper is based on both in depth case studies, sharing.
as documented by Hagen et al. (unpubl.) and a sur-
vey. The report has been through several rounds of
validation by the informants. These findings, describ- 8 THE WAY AHEAD
ing the differences in laws and supervision practices,
are therefore considered as both valid and reliable. We will continue to study the effects of organizational
Turning over to the survey, the major problem is information security measures. Further research will
the low response rate and the few responses. How- be conducted on the effect of implemented security
ever, the answers distribute well among the sectors, policy and IT-user training on security culture.
enabling some discussion of variations among the sec-
tors. Besides, personal interviews with representatives
from the sectors confirm the findings. ACKNOWLEDGEMENT
Studying the identity of the respondents to the ques-
tionnaires, there were significant differences between The authors would like to acknowledge Jan Hovden,
the sectors. In the finance sector, mostly manage- Pål Spilling, Håvard Fridheim, Kjetil Sørlie, Åshild
ment answered, while in the electric power industry, Johnsen and Hanne Rogan for their contribution to the
mostly IT personnel answered. Both IT personnel and work reported in this paper.

412
REFERENCES Lobree, B.A. 2002. Impact of legislation on Informa-
tion Security Management, Security Magazine Practices,
COBIT, 2008: Available at www.isaca.org/cobit November/December 2002: 41–48.
Hagen, J.M. 2003. Securing Energy Supply in Norway—Vul- NS-EN ISO 19011, Retningslinjer for revisjon av systemer
nerabilities and Measures, Presented at the conference: for kvalitet og/eller miljøstyring, [Guidelines for audit-
NATO-Membership and the Challenges from Vulnerabil- ing systems for qualitiy management and environmental
ities of Modern Societies, The Norwegian Atlantic Com- management].
mittee, and the Lithuanian Atlantic Treaty Association, Nystuen, K.O. & Fridheim, H, 2007, Sikkerhet og sårbarhet i
Vilnius, 4th–5th December, 2003. elektroniske samfunnsinfrastrukturer—refleksjoner rundt
Hagen, J.M., Albrechtsen, E. & Hovden, J. Unpubl. Imple- regulering av tiltak, [Secure and vulnerable electronic
mentation and effectiveness of organizational information infrastructures—reflections about regulations and secu-
security measures, Information Management & Computer rity measures], FFI-Report 2007/00941.
Security, accepted with revision. Nystuen, K.O & Hagen, J.M. 2003. Critical Informa-
Hagen, J.M., Nordøen. L.M. & Halvorsen, E.E. 2007. tion Infrastructure Protection in Norway, CIP Workshop,
Tilsynsmetodikk og måling av informasjonssikkerhet i Informatik, Frankfurt, a.M, 29.09-02.10.03, 2003.
finans og kraftsektoren. In Norwegian. [Audit tool and Simonds, R.H. & Shafai-Sharai, Y. 1977. Factors Apparently
measurement of information security in the finance and Affecting Injury Frequency in Eleven Matched Pairs of
power sector] FFI/Rapport-2007/00880. Companies. Journal of Safety Research 9(3): 120–127.
Hagen, J.M. 2007. Evaluating applied information security Simonds, R.H. 1973. OSHA Compliance ‘‘Safety is good
measures. An analysis of the data from the Norwegian business’’, Personnel, July-August 1973: 30–38.
Computer Crime survey 2006, FFI-report-2007/02558 Sundt, C, 2006. Information Security and the Law. Informa-
Hole, K.J., Moen, V. & Tjøstheim, T. 2006. Case study: tion Security Technical Report, 11(1): 2–9.
Online Banking Security, IEEE Privacy and Security, Williams, P. Executive and board roles in information secu-
March/April 2006. rity, Network Security, August 2007: 11–14.
Kredittilsynet (The Financial Supervisory Authority of Nor-
way), Risko- og sårbarhetsanalyse (ROS) 2004, In Nor-
wegian [Risk and vulnerability analysis 2004].

413
Safety, Reliability and Risk Analysis: Theory, Methods and Applications – Martorell et al. (eds)
© 2009 Taylor & Francis Group, London, ISBN 978-0-415-48513-5

The unintended consequences of risk regulation

B.H. MacGillivray, R.E. Alcock & J.S. Busby


Department of Management Science, Lancaster University, UK

ABSTRACT: Countervailing risks have important implications for risk regulation, as they suggest that gains in
environmental and human health may be coming at a significant cost elsewhere, or even that in some situations,
our cures may be doing more harm than good. This paper expands the prevailing explanation of why risk
reducing measures frequently lead to harmful side-effects which render them uneconomic or even perverse.
Our expansion is three-fold, we: highlight how confirmation bias and various pathologies of complex problem
solving may constrain regulators from foreseeing or even considering the harmful side-effects of proposed risk
reduction measures; argue that limited incentives and capacities for regulatory learning constrain the detection
and correction of harmful side-effects of interventions; and contend that the adversarial nature characterising
many risk conflicts systematically gives rise to perverse trade-offs. We conclude that adaptive, stakeholder-based
forms of regulation are best positioned to reform these pathologies of the regulatory state, and to produce a form
of governance best suited for recognising and addressing the need to make risk trade-offs in what are often highly
charged, contested situations.

1 A BRIEF HISTORY OF UNINTENDED the often latent second and third-order effects of deci-
CONSEQUENCES sions and actions (Dörner, 1996). For various reasons,
we seem ill-suited to this task.
In prehistoric societies, man’s tasks were largely This incongruity has led some of our most promi-
ephemeral problems having no significance beyond nent social theorists and philosophers to focus on the
themselves. He collected firewood, hunted prey, unforeseen and unintended consequences of human
forged tools, and sought mates. The low complex- actions, in contexts ranging from economics, to pol-
ity of both the tasks and the social structures within itics, to social policy (e.g. Popper, 1961; Merton,
which they lay meant that they could be pursued 1936; Hayek, 1973). The driving impetus was the
on an ad hoc basis, and only rarely did the need idea that unintended consequences should be a cen-
to view problems as being embedded within other tral concern of the social sciences, as they constrain
problems arise (Dörner, 1996). Passing through early the ability to predict and therefore control the conse-
human history, man became more dependent on oth- quences of social interventions. In the past generation,
ers and aware of his influence on the environment, the baton has been passed to scholars of risk regulation,
and developed rudimentary economic and political a small but influential group of whom have focussed on
systems to manage these interdependences. Initially, the harmful, unintended consequences of risk reduc-
small-scale sufficiency economies and councils of ing measures in the public and environmental health
elders were sufficient to coordinate the limited divi- arenas (e.g. Graham and Wiener, 1996; Sunstein,
sion of labour and facilitate an understanding of causal 1990; Viscusi, 1996). For example, airbags may pro-
relationships between human activities and the nat- tect adults but kill children; gas mileage standards
ural environment (Hofstetter et al., 2002). However, may protect the environment at the cost of thousands
as specialisation and differentiation grew, technology of lives annually, as they encourage manufacturers
and industrialisation advanced, population burgeoned, to sacrifice sturdiness for fuel-efficiency; drug-lags
and economic and social systems became increasingly stemming from stringent testing requirements may
intertwined, complexity has come to characterise the protect the public from potentially adverse effects of
modern world. We now must deal with a series of un-tested pharmaceuticals, whilst at the same time
closely, and often subtly, related problems, with the diminishing the health of those who urgently need
consequence that our approach to problem solving them; bans on carcinogens in food-additives may lead
increasingly requires attention to interdependencies consumers to use non-carcinogenic products which
of social and natural systems, and an awareness of nevertheless carry even greater health risks, and so on.

415
These countervailing risks have important implica- skew the balancing of target and countervailing risks
tions for risk regulation, as they suggest that recent away from normative ideals. They also draw atten-
gains in environmental and human health may be com- tion to risk compensation, wherein the reductions in
ing at a significant cost elsewhere, or even that in some risk sought by regulatory interventions are partly off-
situations, our cures may be doing more harm than set by behavioural changes which may even shift the
good. This paper expands the prevailing explanation of nature of the risk or the group who bears its burden.
why risk reducing measures frequently lead to harmful For example, safer car designs may encourage faster
side-effects (e.g. Graham and Wiener, 1995; Wiener, or more reckless driving, thus leading to a lower than
1998), and argues that stakeholder-based forms of expected decrease in risk to the driving public, and a
adaptive management are the most effective regula- net increase of risk to pedestrians. The scholars argue
tory arrangements for making optimal risk trade-offs that these compensations are frequently unaccounted
in what are often highly charged, contested situations for in regulatory decision making.
requiring tragic choices to be made under conditions From the perspective of regulatory law and politics,
of ignorance. Graham and Wiener invoke public choice theory to
explain why regulatory decision-making often departs
the norm of a balanced, deliberative consideration of
the public good towards satisfying the vested inter-
2 PERVERSE RISK TRADE-OFFS: THE ests of the more vocal interest groups, often with the
PREVAILING EXPLANATION effect of offloading countervailing harms on those
whose voices were omitted from the regulatory pro-
We shall briefly summarise the dominant explanation cess. Similarly, the scholars hold that jurisdictional
of why externalities arise from risk reducing mea- specialisation and fragmentation of the regulatory
sures, most prominently associated with Professors state leave little incentive for regulatory agencies to
Wiener and Graham (see, e.g. Graham and Wiener, monitor or even consider the harmful effects of their
1995; Wiener, 1998). In short, their argument is that regulations on other domains (e.g. across geographical
the fundamental reason that risk reduction measures boundaries, hazard types, and media, etc.).
sometimes create countervailing risks is the intercon- To summarise Graham and Wiener’s view, then,
nectedness of multiple risks, derived in turn from externalities may arise from risk reduction measures
the interconnectedness of social and environmental because environmental and public health problems
systems. Of course, countervailing risks are not prob- are not hermetically sealed, and so unintended, unde-
lematic for risk regulation per se; difficulties arise sirable side-effects of interventions often arise. The
when, on balance, the trade-off between target and trade-offs between countervailing and target risks are
countervailing risks serves to render a regulation to often uneconomic or perverse because of legal, politi-
be an inefficient allocation of resources, or, more dra- cal and psychological factors which prevent regulators
matically, perverse in the sense that we are left with a from thinking rationally and synoptically about their
net diminution of environmental or public health. And interventions. Broadly speaking, we find much truth
so, their argument goes, we must think systematically in this explanation. However, we suggest that the
when considering risk reduction measures. Based on distorting effect of the availability heuristic is likely
a series of detailed case studies, these scholars show overstated, given that subjective judgements of proba-
that that such systematic thinking is far from the norm bility are relatively minor components of overall risk
in risk regulation, and argue that perverse or uneco- perceptions (see, e.g. Sjöberg, 2000). But this is a
nomic trade-offs routinely arise due to the systematic minor quibble. Our broader argument is that there are
failure of regulatory bodies to account for the full con- additional factors at work in distorting risk trade-offs,
sequences of their interventions in a rational manner. which we expand on in the following section.
This failure is explained in terms of psychology and
with reference to some perverse aspects of regulatory
law and politics.
3 EXTENDING THE EXPLANATION
Graham and Wiener’s psychological angle posits
that the use of heuristics, chiefly availability, to make
3.1 The psychological perspective
probabilistic judgements leads to systemic error in
risk perception and by extension erroneous regulatory In addition to the errors which arise from heuristic
responses to risk. Similarly, cognitive biases, such as judgements, and other, unrelated, cognitive biases, we
loss aversion, are held to explain conservative tenden- can finger various pathologies of complex problem
cies in both risk assessment (e.g. default conservative solving as being plausible suspects in the genera-
assumptions) and risk management (e.g. the precau- tion of perverse risk trade-offs. In doing so, we
tionary principle; over-regulation of risks from new draw upon the dynamic decision making research pro-
sources vs. old sources), which are argued to further gramme (e.g. Dörner, 1996; Brehmer, 1992) which

416
explores individual and group decision making in sim- identified, and then ask why we consider this one to
ulated environments defined by the characteristics of be particularly central to the problem of perverse risk
complexity, opaqueness, and dynamism (e.g. natural trade-offs. The reason is that disputes over proposed
resource management). Given these characteristics, public and environmental health measures are often
such simulations are largely analogous to the dilem- highly charged, emotive issues, given that they concern
mas facing regulators tasked with protecting public the protection or sacrifice of the most fundamental val-
and environmental health. ues: human life, inter-generational equity, ecological
Unsurprisingly, this programme has revealed that health, and so forth. In such situations, where peo-
people generally struggle to deal with complex prob- ple commonly perceive moral absolutes as being at
lems; however, the errors committed by the partici- stake, said people can raise confirmation bias to an art
pants were far from random, instead being indicative form. Thus, those drawing attention to the potentially
of general weaknesses or pathologies in reasoning harmful consequences of proposed risk reduction mea-
and perception when dealing with complex, opaque, sures may be viewed as mere reactionaries or industry
dynamic systems. Those of greatest relevance to us are stooges, and the information or data which they use to
(Dörner, 1996; Brehmer, 1992): the common failure support their position treated with scepticism or sim-
to anticipate the side-effects and long-term repercus- ply ignored. In essence, we propose that confirmation
sions of decisions taken; the tendency to assume that bias leads to the systematic discounting or even neglect
an absence of immediate negative effects following of the likelihood and extent of potentially harm-
system interventions serves as validation of the action ful side-effects of proposed risk reduction measures,
taken; and the habit of paying little heed to emerg- a clear formula for the generation of perverse risk
ing needs and changes in a situation, arising from trade-offs.
over-involvement in subsets of problems. Rather than
appreciating that they were dealing with systems com-
posed of many interrelated elements, participants all
3.2 The perspective of regulatory law and politics
too often viewed their task as dealing with a sequence
of independent problems (Dörner, 1996; Brehmer, Much ink has been spent highlighting the legal and
1992). political constraints to optimising those trade-offs
One conclusion that can been drawn from this is inherent to risk reduction measures (e.g. the use of
that people’s mental models often fail to incorporate absolute standards; the narrow and oft conflicting
all aspects of complex tasks, and so making inferences mandates and missions of regulatory agencies; laws
about side-effects and latent consequences of actions forbidding the use of cost-benefit and health-health
is often a bridge too far (Brehmer, 1992). Although analysis in evaluating certain proposed interventions;
we are to some extent stuck with the limitations of our the traditional legal framing of environmental con-
cognitive capacities, it is worth bearing in mind that flicts as short-term, zero sum questions of authority,
mental models are not solely internal psychological jurisdiction, prohibition and entitlement, etc. See, e.g.
constructs, but are to an extent socially constructed. Viscusi, 1996; Sunstein, 1990; Graham and Wiener,
It thus seems reasonable to assume that, in the con- 1996). Suffice it to say that we question the wisdom
text of risk regulation, the incorporation of a broad of these tendencies towards absolutism and the frag-
range of values, interests, and perspectives within the mentation of the regulatory state, problems which we
decision making process would help to counteract the address in our proposals for reform.
tendency to frame the pros and cons of proposed risk However, the debate on the unintended conse-
reduction measures in an unduly narrow, isolated man- quences of risk regulation has focussed on how such
ner (i.e.expand regulators’ mental models). Moreover, legal and political factors constrain regulatory agen-
enhancing the mechanisms and incentives for regu- cies from accounting for the potentially harmful side-
latory agencies to monitor and evaluate the impacts effects of their interventions before they are enacted;
of their interventions could ameliorate the tendency little thought has been given to the detection and cor-
to neglect latent outcomes of decisions (i.e.lengthen rection of these side-effects post implementation. This
regulators’ mental models). We return to this later. is curious, as the limited incentives and capacities of
We hypothesise that a further contributory factor regulatory agencies to collect feedback and use this
to the phenomenon of perverse risk trade-offs is the for error correction have been bemoaned by numer-
well established phenomenon of confirmation bias. ous scholars (e.g. Dryzek, 1987a; Weale, 1992). These
This refers to the tendency to seek out and interpret limitations seem particularly problematic when one
new information in a manner which confirms one’s considers the radical uncertainty characterising much
preconceived views and to avoid information and inter- of the scientific and technical information underpin-
pretations which questions one’s prior convictions. ning regulatory decisions, as well as the inherent
The reader may, quite rightly, point out that there is a dynamism and stochastic nature of social and natural
veritable litany of psychological biases that have been environments. And so proposals for regulatory reform

417
that seek to address the problem of perverse risk trade- by the various parties to entrench existing positions
offs should look with one eye towards error prevention, and to discredit opponents. This is particularly prob-
whilst casting the other towards error detection and lematic given the objective dearth of our knowledge of
correction. many risks to public and environmental health, and the
value-laden assumptions underlying any comparison
of risk against risk (e.g. what price to put on human life,
3.3 The sociological perspective
how to deal with issues such as equity, how to value
We now turn to consider what a sociological perspec- ecological status, etc.). This relative indeterminacy
tive can tell us about the phenomenon of perverse leaves broad scope for the various disputants to inter-
risk trade-offs. Our central point is that the polarised pret risk analysis outcomes in radically different ways,
nature of debates over proposed regulations, both in and when they have access to power or resources, to
the public arena and within the lobbying process, shape the very outcomes themselves through adopting
is a key precondition for the generation of perverse different assumptions and data gathering procedures,
trade-offs. For simplicity, consider the case where a often within the framework of a defensible scientific
regulatory measure for restricting the use of bromi- methodology. And so even where regulatory agencies
nated flame retardants is proposed. Paradigmatically, are able to formulate a policy which ostensibly bal-
we would see various groups lobbying the architects of ances between the diametrically opposed positions of
the regulatory process (i.e.administrators, bureaucrats, those disputants (i.e.is able to escape the zero sum
legislators): NGOs, industry groups, think tanks, and trap), it is often based on highly politicised and value-
so forth. Those favouring the proposed measure would laden data, meaning that the balancing of harm against
of course highlight the forecast gains in environmental harm is illusory.
and human health arising from a ban, whilst those in
opposition will draw attention to the economic costs
and potential countervailing risks which may it may
give rise to (e.g.through reducing the level of protec- 4 A BRIEF PROPOSAL FOR REFORMING
tion from fires, through promoting a shift to potentially THE REGULATORY STATE
hazardous chemical substitutes about which little are
known, etc.). In this final section, we briefly argue that adaptive,
At a fundamental level, these contrasting policy stakeholder-based forms of regulation are best posi-
preferences arise from what are often sharply con- tioned to reform these pathologies of the regulatory
trasting beliefs, values, norms and interests of the state, and to produce a form of governance best suited
disputants. Of course, that people disagree is hardly for recognising and addressing the need to make risk
revelatory. The problem which arises is that the current trade-offs in what are often highly charged, contested
regulatory state systematically encourages adversarial situations. Our discussion is at a general, abstract level,
relations between these groups, through, for exam- leaving any debate over the finer points of such reforms
ple: the legal framing of many environmental con- to scholars of administrative law and regulatory policy.
flicts as zero-sum questions of prohibition, authority Stakeholder-based governance refers to an array of
and jurisdiction (Freeman and Farber, 2005); often practices where a broad cross-section of stakeholders,
relying on lobbying as a proxy for stakeholder con- selected to represent different interests, come together,
sultations (when the latter occur, they tend to be in person, for long-term dialogue to address policy
infrequent and superficial); and in their occasional issues of common concern, overseen by a neutral
resort to the court for final resolution of disputes. party which initiates, lubricates and oversees discus-
This promotion of competitive rather than co-operative sions, ensuring that they are governed by rules of
behaviour encourages disputants to exaggerate dif- reasoned discourse (e.g.ruling out threat, conceal-
ferences between one another, to ascribe malign ment of information, etc.), and in which decisions
intentions to the positions of others, and to simplify are made by consensus rather than diktat or major-
conflicts through the formation of crude stereotypes ity rule (e.g.Dryzek, 1987b; Innes and Booher, 1999;
(Fine, 2006; Yaffee, 1997). Thus, in many situations, McDaniels et al., 1999). A growing body of research
disputes over risk tradeoffs can resemble prisoners’ suggests that stakeholder-based approaches help build
dilemmas, where co-operation could lead to a mutu- trust and render participants less hostile to the views of
ally acceptable solution which balances harm against others, providing the grounds for mutual understand-
harm, but a fundamental lack of trust leaves the partic- ings of stakeholder assumptions, interests, values,
ipants caught in a zero-sum struggle as they fear that norms and perspectives (e.g. Dryzek, 1987b; Innes and
any compromise would not be reciprocated. Booher, 1999; McDaniels et al., 1999). In short, they
Moreover, in such adversarial settings the underly- create an environment which enables participants to
ing scientific, technical and economic data on which find solutions which accommodate each others’ inter-
risk trade-offs are ostensibly based is often (mis)used ests without harming their own, or learn to view all

418
interests as interconnected and thus conceive of dis- REFERENCES
putes as joint problems in which each has a stake
(Innes and Booher, 1999). Here, the malign influ- Brehmer, B. 1992. Dynamic decision making: human control
ences of narrow mental models, of confirmation bias, of complex systems. Acta Psychologica. 81(3):211–241.
of absolutism, of adversarial relationships, and of the Dörner, D. 1996. Recognizing and avoiding error in complex
omitted voice (all, in some way, related to absolutism), situations. New York: Metropolitan Books.
Dryzek, J. 1987a. Ecological rationality. Oxford: Blackwell.
may be expected to be in large part ameliorated. Dryzek, J. 1987b. Complexity and rationality in public life.
Of course, this is not a full-proof solution, and Political Studies. 35(3):424–442.
harmful, unintended consequences will still arise from Fine, G.A. 2006. The chaining of social problems: solutions
regulatory measures derived from even the most holis- and unintended consequences in the age of betrayal. Social
tic process, in part due to the profound uncertainties Problems. 53(1):3–17.
characterising many public and environmental health Freeman, J. and Farmer, D.A. 2005. Modular Environmental
risk dilemmas, and the temporal nature of social val- Regulation. Duke Law Journal. 54:795.
ues, interests, and perspectives. It is not uncommon Graham, J.D. and Wiener, J.B. 1995. Risk versus risk: trade-
for public, governmental or scientific perceptions of offs in protecting health and the environment. Harvard
University Press.
the rationality of past risk trade-offs to migrate over Hayek, F.A. 1973. Law, legislation and liberty. London:
time, leading to calls for corrective measures or even Routledge and Kegan Hall.
a sharp reversal of the path already taken, such that Hoffstetter, P., Bare, J.C., Hammitt, J.K., Murphy, P.A. and
as observers of the decision process we are left with Rice, G.E. 2002. Tools for comparative analysis of alter-
a sense of déjà vu, and a feeling that the governance natives: competing or complimentary perspectives? Risk
of risk resembles a Sisyphean challenge. Thus, it is Analysis. 22(5): 833–851.
crucial that the regulatory state adopt a more adaptive Innes, J.E. and Booher, D.E. 1999. Consensus building and
approach (e.g. McDaniels et al., 1999), in the sense complex adaptive systems: a framework for evaluating
of viewing decision making iteratively, of placing a collaborative planning. Journal of the American Planning
Association. 65(4):412–423.
strong emphasis on the role of feedback to verify the Merton, R.K. 1936. The unanticipated consequences of
efficacy and efficiency of the policies enacted, and of purposive social action. American Sociological Review.
appreciating the wisdom of learning from successive 1(6):894–904.
choices. McDaniels, T.L., Gregory, R.S. and Fields, D. 1999. Democ-
ratizing risk management: successful public involvement
in local water management decisions. Risk Analysis.
5 CONCLUSIONS 19(3): 497–510.
Popper, K.R. 1961. The poverty of historicism. London:
To conclude, we have argued that stakeholder- Routledge and Kegan Hall.
Sjöberg, L. 2000. Factors in risk perception. Risk Analysis.
based, adaptive approaches to regulatory governance 20(1):1–12.
should reduce the incidence of perverse risk trade- Sunstein, C.R. 1990. Paradoxes of the regulatory state.
offs through a) integrating a broader range of val- University of Chicago Law Review. 57(2): 407–441.
ues, perspectives, interests, and scientific and tech- Viscusi, W.K. 1996. Regulating the regulators. University of
nical information into regulatory decision making; Chicago Law Review. 63(4):1423–1461.
b) transforming what were previously framed as zero- Weale, A. 1992. The new politics of pollution. Manchester:
sum disputes into co-operative searches for mutually University of Manchester Press.
acceptable solutions; and c) promoting the early detec- Wiener, J.B. 1998. Managing the iatrogenic risks of risk
tion and correction of undesirable side-effects of reg- management. Risk: health, safety and environment. 9(1):
39–82.
ulatory measures through providing mechanisms and Yaffee, S.L. 1997. Why environmental policy nightmares
incentives for learning. recur. Conservation biology. 11(2): 328–337.

ACKNOWLEDGEMENTS

The authors would like to thank the Leverhulme Trust


for funding this research.

419
Maintenance modelling and optimisation
Safety, Reliability and Risk Analysis: Theory, Methods and Applications – Martorell et al. (eds)
© 2009 Taylor & Francis Group, London, ISBN 978-0-415-48513-5

A hybrid age-based maintenance policy for heterogeneous items

P.A. Scarf
Centre for OR and Applied Statistics, University of Salford, UK

C.A.V. Cavalcante
Federal University of Pernambuco, Brazil

R.W. Dwight
Faculty of Engineering, University of Wollongong, Australia

P. Gordon
Faculty of Engineering, University of Wollongong, Australia

ABSTRACT: This paper considers a hybrid maintenance policy for items from a heterogeneous population.
This class of items consists of several sub-populations that possess different failure modes. There are a substantial
number of papers that deal with appropriate mixed failure distributions for such a population. However, suitable
maintenance policies for these types of items are limited. By supposing that items may be in a defective but
operating state, we consider a policy that is a hybrid of inspection and replacement policies. There are similarities
in this approach with the concept of ‘‘burn-in’’ maintenance. The policy is investigated in the context of traction
motor bearing failures.

1 INTRODUCTION next inspection or check and is corrected immediately


(Nakagawa, 2005). A derivative approach is based on
The occurrence of a mixture in a number of situations delay time analysis. The delay time is the time lapse
is discussed by Jiang and Jardine (2007). For exam- from when a system defect could first be noticed until
ple, a mixture arises in a population of heterogeneous the time when its repair can no longer be delayed
items where some items have a short characteristic because unacceptable consequences such as a seri-
life giving rise to early failure and other items have ous catastrophe might arise due to failure (Christer,
a long characteristic life and are the subject of age- 1999). The delay time is essentially the time between
related wear-out. When the time to failure of an item defect arrival and failure due to this defect. In delay-
has a mixture distribution, the failure behaviour is not time inspection models, the focus is both on defects
immediately clear, since a mixture can give rise to and on failure. The majority of the inspection policy
different types of hazard rate functions and so approaches only take account of the periodicity of state
different failure behaviour (Jiang and Murthy, 1998). verification over the life of the item or some alternative
In this sense, it is not straightforward to determine planning horizon.
a suitable maintenance policy to reduce the early- In age-based replacement, an item is replaced at
phase failures and at the same time take account of failure and also preventively replaced when the item
failures during a wear-out phase. Thus, standard main- reaches age T . A block replacement policy replaces
tenance policies such as age-based replacement or an item on failure and at times kΔ, k = 1, 2, . . .
regular inspection may not be the most appropriate regardless of the age of the item (Barlow and Proschan,
policies for such items. 1965). The presumption here is that an item is located
Inspection policies are concerned with determin- in a socket (in a system) and together the item and
ing the state of the item inspected and then main- socket perform some operational function within the
taining accordingly. Many articles deal with this system (Ascher and Feingold, 1984). These models are
problem (Valdez-Florez and Feldman, 1989). The appropriate for items that possess an increasing failure
standard inspection policy checks an operating unit at (hazard) rate function.
successive times Tk (k = 1, 2, . . .). The most com- Items that are heterogeneous, arising from a
mon assumption is that any failure is detected at the mixed population, require a different approach to

423
maintenance. Finkelstein and Esaulova (2001b) argue where
that periodic policies do not take account of system-
atic changes that occur in the pattern of ageing of items w(t) = p(1 − F1 (t))/(1 − FMixed (t)).
from a mixed population. In this paper, we propose a
maintenance policy that is a hybrid of inspection main- The lifetime distribution for a mixture of items
tenance and age-based replacement. The inspection from two sub-populations is illustrated in Figure 1.
phase of the policy deals with early failures. The age- Mixtures of this kind do not necessarily have an
based replacement phase deals with normal wear-out. increasing failure (hazard) rate function. Some exam-
We will consider items for which failure implies ples that support the last conclusion have been given
immediate cost-consequences. In this context, for by several authors (Glaser, 1980; Gupta and Gupta,
inspection to be viable, we suppose that a defect 1996; Finkelstein and Esaulova, 2001a; Block et al.,
may arise prior to failure, and that these defects are 2003). Jiang and Murthy (1998) discuss mixtures
detectable at inspection. Before developing this model involving two Weibull distributions. Eight different
in section 4, we discuss mixtures of failure distribu- behaviours for the failure (hazard) rate function for
tions in general. Section 5 present a numerical example the mixed distribution are evident, depending on the
based on traction motor bearing failure. parameter values of the underlying Weibull distribu-
Note that the model we propose is related to burn- tions. For the general case, there are five parameters;
in maintenance. In this policy, items are subjected to the shape of the failure (hazard) rate is dependent
a burn-in test. According to Cha et al., (2004), burn- on the values of the two shape parameters, β1 and
in is a widely used method to improve the quality of β2 , the ratio η2 /η1 of the two scale parameters (and
products or systems after they have been produced. not their individual values) and, finally, the mixing
The implication is that the population of items pro- parameter p.
duced are heterogeneous and poor quality items (with Other authors have explored mixed failure dis-
short operational lives) will be screened during burn- tributions through their mean residual life function
in. This early life screening will be analogous to the (Abraham and Nair, 2000; Finkelstein, 2002).
inspection phase of the hybrid maintenance model that The fitting of Weibull distributions to failure data
we propose. Burn-in maintenance modelling is con- requires care. It is often the case that the data possess
cerned with determining, in combination, optima for underlying structure that is not immediately apparent
the burn-in time and the subsequent preventive main- due to, for example, inspections, left and right cen-
tenance policy for the operational phase (Drapella and soring, or heterogeneity. It would be unfortunate to
Kosznik, 2002; Jiang and Jardine, 2007). Our model fit a two-parameter Weibull distribution to failures
is different in that both inspection and preventive that arise from a mixture, and then adopt an age-
replacement will be carried out during the operational based for the items based on the fitted two-parameter
phase of the item. Weibull since the implied critical replacement age
would be inappropriate for both sub-populations of
2 MIXED FAILURE DISTRIBUTIONS items (Murthy and Maxwell, 1981). A full discussion

According to Barlow and Proschan (1965), mixtures of


distributions arise naturally in a number of reliability
situations. For example, suppose two distinct assembly
lines produce units with distributions F1 and F2 , F1 =
F2 . After production, units from both assembly lines
probability density

flow into a common shipping room, so that outgoing


lots consist of a random mixture of the output of the
two lines. With this illustrative example, the authors
define the mixed distribution as:

FMixed = pF1 + (1 − p)F2 , (1)

where p and 1−p are the proportions of items from line


1 and 2 respectively. From equation (1), the probability
density function and failure (hazard) rate function are
age/years
(Jiang and Murthy, 1998):

fMixed (t) = pf1 (t) + (1 − p)f2 (t), Figure 1. Mixed distribution (____ ). Underlying Weibull
distributions, We (η1 = 3, β1 = 2.5) (... . ) and We
hMixed (t) = w(t)h1 (t) + [1 − w(t)]h2 (t), (η2 = 18, β2 = 5) (-----), mixing parameter p = 0.1.

424
of the fitting of Weibull distributions to data is given 4 THE MODEL
in Jiang and Murthy (1995). In particular, they discuss
the appearance of Weibull plots when the underlying We use the following notation. For an item,
distribution is a three parameter or a mixed Weibull
distribution. • X is the age at defect arrival with probability density
function fX ;
• H is the delay time from defect arrival to failure
3 MAINTENANCE MODELS WITH MIXED with probability density function fH , independent
DISTRIBUTIONS of X ;
• Δ is the interval between inspections;
As indicated above, there have been many contribu- • K is the number of inspections that will take place
tions on the theme of mixed distributions. However, during the inspection phase which has length KΔ;
there has been less discussion of maintenance policies • Y is the age at failure so that Y = X + H ;
for items with mixed failure distributions. Here we • T is the age at preventive replacement (≥KΔ).
review the main contributions in this area.
We assume that:
Murthy and Maxell (1981) proposed two types
of age-replacement policy for a mixture of items • fX (x) = pf1 (x) + (1 − p)f2 (x) where pis the mixing
from two sub-populations, 1 and 2. In policy I, it is parameter and f1 (x) and f2 (x) follow Weibull dis-
assumed that the lifetime distribution of items for each tributions with characteristic lives η1 , η2 and shape
sub-population is known, but it is not known if parameters β1 , β2 .
an operating item is of type 1 or 2. In the pol- • Inspections are perfect in that defects present will
icy II, it is assumed that the decision maker can, be identified;
by some test that costs α per item, determine if • Defective items are replaced at inspection instan-
an item is of type 1 or 2, and then subsequently taneously and the average cost of replacement of a
replace items from sub-population 1 at age T1∗ or defective item is CR ;
at failure, and replace items from sub-population • Failed items are immediately apparent, cause oper-
2 at age T2∗ or at failure, which ever occurs ational failure of the system, and are replaced
first. Finkelstein (2004) proposed a minimal repair instantaneously with average cost CF ;
model generalized to the case when the lifetime • At the critical replacement age, T , preventive
distribution function is a continuous or a discrete replacement of an item is instantaneous and again
mixture of distributions. As was stated in the intro- costs CR < CF ;
duction, some adaptations of burn-in models have • The cost of inspection is CI < CR ;
being proposed by Drapella and Kosznik (2002) and • On replacement of items (whatever the state of the
Jiang and Jardine (2007). The objective of these replaced item), the system is restored to the as-new
models is to find optimum solutions for a combined state.
burn-in-replacement policy, in order to take into con-
sideration the change in ageing behaviour of items In this way, we assume that the population of items
from a heterogeneous population, represented by a consists of two sub-populations, one weak with a
mixed distribution. The cost advantage of the com- shorter lifetime and one strong with a longer lifetime,
bined policy over the separate policies of burn-in and and failures are anticipated by a defective state. In
replacement is quite small. The combined policy is order to reduce early failures, all items are inspected
also much more complex, and therefore it is difficult with frequency 1/Δ during the inspection phase up to
to decide if the combined policy is superior (Drapella age KΔ. An item will be replaced at the time of the ith
and Kosznik, 2002). Jiang and Jardine (2007) argue inspection if it is defective; it will be replaced on fail-
that preventive replacement is more effective in the ure; it will be replaced preventively at T if it survives
combined policy. to this age. The objective of inspection is to prevent
The burn-in process is not always suitable since early-life failures of weaker items. Inspections act as
a long-burn in period may be impractical. Then, a natural, operational burn-in process, since the weak
the early operational failure of short-lived items items will fail much earlier than strong items. The
may become a possibility. For this reason, in the objective of preventive replacement, which takes place
next section, we propose a model that is simi- over a much longer time-scale, is to reduce wear-out
lar to a combined burn-in-replacement policy, but failures in later life.
instead of a burn-in phase we propose a phase of The failure model has three states: good, defective,
inspection to detect the weak items. This can then failed. In the good and defective states, the system is
accommodate, during operation, the evolution of operational. The notion of a defective state allows us to
the ageing behaviour of items from a heterogeneous model inspections: if the delay-time is zero (two-state
population. failure model), then, given our assumption that item

425
failures lead to immediate operational failure, inspec-
T

T −x
tion is futile. Note that the model might to be extended + (x + h)fH (h)fX (x)dhdx
to consider a mixed population of delay-times, a
proportion of which are zero (Christer and Wang, KΔ 0
⎡ ⎤
1995). This effectively relaxes the perfect inspection
T

assumption because it implies that a proportion of fail- + T⎣ (1−FH (T −x))fX (x)dx+ fX (x)dx⎦ .
ures cannot be prevented by inspection. We do not
consider this model here however. KΔ T
The decision variables in the model are K, T and
Δ. KandT are age-related, so that on replacement, the In the same way, developing equation (3) we have
inspection phase begins again. Thus, the maintenance that when K > 0 the expected cost per cycle is
model is analogous to age-based replacement. The
as-new replacement assumption implies that we can 
K
use the renewal-reward theorem and hence the long- E(U ) = (iCI + CR )
run cost per unit time as objective function. i=1
Within this framework, the length of a renewal ⎡ ⎤
cycle (time between renewals), V , can take different

⎢ ⎥
values, and ×⎣ (1 − FH (iΔ − x))fX (x)dx⎦
(i−1)Δ
V = iΔ (2a)


K
+ [(i−1)CI +CF ] FH (iΔ−x)fX (x)dx
if (i − 1)Δ < X < iΔ ∩ X + H > iΔ, (i = i=1 (i−1)Δ
1, . . . , K). Thus, for example, V = Δ, and the item is
replaced at first inspection, if X < Δ ∩ X + H > Δ.
T
Also, + (KCI +CF ) (FH (T −x))fX (x)dx

V = Y, (i − 1)Δ < Y < iΔ, (2b) ⎡

T
if X > (i − 1)Δ∩X + H < iΔ, (i = 1, . . . , K); + (KCI + CR ) ⎣ (1 − FH (T − x))fX (x)dx
V = Y, KΔ < Y < T , (2c) KΔ

if X > KΔ ∩ X + H < T ; and

V =T if X > KΔ ∩ X + H > T . (2d) + fX (x)dx⎦ .
T
The cost incurred per cycle, U , is given by
These expressions simplify when K = 0 to


⎪ iCI + CR if V = iΔ, i = 1, . . . , K,
⎪(i − 1)CI + CF
⎨ if (i − 1)Δ < V < iΔ,
T

T −x

U= i = 1, . . . , K, E(V ) = (x + h)fH (h)fX (x)dhdx




⎩KCI + CF
⎪ if KΔ < V < T , 0 0
KCI + CF if V = T . ⎡T ⎤



(3)
+T ⎣ (1−FH (T −x))fX (x)dx+ fX (x)dx⎦ ,
0 T
Developing equations (2a-d), we have that when
K > 0 the expected renewal cycle length is
and
⎡ ⎤

K

T
⎢ ⎥
E(V ) = iΔ ⎣ (1 − FH (iΔ − x))fX (x)dx⎦ E(U ) = CF (FH (T − x))fX (x)dx
i=1 (i−1)Δ 0
⎡T ⎤

K

iΔ−x


+ (x + h)fH (h)fX (x)dhdx +CR ⎣ (1−FH (T − x))fX (x)dx+ fX (x)dx⎦.
i=1 (i−1)Δ 0 0 T

426
Expressions for E(U ) and E(V ) in the case K = 10
0 could also be derived by noting that the hybrid
policy reduces to age-based replacement with critical
T 8
age T . Then we have that E(V ) = 0 (1 − FY (y))dy
and E(U ) = CF FY (y) + C  Ry (1 − FY (y)), where

Frequency
Y = X + H and so FY (y) = 0 FH (y − x)fX (x)dx. 6
Using the above expressions for E(U ) and E(V ),
the optimal hybrid policy of inspection up to age KΔ 4
and replacement at age T for items from a heteroge-
neous population can be determined by minimizing 2
the long-run cost per unit time:
0
E(U ) 1 2 3 4 5 6
C∞ (T , Δ, K) =
E(V ) time to failure/ years

with respect to K, Δ and T . Figure 3. Histogram of bearing failure times for 39 traction
It is interesting to note that hybrid inspection and motors.
replacement policies that have been developed to date
(e.g. Wang and Christer, 2003) are based on the notion
of increasing defect arrival rate and minimal repair. A histogram of bearing failure times is shown in
For these policies, inspections will tend to be carried Figure 3. It is plausible that these failures are early life
out with increasing frequency as the item reaches the failures of items (here bearings) that arise from a mixed
critical age for replacement. This is in direct contrast population. Furthermore, inspections and replacement
to the policy developed in this paper. may account for the reduced number of failures at ages
1, 2, 3, and 4, so that the mixing parameter is some-
what more than 39/2296. Plausible values may be in the
5 NUMERICAL EXAMPLE range be p = 0.04 to p = 0.10. We will assume a mix-
ture with two sub-populations of bearings, the weak
Scarf et al. (2005) consider the lifetimes of bearings items with a Weibull distribution of time to defect (time
in 375V d.c. traction motors used by a commuter rail- in good state) with characteristic life η1 = 3 (years)
way, and investigated preventive replacement policies and shape parameter β1 = 2.5, and the strong items
for these bearings. The railway company uses 2296 with a Weibull distribution of time to defect (time in
traction motors (typically 32 per train), and over a good state) with characteristic life η2 = 18 and shape
period of study observed 39 bearing failures. These parameter β2 = 5. Note, the parameters of the dis-
are shown in a Weibull plot (Fig. 2). tribution of defect arrival time for the strong items
is not based on the data here. As motors were pre-
ventively replaced at 7 years, we would expect to see
0 only very few failures of long-lived (strong) items. An
exponential distribution was arbitrarily chosen for the
delay times, with mean λ=1/2 (year). As the time to
-2 failure is the sum of the time to defective state and
loglog{ 1/(1-i/N))}

the delay time, the actual failure time distribution will


have a characteristic life of approximately 3.5 years.
-4 The variance of time to failure will of course be larger
than that of the underlying time to defect distribution.
A hybrid maintenance policy for such a heteroge-
-6 neous population of bearings is now investigated. For
the purpose of this investigation, we take the preven-
tive replacement cost as our unit of cost (CR = 1), and
-8 consider a range of inspection and failure replacement
1 10 20 30 costs.
time to failure/years The results are shown in Table 1. As can be observed
here, when the sub-populations are well separated
Figure 2. Weibull plot for traction motor failures. For a (large β2 or small η1 ), a hybrid policy with K ∗ Δ∗ <
single population of items, this implies β ≈ 2.5 and η ≈ 30 T ∗ is optimal. Two phases are then apparent—the
years. early inspection phase, and the later non-inspection

427
Table 1. Optimum hybrid policy for various values of cost parameters and failure model parameters. Long run cost per unit
time of optimum policy is C∞ ∗ . Unit cost equal to the cost of preventive replacement (C = 1). Time unit here taken to be one
R
year although this is arbitrary.

β1 η1 β2 η2 p λ CI CF ∗
C∞ T∗ Δ∗ K∗

2.5 3 5 18 0.1 0.5 0.05 10 0.224 10.25 0.85 5


2 3 5 18 0.1 0.5 0.05 10 0.225 10.25 0.87 5
3 3 5 18 0.1 0.5 0.05 10 0.222 10.25 0.83 5
2.5 2 5 18 0.1 0.5 0.05 10 0.216 10.25 0.60 5
2.5 4 5 18 0.1 0.5 0.05 10 0.228 10.81 1.44 7
2.5 3 4 18 0.1 0.5 0.05 10 0.253 10.45 0.99 10
2.5 3 3.5 18 0.1 0.5 0.05 10 0.269 10.23 0.88 11
2.5 3 3 18 0.1 0.5 0.05 10 0.288 10.21 0.75 13
2.5 3 5 12 0.1 0.5 0.05 10 0.313 7.51 0.78 9
2.5 3 5 24 0.1 0.5 0.05 10 0.171 13.67 0.84 5
2.5 3 5 18 0.1 0.25 0.05 10 0.243 10.29 2.74 1
2.5 3 5 18 0.1 1 0.05 10 0.198 11.44 1.33 8
2.5 3 5 18 0.05 0.5 0.05 10 0.185 9.86 2.96 1
2.5 3 5 18 0.15 0.5 0.05 10 0.254 10.64 0.64 7
2.5 3 5 18 0.1 0.5 0.025 10 0.202 11.25 0.68 16
2.5 3 5 18 0.1 0.5 0.075 10 0.233 10.40 1.75 2
2.5 3 5 18 0.1 0.5 0.05 2 0.103 14.56 0 0
2.5 3 5 18 0.1 0.5 0.05 20 0.310 9.33 0.48 10

wear-out phase. As β2 becomes smaller and the sub- of failure is large or the proportion of weak items in the
populations are less distinct, then K ∗ Δ∗ ≈ T ∗ , and it is mixed population is large, regular inspection in early
optimum to inspect over the entire life. When the cost life is recommended.
of inspection is varied, the optimum policy behaves as The similarity of the model to combined burn-in-
expected—lower inspection costs lead to more inspec- replacement policies is discussed. The hybrid mainte-
tions. Also, a longer mean delay time leads to more nance policy can be generalized and extended. Hybrid
inspections and vice versa, implying that inspections inspection and block replacement policies may be
are only effective if there is sufficient delay between developed in a similar manner although the calcula-
defect arrival and consequent failure. The effect of tion of the long-run cost per unit time will be more
varying the mixing parameter can also be observed in difficult. Extensions to repairable systems could also
this table. be considered.

6 CONCLUSION
ACKNOWLEDGEMENTS
In this paper, a hybrid maintenance policy for items
from a heterogeneous population is proposed. A lim- The work presented here was carried out during the
ited number of maintenance models for these kinds visit of the second author to the University of Salford.
of items have been developed to date. In particular This visit was supported by CAPES (the Brazil-
we consider items that arise from a mixture of 2 sub- ian National Mobility Programme), grant number:
populations. The first sub-population represents weak 1045/07-5.
or low quality items (or possibly poor installation of
items), the second stronger, more long-lived items.
The concepts of delay-time in inspection maintenance
and age-based replacement in preventive maintenance REFERENCES
are combined in a hybrid policy that mitigates early
failures of weak items and extends the age at preven- Abraham, B. & Nair, N.U. 2000. On characterizing mix-
tures of some life distributions, Statistical Papers 42(3),
tive replacement of strong items. The behaviour of the 387–393.
policy is investigated for various values of the parame- Ascher, H. & Feingold, H. 1984. Repairable Systems Relia-
ters of the underlying mixture and costs of inspection, bility. New York: Marcel Dekker.
preventive replacement, and failure replacement. This Barlow, R.E. & Proschan, F. 1965. Mathematical Theory of
behaviour is as might be anticipated—where the cost Reliability. NewYork: Wiley.

428
Block, H.W., Savits, T.H. & Wondmagegnehu, E.T. 2003. Gupta, P.L. & Gupta, R.C. 1996. Ageing characteristics of
Mixtures of distributions with increasing linear failure the Weibull mixtures. Probability in the Engineering and
rate. Journal of Applied Probability 40(2), 485-504. Informational Sciences 10(4), 591–600.
Cha, J.H., Lee, S. & Mi, J. 2004. Bounding the optimal Jiang, R. & Murthy, D.N.P. 1995. Modeling failure-data by
burn-in time for a system with two types of failure, Naval mixture of 2 Weibull distributions: a graphical approach,
Research Logistics. 51(8), 1090–1101. IEEE Transactions on Reliability, 44(3), 477–488.
Christer, A.H. 1999. Developments in delay time analysis for Jiang, R. & Murthy, D.N.P. 1998. Mixtures of Weibull
modeling plant maintenance. Journal of the Operational distributions—parametric characterization of failure rate
Research Society, 50(1), 1120–1137. function. Applied Stochastic Models and Data Analysis
Christer, A.H. & Wang, W. 1995. A delay-time based mainte- 14(1), 47–65
nance model for a multi-component system. IMA Journal Jiang, R. & Jardine, A.K.S. 2007. An optimal burn-in
of Management Mathematics 6(2), 205–222. preventive-replacement model associated with a mix-
Drapella, A. & Kosznik, S. 2002. A short communication ture distribution. Quality and Reliability Engineering
combining preventive replacement and burn-in proce- International 23(1), 83–93
dures. Quality and Reliability Engineering International Murthy, D.N.P. & Maxwell, M.R. 1981. Optimal age replace-
18(5), 423–427. ment policies for items from a mixture. IEEE Transactions
Finkelstein, M.S. 2002. On the shape of the mean residual on Reliability 30(2):169–170.
lifetime function. Applied Stochastic Models in Business Nakagawa, T. 2005. Maintenance Theory of Reliability.
and Industry 18(2), 135–146 Springer, London
Finkelstein, M.S. 2004. Minimal repair in heteroge- Scarf, P.A., Dwight, R. & Al-Musrati, A. 2005. On reliability
neous populations. Journal of Applied Probability 41(1), criteria and the implied cost of failure for a maintained
281–286. component. Reliability Engineering and System Safety
Finkelstein, M.S. & Esaulova, V. 2001a. On an inverse prob- 89(2), 199–207.
lem in mixture failure rates modeling Applied Stochastic Valdez-Florez, C. & Feldman, R.M. 1989. A survey of
Models in Business and Industry 17(2), 221–229. yes preventive maintenance models for stochastically deteri-
Finkelstein, M.S. & Esaulova, V. 2001b. Why the mixture orationg single-unit systems. Naval Logistics Quarterly
failure rate decreases. Reliability Engineering and System 36(4), 419–446.
Safety 71(2), 173–177. Wang, W. & Christer, A.H. 2003. Solution algorithms for
Glaser, R.E. 1980. Bathtub and related failure rate character- a nonhomogeneous multi-component inspection model.
izations. Journal of the American Statistical Association,. Computers & Operations Research 30(1), 19–34.
75(371), 667–672.

429
Safety, Reliability and Risk Analysis: Theory, Methods and Applications – Martorell et al. (eds)
© 2009 Taylor & Francis Group, London, ISBN 978-0-415-48513-5

A stochastic process model for computing the cost of a condition-based


maintenance plan

J.A.M. van der Weide


Department of Applied Mathematics, Delft University of Technology, Delft, The Netherlands

M.D. Pandey
Department of Civil Engineering, University of Waterloo, Waterloo, Canada

J.M. van Noortwijk


HKV Consultants, Lelystad, The Netherlands
Department of Applied Mathematics, Delft University of Technology, Delft, The Netherlands

ABSTRACT: This paper investigates the reliability of a structure that suffers damage due to shocks arriving
randomly in time. The damage process is cumulative, which is a sum of random damage increments due to
shocks. In structural engineering, the failure is typically defined as an event in which structure’s resistance
due to deterioration drops below a threshold resistance that is necessary for the functioning of the structure.
The paper models the degradation as a compound point process and formulates a probabilistic approach to
compute the time-dependent reliability of the structure. Analytical expressions are derived for costs associated
with various condition and age based maintenance policies. Examples are presented to illustrate computation of
the discounted life-cycle cost associated with different maintenance policies

1 INTRODUCTION Poisson process is simpler (Mercer and Smith 1959).


Later this model was generalized to compound non-
In industrialized nations, the infrastructure elements homogeneous Poisson process by Nakagawa and his
critical to economy, such as bridges, roads, power co-workers (Nakagawa 2005, Nakagawa and Mizutani
plants and transmission lines, are experiencing aging 2007). These models are fundamental to developing
related degradation, which makes them more vul- and optimizing condition-based maintenance strate-
nerable to failure. To minimize the risk associated gies (Dekker 1996). The use of compound renewal
with failure of an infrastructure system, inspection process in engineering reliability is very limited in
and replacements are routinely carried out. Because practical applications due to its mathematical com-
of uncertainty associated with degradation mecha- plexity and involved computation associated with the
nisms, operational loads and limited inspection data, evaluation of convolution integrals. Our experience
probabilistic methods play a key role in developing also suggests that clarity of mathematical derivation
cost effective management models. The theory of is essential to promote practical engineering applica-
stochastic processes and the renewal theorem have tions.
been fundamental to the development of risk-based The objective of this paper is to present a concep-
asset management models. tually simple and intuitive interpretation of renewal
The theory of renewal process has been discussed processes with applications to modeling condition-
in several monographs (Cox 1962, Feller 1957 and based maintenance policies. The proposed derivation
Smith 1958 and Barlow and Proschan 1965). The com- is very general and it can be reduced to special cases
pound renewal process involves two dimensions of of homogeneous or non-homogeneous Poisson shocj
uncertainties, i.e., the inter-arrival time of the event processes. Expressions are derived to compute life-
(shock or damage) and its magnitude are modeled as cycle costs and other parameters of interest such as
random variables (Smith 1958). The concept of using mean life-cycle length.
compound renewal process for modeling degradation The failure of a structure occurs when its resistance
was summarized in Barlow and Proschan (1965). The due to deterioration drops below a threshold resis-
analytical formulation of the compound homogeneous tance that is necessary for adequate functioning of the

431
structure. This paper investigates the reliability of a function given as
structure that suffers damage due to shocks arriving
randomly in time. The degradation is assumed to be G(x) = P(Y1 ≤ x). (4)
cumulative sum of random damage increments due to
shocks. In fact we consider a more general version It is also assumed that the damages increment Yj and
of the shock model considered in Ito et al., (2005). the shock process N are independent. Since damage
The paper models degradation as a compound point is additive, the total damage Z(t) at time t is given as
process and formulates a probabilistic approach to
compute time-dependent reliability of the structure. N (t)
Analytical expressions are derived for costs associ- 
Z(t) = Yj . (5)
ated with various condition and age based maintenance
j=1
policies. Simpler expressions are obtained in case of
the shock process being the Poisson process. Exam-
ples are presented to illustrate computation of the Using the total probability theorem and indepen-
life cycle cost associated with different maintenance dence of the sequence Y1 , Y2 , . . . and N (t), we can
policies. write the distribution of cumulative damage as
This paper is organized as follows. In section 2, we

 j 
present a general model of the stochastic degradation  
process. section 3 describes a mathematical framework P(Z(t) ≤ x) = P Yi ≤ x, N (t) = j . (6)
to evaluate the total maintenance cost of a system over j=0 i=1
a defined planning period. Three types of maintenance
policies are given in section 4 and specific results are The distribution of the sum of j damage increments
derived for the non-Homogeneous Poisson process and is obtained by convolution of G(x) given as
illustrative examples are included. Conclusions are
presented in section 5. G ( j) (x) = P(Y1 + · · · + Yj ≤ x), (7)

which can also be expressed as


2 MODEL OF DEGRADATION PROCESS  x

2.1 Model G ( j+1) (x) = G (j) (x − y) dG(y)


0
In this paper the degradation process is modeled by a  x
cumulative damage model, where the system suffers = G(x − y) dG (j) (y). (8)
0
damage due to shocks. We model the occurrence times
of shocks as the points in a stochastic point process
on [0, ∞). For t ≥ 0, we denote the total number of Note that G (0) (x) = 1 for x ≥ 0.
shocks during (0, t] by N (t), N (0) ≡ 0. Define the Now the distribution of the total damage can be
probability of getting j shocks in (0, t) as written in a more compact form as



Hj (t) = P(N (t) = j), (1) P(Z(t) ≤ x) = Hj (t)G (j) (x). (9)
j=0
and the expected number of shocks as
Substituting H0 (t) = 1 − F1 (t) and Hj (t) = Fj (t) −
R(t) = E(N (t)). (2) Fj+1 (t), j ≥ 1, it can be written as



The probability that the jth shock occurs at time Sj P(Z(t) ≤ x) = 1 + [G (j) (x) − G (j−1) (x)]Fj (t).
during (0, t] is given as j=1
(10)


Fj (t) = P(Sj ≤ t) = P(N (t) ≥ j) = Hi (t). (3) Suppose damage exceeding a limit xcr causes a struc-
i=j
tural failure, this equation can be used to compute
reliability as P(Z(t) ≤ xcr ). This is a fundamental
Consider that the jth shock causes damage of a expression in the theory of the compound point pro-
random amount Yj , and it is assumed that damage cess, which can be used to derive probabilities of other
increments, Y1 , Y2 , . . . , are iid with the distribution events associated with a maintenance policy.

432
2.2 Condition-based maintenance γj = αj − βj
When the total damage Z(t) exceeds an unaccept- 
j
able level K, it is referred to as the degradation πj = 1 − αi = G (j) (k).
failure of structure. The failure prompts a corrective i=1
maintenance action (CM) involving the replacement
or complete overhaul of the component (as good as For any n ≥ 1, the collections {A1 , . . . , An , Bn }
new repair). On the other hand, a preventive mainte- and {APM1 , A1 , . . . , An , An , Bn } are both finite
CM PM CM
nance action (PM) is triggered when the total damage partitions of the sample space, so
exceeds a maintenance threshold level k, k < K. For
sake of generality, it is also assumed that the compo-
nent will be replaced after certain age T0 , whichever 
n 
n 
n

occurs first. After a maintenance action the structure αi + πn = βi + γi + πn = 1.


is good-as-new. We assume that costs are cK for CM, i=1 i=1 i=1

c0T for replacement at time T0 and c0k for PM before


time T0 . Let T be the first time at which a maintenance The joint distribution of (T , C) can now be described
action has to be done and let C denote the associated as follows. If N (T0 ) = n, n ≥ 1, then
cost. Note that T is a random variable, and T0 is a
decision variable and t is used to indicate an arbitrary ⎧

⎪ (Sj , c0k ) on APM
j , j ≤ n,
(non-random) point in time. We need to derive the ⎨
joint distribution of the pair (T , C) to calculate the life (T , C) = (Sj , cK ) on ACMj , j ≤ n, (12)


cycle cost. ⎩
Define the event that the total damage exceeds the (T0 , c0T ) on Bn .
PM threshold in shocks j to (j − 1) ( j = 1, 2, . . . ) as
 j−1  If no shocks occurred up to time T0 , i.e. N (T0 ) = 0,
 
j
then (T , C) = (T0 , c0T ). We define for convenience in
Aj = Yi ≤ k < Yi . notation B0 = Ω, so P(B0 ) = 1.
i=1 i=1

The other two events that PM or CM have to be done


at the jth shock are defined as 3 MAINTENANCE COST MODEL
  3.1 General

j
APM
j = Aj ∩ Yi ≤ K A new structure is installed in service and after first
i=1 maintenance action at time T1 with cost C1 , a new
cycle starts. Denote the time to the next maintenance
and action by T2 and the cost by C2 . We assume that the
  sequence (Tn , Cn ) of random vectors is an iid sequence

j
with
ACM
j = Aj ∩ Yi > K .
i=1
d
(Tn , Cn ) = (T , C),
Finally, the event that no PM takes place at at jth shock
or time Sj is defined as
where (T , C) is defined in Section 2. The times at
 j  ∞
which the cycles start will be denoted by Sn = T1 +
  · · · + Tn , n ≥ 1 and S0 ≡ 0. Let N (t) denote the total
Bj = Yi ≤ k = Ai . (11)
number of maintenance actions during [0, t]:
i=1 i=j+1

Denote the probabilities of these events by αj = P(Aj ), N (t) = n ⇔ Sn ≤ t < Sn+1 . (13)
βj = P(APMj ), γj = P(Aj ) and πj = P(Bj ). We
CM

can express these probabilities in the distribution G as The total cost K(t) up to time t associated with this
follows maintenance strategy is

αj = G (j−1) (k) − G (j) (k) N (t)



 k K(t) = Cj . (14)
βj = [G(K − x) − G(k − x)] dG (j−1) (x) j=1
0

433
A well-known result is that the long-term expected Let PK be the probability that we have to perform a
average cost C(T ) per time unit is (Tijms 2003) corrective maintenance (CM) at time T .



1 E(C)
C(T ) = lim K(t) = . (15) PK = P(C = cK ; N (T0 ) = n)
t→∞ t E(T )
n=1
∞ 
 n
Consider exponential discounting with discount
= γj Hn (T0 )
factor
n=1 j=1


D(t) = e−rt , (16)
= γj Fj (T0 ). (21)
j=1
where r > 0 is a given discount rate. Let K(t, r) denote
the discounted cost over [0, t] of the given maintenance The probability that at time T0 the total damage is
policy, which is given as still below the managerial level k and the cost of the
maintenance action is c0T is equal to
N (t)

K(t, r) = e−rSj Cj 1{Sj ≤t} . (17) P(C = c0T ) = P(T = T0 ) = 1 − Pk
j=1 ∞

=1− αj Fj (T0 ). (22)
The long-term expected equivalent average cost per j=1
time unit is
It follows that
rE(Ce−rT )
C(T , r) = lim rE(K(t, r)) = (18)
t→∞ 1 − E(e−rT ) P(C = c0k ) = 1 − (P(C = cK ) + P(C = c0T ))


see Weide et al., (2008). = βj Fj (T0 ). (23)
j=1

3.2 Key results For the expected cost for the first maintenance
In order to derive the cost rate using Eq.(15) or (18), we get
a number of probability terms and expectations have
to be evaluated. This section summarizes such expres- E(C) = c0T (1 − Pk ) + c0k (Pk − PK ) + cK PK
sions, whereas their derivations are presented in the ∞
Appendix.  
= c0T + cK γj + c0k βj − c0T αj Fj (T0 ).
There are three possible ways for the renewal of
j=1
structure at time T : preventive replacement when
Z(T ) = k or age exceeds T0 or complete failure. The (24)
time interval between the renewals, also referred to as
renewal cycle, is a random variable and its expected If we assume c0 = c0k = c0T and cK = c0 + δK ,
value is given as we obtain a simple expression

  T0 E(C) = c0 + δK PK . (25)
E(T ) = πj Hj (x) dx. (19)
j=0 0
A general expression for expected discounted cost
in an interval T is derived, as shown in Appendix, as
It follows immediately from Eq.(11) that the prob-
ability that a maintenance action is necessary before ∞

time T0 equals E(Ce−rT ) = e−rT0 [c0 + δK Cn ] Hn (T0 )
n=0

 ∞
  T0
Pk = P(T < T0 ) = 1 − πn Hn (T0 ). (20) + (c0 Bn + cK Cn ) Hn (x)re−rx dx. (26)
n=0 n=1 0

434
n n
where Bn = j=1 βj and Cn = j=1 γj . It follows that Since ACM1 = {Yi > K}, APM
1 = {Yi ≤ K} and An = ∅
for all n ≥ 2, we get
E(Ce−rT ) = e−rT0 E(C)
E(Ce−rT ) = c0 (1 − I (T0 ))

  T0 
+r (c0 Bn + cK Cn ) Hn (x)e−rx dx. + δK Ḡ(K) 1 − erT0 H0 (T0 ) − I (T0 ) ,
n=1 0
where as usual Ḡ(K) = 1 − G(K). It follows that the
Taking C ≡ 1, the expected (discounted) length of the long-term discounted value of the expected equivalent
renewal cycle can be obtained as average cost rate is given by Eq.(18) as

∞  rc0 (1 − I (T0 ))
 T0 C(T , r) =
−rT −rx I (T0 )
E(e )=1− πn Hn (x)re dx. (27)
0 
n=0
rδK Ḡ(K) 1 − e−rT0 H0 (T0 ) − I (T0 )
+ .
All the expressions derived above are completely gen- I (T0 )
eral, without any specific assumptions about the form
of the point process N that describes the random 4.2 Model 2
arrival of shocks in time, see section 2. Note that
In this case k = K and APM
j = ∅, j ≥ 1, which leads to
rE(Ce−rT ) E(C) ∞ 
lim C(T , r) = lim = = C(T ).  T0
r↓0 r↓0 1 − E(e−rT ) E(T ) E(e−rT ) = 1 − G (j) (K) Hj (x)re−rx dx,
j=0 0

4 MAINTENANCE POLICIES and


 ∞  
This section reanalyzes the following three mainte-  T0
nance policies discussed by Ito et al., (2005). For E(Ce−rT ) = cK 1 − πn Hn (x)re−rx dx
0
simplicity, we will always assume in this section that n=0
c0T = c0k = c0 and cK = c0 + δK . ∞

− δK e−rT0 πn Hn (T0 ).
1. Model 1. The system undergoes PM at time T0 or at
n=0
the first shock S1 producing damage Y1 , whichever
occurs first. This means that the PM level k = 0.
Using eq.(18), the long-term expected equivalent
2. Model 2. The system undergoes PM at time T0 or
average cost per unit time follows from this.
CM if the total damage exceeds the failure level
K, whichever occurs first. This means that the PM
level k = K. 4.3 Model 3
3. Model 3. The system undergoes PM only when the
total damage exceeds the managerial level k. This Since in this case T0 = ∞, it leads to
means that T0 = ∞. ∞ 
 ∞
E(e−rT ) = 1 − πn Hn (x)re−rx dx.
n=0 0
4.1 Model 1
In this case P(Bn ) = 0 for all n ≥ 1 because and
k = 0. Using eqn.(27), the expected cycle length is
obtained as E(Ce−rT )
⎛ ⎞
∞ 
E(e−rT ) = 1 − I (T0 ),  n ∞
= ⎝ [c0 βj + cK γj ]⎠ Hn (x)re−rx dx.
n=1 j=1 0
where
 T0 Using eq. (18), the long-term expected equiva-
I (T0 ) = H0 (x)re−rx dx. lent average cost per unit time follows from these
0 expressions.

435
5 POISSON SHOCK PROCESS where Γ(a, x) denotes the upper incomplete gamma
function
In this Section we analyze the case when the shock
process N (t) is a non-homogeneous Poisson Process  ∞
(NHPP) with a continuous intensity function λ(t). If Γ(a, x) = t a−1 e−t dt.
the intensity λ(t) ≡ λ is constant, then the Poisson x

process is called homogeneous. The number of shocks


during (0, t] is now Poisson distributed with mean Further
 t  T0
R(t) = λ(u) du. (28) Fj (T0 ) = Hj−1 (x)λ(x) dx
0 0
1   a 
The probability of exactly j shocks during (0, t] is = Γ(j) − Γ j, T0b .
(j − 1)! b
R(t)j −R(t)
Hj (t) = P(N (t) = j) = e , j = 0, 1, . . . The expected cycle length is obtained as
j!
(29)

  T0
The probability density of the time Sj at which the jth E(T ) = G (j) (k) Hj (x) dx
0
shock occurs is j=0
  ∞
R(u)j−1 1 b 1/b  G (j) (k)
fj (u) = λ(u)e−R(u) =
(j − 1)! b a j=0
j!
= Hj−1 (u)λ(u), u > 0, j ≥ 1. (30)   a 
Γ(j + 1/b) − Γ j + 1/b, T0b ,
b
The probability Fj (t) that the jth shock occurs before
time t is then and expected cycle cost is derived as


Fj (t) = P(N (t) ≥ j) = Hi (t), (31) E(C) = c0 + δK PK
i=j ∞  
 Γ(j) − Γ j, ab T0b
= c0 + δK
or, using the formula for the probability density of the j=1
(j − 1)!
jth shock,
 k
 t × [1 − G(K − x)] dG (j−1) (x).
Fj (t) = Hj−1 (u)λ(u) du. (32) 0
0
So, the long-term expected average cost C(T0 , k)
Consider a special case with power law intensity model per unit time is can be derived from the above two
a b equations.
λ(t) = at b−1 and R(x) = x .
b
5.1 Example
Then
In this section, we analyze three maintenance policies
1  a b j  a 
discussed in section 4 under the assumption that the
Hj (x) = x exp − xb ,
j! b b arrival of shocks follow the Poisson process with a
power law intensity function.
and substituting y = (a/b)xb , we get In case of the maintenance policy of Model 1 and
without discounting, the long-term expected average
 T0 cost C(T0 ) per unit time is obtained as
Hj (x) dx
0   
  c0 + δK Ḡ(K) 1 − Γ 1, ab T0b
1 b 1/b 1   a       .
= Γ(j + 1/b) − Γ j + 1/b, T0b , 1 b 1/b
Γ(1/b) − Γ 1/b, ab T0b
b a j! b b a

436
190 220
C (T0) C (T ,r)
0
180 Asymptotic Value 200 Asymptotic Value

170
180

160
160
150
140
140
120
130

100
120

110 80

100 60
0 0.5 1 1.5 2 2.5 3 3.5 4 0 0.5 1 1.5 2 2.5 3 3.5 4

Figure 1. Plot of C(T0 ) and the asymptotic value for the Figure 2. Plot of C(T0 , r) and the asymptotic value for the
case a = 2 and b = 2, c0 = 20, δK = 180, Ḡ(K) = 0.40. case a = 2 and b = 2, c0 = 20, δK = 180, Ḡ(K) = 0.40,
r = 0.04.

As an example, take a = b = 2. The intensity λ(t) =


2t and R(t) = t 2 and we get we have k = K and the expression


c0 + δK Ḡ(K)(1 − e−T0 )
2 K
E(C)
= √  [1 − G(K − x)] dG (j−1) (x)
E(T ) π /2 erf (T0 ) 0

c0 + δK Ḡ(K)
→ √ , can be simplified to G (j−1) (K) − G (j) (K). So
π /2
∞   k
Figure 1 contains a plot of the long-term expected  Γ(j) − Γ j, ab T0b
[1 − G(K − x)] dG (j−1) (x)
average cost C(T0 ) per unit time as function of T0 . (j − 1)! 0
j=1
It is interesting to see what happens if we consider
∞  
discounting. Taking again a = b = 2, we get  Γ(j) − Γ j, ab T0b  (j−1) 
= G (K) − G (j) (K)
√ (j − 1)!
r π r 2 /4 j=1
I (T0 ) = e (erf (T0 + r/2) − erf (r/2)) , ∞ 
2  a j  a 
= 1− T0b exp − T0b G (j) (K).
and the long-term expected equivalent average cost j=0
b b
per unit time is obtained by substitution of I (T0 ) in
the expression derived in Subsection 4.1. The limiting So, the long-term expected average cost C(T0 , K) per
value as T0 → ∞ is unit time is

r(c0 + δK Ḡ(K))(1 − I (∞))    a b j  a b  (j)


, c0 + δK ∞ j=0 1 − b T0 exp − b T0 G (K)
I (∞)
 
1 b 1/b∞ G (j) (K)
   .
b a j=0 j! Γ(j + 1/b) − Γ j + 1/b, ab T0b
where
√ In case of maintenance policy 3, the long-term
r π r 2 /4
I (∞) = e (1 − erf (r/2)) . expected average cost C(T0 , k) per unit time is
2 derived as
Figure 2 contains a plot of the long-term expected ∞  k (j−1)
equivalent average discounted cost C(T0 , r). c0 + δK j=1 0 [1 − G(K − x)] dG (x)
In the rest of this Section we only consider the case  b 1/b ∞ G(j) (k) .
1
j=0 Γ(j + 1/b)
without discounting. In case of maintenance policy 2, b a j!

437
6 CONCLUSIONS we get

The paper derives the expressions for computing ∞


discounted life cycle cost rate associated with mainte- 
nance policies involving the age-based and condition- E[Ce−rT ] = c0T e−rT0 πn Hn (T0 )
n=0
based criteria for preventive replacement. The paper
generalizes the results presented in the literature (Ito ∞

et al 2005), which is the main contribution of this + (c0k βj + cK γj )E[e−r Sj ; Sj ≤ T0 ]. (33)
paper. The proposed model is currently extended to a j=1
more general case in which shocks follow the renewal
process. 
The term E e−r Sj ; Sj ≤ T0 can be expressed in the
distribution functions Fj of Sj as
7 APPENDIX
 T0
This Appendix contains the derivations of the Key E[e−r Sj ; Sj ≤ T0 ] = e−rx dFj (x)
Results in section 3. 0
 T0

 = e−rT0 Fj (T0 ) + Fj (x)re−rx dx,
E[Ce−rT ] = E[Ce−rT ; N (T0 ) = n] 0
n=0

or in terms of the probabilities Hn (t) as


For n = 0 we have


E[Ce−rT ; N (T0 ) = 0] = c0T e−rT0 H0 (T0 ),
E[e−r Sj ; Sj ≤ T0 ] = hn ,
n=j
and for n ≥ 1 we split the expectation over the parti-
1 , A1 , . . . , An , An , Bn }. Using
tion elements {APM CM PM CM

definition (12) of (T , C), we get where

E[Ce−rT ; N (T0 ) = n]  T0

n hn = e−rT0 Hn (T0 ) + Hn (x)re−rx dx.
−r Sj
= c0k βj E[e ; N (T0 ) = n] 0

j=1


n If we substitute this last expression in (33) we get, after
−r Sj interchanging the order of summation
+ cK γj E[e ; N (T0 ) = n]
j=1

+ c0T e−rT0 πn Hn (T0 ). E[Ce−rT ]



 ∞

Since = c0T e−rT0 πn Hn (T0 ) + (c0k Bn + cK Cn )hn ,
n=0 n=1
∞ 
 n
βj E[e−r Sj ; N (T0 ) = n]  
n=1 j=1 where Bn = nj=1 βj and Cn = nj=1 γj .
∞ ∞ Re-arranging the terms, we can also write
 
= βj E[e−r Sj ; N (T0 ) = n]
j=1 n=j E[Ce−rT ] = c0T e−rT0 H0 (T0 )

 ∞

= βj E[e−r Sj ; N (T0 ) ≥ j] + e−rT0 (c0T πn + c0k Bn + cK Cn )Hn (T0 )
j=1 n=1

 ∞
  T0
= βj E[e−r Sj ; Sj ≤ T0 ], + (c0k Bn + cK Cn ) Hn (x)re−rx dx. (34)
j=1 n=1 0

438
If c0 = c0T = c0k and cK = c0 + δK , this formula REFERENCES
simplifies to
[1] Barlow, R.E., and Proschan, F., 1965. Mathematical

 Theory of Reliability. Wiley, New York.
E[Ce−rT ] = e−rT0 (c0 + δK Cn )Hn (T0 ) [2] Cox, D.R., 1962. Renewal Theory. Methuen, London.
n=0 [3] Dekker, R., 1996. Applications of Maintenance Opti-
mization Models: A Review and Analysis. Reliability

  T0 Engineering and System Safety. (51): 229–240.
+ (c0k Bn + cK Cn ) Hn (x)re−rx dx, (35) [4] Feller, W., 1957. An Introduction to Probability
n=1 0 Theory and Its Applications, 3 edn. Wiley, New York.
[5] Ito, K., Qian, C.H. and Nakagawa, T., 2005. Opti-
which is the same as formula (26). Substituting C ≡ 1 mal preventive maintenance policies for a shock
in (34) and noting that model with given damage level. Journal of Quality
in Maintenance, 11(3): 216–227.
∞ 
 T0  T0 [6] Mercer, A. and Smith, C.S., 1959. A Random
Hn (x)re−rx dx = (1 − H0 (x))re−rx dx Walk in Which the Steps Occur Randomly in Time.
0 0 Biometrika, 46: 30–55.
n=1
[7] Nakagawa, T., 2005. Maintenance Theory of Reliabil-
 T0 ity. Springer series in reliability engineering. London:
= 1 − e−rT0 − H0 (x)re−rx dx, Springer.
0 [8] Nakagawa, T. and Mizutani, S., 2007. A Summary of
Maintenance Policies for a Finite Interval. Reliability
we get formula (27) Engineering and System Safety.
[9] Smith, W.L., 1958. Renewal Theory and Its Ramifi-

  T0 cation. Journal of the Royal Statistical Society, Series
E[e−rT ] = e−rT0 + (1 − πn ) Hn (x)re−rx dx B (Methodological) 20(2)(1): 243–302.
n=1 0 [10] Tijms, H.C. 2003. First Course in Stochastic Models

  T0
John Wiley and Sons, New York, NY
[11] Weide, J.A.M. van der, Suyono and Noortwijk,
=1− πn Hn (x)re−rx dx. (36) J.M. van, 2008. Renewal theory with exponential and
n=0 0
hyperbolic discounting. Probability in the Engineer-
ing and Information Sciences, 22(1):53–74.
Differentiating with respect to r followed by substitu-
tion of r = 0 yields

  T0
E[T ] = πn Hn (x) dx,
n=0 0

which is the same as formula (19).

439
Safety, Reliability and Risk Analysis: Theory, Methods and Applications – Martorell et al. (eds)
© 2009 Taylor & Francis Group, London, ISBN 978-0-415-48513-5

A study about influence of uncertain distribution inputs in maintenance


optimization

R. Mullor
Dpto. Estadística e Investigación Operativa, Universidad Alicante, Spain

S. Martorell
Dpto. Ingeniería Química y Nuclear, Universidad Politécnica Valencia, Spain

A. Sánchez & N. Martinez-Alzamora


Dpto. Estadística e Investigación Operativa Aplicadas y Calidad, Universidad Politécnica Valencia, Spain

ABSTRACT: An important number of studies have been published in the last decade in the field of RAM+C
based optimization considering uncertainties. They have demonstrated that inclusion of uncertainties in the
optimization brings the decision maker insights concerning how uncertain the RAM+C results are and how
this uncertainty does matter as it can result in differences in the outcome of the decision making process. In
the literature several methods of uncertainty propagation have been proposed. In this context, the objective of
this paper focuses on assessing how the choice of input parameters uncertain distribution may affect output
results.

1 INTRODUCTION extended the method considering dependence between


the objective functions.
Safety related systems performance optimization is In this context, the objective of this paper focuses on
classically based on quantifying the effects that assessing how the choice of input parameters uncer-
testing and maintenance activities have on reliabil- tain distribution in the input parameters may affect
ity, availability, maintainability and cost (RAM+C). the output. This work uses the results obtained in
However, RAM+C quantification is often incom- a previous paper (Martorell et al., 2006) where the
plete in the sense that important uncertainties may estimation of different parameters related to equip-
not be considered. An important number of stud- ment reliability and maintenance effectiveness were
ies have been published in the last decade in the carried out using the Maximum Likelihood (MLE)
field of RAM+C based optimization considering Method. As a consequence of the parameter esti-
uncertainties (Bunea & Bedford, 2005, Marseguerra mation process the results obtained in the RAM+C
et al., 2004, Martorell et al., 2007). They have demon- based optimization are uncertainty. Herein the opti-
strated that inclusion of uncertainties in the optimiza- mization problem is formulated as a multi-objective
tion brings the decision maker insights concerning how problem where reliability and cost act as decision crite-
uncertain the RAM+C results are and how this uncer- ria and maintenance intervals act as decision variables.
tainty does matter as it can result in differences in the A tolerance interval based approach is used to address
outcome of the decision making process. uncertainty, which assures a certain level of tolerance/
Several methods of uncertainty propagation have confidence is achieved with the minimum number of
been proposed in the literature. Some of the existing samples.
methods follow an approach based on estimating tol- The effect of input parameters distribution is ana-
erance intervals, exploiting the advantages of order lyzed using a normal distribution. This assumption
statistics to provide distribution free tolerance inter- is the most realistic one since the parameters have
vals for the output. The pioneering work was done been estimated by maximum likelihood method which
by Wilks (Wilks, 1941). Later on, Wald (Wald, 1942) ensures the normal distribution of the parameters.
generalized the Wilks’s method for multivariate and A second study is performed considering the depen-
independent cases. Finally, Guba (Guba et al., 2003) dent or independent input parameters.

441

2 RELIABILITY AND COST MODELS w = t (1 − ε) + (6)
2
As established in the introduction, the objective of
this paper is focused on the maintenance optimization Now, substituting Eqns (5) and (6) into Eqns (1) and
of safety related equipment, based on reliability and (2) the reliability functions of PAS and PAR models
cost criteria, and considering uncertainty in the equip- are obtained as continuous functions of time.
ment reliability characteristics and in the maintenance Thus, expressions corresponding to the PAS
effectiveness. approach considering linear and Weibull distributions
In this context, the objective of this section is to are given by Eqns (7) and (8):
present the expressions of the objective functions,    
averaged reliability and cost, as functions of the α M 2
R(t) = exp − t − mM + (7)
decision variables (maintenance intervals), reliability 2 ε
model parameters and maintenance effectiveness. ⎛ 
Two models of failure distribution, Weibull and β ⎞
t − mM + Mε
linear, and two imperfect maintenance models, PAS R(t) = exp ⎝− ⎠ (8)
(Proportional Age Set-Back) and PAR (Proportional η
Age Reduction, are considered.
Following the same reasoning, if the PAR approach
2.1 Reliability function is considered the expressions corresponding to
The general expression of reliability function asso- the reliability function are given by Eqns. (9) and (10)
ciated to linear and Weibull failures distribution for linear and Weibull distributions, respectively.
respectively are given by:    
 α  α Mε 2
R(t) = exp − t (1 − ε) + (9)
R(w) = exp − w2 (1) 2 2
2
⎛  β ⎞
    t (1 − ε) + M2ε
w β R(t) = exp ⎝− ⎠ (10)
R(w) = exp − (2) η
η

where w is the age of the component, which depends Using Eqns. (7) to (10) the averaged reliability
on the imperfect maintenance model selected, α is the function can be obtained as:
linear aging rate and β and η are the shape factor and
the characteristic time which represents the time scale, 1 L
R= R(t)∂t (11)
respectively. L 0
To take into consideration imperfect maintenance,
the expressions of PAS and PAR models described in Where R(t) is obtained from Eq. (9) and (10).
(Martorell et al., 1999) are used. These expressions
are the following:
2.2 Cost function

m−1 The relevant costs in analyzing maintenance opti-
w =t− (1 − ε)k ε τm−k−1 (3) mization of safety-related equipment include the cost
k=0 of performing preventive and corrective maintenance
and the cost associated with replacing the component
w = t − ε τm−1 (4) (Martorell et al., 2002). The following expressions are
used to quantify these costs
where t is the chronological time, ε the maintenance
cma
effectiveness which range in the interval [0,1], τ the cm = 8760 · (12)
time between two maintenance activities and m the M
number of maintenance tasks executed in a period L. 1
Assuming that the period between maintenance cc = 8760 · (ρ + h ∗ M) cca (13)
M
activities is constant and equal to M, Eqns (3) and coa
(4) are simplified to: co = 8760 · (14)
L
M where the term cm represents a yearly cost contribution
w = t − mM + (5)
ε as consequence of performing preventive maintenance

442
on the component over a year period, cc represents a functions the MOP can be expressed as:
yearly cost contribution as consequence of performing
corrective maintenance and co is the overhaul cost that min  e(C) + (1 − ) e(R)
represents the yearly cost associated with replacing the subject to : R ≥ Rr ; C ≤ Cr (19)
component with periodicity L. In addition, parameters
cma and cca represent the cost associated with con- that is, the minimization of convex combinations of
ducting each preventive and corrective maintenance, both efficiency functions, being  the weighting coef-
respectively, coa represents the total cost of replacing ficient and e(C) and e(R) the efficiency of each feasible
the component while ρ is the cyclic or per-demand solution which can be evaluated as:
failure probability. For the PAS model, the average
hazard function h∗ is obtained considering the lin- C − Cr
ear and Weibull distributions, obtaining the following e(C) = (20)
Cr − Co
expressions:
Rr − R
Mα(2 − ε) e(R) = (21)
h = ∗
(15) Ro − Rr

The point (Cr , R0 ) is obtained to maximize the
∗ Mβ−1 β
equipment reliability where the cost function evalu-
h = (1 − (1 − ε) ) (16) ated at its initial value, Ci , acts as restriction, and
(εη)β
(C0 , Rr ) is the result of taking as objective function
the equipment cost which must be minimized keeping
If the PAR model is considered the expressions of the reliability function greater than its initial value, Ri .
the averaged hazard function are given by: The problem of optimization of the mainte-
nance considering reliability and cost criteria has
α been solved using Sequential Quadratic Programming
h∗ = (ε M + L (1 − ε)) (17) (SQP) (Biggs, 1975).
2

(Mε + 2L (1 − ε))β − (Mε)β


h∗ = (18) 4 NON PARAMETRIC METHOD
L (1 − ε) (2η)β OF UNCERTAINTIES PROPAGATION

Finally, the global cost of the equipment C can be The problem to tackle in this section is how to bound
derived by summing up the corresponding cost contri- the uncertainty present in objective functions optimum
butions of the relevant components using Eqns. (12) values. In the uncertainty analysis N sets of input
to (14). parameters x = (x1 , x2 , . . . , xN ) containing variables
are selected randomly. After N runs with fluctuat-
ing inputs, using a crude Monte Carlo sampling, we
obtain N random varying output vectors which carry
3 MULTI-OBJECTIVE OPTIMIZATION
information on the fluctuating input.
PROCEDURE
The statistical evaluation of the output parameters
is based on the distribution-free tolerance limits elab-
A Multiobjective Optimization Problem (MOP) con-
orated by Wilks (Wilks, 1941) and extended by Guba
siders a set of decision variables x, a set of objective
(Guba et al., 2003).
functions f(x) and a set of constraints g(x) based on
We assume the output comprise p dependent vari-
decision criteria. In our problem, the MOP consists
ables. Carrying out N runs, we get a sample matrix
in determining the maintenance intervals, over the
Y = [yij ] where column j corresponds to the result
replacement period, on each component of the equip-
obtained in the j-th sampling for the p variables.
ment, which maximize the equipment reliability, R,
Using this matrix can be obtained the appropriate
while minimize its cost, C, subject to restrictions
intervals [Li, Ui]γ/β, that is, the construction of
generated by an initial solution (Ci , Ri ), usually deter-
p pairs of random variables Li (y1 , y2 , . . . , yN ) and
minate by the values of current maintenance intervals
Ui (y1 , y2 , . . . , yN ), i = 1, 2, . . . , p such that
implemented in plant.
Applying the so called weighted sum strategy, the  
L1 Lp
multiobjective problem of minimizing the vector of P ··· g(y1 , . . . , yp )∂y1 . . . ∂yp > γ = β
objective functions is converted into a scalar problem U1 Up
by constructing a weighted sum of all the objective
functions. In particular, if we have two objective (22)

443
Since the probability of coverage depends on Table 1. Number of runs to obtain tolerance intervals for
the unknown joint density function of the output k · p at γ/β levels.
g(y1 , . . ., yp ), it is necessary to use a reasonable proce-
dure such that the probability β is independent of the β
joint distribution function of outputs variables.
The generalized version of this method, known as k·p 0.90 0.95 0.99
truncated sample range, consist of being L as the γ 0.90 1 22 29 44
greatest of the r smallest values in the sample and 2 38 46 64
U the smallest of the r largest values, that is, let 3 52 61 81
y(1), y(2), . . .y(N) be the sample values of y arranged 4 65 76 97
in order of increasing magnitude, then the toler- 0.95 1 45 59 90
ance interval [L,U] is taken as [y(r), y(N − r + 1)] 2 77 93 130
such that the couple coverage/confidence is achieved. 3 105 124 165
Coverage (γ) measures the proportion of the distri- 4 132 153 198
0.99 1 230 299 459
bution included in the random interval [L,U] while
2 388 473 662
confidence (β) is the confidence level. 3 531 628 838
In this work we use the not truncated version where 4 667 773 1001
r = 1, so the tolerance interval becomes [y(1), y(N)],
the minimum and maximum, respectively, values in
the sample, so coverage/confidence levels achieved
depend only in the sample size N. This version consist Table 1 shows the number of runs needed to deter-
of arranging all rows in the sample matrix in order of mine tolerance intervals for some usual couples of
increasing magnitude of first row and selecting y1 (1) coverage/confidence levels.
as L1 and y1 (N) as U1 . Then, first and last columns
are removed from the arranged sample matrix which is 5 APPLICATION CASE
arranged again, now, in order of increasing magnitude
of second row. From this updated sample matrix, we The application case is focused on the optimiza-
choose y2 (1) as L2 and y2 (N − 2) as U2 , then, first tion process of preventive maintenance associated to
and last columns of this matrix are removed to con- motor-operated safety valves of a Nuclear Power Plant.
tinue selecting tolerance limits for next variables. We The equipment consists of two main components
continue this embedding procedure to the last row of (actuator (A) and valve (V)) in serial configuration.
the sample matrix and define p-dimensional volume Reliability (α, β, and η) and maintenance effectiveness
(ε) parameters have been previously estimated using
Vp = [L1, U1] · [L2, U2] · · · · · [Lp, Up] (23) the Maximum Likelihood Estimation (MLE) method.
So, the problem considers how the uncertainty asso-
which achieves Eq. (22) depending only on the sample ciated to reliability and to maintenance effectiveness
size N. parameters affect on maintenance optimization pro-
Now, it is necessary to find the sample size N cess based on system reliability and cost (R+C)
which achieves the coverage/confidence levels previ- criteria.
ously selected, since achieving these levels depends Table 2 shows the distribution probability, the
only on the selected sample size. imperfect maintenance model and the reliability data
The relation between coverage/confidence levels for the actuator and valve necessary to quantify the
and the sample size can be evaluated as: equipment reliability. These values represent mean
values obtained in the estimation process. In addi-
N−kp   tion, single cost data necessary to quantify the yearly
 N j equipment cost are showed in Table 3.
β= γ (1 − γ)N−j (24) Additionally, maximum likelihood estimators have
j
j=0 the property to be distributed asymptotically. Thus,
it is possible obtain the parameters joint distribution
where γ and β are couple coverage/confidence levels, which is given by:
N the searched sample size, k the number of limits for
the tolerance intervals (k = 1, 2 one or two sided con- 
(β, η, εA , α, εV ) ∼ N(μ,
 C) (25)
fidence levels) and p the number of objective functions
compromised in the output. Then, confidence level is
the value, in N-kp, of the distribution function of a ran- Being μ  the mean vector and C  the variance-
dom Binomial variable with sample size N and success covariance matrix. By using the MLE method the
probability γ, the coverage level. following values are obtained:

444
Table 2. Reliability data. intervals for two outputs have been performed. After
arranging the rows of the sample matrix of runs, the
Actuator Valve first and the last value of each output is selected as its
lower and upper tolerance limit, after explained remov-
Distribution Weibull Linear
ing of columns explained, obtaining like this 0.95/0.95
IM∗ model PAS PAR
ε 0.8482 0.7584 coverage/confidence limits for all solutions obtained
α – 1.54E-9 in the optimization process which constitute upper and
β 7.4708 – lower tolerance limits to the Pareto front. Figures 1 and
η 15400 – 2 show the tolerance limits obtained for reliability and
cost values, respectively.
Now, we analyze the influence in the tolerance lim-
its considering that the reliability and effectiveness
Table 3. Single cost data for actuator and valve. maintenance parameters are normally distributed but
assuming they are not dependent, thus is:
cca cma coa
Component [C
=] [C
=] [C
=]
β ∼ N(7.4707, 0.1572)
Actuator 3120 300 1900
Valve 3120 800 3600 η ∼ N(15397, 40808)

⎛ ⎞ εA ∼ N(0.8482, 2.2587e − 4)
7.4707
⎜ 15397 ⎟ α ∼ N(1.7343e − 9, 2.7619e − 20)
⎜ ⎟
μ
 = ⎜ 0.8482 ⎟ (26)
⎝1.73e − 9⎠ εV ∼ N(0.7584, 8.4261e − 4)
0.7584
The values associated to the mean and the standard
and deviation of the normal distributions are obtained from
⎛ ⎞
0.1572 −5.5646 −1.6944e − 3 0 0
⎜ −5.5646 4.0808e + 4 −2.3730 0 0 ⎟
 = ⎜−1.6944e − 3 −2.3730 2.2587e − 4

C ⎜ 0 0 ⎟ (27)
⎝ 0 0 0 2.7619e − 20 2.9647e − 12⎠
0 0 0 2.9647e − 12 8.4261e − 4

The optimization process is performed under relia- the mean vector and variance-covariance matrix given
bility and cost criteria y = {R,C} and the maintenance by Eqs. (25) and (27). Figures 3 and 4 show the results
intervals for each component act as decision variables. obtained in the optimization process for reliability and
The equipment reliability and associated cost have cost functions, respectively.
been quantified using the analytical models previously
introduced. A Sequential Quadratic Programming
(SQP) method is used as algorithm to optimization.
(Biggs, 1975). 0,861
Both, equipment reliability and cost functions are
considered to be deterministic in the sense that when
all necessary input data for the model are specified 0,860
they provide only one value for every output. However,
as inputs of the equipment reliability and cost mod-
els fluctuate according to distribution law reflecting 0,859
uncertainty on parameters and equipment reliability
and cost will fluctuate in repeated runs. In this case a
multivariate normal distribution whose parameters are
0,858
given by Eqns. (25) and (27) is used to characterize 3000 3100 3200 3300 3400 3500 3600
uncertainty.
Following distribution free tolerance intervals
approach discussed in section 4 to address uncertainty, Figure 1. R-C plot of uncertain results considering depen-
Wilks’ equation results in 153 runs to achieve lev- dency and parameters normally distributed (Tolerance limits
els of 0.95/0.95 for coverage/confidence in two sided for cost function).

445
Figure 5. R-C plot of uncertain results considering indepen-
Figure 2. R-C plot of uncertain results considering depen- dency and parameters uniform distributed (Tolerance limits
dency and parameters normally distributed (Tolerance limits for cost function).
for reliability function).

Figure 6. R-C plot of uncertain results considering indepen-


Figure 3. R-C plot of uncertain results considering indepen- dency and parameters uniform distributed (Tolerance limits
dency and normal distributed parameters (Tolerance limits for reliability function).
for cost function).

Comparing the tolerance intervals obtained in the


two cases analyzed an important reduction of the
tolerance intervals is observed when the uncertain
parameters are considered normally distributed and
dependent.
Finally, we analyze the influence in the tolerance
limits due to the lack of knowledge of the joint
distribution function. According to the principle of
insufficient reason a uniform distribution is used to
characterize epistemic uncertainty in absence of infor-
mation. So, we assume that the parameters estimated
(ε, α, β and η) follow uniform distributions. Upper and
lower limits of the uniform distribution associated to
each parameter are obtained as θ̂ ± 2σθ̂ . Figures 5 and
6 show the results obtained to optimization process for
reliability and cost functions, respectively.
Comparing the results showed in Figures 4–6 no sig-
Figure 4. R-C plot of uncertain results considering indepen- nificant difference in the tolerance intervals obtained
dency and normal distributed parameters (Tolerance limits assuming a normal or uniform distribution and inde-
for reliability function). pendence among the parameters is observed.

446
6 CONCLUDING REMARKS Gill PE, Murray W, Wright MH. Practical Optimization.
London Academic Press 1981.
This paper presents an analysis of the results obtained Guba A, Makai M, Pal L. Statistical aspects of best estimation
in the preventive maintenance interval optimization method-I. Reliability Engineering & System Safety 2003;
of a safety-related equipment based on reliability and 80: 217–232.
Malik MAK. Reliable preventive maintenance scheduling.
cost criteria and considering that reliability and main- AIIE Transactions 1979; 11: 221–228.
tenance effectiveness parameters are random variables Marseguerra M, Zio E, Podofillini L. Optimal reliabil-
which introduce uncertainty in the decision making ity/availability of uncertain systems via multi-objective
process. genetic algorithms. IEEE Transactions on Reliability
So, how the input parameters distribution affect 2004; 53(3): 424–434.
to the output results is analyzed. Thus, different Martorell S, Sanchez A, Carlos S, Serradell V. A tol-
input parameters distribution and independency or erance interval based approach to address uncertainty
dependency among parameters were considered. The for RAMS+C optimization. Reliability Engineering &
application case shows that the effect on the tolerance System Safety 2007; 92: 408–422.
Martorell S, Sanchez A, Carlos S, Serradell V. Age-dependent
intervals of the type of distribution of the parameters reliability model considering effects of maintenance and
is not significant. However, significant differences are working conditions. Reliability Engineering & System
observed when dependence or independence among Safety 1999; 64: 19–31.
the parameters is considered. Martorell S, Sanchez A, Carlos S, Serradell V. Simultaneous
and multi-criteria optimization of TS requirements and
maintenance at NPPs. Annals of Nuclear Energy 2002;
ACKNOWLEDGMENTS 29(2): 147–168.
Nutt WT, Wallis GB. Evaluation of nuclear safety from the
Authors are grateful to the Spanish Ministry of Edu- outputs of computer codes in the presence of uncertainties.
Reliability Engineering & System Safety 2004; 83: 57–77.
cation and Science for the financial support of this Rocco CM, Miller AJ, Moreno JA, Carrasquero N, Medina M.
work in the framework of the Research Project Ref. Sensitivity and uncertainty analysis in optimization pro-
ENE2006-15464-C02-01 which has partial financial grams using an evolutionary approach: a maintenance
support from the FEDER funds of the European Union. application. Reliability Engineering & System Safety
2000; 67(3): 249–256.
Sanchez A, Martinez-Alzamora N, Mullor R, Martorell S.
REFERENCES Motor-operated valve maintenance optimization consid-
ering multiple failure modes and imperfect maintenance
Biggs MC. Constrained minimization using recursive models. Proceedings of ESREL 2007.
quadratic programming. Towards Global Optimization, Wald A. Setting of tolerance limits when the sample is large.
North-Holland 1975: 341–349. Annals of Mathematic Statistics 1942; 13: 389–399.
Bunea C, Bedford T. The effect of model uncertainty on main- Wilks SS. Determination of the sample size for setting tol-
tenance optimization. IEEE Transactions on Reliability erance limits. Annals of Mathematic Statistics 1941; 12:
2002; 51(4): 486–493. 91–96.

447
Safety, Reliability and Risk Analysis: Theory, Methods and Applications – Martorell et al. (eds)
© 2009 Taylor & Francis Group, London, ISBN 978-0-415-48513-5

Aging processes as a primary aspect of predicting reliability and life


of aeronautical hardware

J. Żurek, M. Zieja & G. Kowalczyk


Air Force Institute of Technology

T. Niezgoda
Military University of Technology

ABSTRACT: The forecasting of reliability and life of aeronautical hardware requires recognition of many
and various destructive processes that deteriorate the health/maintenance status thereof. The aging of technical
components of aircraft as an armament system proves of outstanding significance to reliability and safety of the
whole system. The aging process is usually induced by many and various factors, just to mention mechanical,
biological, climatic, or chemical ones. The aging is an irreversible process and considerably affects (i.e. reduces)
reliability and life of aeronautical equipment.

1 INTRODUCTION 1. Health/maintenance status of any item included


in the aeronautical equipment can be described
The aging processes that affect aeronautical equipment with diagnostic parameters available throughout
are to a greater or lesser degree correlated with the the operational phase, and designated in the fol-
item’s time of operation or the number of cycles of lowing way:
its operation. In terms of aging processes the items
of aeronautical equipment can be divided into three
groups (Żurek 2005): X = (X1 , X2 , X3 , . . . , Xn ) (1)

• those with strongly correlated changes in values


of diagnostic parameters with time or amount of 2. Values of diagnostic parameters change due to
operation, aging processes going on all the time. It is assumed
• those with poorly correlated changes in values that these changes are monotonic in nature; they
of diagnostic parameters with time or amount of can be presented in the following way:
operation, and
• ones showing no correlation changes in values  
of diagnostic parameters with time or amount of ΔXi = Xi − Xinom  , i = 1, 2, 3, . . . , n (2)
operation.

For the items representing the first group one can where: ΔXi = absolute value of deviation of the
predict the instance of time when the diagnostic param- diagnostic parameter from the nominal value;
eter’s boundary condition occurs. One can also predict Xi = current value of the i-th parameter;
the time instance of the item’s safe shut down and then Xinom = nominal value of the i-th parameter.
plan appropriate maintenance actions to be carried out. 3. Any item of the aeronautical equipment is ser-
viceable (fit for use) if the following dependence
occurs:

2 A METHOD TO FORECAST RELIABILITY


g
AND LIFE OF SOME SELECTED ITEMS ΔXi ≤ ΔXi (3)
OF AERONAUTICAL EQUIPMENT
g
What has been assumed in the already developed where: ΔXi = absolute value of boundary deviation
method is as follows (Żurek 2005): of the diagnostic parameter from the nominal value.

449
To be clearer, the following terms (notations) Equation (9) is now rearranged to take the form of a
have been introduced: partial differential equation of the Fokker-Planck type:

ΔXi = zi (4) ∂u(zi , t) ∂u(zi , t) 1 2 ∂ 2 u(zi , t)


= −C + C (10)
ΔXi
g
=
g
zi (5) ∂t ∂zi 2 ∂zi2

where: zi = absolute value of deviation of the Since C is a random variable, an average value of
diagnostic parameter from the nominal value; this variable is introduced. It has the form:
g
zi = absolute value of boundary deviation of the
diagnostic parameter from the nominal value. Cg
Equation (3) can be, therefore, written down in E[C] = Cf (c)dc (11)
the following form:
Cd
g
zi ≤ zi (6)
where: f (c) = density function of the random variable
C; Cg , Cd = upper and lower values of the random
4. Values of changes in diagnostic parameters grow
variable C.
randomly.
Taking account of equation (11) while considering
5. Changes in diagnostic parameters accepted for the
formula (10) the following dependence is arrived at:
assessment of health/maintenance status of indi-
vidual items of aeronautical equipment are inde-
pendent random variables, i.e. any change of any ∂u(zi , t) ∂u(zi , t) 1 ∂ 2 u(zi , t)
= −b + a (12)
of these parameters does not result in any change ∂t ∂zi 2 ∂zi2
in values of other parameters.
6. The method has been dedicated to some selected where: b = E[C]—an average increment of value of
items of the aeronautical equipment, namely to deviation of the diagnostic parameter per time unit;
those for which the rate of changes in diagnos- a = E[C 2 ]—a mean square increment of value of
tic parameters can be described with the following deviation of the diagnostic parameter per time unit.
dependence: We need to find a partial solution of equation (12),
one that at t → 0 is convergent with the so-called
dzi Dirac function: u(zi , t) → 0 for zi  = 0, but in such
=C (7)
dt a way that the function integral u(zi , t) equals to unity
for all t > 0. This solution takes the form:
where: C = operating-conditions dependant ran-
dom variable; t = calendar time. 1 (zi −B(t))2
u(zi , t) = √ e− 2A(t) (13)
The dynamics of changes in values of deviations 2πA(t)
of assumed diagnostic parameters, if approached ran-
domly, is described with a difference equation. One where:
arbitrarily chosen parameter zi has been accepted for
analysis. The difference equation for the assumptions t t
made takes the form (Tomaszek 2001): A(t) = adt = at B(t) = bdt = bt
0 0
Uzi ,t+Δt = PUzi −Δzi ,t (8)
Function (13) is a probabilistic characteristic of
where: Uzi ,t = probability that at the instance of time changes of the diagnostic parameter due to effects
t the deviation of a diagnostic parameter takes value of aging processes, the rate of which can be deter-
zi ; P = probability that the value of the deviation mined with equation (7). Density function of changes
increases by value Δzi within time interval of Δt. in value of the diagnostic parameter can be used
Equation (8) takes the following form if function directly to estimate reliability and life of an aeronau-
notation is used: tical device, the health/maintenance status of which is
estimated with this parameter. Applying the density
u(zi , t + Δt) = u(zi − Δzi , t) (9) function of changes in values of the diagnostic param-
eter to determine distribution of time of exceeding the
where: u(zi , t) = time-dependant density function of boundary condition is a good example of such a solu-
changes in diagnostic parameter. tion. Probability of exceeding the boundary value by

450
the diagnostic parameter can be presented using den- dependence (20):
sity functions of changes in the diagnostic parameter
(Tomaszek 2001): ∞
σ = 2
t 2 f (t)zig dt − (E [T ])2 (20)
∞ 0
1 (zi −bt)2
e−
g
Q(t, zi ) = √ 2at dz (14)
g
2π at Hence
Zi
g √  g 2  g 2
azi + b zi 5a2 z
σ = 2
+ 4 − i2 (21)
To determine the density function of time of exceed- b3 4b 2b
g
ing the admissible value of deviation zi for the first
time one should use the following dependence: The presented method of determining the distri-
bution of time of exceeding the boundary condition
by the diagnostic parameter allows of finding the
∂ g
f (t) = Q(t, zi ) (15) density function) of time of reaching the bound-
∂t ary state. On the basis thereof one can determine
reliability of a given item of aeronautical equip-
Substitution with equation (14), introduced in equa- ment, the health/maintenance status of which is esti-
tion (15), gives: mated by means of the diagnostic parameter under
consideration:
∞ t
∂ 1 (zi −bt)2
f (t) = √ e− 2at dz (16) R(t) = 1 − f (t)zig dt (22)
∂t 2π at
g
Zi 0

The probability density function that determines


Using properties of the differentiation and integra-
distribution of time of the diagnostic parameter’s value
tion, the following dependence is arrived at:
passing through the boundary condition allows also
of calculating the aeronautical item’s life. Therefore,
g
zi + bt
g
1 (zi −bt)2 the level of risk of exceeding the boundary condition
f (t)zig = √ e− 2at (17) should be found:
2t 2π at
t
Equation (17) determines density function of time Q(t)zig = f (t)zig dt (23)
of exceeding the boundary condition by values of the
0
diagnostic parameter. What is to be found next is the
dependence that determines the expected value of time
The value of time, for which the right side of equa-
of exceeding the boundary condition by the diagnostic
tion (23) equals to the left one, determines life of
parameter:
an item of aeronautical equipment under conditions
defined with the above-made assumptions.
∞
E [T ] = tf (t)zig dt (18)
3 ESTIMATES OF LIFE AND RELIABILITY
0
OF AIRBORNE STORAGE BATTERIES

Hence Airborne storage batteries are those items of aeronau-


tical equipment that show strong correlation between
zi
g
z a z
g
a
g changes in values of diagnostic parameters and time or
E [T ] = + i + 2 = i + 2 (19) amount of operating time. Capacitance Q is a diagnos-
2b 2b 2b b 2b tic parameter directly correlated with aging processes
that take place while operating airborne storage bat-
We also need to find the dependence that deter- teries, one which explicitly determines the expiry date
mines the variance of distribution of time of exceeding thereof (Fig. 1). The presented method allows of esti-
the boundary condition by the diagnostic parame- mating the reliability and residual life of airborne
ter. In general, this variance is determined with storage batteries using diagnostic parameters recorded

451
in the course of operating them. A maximum likeli- Storage batteries 12-SAM-28
hood method was used in order to estimate parameters
a and b in the equation (17). Gained with the hitherto 1,2
596

made calculations for the airborne storage batteries 1 280


330
12-SAM-28 are the following characteristics of the 0,8 159
density function of time of exceeding the boundary 170

R(t)
0,6
condition by values of the diagnostic parameter f (t) 525

0,4 112
and the reliability function R(t). They are shown in 180

Figs. 2 and 3. 0,2 574


109
With the above-presented method applied, the fol- 0
0 10 20 30 40 50 60 70 80 90 100
lowing values of life T and residual life Tr have been Service life - t[months]
gained for particular storage batteries 12-SAM-28 (see
Table 1).
Figure 3. Characteristic curves of the reliability function
R(t) for storage batteries 12-SAM-28.
4 CONCLUSIONS
Table 1. Estimated values of life and residual life.
The method introduced in this paper allows of analysis
of health/maintenance status of some selected items STORAGE BATTERIES 12-SAM-28
of aeronautical equipment because of the nature of
changes in values of diagnostic parameters available No Battery No. T [months] Tr [months]

1 109 27.7 13.76


Storage batteries 12-SAM-28 2 574 29.2 11.28
3 180 34.54 17.5
29

28
4 112 41.19 23.47
27 109
5 525 50.71 32.17
26
574 6 170 47.8 30.38
Capacitance - Q[Ah]

180
25
112
7 159 26.13 10.61
24 525 8 330 23.73 6.98
170
23
159
9 280 29.12 12.32
22
330 10 596 20.85 5.7
21
280
20 596
19

18
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25
Service life - t[months]
throughout the operational phase thereof. Determina-
tion of how the values of diagnostic parameters and
Figure 1. Characteristic curves of the changes in capaci- deviations thereof increase enables determination of
tance values for storage batteries 12-SAM-28. time interval, within which a given item remains fit
for use (serviceable).
Dependence for the rate of changes in value of
Storage batteries 12-SAM-28
the diagnostic parameter, i.e. equation (7), is of pri-
0,06
mary significance in this method. The method will
596
not change substantially if other forms of this depen-
0,05
280 dence (i.e. equation (7)) are used. These different
0,04
330 forms may result in changes of coefficients in the
159
170
Fokker-Planck equation (10), which in turn will result
f(t)

0,03
525 in changes of the dependences for both an average
112
0,02
180
value and variance of the density function of changes
0,01
574 of the diagnostic parameter. The method offers also
109 a capability of describing aging and wear-and-tear
0
0 10 20 30 40 50 60 70 80 90 100
processes within a multi-dimensional system. The
Service life - t[months]
above-presented method allows of:

• assessment of residual life of some selected items of


Figure 2. Characteristic curves of the density function f (t) aeronautical equipment with the required reliability
for storage batteries 12-SAM-28. level maintained,

452
• estimation of reliability and life of some selected REFERENCES
items of aeronautical equipment on the grounds of
diagnostic parameters recorded in the process of Jaźwiński, J. & Żurek J. 2007. Wybrane problemy sterowania
operating them, zapasami. Radom.
• verification of the process of operating some Tomaszek, H. & Wróblewski, M. 2001. Podstawy oceny
selected items of aeronautical equipment to main- efektywnosci eksploatacji systemów uzbrojenia lot-
niczego. Warszawa.
tain the required level of reliability between partic- Żurek, J. 2006. Żywotność śmiglowców. Warszawa.
ular checks. Żurek, J. & Tomaszek, H. 2005. Zarys metody oceny
The way of proceeding suggested in the method niezawodności statku powietrznego z uwzglednieniem
under examination can be adopted for specific char- uszkodzeń sygnalizowanych i katastroficznych (naglych).
Warszawa.
acteristics of aging and wear-and-tear processes
that affect various items of aeronautical equipment
throughout operational phase thereof.

453
Safety, Reliability and Risk Analysis: Theory, Methods and Applications – Martorell et al. (eds)
© 2009 Taylor & Francis Group, London, ISBN 978-0-415-48513-5

An alternative imperfect preventive maintenance model

J. Clavareau
Université Libre de Bruxelles (U.L.B), Service de métrologie Nucléaire, Belgium
Grant F.R.I.A, Belgium

P.E. Labeau
Université Libre de Bruxelles (U.L.B), Service de métrologie Nucléaire, Belgium

ABSTRACT: All systems are subject to aging. When a component is aging, its global performances are decreas-
ing. In order to reduce the effect of aging on a system, preventive maintenance actions can be performed. Yet the
rejuvenation of a unit that can be achieved thanks to these interventions is most of the time not total, and does not
correspond to the classical as-good-as-new assumption. Imperfect maintenance models have been proposed in the
literature in order to embody the partial efficiency of preventive actions. This paper reviews the approaches that
are available in the literature to model imperfect maintenance. It also proposes to extend and modify some of them,
in order to obtain a simple and mathematically exploitable model, associated with a more realistic time-dependent
behavior of the component than that corresponding to the application of previous imperfect maintenance models.

1 INTRODUCTION interesting to replace it than to keep maintaining it.


This trade-off between maintenance vs. failure costs
All systems are subject to aging. When a compo- and new investments defines the end of the working
nent is aging, its global performances are decreasing. life of the system.
As a consequence of aging, the failure probability is In practice, all the variables affecting the aging
increasing and the productivity of the operated system process are actually not known. The issue of reduc-
is lower. In order to reduce the aging effect on a system, ing the evolution of all the different degradation
preventive maintenance actions can be performed. The factors to a probability law for the component lifetime,
most easily modeled preventive maintenance is the which is mathematically exploitable and depending on
replacement of the system: after maintenance, the sys- few parameters, remains open. Most authors consider
tem is As Good As New (AGAN). The maintenance strictly increasing failure rates, usually in the form of
policies based on preventive replacement of the sys- a single-mode, two-parameter Weibull law (see eq.(2)
tem have been studied from a long time, starting from below).This form is actually quite different from the
Barlow & Proschan 1965. commonly accepted bathtub curve, as it constraints the
The reality is often quite different: preventive main- failure rate to start from a zero value and often leads
tenance interventions are made to reduce the aging to estimations of the shape parameters lying between
effect or to delay the apparition of these effects but 1 and 2, hence to an exponent lower than 1 in the
these actions are not perfect. The goal of imperfect pre- time-dependent expression of the failure rate.
ventive maintenance is to maintain the performances In this paper, we propose, in section 2, a quick
of the system within an acceptable range. Also, the overview of maintenance efficiency models avail-
efficiency and the cost of each maintenance interven- able in the literature and in section 3 we propose a
tion will be dependent on which type of actions is model extending previous models to a more realistic
undertaken. For example, in a car, oil, brake pads or description of the aging process, including the mainte-
tires have to be periodically changed, but none of these nance interventions. section 4 briefly summarizes the
actions will make a new car. The way they affect the car different points treated in section 3.
state, the cost of each action and also the optimal peri-
ods to carry them out will be different from one action
to the other. We also know that, despite all possible 2 MODELS REVIEW
maintenance actions, the car ages. When the perfor-
mances of the system tend to go outside the acceptable As previously said, in Barlow & Proschan 1965,
range mentioned above, it becomes soon or later more models with AGAN policies are studied.

455
Brown & Proschan 1983 envisaged a repair model behavior:
in which a repair is minimal (As Bad As Old) with a
probability p and degraded with a probability 1 − p. β t β−1
λ(t) = ( ) . (2)
Minimal repair means that the failure rate after repair α α
is the same as before the failure. The degradation
of the component considered is an increase of the where α is the scale parameter (in time unit) and β is
failure rate. With a constant failure rate between fail- the shape parameter of the distribution.
ures, this model leads to a piecewise increasing failure
rate in time. A variant of this model for preven- 2.1 Review of failure rate impact models
tive maintenance can be deduced from there. In this
preventive maintenance model, the preventive inter- 2.1.1 Arithmetic Reduction of Intensity (ARI) model
vention is AGAN with a probability p and ABAO with Following Doyen & Gaudoin 2004, we will define the
a probability 1 − p. Arithmetic Reduction of Intensity (ARI) as follows:
Generally, imperfect preventive maintenance the failure rate after maintenance is taken equal to
models can be divided in two classes (Doyen & Gau- the failure rate before maintenance, minus a given
doin 2004) depending on whether the maintenance quantity.
actions affect the failure rate (Nakagawa 1986, Naka- We can distinguish three particular cases:
gawa 1988, Zequeria & Bérenguer 2006) or the effec- – The reduction is a fraction of the augmentation of
tive age (Canfield 1986, Kijima 1988, Malik 1979, the failure rate from the last maintenance. We have,
Martorell et al., 1999) of the component. We can also immediately after the M th PM intervention:
cite a hybrid model (Lin et al., 2000) for which the
effect of the maintenance actions reduces the value of λ+ − − +
TM = λTM − ρ(λTM − λTM −1 ). (3)
both the failure rate and the effective age. Obviously
the value of the failure rate is determined by the value where ρ is the efficiency of the maintenance
of the effective age, so there is a direct link between between 0 and 1, TM is the time of the M th PM
the two classes. Yet we will see below that the main- action, λ−TM the value of the failure rate just before
tenance effect modeled with one of the two classes is the maintenance and λ+ TM the value of the failure rate
not straightforwardly obtained with the other one. just after the maintenance.
These works discuss ways of modeling the mainte- This model is called ARI1 and gives the following
nance effects either in case of preventive maintenance failure rate at time t:
(PM) or in case of corrective maintenance (CM). Here
we will summarize the main formulations in the case λt = λ(t) − ρλ(TMt ) (4)
of preventive maintenance with minimal repair (the
component’s state just after repair is the same as just where Mt is the is the number of maintenance
before failure) which is equivalent to compute the dis- actions up to time t.
tribution of the first failure. If repair are not minimal We can see that the failure rate at time t is thus given
the intervention effect can be modeled exactly in the by only one subtraction.
same way that the intervention effect of a preventive – The reduction is proportional to the global wear out
imperfect maintenance. In this work we will thus focus before the maintenances. This model is called ARI∞
on the formulation where CM is minimal. and we have:
The distribution of the failure time is completely
characterized by the conditional failure intensity λ+ − −
Tm = λTm − ρλTm . (5)
defined by:
Considering a constant time interval TPM between
1 two consecutive preventive maintenance actions and
∀t ≥ 0, λt = lim P(Nt+dt − Nt = 1| Ht ) (1)
dt→0 dt minimal repair in case of failure, the ARI∞ model
gives the following expression of the failure rate at
where Ht is the past history of the component, i.e. the time t:
set of all events having occurred before t and Nt the
number of failures observed up to time t. For the sake 
Mt −1

of simplicity, we will use the expression ‘‘failure rate’’ λt = λ(t) − ρ (1 − ρ)j λ((Mt − j)TPM ) (6)
instead of ‘‘conditional failure rate’’ in the sequel of j=0
the text.
All the above cited references assume that the initial where Mt is the number of maintenance actions
failure (i.e, the failure rate before any maintenance performed up to time t. The reduction is thus pro-
action) rate is strictly increasing. The initial failure portional to a potentially infinite sum, hence the
rate is then most of the time given by a Weibull-like name of the model.

456
– An intermediate case between the two previous failure mode, with the failure rate λ(t), and the non-
ones can be considered. If we keep only the first maintainable failure mode, with the failure rate h(t).
m terms in the sum in (6), we have the ARIm They propose to take into account dependence between
model: these two competing failure modes. The failure rate
of the maintainable failure mode is the sum of the
 t −1)
min(m−1,M
initial failure rate λ(t) plus a positive value p(t)h(t).
λt = λ(t) − ρ (1 − ρ)j λ ((Mt − j)TPM ) They suppose that without imperfect PM the maintain-
j=0 able failure mode is AGAN and the non-maintainable
(7) failure mode remains unchanged.
Under this model the failure rate of the component
The ARI1 and ARI∞ models are then particu- at time t is given by:
lar cases of the ARIm model. In the ARI family of
models, the value ρ = 0 means that the interven- λt = h(t) + λ(t − Mt TPM ) + p(t − Mt TPM )h(t) (9)
tion is ABAO but, when ρ = 1, the intervention is
not AGAN because the failure rate evolution with With this model, the efficiency of maintenance is
time is different from the evolution of the initial decreasing with time.
failure rate of the component. This behavior is at In this model, there is no flexibility in the main-
the same time an advantage and a drawback of the tenance impact to adapt it to a specific type of
model: there is a part of the aging, related to the maintenance action. The maintenance efficiency does
working time, that is unavoidable, but the replace- not depend on the undertaken actions on the compo-
ment of a component is not included in the model nent, but only on the internal parameters which are the
of impact. Figure 1 below illustrates the behavior maintainable and the non-maintainable failure rates,
of different models including ARI1 and ARI∞ for a as well as the model of dependence between these two
given initial failure rate. We can see that the ARI failure modes. It also seems difficult to determine the
models result in a piecewise vertical translation of the dependence function p(t) for practical cases.
failure rate value and hence keep the wear out trend.
Figure 1, below, gives an example of the result-
ing failure rate for different maintenance efficiency 2.2 Review of effective age impact models
models, including ARI1 and ARI∞ models. In these models, the actual working time is replaced
by an effective age, denoted τt in the estimation of the
2.1.2 Nagakawa model failure rate at time t. The value of this effective age
Nakagawa 1986, 1988 give a model where each PM should represent the state of wear out of the compo-
resets the value of the failure rate to 0. But after a PM, nent. The concept of effective age is sometimes also
the failure rate evolution is increased by an adjustment referred to as virtual age in the literature.
factor θ bigger than 1. We have: As for the failure rate model, following Doyen &
Gaudoin 2004, we will define the Arithmetic Reduc-
λ+
Tm = θ λ(t − Mt TPM )
Mt
(8) tion of Age (ARA). In the ARA model, the effective
age after maintenance results from a decrease of the
This Nagakawa model assumes that the failure rate effective age before the maintenance. We can also dis-
of a component after a PM action is a product of the tinguish between an ARA1 model, an ARA∞ and the
adjustment factor and the failure rate before the action. general ARAm model:
The value of the adjustment factor is a measure of the
PM quality. Indeed, if θ is equal to 1, the maintenance – ARA1 : the reduction of the effective age due to the
is AGAN. This model seems non-intuitive because in intervention is proportional to the increase of age
a certain way, the PM action increases the failure rate. between two maintenance actions:
If the factor θ intends to emphasize the fact that there
is an irreversible part in the aging, it seems that θ must τT+M = τT−M − ρ(τT−M − τT+M −1 ), (10)
then depend on the working time. Indeed, in this situ-
ation, the higher the frequency of the PM, the higher with the efficiency ρ such that 0 ≤ ρ ≤ 1 and
the increase of the failure rate evolution, what is unex- λt = λ(τt ).
pected. Also, there is no ABAO possibility because – ARA∞ : the reduction of the effective age due to
the model always resets the failure rate to 0. the intervention is proportional to the age before
maintenance:
2.1.3 Maintainable and non-maintainable failure
mode model τT+M = τT−M − ρτT−M , (11)
In Zequeira and Bérenguer 2006, the failure process
is separated in two failure modes: the maintainable with 0 ≤ ρ ≤ 1 and λt = λ(τt ).

457
– ARAm : by analogy with the ARIm , with a constant
maintenance time interval and a constant efficiency,
we can define:

 t −1)
min(m−1,M
τt = t − ρ (1 − ρ)j (Mt − j)TPM (12)
j=0

with 0 ≤ ρ ≤ 1 and λt = λ(τt ).


If ρ = 1, we have an AGAN intervention and, if
ρ = 0, we have an ABAO intervention.
The model from Canfield 1986, the Kijima1 model
(Kijima 1988) and the Proportional Age Setback
(PAS)(Martorell et al., 1999) are all equivalent to the
ARA1 model.
The model from Malik 1979, the Kijima2 model Figure 1. Conditional failure rate for models ARI1 , ARI∞ ,
(Kijima 1988) and the Proportional Age Reduction ARA1 and ARA∞ .
(PAR)(Martorell et al., 1999) are all equivalent to the
ARA∞ model. 2.3 Extended models
It is obvious that with a failure rate of the form (2) In Wu and Clements-Croome 2005, the authors extend
and β = 2, the ARA models are equivalent to the ARI the Nakagawa and the Canfield model. They assume
models. that the maintenance efficiency is a random variable.
Note also that, with a constant efficiency and a This can be suitable when the maintenance efficiency
constant maintenance time interval, the ARA1 model is estimated between two boundaries or to emphasize
gives: the fact that each maintenance operation may have a
different impact, thus, embodying the variability in the
τt = t − ρMt TPM (13) resulting state of the component.
To avoid the problem of the upper bound in age
and the ARA∞ model gives: Fouathia and al. 2005 propose not only to use, for
the ARA∞ model, a random variable in the mainte-

Mt
nance efficiency but also to assume that the expected
τt = (t − Mt TPM ) + (1 − ρ)i TPM (14) value of this efficiency is decreasing with the num-
i=1 ber of PM interventions undertaken on the com-
ponent. In Clavareau and Labeau 2006, the same
For the ARA∞ model, the constant efficiency leads idea is used but with an expected value of the
to an upper bounded effective age: efficiency decreasing with the effective age of the
component.
1−ρ
lim τT+M = TPM . (15) Martorell et al., 1999 proposes in addition to the
M →∞ ρ PAS and PAR models to use the Accelerated Life
Model (ALM) to describe the influence of the work-
Indeed, using (14), we have: ing conditions on the effective age. Between the
  maintenance actions, the augmentation of the effec-

M
1 tive age is no longer linear but it is proportional to
TPM lim (1 − ρ)i = TPM −1
a factor Ψ(z) = eξ z , where z is the vector of q
T
M →∞
i=1
1 − (1 − ρ)
working state variables and ξ T the transposed vec-
1−ρ tor of the q corresponding regression coefficients.
= TPM (16)
ρ In their applications, they assume a constant value
for the maintenance efficiency and constant working
considering that ρ is between 0 and 1. conditions. If the working conditions are constant,
This upper bound of the effective age implies an this model results in a change in the value of the
upper bound in the failure rate of the component. scale parameter depending on the specific use of the
We can see this behavior in Figure 1, where the component.
ARA∞ curve is the lowest one. Related to this problem of working conditions we
We can see, in Figure 1, that, obviously, the ARA can cite the proportional intensities and proportional
model results in a piecewise translation in time of the hazard model. Kumar & Klfesjo 1994 review the
initial failure rate. existing literature on the proportional hazard model.

458
In Samrout et al., 2008, the authors propose that a PM Barlow and Proschan model of imperfect repair, there
action reduces the effective age of the component by a is no difference between the ABAO and the AGAN
reduction factor (such as in an ARA∞ model) depend- interventions before aging has started.
ing on the costs of the intervention. Percy and Kobbacy
2000 propose that the intensity function is given by: 3.2.1 Location parameter impact model
κ(t) = κ0 (t)eγ x for hazards following a PM and by:
T
Because the aim of the PM is to maintain the unit
λ(t) = λ0 (t)eω y for hazards following a CM, where t
T
performance in an acceptable range, one can imag-
measures time since the most recent event and x is the ine that, before aging started, the effect of a PM is
vector of working state variables, which can possibly to delay the wear-out onset of the component. We
be modified by the maintenance actions. can model this delay by using, instead of an effec-
tive age, an effective location parameter in equations
(17) and (18). The effect of a PM will then be to
3 MODEL EXTENSION increase this effective location parameter. Two models
can be proposed. In the first one the location param-
3.1 Bi-Weibull failure rate eter is incremented by an amount proportional to the
maintenance period TPM :
The previously cited papers always consider strictly
increasing failure rates in their applications. The lat- νT+M = νT−M + ρTPM (19)
ter follow a power law according to equation (2). This
form is quite different from the bathtub curve usually where the maintenance efficiency ρ is between 0 and
agreed upon as the most realistic time-dependent evo- 1. When ρ is equal to 0, the intervention is ABAO and
lution of the failure rates. After the ‘‘infant mortality’’ when ρ is equal to 1, the intervention is AGAN.
region, there is a possibly long period of life where Equation (17) gives:
the component’s aging is limited, or even negligible,
as long as the physical characteristics of the system
under study are or are maintained within an accept- 
M
νT+M = ν0 + ρTPM = ν0 + M ρTPM (20)
able range. Nonetheless the component is aging and
i=0
at some point, the aging process accelerates, and it is
preferable to replace the component than to maintain
where ν0 denotes the initial value of the location
it. To approach this behavior, we can use an initial fail-
parameter.
ure rate constant in a first period of time, before being
In the second model, the location parameter is equal
increased by an additional contribution following the
to the initial location parameter incremented by an
power law:
amount proportional to the age of the component:
λ(t) = λ0 if t ≤ ν
νT + = ν0 + ρτT − . (21)
β t − ν β−1 (17) M M
λ(t) = λ0 + ( ) if t > ν
α α If we suppose that repairs are minimal, we have
τTi− = M ∗ TPM and thus equation (21) is equivalent to
The corresponding cumulative distribution is given
by the following bi-Weibull law: equation (20).
We can also see that the model of equation (21) is
 equivalent in terms of failure probability to the ARA∞
1 − e−λ0 t if t ≤ ν
F(t) = t−ν β (18) model. Indeed, if we compute the value τT+M − νT+M ,
1 − e−λ0 t−( α ) if t > ν because, in the model from equation (21), the age
after maintenance, τT+M , is the same as the age before
Later, we will call λ0 the constant failure rate and ν maintenance, τT−M , we have:
the location parameter.
τT+M − νT+M = τT−M − ρτT−M − ν0 (22)
3.2 Maintenance model
With this assumption of a constant failure rate on an And the quantity τT−M −ρτT−M is equal to the effective
initial period, the previous models of imperfect inter- age after intervention in the model ARA∞ (equation
vention are partly irrelevant as they assume the aging (11)). Thus, when the effective age at time t, τt , is lower
has already started from the beginning of the com- than the location parameter, the delay till the aging
ponent operation. Indeed, it would be meaningless to onset due to the imperfect PM can be symmetrically
reduce this constant failure rate if no major change obtained by reducing the value of the effective age or
is made on the component. In the same way, in the by increasing the value of the location parameter.

459
3.2.2 Extended effective age model When the effective age at time t, τt , is lower than the
We propose thus to use the effective age as a degrada- location parameter, the failure rate of the component
tion criterion. Depending on the previous interven- is constant and the maintenance can delay the start of
tions, the effective age will give the global perfor- the aging process. When the age is greater than the
mance state of the component. With this assumption location parameter, if, mathematically, relation (23)
we only have one parameter describing the entire aging is still applicable, the meaning is different. We have
process of the component. Before any intervention on two different cases: If δM is greater than τT−M − ν, the
the component, its effective age is equal to its working aging of the component is stopped and the failure rate
time. The interventions can modify the value of the is again constant, decreased down to the value λ0 . If
effective age and between two consecutive interven- δM is lower than τT−M − ν, the aging of the component
tions, in normal working conditions, it is assumed that is not stopped and the failure rate is still increasing but
the increase of the effective age is equal to the time with a translation in time corresponding to δM .
between the two interventions. With this model, the corresponding cumulative
In most of the works cited in the review of section 2, distribution at time t, after an intervention at time ts is:
excepted Zequeira & Bérenguer 2006, Fouathia et al.,

2005 and Clavareau and Labeau 2006, the mainte-
⎪ 1 − e−λ0 (t−ts ) if τt ≤ ν
nance action is characterized by a constant efficiency ⎪


⎪ τt −ν β
and a constant cost at each intervention, no matter what ⎨1 − e−λ0 (t−ts )−( α ) if τt > ν
maintenance period is considered. With an effective F(τt ) = and τs+ ≤ ν (25)


age model, it leads to considering that the absolute ⎪
⎪ τt −ν β

⎩1 − e−λ0 (t−ts )−( α )
if τs+ > ν
effect of the maintenance is always higher when the τs+ −ν β
−( α )
maintenance time interval is increasing, but with no e
impact on the intervention cost. When we are trying to
optimize the interval of maintenance, it seems not log- where τt = τs+ + (t − ts ) is the effective age of the
ical, that the same intervention, with the same costs, component at time t, τs+ is the effective age just after
can better restore the component, only because the intervention.
component is more aged.
The amount of effective age decrease after a PM 3.2.3 Intervention Impact model
intervention will obviously depend on the action When a component ages, part of this aging cannot
undertaken during this intervention. We propose to be rejuvenated, unless a bigger cost is required. In
not fix a priori the efficiency of the maintenance but practice, this model allows us to envisage that each
to relax this assumption and let the maintenance effect maintenance operation may have a different impact.
vary as a function of the undertaken action in the This impact should not be taken constant. Moreover,
maintenance interventions. We will have: we can assume, as Wu and Clements-Croome 2005,
that δM is a random variable embodying the variability
τT+M = τT−M − δM (23) in the resulting state of the component. The value or
the distribution of δM must also depend on the age of
The amount of the decrease, δM , will characterize the component if the expected effect of a preventive
the effect of the M th maintenance. The value of δM maintenance is less and less efficient as the number of
can be interpreted directly as the augmentation of the maintenance actions undergone by a component gets
expected residual lifetime of the component due to the higher.
intervention. With this formalism, if δM is equal to The model also allows considering a limited PM
0, the intervention is ABAO, if δM is equal to τT−M the in order to maintain the initial working conditions
maintenance is AGAN, if δM is lower than τT−M it corre- and performances, and a more important PM with a
sponds to an imperfect intervention and possibly if δM better efficiency to recover the loss of performances
is greater than τT − the intervention is more than perfect. due to degradation. The latter type of PM should be
M
Note that the ARA1 and ARA∞ models are par- performed with a lower periodicity and it entails higher
ticular cases of this more general model. In ARA1 , costs than the limited, usual one. With this point of
the maintenance effect, δM , is equal to ρTPM , and, in view, the decision to replace a component is in compe-
ARA∞ , we have δM = ρτT−M . tition with the decision to make or not a more important
and costly maintenance with a large effect on the
In the case of the initial failure rate (15) we have a
component state but with a higher risk of failure than
resulting conditional failure rate given by:
the replacement.
λt = λ0 if τt ≤ ν Figures 2 and 3 illustrate the impact of this kind
of mixed maintenance, for a component with an ini-
β τt − ν β−1 (24) tial failure rate following (24) with λ0 = 1/10 u.t.−1 ,
λt = λ0 + ( ) if τt > ν
α α α = 10 u.t, β = 2.5, ν = 5 u.t.

460
the action undertaken from one intervention to another
one. This variation seems natural when the working
time grows in order to keep the performances of the
system in an acceptable range.
The model is also suitable in the case where the
repairs are not minimal. In this case, each intervention,
including repairs, will affect the effective age.

3.2.4 Global performances


When a component is aging, not only its failure rate is
increasing but its global performances are decreasing.
We consider that the component is described by its fail-
ure rate, by its consumption rate (working costs by unit
of time) and by its purchase cost. If we consider that
the effective age is a global indicator of the component
state, we can propose to characterize the performances
Figure 2. Comparison of the effective age evolution for two of the working component by a consumption rate η
maintenance policies. varying as a function of the value of the effective age.
In the same way as for the failure rate, this consump-
tion rate will increase with the effective age of the
component, embodying the degradation of the perfor-
mances with aging. The behavior of this aging process
will be dependent on the component. But, as in the case
of the failure rate, it is reasonable to suppose a con-
stant part when the age is not too important and an
increasing part when the age has reached a threshold
value, and a law of the form (24), with appropriate
parameters, can model this behavior.

4 SUMMARY AND PERSPECTIVES

After a quick overview of different maintenance


models available in the literature, we propose, in this
paper, to extend some of the considerations of these
Figure 3. Comparison of the resulting failure rates for two models. The first point is to use the effective age of
different maintenance policies. a component as a global indicator of the component’s
state. The failure rate at time t will depend directly
on this effective age at time t, but this is also true
In Figure 2 we can compare the effective age for other component performances such as for exam-
evolution resulting from the ARA1 model on one hand, ple the energy consumption of the component. If the
and that resulting from the same model with a bigger effective age is a global state indicator, we propose that
impact taking place in the third maintenance interven- all the events that act on the component state, as main-
tion on the other hand. Figure 3 compares the resulting tenance interventions or working conditions, modify
failure rates. We can see that the first intervention the value of the effective age.
delays the start of the aging, while the second inter- We think that it is not realistic to consider a strictly
ventions stop the aging process. The bigger impact on increasing failure rate from the very start of the com-
the third maintenance intervention permits to have an ponent’s lifetime. Indeed, such an assumption leads
effective age after intervention lower than the location to considering that the likelihood to fail of a com-
parameter when the effective age before intervention ponent starts increasing from the very moment it
was higher than the parameter location. It means that is operated. Yet the daily experience of maintaining
the actions undertaken during this third intervention equipment leads rather to interpreting maintenance as
are sufficient to stop the wear out process and rejuve- a way of preserving as long as possible the present
nate the component state to its initial constant failure performances, including the failure rate, of a unit,
rate. The ARA1 (and other models with constant until an accumulation of wear-out triggers a rapid
efficiency) does not consider this possible variation in performance degradation, requiring different, more

461
costly maintenance resources. In order to model such a Canfield RV., 1986, ‘‘Cost optimization of periodic preven-
maintenance impact, we have to consider a bi-Weibull tive maintenance’’, IEEE Trans Reliab; 35:78–81.
distribution for the failure times with a constant fail- Clavareau J. and Labeau P.E., 2006 ‘‘Maintenance and
ure rate and an increasing part, starting after a certain replacement policies under technological obsolescence’’
useful life, which is lengthened via maintenance. Proc. of ESREL’06. Estoril (Portugal), 499–506.
Doyen L., Gaudoin O., 2004, ‘‘Classes of imperfect repair
We think also that the maintenance intervention models based on reduction of failure intensity or effective
should not have a constant efficiency but that these age’’, Rel. Engng. Syst. Safety; 84:45–56.
interventions act differently depending on which type Fouathia O., Maun J.C., Labeau P.E., and Wiot D., 2005,
of intervention is actually undertaken on the compo- ‘‘Cost-optimization model for the planning of the renewal,
nent. Consequently we propose to relax the common inspection, and maintenance of substation facilities in
hypothesis of a maintenance effect proportional to Belgian power transmission system’’, Proc. of ESREL
a given quantity (the maintenance interval or the 2005. Gdansk (Poland); 631–637.
effective age before intervention in the most com- Kijima M., Morimura H., Suzuki Y., 1988, ‘‘Periodical
mon cases). We think that this formulation allows replacement problem without assuming minimal repair’’,
Eur. J. Oper. Res; 37:194–203.
us to investigate more realistic issues, such as the Kumar D. and Klfesjo B., 1994, ‘‘Proportional hazard model:
compromise between different interventions (possibly, a review’’, Rel. Engng. Syst. Safety; 44:177–88.
including a replacement) leading to different rejuve- Lin D., Zuo MJ, Yam RCM., 2000, ‘‘General sequential
nation levels of the component’s effective age but with imperfect preventive maintenance models’’, Int J Reliab
different costs and constraints. Qual Saf Eng; 7:253–66.
In conclusion, the proposed model is thought to Malik MAK., 1979 ‘‘Reliable preventive maintenance
be closer to the reality of the maintenance field. It scheduling’’, AIEE Trans; 11:221–8.
allows more flexibility in the modeling but it always Martorell S., Sanchez A., Serradell V., 1999, ‘‘Age-
keeps the assumption that the law parameters describ- dependent reliability model considering effects of mainte-
nance and working conditions’’ , Rel. Engng. Syst. Safety;
ing the failure rate are constant in time. Further work 64:19–31.
will investigate the possibility that the aging process is Nakagawa T., 1986 ‘‘Periodic and sequential preventive
stopped but that it degraded the component, resulting maintenance policies’’, J. Appl. Probab; 23:536–42.
now in a higher constant failure rate λM . Nakagawa T., 1988 ‘‘Sequential imperfect preventive main-
Also events in the environment of the component tenance policies’’, IEEE Trans Reliab; 37:295–8.
can affect its performances, hence the aging of the Percy D.F. and Kobbacy K.A.H, 2000, ‘‘Determining eco-
component. For example, the failure of another com- nomical maintenance intervals’’, Int. J. Production Eco-
ponent in the production process can increase the nomics; 67:87–94.
failure probability of the component without causing Samrout M., Châtelet, E., Kouta R., and N. Chebbo,
2008, ‘‘Optimization of maintenance policy using the
an instantaneous failure. Other events, as for example, proportional hazard model’’, Rel. Engng. Syst. Safety;
an overvoltage or a difference in the working tempera- doi:10.1016/j;ress.2007/12.006.
ture can cause such a behavior. As we use the effective Wu S., Clements-Croome D., 2005, ‘‘Preventive maintenance
age as a global state indicator of the components state, models with random maintenance quality’’, Rel. Engng.
one could think of modeling the effect of these events Syst. Safety; 90:99–105.
by an increase in the component’s effective age. Zequeira R.I., Bérenguer C., 2006, ‘‘Periodic imperfect pre-
ventive maintenance with two categories of competing
failure modes’’, Rel. Engng. Syst. Safety; 91:460–468.
REFERENCES

Barlow R., Proschan F., 1996, ‘‘Mathematical theory of


reliability’’, 1965, SIAM, Philadelphia.
Brown M., Proschan F., 1983, ‘‘Imperfect repair’’, J. Appl.
Prob.; 20:851–860.

462
Safety, Reliability and Risk Analysis: Theory, Methods and Applications – Martorell et al. (eds)
© 2009 Taylor & Francis Group, London, ISBN 978-0-415-48513-5

An imperfect preventive maintenance model with dependent failure modes

I.T. Castro
University of Extremadura, Badajoz, Spain

ABSTRACT: Consider a system subject to two modes of failures: maintainable and non-maintainable. When-
ever the system fails, a minimal repair is performed. Preventive maintenances are performed at integer multiples
of a fixed period. The system is replaced when a fixed number of preventive maintenances have been completed.
The preventive maintenance is imperfect and the two failure modes are dependent. The problem is to determine
an optimal length between successive preventive maintenances and the optimal number of preventive mainte-
nances before the system replacement that minimize the expected cost rate. Optimal preventive maintenance
schedules are obtained for non-decreasing failure rates and numerical examples for power law models are given.

1 INTRODUCTION kT , where k = 0, 1, 2, . . . and T > 0, the failure


rate of the maintainable failures after each preventive
The classic preventive maintenance models assume maintenance is given by
that the system becomes anew after each preven-
tive maintenance. But actually, the improvement after
rk,T (t) = r(t − kT ) + pk (t)h(t),
each preventive maintenance depends on the system
age as well as the cost and the cumulative number kT ≤ t < (k + 1)T , (1)
of preventive maintenance tasks done since the last
replacement. This action is called imperfect preven-
where r(t) denotes the failure rate before the first
tive maintenance. There is an extensive literature about
preventive maintenance, h(t) is the non-maintainable
imperfect preventive maintenance models, see (Ben-
failure rate and pk (t) is interpreted as a probability
Daya et al. 2000), (Nakagawa 2005), (Osaki 2002) and
in the following sense: if the system has a non-
(Pham 2003) for more details.
maintainable failure at time t then it will undergo also
In this work, the system failures can be grouped
a maintainable failure with probability pk (t).
into two modes: maintainable and non-maintainable
Analogously to (Zequeira and Bérenguer 2006),
analogously to (Lin et al. 2001) and (Zequeira and
we assume that the two modes of failure are depen-
Bérenguer 2006). A failure rate function is related
dent. Preventive maintenance actions are performed
to each failure mode. The preventive maintenance is
at kT , k = 1, 2, . . . , and they affect the failure
imperfect because it can only reduce the failure rate of
rate of the maintainable failures. We assume that the
the maintainable failures but can not change the fail-
occurrence of maintainable failures depends on the
ure rate of the non-maintainable failures. For example,
total of non-maintainable failures from the installa-
the preventive maintenance may include tasks such as
tion of the system. Assuming that the failure rate of
oiling, cleaning, partial system replacement, . . . that
the maintainable failures is zero for a new system,
only can improve some failures but they can not elimi-
the maintainable failure rate after the k-th preventive
nate other failures related to the inherent design of the
maintenance action is
system.
(Lin et al. 2001) modeled the effects of the preven-
tive maintenance tasks by the reduction of the failure r1,k (t) = r1,0 (t − kT )aN2 (kT ) ,
rate of the maintainable failures and the effective age of
kT ≤ t < (k + 1)T , (2)
the system using adjustment factors. In a more realistic
approach, (Zequeira and Bérenguer 2006) considered
that the failure modes are dependent. This dependency where a > 1 and N2 (kT ) represents the number of
is expressed in terms of the failure rate: the failure non-maintainable failures in [0, kT ] and r1,0 (t) the
rate of the maintainable failures depends on the fail- maintainable failure rate for a new system. After each
ure rate of the non-maintainable failures. Assuming preventive maintenance, the maintainable failure rate
that the preventive maintenance is performed at times resets to zero. The adjustment factor aN2 (kT ) represents

463
the effect of the wear-out of the system (due to the non- of the maintainable failures in the interval [kT ,
maintainable failures) in the occurrence of the main- (k + 1)T ), r1,k is given by
tainable failures. Practical applications of this model
are showed in (Zequeira and Bérenguer 2006) where r1,k (t) = r1,0 (t − kT )aN2 (kT ) ,
some examples of dependence between maintainable
and non-maintainable failures are explained.
where a > 1.
In this work, the preventive maintenance actions
5. The system is replaced at the N -th preventive main-
are performed at times kT , k = 1, 2, . . . and the sys-
tenance after its installation. After a replacement,
tem is replaced whenever it reaches an age of NT after
the system is ‘‘as good as new’’ and the replacement
last renewal. When the system fails, a minimal repair
time is negligible.
is performed. Costs are associated with the preven-
6. The costs associated with the minimal repairs for
tive maintenance actions, with the repairs and with the
maintainable and non-maintainable failures are C1
replacements. The objective is to determine an opti-
and C2 respectively. The cost of each preventive
mal length between preventive maintenances and the
maintenance is Cm and the replacement cost is
optimal number of preventive maintenances between
Cr (Cm < Cr ). All costs are positive numbers.
replacements of the system.
The problem is to determine the optimal length
between preventive maintenances and the total number
2 FORMULATION of preventive maintenances before the system replace-
ment. The optimization problem is formulated in terms
We consider a maintenance model under where correc- of the expected cost rate.
tive and preventive maintenances take place according From Assumptions 3 and 4, the preventive mainte-
to the following scheme. nance tasks are imperfect in a double sense. First, the
1. Before the first preventive maintenance, the preventive maintenance does not affect the failure rate
maintainable failures arrive according to a non- of the non-maintainable failures. Second, the accu-
homogeneous Poisson process (NHPP) {N1 (t), t ≥ mulated wear-out due to the non-maintainable failures
0} with intensity function r1,0 (t) and cumulative affect the failure rate of the maintainable failures.
failure intensity function This accumulated wearing is not eliminated under pre-
ventive maintenance actions and it is showed in the
 t constant a. If a = 1, non-maintainable failures and
H1 (t) = r1,0 (u)du, t ≥ 0. maintainable failures are independent.
0 From (2), after the successive preventive mainte-
nances, the failure rate of the maintainable failures is
A maintainable failure is corrected by a minimal stochastic and we shall use some results of the theory
repair with negligible repair time. We assume that of the doubly stochastic Poisson processes (DSPP). In
r1,0 (t) is continuous, non-decreasing in t and zero a DSPP, the intensity of the occurrence of the events
for a new system, that is, r1,0 (0) = 0. is influenced by an external process, called informa-
2. The non-maintainable failures arrive according to tion process, such that the intensity becomes a random
a NHPP {N2 (t), t ≥ 0} with intensity function r2 (t) process. An important property of the DSPP is the fol-
and cumulative failure intensity function lowing. If {N (t), t ≥ 0} is a DSPP controlled by the
 t
process Λ(t) one obtains that
H2 (t) = r2 (u)du, t ≥ 0.  
0 1
P[N (t) = n] = E Λ(t)n e−Λ(t) , n = 0, 1, 2, . . . .
n!
A non-maintainable failure is corrected by a min-
imal repair and the repair time is negligible. We (3)
assume that r2 (t) is continuous and non-decreasing
in t. From (2), the random measure in t where kT ≤ t <
3. The system is preventively maintained at times kT , (k + 1)T and k = 0, 1, 2, . . . is given by
where k = 1, 2, . . . and T > 0. The preventive  t
maintenance actions only reduce the failure rate of
the maintainable failures and the failure rate of the Λk (t) = r1,0 (u − kT )aN2 (kT ) du
kT
non-maintainable failures remains undisturbed by
the successive preventive maintenances. = aN2 (kT ) H1 (t − kT ), (4)
4. The non-maintainable failures affect the failure rate
of the maintainable failures in the following way. where H1 (t) denotes the cumulative failure intensity
Denoting by r1,k , k = 0, 1, 2, . . . the failure rate function of r1,0 (t).

464
Denoting by N1 (t) the number of maintainable T
Nopt = min {A(T , N ) > Cr − Cm } , (8)
N ≥0
failures in [0, t] and using (3) and (4), one obtains
that
where A(T , N ) is given by
P[N1 ((k + 1)T ) − N1 (kT ) = n]
 
−1

1 N2 (kT )  N2 (kT ) 
N
=E (a H1 (T )) exp −a
n
H1 (T ) , A(T , N ) = C1 H1 (T ) NgN (T ) − gk (T )
n! k=0

+ C2 [NH2 ((N +1)T )−(N +1)H2 (NT )] .


for n = 0, 1, 2, . . . . Furthermore, using
(9)
H2 (kT )z
P[N2 (kT ) = z] = exp{−H2 (kT )},
z!
If N ∗ exists such that A(T , N ∗ ) = Cr − Cm , then Nopt
T

for z = 0, 1, 2, . . . , the expected number of main- is not unique. Furthermore, if T1 ≤ T2 then


tainable failures between the k-th preventive mainte-
nance and the (k + 1)-th preventive maintenance for T1
Nopt T2
≥ Nopt .
k = 0, 1, 2, . . . is given by

E [N1 (kT , (k + 1)T )] Theorem 1.2 Let C(T , N ) be the function given by
(6). When N is fixed, the value of T that minimizes
= H1 (T ) exp((a − 1)H2 (kT )). (5) C(T , N ) is obtained for T = Topt
N N
where Topt is the
value that verifies
We denote by C(T , N ) the expected cost rate. From
Assumption 6 and using Eq. (5), one obtains that N
B(Topt , N ) = Cr + (N − 1)Cm , (10)
N −1

C1 H1 (T ) exp((a − 1)H2 (kT )) where B(T , N ) is the function given by
k=0
C(T , N ) =
NT
N −1
 
C2 H2 (NT ) + Cr + (N − 1)Cm B(T , N ) = C1 gk (T ) r1,0 (T )T − H1 (T )
+ , (6)
NT k=0

N −1
where the numerator represents the expected cost
between replacements of the system and the denomi- + C1 gk (T ) {H1 (T )(a−1)kTr2 (kT )}
nator the time between successive replacements of the k=0
system. + C2 (NTr2 (NT ) − H2 (NT )) . (11)

3 OPTIMIZATION Furthermore, if either r1,0 or r2 is unbounded

The problem is to find the values T and N that mini- lim r1,0 (t) = ∞, lim r2 (t) = ∞,
mize the function C(T , N ) given in (6). In other words, t→∞ t→∞
to find the values Topt and Nopt such that
N
then Topt < ∞.
C(Topt , Nopt ) = inf {C(T , N ), T > 0, N = 1, 2, 3, . . . }.
(7)
In the problem of optimization given in (7), the
values Topt and Nopt must satisfy the expressions (8)
Theorem 1 and 2 show the optimization problem in
and (10). In general, one cannot obtain an explicit ana-
each variable. The proof of these results can be found
lytical solution for these equations and they have to be
in (Castro 2008).
computed numerically. But, one can reduce the search
of the values Nopt and Topt for a finite set values of the
Theorem 1.1 Let C(T , N ) be the function given by variable N . For that, we use a similar result to Theo-
(6). For fixed T > 0, the finite value of N that rem 11 given in (Zhang and Jardine 1998) p.1118. The
minimizes C(T , N ) is obtained for N = Nopt
T
given by result is the following

465
Lemma 1.1 We denote by T ∗ the following the first preventive maintenance are given by
expression

Cm r1,0 (t) = 6.33t 0.2 , r2 (t) = 0.1739t 0.2 , t ≥ 0,


T∗ = 1
, (12)
C(Topt , 1)
and,

T
and let Nopt be the value of N that optimizes C(N , T ∗ ).
Assuming that either r1,0 or r2 is unbounded, then the H1 (T ) = (4T )1.2 ≥ H2 (T ) = (0.2T )1.2 , T ≥ 0,
problem of optimization of C(T , N ) given in (7) has
finite optimal solutions and
that is, before the first preventive maintenance, the
 expected number of maintainable failures is less than
C(Topt , Nopt ) = min ∗ min C(T , N ) . (13) the expected number of non-maintainable failures.
1≤N ≤Nopt
T T >0 Furthermore, the following costs are associated to the
minimal repairs, to the inspection tasks and to the
Proof. The proof of this result can be found in system replacement.
(Zhang and Jardine 1998). Note that for the model
presented in this paper, from Theorem 1 if T1 ≤ T2 C1 = 5, C2 = 7.5, Cr = 200, Cm = 75.
T1 T2
then Nopt ≥ Nopt . This condition is necessary for the
proof of the result.
Finally, let a = 1.02 be the adjustment factor.
Figure 1 shows the graphics of C(T , N ) given in (6)
Remark 1.1 Analogously to (Zhang and Jardine
1 for different values of N . These values of N are
1998), C(Topt , 1) in (12) may be replaced by any
i
C(Topt , i) for i ∈ {2, 3, . . . }.
N = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 14, 16, 20, 30].

4 NUMERICAL EXAMPLES To analyze the values Topt and Nopt that verify (7),
Figure 2 shows an enlargement of Figure 1 to clarify
We assume that the intensity functions of the processes the procedure of finding the values Topt and Nopt .
{N1 (t), t ≥ 0} and {N2 (t), t ≥ 0} follow a power law By inspection over these values, one obtains that
model with non-decreasing failure rates, that is, the optimal values for this range of values are Topt =
3.575 and Nopt = 10 with an expected cost rate of
ri (t) = λi βi (λi t)βi −1 , t ≥ 0, i = 1, 2, C(3.575, 10) = 51.281.
Using Lemma 1, one can reduce the search of the
where βi > 1. Let λ1 = 4, β1 = 1.2, λ2 = 0.2, optimal values T and N verifying (7) in a limited range
β2 = 1.2 be the parameters of the failure rates. Con- of values of N . We shall follow the following steps
sequently the failure intensities of the processes before

200 54

N=4
180
N=16 N=14 N=11N=10 N=9 N=8
N=30 N=20 N=12 53.5
160 N=7

140 N=6 53
N=1

120
C(T,N)

N=5
C(T,N)

100 52.5
N=4 N=6

80 N=3
N=2 N=7
52
60
N=8
40 N=9
51.5
20
N=10
0 51
0 5 10 15 20 25 30 35 40 2 2.5 3 3.5 4 4.5 5 5.5 6
T T

Figure 1. Expected cost C(T , N ) versus T for different Figure 2. Expected cost C(T , N ) versus T for different
values of N . values of N .

466
12
1. To find the value Topt that verifies 50

min C(T , 12)) = C(Topt


12
, 12).
T >0
0

Using Theorem 2 and the function B(T , 12) given

A(0.4869,N) Cr+Cm
12
in (11) Topt verifies the equation

12
B(Topt , 12) − Cr − 11Cm = 0.

Using a root search algorithm one obtains that


12
Topt = 3.2256 and C(3.2256, 12) = 51.3431.
2. Calculate

Cm 25
T∗ = 12
= = 0.4869.
C(Topt , 12) 51.3431

0.4869
3. To find the value Nopt that verifies 0 10 20 30 40 50 60 70 80 90
N

min{C(0.4869, N )} = C(0.4869, Nopt


0.4869
). Figure 4. Function C(TN , N ) versus N .
N ≥1

For that, from Theorem 1 and Figure 3, we obtain B(T , N ) given by (11) and for the different val-
0.4869
by inspection that Nopt = 86. ues of N . By inspection, one obtains that the
4. Finally, we have to find Topt and Nopt that verify optimal values for the optimization problem are
Topt = 3.575 and Nopt = 10 with an expected cost
min {min C(T , N )} = C(Topt , Nopt ). rate of C(3.575, 10) = 51.281.
1≤N ≤86 T >0

N
Figure 4 shows the values of C(Topt , N ) for dif- 5 CONCLUSIONS
N
ferent values of N . The values Topt are obtained
using a root search algorithm for the function In a system with two modes of failures and succes-
sive preventive maintenance actions, we have stud-
ied the problem of finding the optimal length T
68 between successive preventive maintenances and the
optimal number of preventive maintenances N − 1
66 before the total replacement of the system. The two
modes of failures are dependent and their classifi-
64 cation depends on the reduction in the failure rate
after the preventive maintenance actions. For fixed
62 T , an optimal finite number of preventive mainte-
nances before the total replacement of the system
C(Topt , N)

60 is obtained. In the same way, when N is fixed, the


N

optimal length between successive preventive mainte-


58 nances is obtained. If the failure rates are unbounded,
this value is finite. To analyze the optimization
56
problem in two variables, except some special cases,
we have to use numerical results but one can reduce
54
the search of the optimal values to a finite case.
52

ACKNOWLEDGEMENTS
50
10 20 30 40 50 60 70 80 90 100
N This research was supported by the Ministerio de
Educación y Ciencia, Spain, under grant MTM2006-
Figure 3. Function A(0.4869, N ) − Cr + Cm versus N . 01973.

467
REFERENCES Osaki, S. (2002). Stochastic Models in Reliability and
Maintenance. Springer-Verlag, Berlin.
Ben-Daya, M., S. Duffuaa, and A. Raouf (2000). Mainte- Pham, H. (2003). Handbook of Reliability Engineering.
nance, Modeling and Optimization. Kluwer Academic Springer-Verlag, London.
Publisher. Zequeira, R. and C. Bérenguer (2006). Periodic imperfect
Castro, I.T. (2008). A model of imperfect preventive main- preventive maintenance with two categories of competing
tenance with dependent failure modes. European Journal failure modes. Reliability Engineering and System Safety
of Operational Research to Appear. 91, 460–468.
Lin, D., J. Ming, and R. Yam (2001). Sequential imperfect Zhang, F. and A. Jardine (1998). Optimal maintenance mod-
preventive maintenance models with two categories of els with minimal repair, periodic overhaul and complete
failure modes. Naval Research Logistics 48, 173–178. renewal. IIE Transactions 30, 1109–1119.
Nakagawa, T. (2005). Maintenance Theory of Reliability.
Springer-Verlag, London.

468
Safety, Reliability and Risk Analysis: Theory, Methods and Applications – Martorell et al. (eds)
© 2009 Taylor & Francis Group, London, ISBN 978-0-415-48513-5

Condition-based maintenance approaches for deteriorating system


influenced by environmental conditions

E. Deloux & B. Castanier


IRCCyN/Ecole des Mines de Nantes, Nantes, France

C. Bérenguer
Université de Technologie de Troyes/CNRS, Troyes, France

ABSTRACT: The paper deals with the maintenance optimization of a system subject to a stressful environment.
The behavior of system deterioration can be modified by the environment. Maintenance strategies, based not only
on the stationary deterioration mode but also on the stress state, are proposed to inspect and replace the system
in order to minimize the long-run maintenance cost per unit of time. Numerical experiments are conducted
to compare their performance with classical approaches and thus highlight the economical benefits of our
strategies.

1 INTRODUCTION One way of capturing the effect of a random environ-


ment on an item’s lifelength is to make its failure rate
Maintenance is an essential activity in any production function (be it predictive or model) a stochastic pro-
facility. In this context, numerous research works have cess. Ones of the most important approaches are the
provided multiple relevant approaches to optimize well-known proportional hazard rate (Cox 1972) or the
maintenance decision based on different character- cumulative hazard rate (Singpurwalla 2006; Singpur-
istics of the system in order to reduce the associ- walla 1995) which consist to model the effect of the
ated costs and in the same time to maximize the environment with the introduction of covariates in the
availability and the safety of the considered system hazard function. However, the estimation of the dif-
(Rausand and Hoyland 2004). Condition-Based Main- ferent parameters are quite complex for systems even
tenance (CBM) approaches (Wang 2002), where the if efficient methodologies exists for the component
decision is directly driven by a measurable condi- level, see Accelerated Life Testing (Bagdonavicius
tion variable which reflects the deterioration level of and Nikulin 2000; Lehmann 2006). Other reliabil-
the system, have proved their efficiency compared ity works on non-parametric methods (Singpurwalla
to classical time- or age-based policies (Gertsbakh 2006) allow to specify the likelihood by taking into
2000) in terms of economical benefits and also in account all the information available. Nevertheless,
terms of system safety performance. Most of the the direct application of these models for maintenance
maintenance decision framework does not take into optimization leads to very complex and inextricable
account some potential variations in the deteriora- mathematical equations.
tion mode. The classical approaches in maintenance In this context, the objective of this paper is to
optimization do not explicit the relationship between develop a maintenance decision framework for a con-
the system performance and the associated operating tinuous deteriorating system which is influenced by
environment. environmental condition. We consider a system sub-
However, in most practical applications items or ject to a continuous random deterioration such as
systems operate in heterogeneous environments and corrosion, erosion, cracks . . . (van Noortwijk 2007).
loads. The stressful environment and other dynami- The deterioration process is influenced by environ-
cally changing environmental factors may influence mental conditions such as, e.g., humidity rate, temper-
the system failure rate or at least the degradation ature or vibration levels. First, a specific deterioration
modes. Different deterioration models taking into model based on the nominal deterioration characteris-
account the stressful environment are provided in the tics (without any stress), the random evolution of the
reliability field (Singpurwalla 1995; Lehmann 2006) environment and the impact of the stress on the system
but none of them are developed in maintenance area. are developed. Then, different maintenance strategies

469
which integrate the deterioration level but also the developments. The gamma process is parameterized
stress information are presented. Finally, numerical by two parameters α and β which can be estimated
results based on generic data are conducted to compare from the deterioration data. Gamma processes have
these different approaches with a classical approach received recently a tremendous amount of attention
and thus will highlight the economical and safety in the reliability and maintainability literature as
benefits of our approach. means to model degradation of many civil engineering
structures under uncertainty (vanNoortwijk 2007).

2 DESCRIPTION OF THE FAILURE PROCESS


2.2 Stress process
We consider a single-unit system subject to one failure
mechanism evolving in a stressful environment. The It is assumed that the system is subject to an envi-
failure process results from an excessive deterioration ronmental stress (e.g. temperature, vibrations, . . . ). It
level. This section is devoted to describe the system is considered that the environment condition at time
failure degradation, the evolution of the stress and the t can be summarized with a single binary covariate
relationship between these two processes. (Yt )(t≥0) . It is assumed that Yt is an indicator of the
environment evolution, i.e. it does not model the envi-
ronment but only indicate if the system is stress or not
2.1 Stochastic deterioration model (Yt = 1 if the system is stressed and 0 otherwise). The
The condition of the system at time t can be sum- time intervals between successive state changes are
marized by a scalar aging variable Xt (Deloux et al., exponentially distributed with parameter λ0 (respec-
2008; Grall et al., 2006; Grall et al., 2002) which tively λ1 for the transit to the non-stressed state at
varies increasingly as the system deteriorates. Xt can the stress state (respectively stressed to non-stressed
be e.g. the measure of a physical parameter linked to state). At t = 0 the system is in the stressed state with
the resistance of a structure (length of a crack, . . . ). a probability r̄ = λ1λ+λ0
0
.
The initial state corresponds to a perfect working state,
i.e. X0 = 0. The system fails when the aging variable is
greater than a predetermined threshold L. The thresh- 2.3 Impact of the stress process on the system
old L can be seen as a deterioration level which must deterioration
not be exceeded for economical or security reasons.
The deterioration process after a time t is independent The system evolves in a stressful environment, it
of the deterioration before this time. In this paper, it is considered that the deterioration behavior can be
is assumed that (Xt )(t≥0) is a gamma process and the
increment of (Xt )(t≥0) on a unit of length δt , Xδt , fol-
lows a gamma probability distribution function with X(t)
shape parameter αδt and scale parameter β:

1
fαδt,β (x) = β αδt xαδt−1 e−βx Ix≥0 (1)
Γ(αδt)

Thus, the non-decreasing mean degradation can be


determined as:
t
E(Xt ) = m(t) = αt (2)
Y(t)

and the variance is proportional to the mean degrada-


tion:
m(t)
Var(Xt ) = (3)
β2 1

Note that the gamma process is a positive process


0
with independent increment, hence it is sensible to Non-stressed Stressed t
use this process to describe the deterioration (van- system system
Noortwijk 2007). Another interest of the gamma
process is the existence of an explicit probability distri- Figure 1. Evolution of the deterioration process impacted
bution function which permits feasible mathematical by the stress process.

470
impacted by this environment. The covariate influ- replacement is performed during the inspection on Xt
ence can be reduced to a log-linear regression model if the deterioration level belongs to the interval (ξ , L).
if Yt = y; Xy (δt) Γ(α0 eγ y δt, β) (Bagdonavicius and Let cp be the preventive replacement cost (cp < cc ).
Nikulin 2000; Lehmann 2006) where γ measure the This maintenance policy is denoted Policy 0 here-
influence of the covariate on the degradation process. after.
Thus, it is assumed that the system is subject to an
increase in the deterioration speed while it is under
stress (i.e. while Yt = 1), the system deteriorates 3.2 Cost-based criterion for maintenance
according to its nominal mode while it is not stressed performance evaluation
(while Yt = 0). The parameters of the degradation pro-
The maintenance decision parameters which should be
cess when the system is non-stressed are α0 δt (α0 = α)
optimized in order to minimize the long-run mainte-
and β and when the system is under stress α1 δt and
nance cost are:
β with α1 = α0 eγ . In average the shape parameter ᾱ
is α0 (1 + r̄(eγ − 1)). α0 and β can be obtained by • the inspection period τ which allows balancing the
using the maximum likelihood estimation. γ can be cumulative inspection cost, earlier detection and
assimilated to a classical accelerator factor and can be prevention of a failure;
obtained with accelerated life testing method. • the preventive maintenance threshold ξ which
The Figure 1 sketches the evolution of the stress reduces cost by the prevention of a failure.
process and its influence on the system deterioration.
An illustration of the maintenance decision is pre-
sented in Figure 2.
3 DEFINITION AND EVALUATION OF THE The degradation of the unmaintained system state
MAINTENANCE POLICY is described by the stochastic process (Xt )t≥0 . Let in
the sequel the process (X t )t≥0 describes the evolu-
This section presents the maintenance decision frame- tion of the maintained system state. It can be ana-
work. First, the structure of the maintenance pol- lyzed through its regenerative characteristics: after a
icy is presented to define when an inspection or a complete replacement of the system (all the system
replacement should be implemented. The mathemat- components are simultaneously replaced), it is in the
ical expressions of the associated long-run mainte- ‘‘as good as new’’ initial state and its future evolu-
nance cost per unit of time are developed to optimize tion does not depend any more on the past. These
the maintenance decision regarding the system state complete system replacement times are regenerations
behavior. points for the process describing the evolution of the
global maintained system state. Thanks to the renewal
3.1 Structure of the maintenance policy property, we can limit the study of the process to a
renewal cycle, which significantly reduces the com-
The cumulative deterioration level Xt can be observed plexity of the analysis. The renewal-reward theorem
only through costly inspections. Let cix be the uni- (Asmussen 1987) implies that the cost function equals
tary inspection cost. Even if non-periodic inspection the expected costs per cycle divided by the expected
strategies are optimal (Castanier et al., 2003), a peri- length of a cycle:
odic strategy is first proposed. The benefit of such a
choice is a reduced number of the decision parameters,
only the inspection period τ , and an easier implemen- C(t) E(C(S))
C∞ (τ , ξ ) = lim = (4)
tation of the approach in an industrial context. This t→∞ t E(S)
inspection is assumed to be perfect in the sense that it
reveals the exact deterioration level Xt .
A replacement can take place to renew the sys- X(t)
Failure due to an
tem when it is failed (corrective replacement) or to excessive deterioration
prevent the failure (preventive replacement). A cor- corrective level
rective replacement is performed when the system is replacement area L
observed in the failed state during an inspection on preventive
replacement area
Xt . We assume the unitary cost of a corrective replace-
ment cc is composed by all the direct and indirect costs
incurred by this maintenance action. Only the unavail-
ability cost cu per unit of time the system is failed
has to be added to cc . The decision rule for a preven-
tive replacement is the classical control limit rule: if ξ Figure 2. Evolution of the deterioration process and the
is the preventive replacement threshold, a preventive stress process when the system is maintained.

471
where E(W ) is the expected value of the random vari-

35
able and S is the length of a regenerative cycle. The cost
C(S) is composed of the inspections, replacements and

30
unavailability costs and can be written:

Ins pection period


cix E(S)
τ + cp + (cc − cp )P(Xs > L)

25
C∞ (τ , ξ ) =
E(S)

20
cu E(Du (S))
+ (5)
E(S)

15
where Du (t) represents the unavailability time before t.
The computation of the long-run maintenance cost

10
per unit of time is presented in Appendix A, it requires
the evaluation of the reliability function of the main- 0.0 0.2 0.4 0.6 0.8 1.0

tained deterioration process in the different modes: r


stressed or not stressed.
Figure 3. Evolution of the inspection time depending on the
3.3 Extension of the maintenance policy with the time elapsed in the stress state r̄.
introduction of the knowledge on the stress
Previously, only the information given by the deteri-
oration level has been used, Policy 0 is based on the r(t). If the new inspection period leads to an inspec-
stationary deterioration mode but it can be useful to tion time lower than the present time, the system is
adapt the decision for inspection and replacement with inspected immediately with the following decision
the observed time elapsed in the different operating (t, ξr(t) ).
conditions. Policy 0 is well-adapted when only an ‘‘a
priori’’ knowledge on the stress process is available. In Even if Policy 1 allows to minimize the maintenance
such a way, the knowledge of the exact time elapsed in cost criterion, it is not easily to implement this policy
the different operating conditions should allow to fit a in industry, thus we propose another policy, Policy 2.
better estimation of the present state. We will develop Policy 2 is the discrete case of Policy 1. In order to
in the next paragraph two strategies (hereafter denoted reduce the number of updates of the decision parame-
Policy 1 and Policy 2) which offer the opportunity to ters, we propose to introduce thresholds (l1 , . . . , ln ) on
adapt the decision (inspection time and replacement r(t). The decision framework for Policy 2 is defined
threshold) as a function of the effective time elapsed as follows:
in an operating condition (i.e. time elapsed in the stress
condition). • if r(t) = 0, the decision parameters associated are
The decision framework for Policy 1 is defined as (τ0 , ξ0 ).
follows: • if r(t) = 1, the decision parameters associated are
• let r(t) be the proportion of time elapsed in the (τ1 , ξ1 ).
stressed condition since the last replacement. Let • if r(t) ∈ (lk , lk+1 ), the decision parameters asso-
introduce (τr(t) , ξr(t) ) which are the optimized deci- ciated are (τlk+1 , ξlk+1 ) if t < τk+1 and (t, ξlk+1 )
sion parameters for Policy 0 when r(t) = r̄. otherwise.
For example, we can obtain similar curves as
presented in Figure 3 for all values of r(t). Figure 4 illustrates the updating of the decision
• An inspection is performed at time t: parameters depending on r(t) and the number of
thresholds. In this example, only one threshold on
– if X (t) ≥ ξr(t) , a replacement is performed and
r(t) is considered: l1 . A t = 0, the system is
the system is ‘‘as good as new’’. The next inspec-
assumed to be in a new state and the system is stressed
tion is planned at time t + τ0 if the system is
(Y (t) = 1), thus the decision parameters are (τ1 , ξ1 )
non-stressed or at t + τ1 if the system is stressed;
(see T 1 in the Figure 4). These decision parame-
– if X (t) < ξr(t) , the system is left as it is and the
ters remain unchanged until r(t) ≤ l1 . As soon as
next inspection is planned at time t + τr(t) .
r(t) > l1 , see T2 on Figure 4, decision parameters
• The environment state is continuously monitored. are updated to (τl1 , ξl1 ). At T 3, r(t) returns above l1
r(t) is continuously evaluated and allow modifica- and the new decision parameters (τ1 , ξ1 ) leads to a
tions in the decision parameters. the next inspection passed time, thus the deterioration level is inspected
(τr(t) , ξr(t) ) are determined in function of the new immediately (cf. the two arrows from the curve r(t)).

472
Proportion of time estimate in order to know the time elapsed
elapsed in the stress in the failed state Du . In this case the cost is
state: r(t) cix + cc + cu .Du
1 ∗ if the cumulative deterioration is lower than
the failure threshold but upper than the cor-
responding preventive threshold a preventive
l1 replacement is implemented with the following
cost: cix + cp .
0 End
1 l1 0 – Step 2: estimation of the cost criterion: the long
Y(t) 1 0
l1 run maintenance cost corresponds to the mean of
1 the Nh costs.
0
T1 T2 T3
4 NUMERICAL RESULTS
Figure 4. Evolution of the inspection time depending on the
state of the stress with Policy 2. This section is devoted to compare the economic
performance of the three proposed policies. The
numerical results are provided here by the use of R
The objective of Policy 2 is to optimize the number (www.r-project.org) software, specific programs are
of thresholds in order to balance the number of deci- developed to numerically evaluate each expectation
sion parameters change and the optimal cost given by in Equation 5 in case of Policy 0 and the classical
policy 1. In this paper, the number of thresholds is numerical gradient procedure provided by the R soft-
not optimized, we present only the performance of ware is used to determine the optimal cost and thus the
this policy in function of the number of the number optimized maintenance decision parameters. A Monte
of thresholds. Carlo approach is used in case of the non-periodic
The following iteration algorithm is used for optimiz- inspection strategies described in section 3.2.
ing Policy 2:
• Step 0: initialization of the maintenance policy: 4.1 Economic performance of Policy 1 compared to
– choice of the number of thresholds: n; Policy 0
– determination of the n policies (τn , ξn ) optimized The curves in figure 5 are the respective representation
with policy 0; of the optimized cost criterion for the Policy 0 and 1.
For i = 1:Nh (1000) They are obtained when the mean proportion of time
• Step 1: estimation of the maintenance cost in one elapsed in the non-stressed state, r̄ varies from 0 to 1.
cycle:
– Step 1.1: initialization of the stress state Y0 and
the deterioration level X0 = 0;
– Step 1.2: simulation of all the dates of stress state
0.65

change with the exponential law;


– Step 1.3: determination of all the inspection peri-
ods function of the time elapsed in the stress
state and the n policies optimized according to
Cos t

0.60

the number of thresholds; Policy 0


Policy 1
– Step 1.4: simulation of the successive deteriora-
tion increments between two inspections function
of the dates of stress state change and the stress
0.55

levels until a repair;


∗ if the cumulative deterioration is lower than the
corresponding preventive threshold, the sys- 0.0 0.2 0.4 0.6 0.8 1.0
tem is left as it is until the next inspection, this
r
inspection incurs a cost cix ;
∗ if the cumulative deterioration is greater than Figure 5. Variation of the optimized cost for Policy 0
the failure threshold L, a corrective replace- (dashed curve) and of the estimation of the optimized cost
ment is implemented and the failure time is for Policy 1 when r̄(λ0 , λ1 ) varies from 0 to 1.

473
Policy 2 would have better results than Policy 1
0.625
if we introduced a cost for each change of decision
parameters. Moreover, Policy 2 is easier to implement
Policy 0
Policy 1 than Policy 1 in an industrial context.
Policy 2
0.620

5 CONCLUSION
Co s t

In this paper different maintenance decision frame-


works for a continuous deteriorating system which
0.615

evolves in a stressful environment have been proposed.


The relationship between the system performance and
the associated operating environment has been respec-
tively modeled as an accelerator factor for degradation
2 4 6 8
and as a binary variable. A cost criterion has been
numerically evaluated to highlight the performance
number of thres holds
of the different maintenance strategies and the bene-
fits to consider the opportunity to update the decision
Figure 6. Optimized cost variation when the mean propor-
tion of the time elapsed in the stress state varies. according to the history of the system.
Even if the last proposed structure for maintenance
decision framework has shown interesting perfor-
These curves are obtained for arbitrarily fixed main- mance, a lot of research remains to be done. First, the
tenance data and operation costs: α0 = 0.1, β = 7, mathematical model of the cost criterion in this case
γ = 0.3, L = 2, cix = 5, cc = 100, cp = 30, cu = 25. has to be evaluated. A sensitivity analysis when main-
Figure 5 illustrates the benefit of the non-periodic tenance data varies should be performed. Moreover,
inspection strategy. In fact, the non-periodic inspec- during an inspection, the maintenance decisions for
tion strategy (full line) is always the policy which the next inspection are based only on the time elapsed
minimizes the long-run maintenance cost per unit of by the system in the failed state, but it could be inter-
time. When, in average, the proportion of time elapsed esting to use the information given by the degradation
in the non-stressed state, r̄, tends to 0 (never stressed) level as well and so propose a mix of condition - and
or 1 (always stressed), Policy 1 tends to Policy 0. The stress - based framework for inspections.
economic benefit of Policy 1 compared to Policy 0
varies from 0% (when r̄ = 0 and r̄ = 1) to 5%. APPENDIX A
Finally, the non-periodic scheme takes the advan-
tage here to propose an adaptive inspection interval to Let f (x, t) be the probability density function of the
the real proportion of time elapsed in the stress state deterioration increment at time t of a system subject
which allow significantly better performance than the to aging and stress:
periodic scheme.
β ᾱt ᾱt−1 −β
f (x, t) = x e (6)
4.2 Economic performance of Policy 2 Γ(ᾱt)
The curves in figure 6 are the respective representa- with ᾱ = α0 (eγ + r̄(1 − eγ )) The different expecta-
tion of the optimized cost criterion for the Policy 0 tions in Equation 5 can be evaluated by a function of
and 2. They are obtained when the number of thresh- the reliability of the maintained system at steady-state,
olds in the case of Policy 2 varies from 1 to 10. The Rm (t). In order to determine Rm (t) for all t ∈ (0, S),
curve for Policy 2 is the smoothed estimators of the two cases are identified:
respective costs obtained by simulations. These curves
are obtained for arbitrarily fixed maintenance data and • Case 1: no inspection on the system has been made
operations costs to following values: α0 = 0.1, β = 7, before t (t < τ ). The probability density function
γ = 0.3, L = 2, cix = 5, cc = 100, cp = 30, for the system to be in a good state is only a func-
cu = 25, r̄ = 0.4. Figure 6 illustrates the benefit tion of the observed deterioration level x ∈ (0, L) at
of the non-periodic strategy compared to the peri- time t. Hence for t < τ , we have:
odic strategy. The economic performance of Policy 2
increases with the number of thresholds and varies Rm (t) = P(X (t) < L)
between 0.012% and 2.5% compared to Policy 1.  L
When the number of thresholds increases Policy 2 = f (x, t)dx (7)
tends to Policy 1. 0

474
• Case 2: at least one inspection has been performed REFERENCES
at t (t ≥ τ ). The probability density function for the
system to have been never replaced is a function of Asmussen, S. (1987). Applied Probability and Queues, Wiley
the observed deterioration level x ∈ (0, ξ ) during Series in Probability and Mathematical Statistics. Wiley.
the last inspection at time ([t/τ ].τ ), (where [.] is the Bagdonavicius, V. and M. Nikulin (2000). Estimation in
integer part function) and of the deterioration level Degradation Models with Explanatory Variables. Lifetime
Data Analysis 7, 85–103.
reached since the last inspection y ∈ (x, L). Hence Castanier, B., C. Bérenguer, and A. Grall (2003). A sequen-
for t ≥ τ , we have: tial condition-based repair/replacement policy with non-
periodic inspections for a system subject to continuous
Rm (t) = P(X (t) < L|X ([t/τ ].τ ) < ξ ) wear. Applied Stochastic Models in Business and Indus-
 ξ  L−x try 19(4), 327–347.
Cox, D. (1972). Regression models and life tables. Journal
= f (x, [t/τ ].τ )f (y, t)dydx (8)
0 0 of the Royal Statistics B(34), 187–202.
Deloux, E., B. Castanier, and B. C. (2008). Maintenance pol-
E(S), the expected length of a regenerative cycle icy for a non-stationary deteriorating system. In Annual
can be expressed by the two following scenarios: Reliability and Maintainability Symposium Proceedings
2008 (RAMS 2008), Las Vegas, USA, January 28–31.
• the cycle ends by a corrective replacement Gertsbakh, I. (2000). Reliability Theory With Applications to
• a cycle ends by a preventive replacement Preventive Maintenance. Springer.
Grall, A., L. Dieulle, C. Bérenguer, and M. Roussig-
and is given by: nol (2002). Continuous-Time Predictive-Maintenance
 ∞
Scheduling for a Deteriorating System. IEEE Transac-
tions on Reliability 51(2), 141–150.
E(S) = xτ (Rm ((x − 1)τ ) − Rm (xτ ))dx
Grall, A., L. Dieulle, C. Bérenguer, and M. Roussig-
0   nol (2006). Asymptotic failure rate of a continuously
Scenario1 monitored system. Reliability Engineering and System
 ∞ Safety 91(2), 126–130.
+ xτ (Rm (x−1)τ − P(X (xτ ) < ξ |X (x − 1)τ < ξ )) Lehmann, A. (2006). Joint modeling of degradation and
0   failure time data. In Degradation, Damage, Fatigue and
Scenario2 Accelerated Life Models In Reliability Testing, ALT2006,
Angers, France, pp. 26–32.
(9) Rausand, M. and A. Hoyland (2004). System Reliability
Theory-Models, Statistical Methods, and Applications
with (Second ed.). Wiley.
Singpurwalla, N. (1995). Survival in Dynamic Environments.
P(X (xτ ) < ξ |X (x − 1)τ < ξ ) = Statisticals Science 10(1), 86–103.
 ξ  ξ −y Singpurwalla, N. (2006). Reliability and risk A Bayesian
Perspective. Wiley.
f (y, (x − 1)τ )f (z, xτ )dzdy (10) van Noortwijk, J. (2007). A survey of the application of
0 0
gamma processes in maintenance. Reliability Engineering
and System Safety doi:10.1016/j.ress.2007.03.019.
P(Xs > L), the probability of a corrective replace-
Wang, H. (2002). A Survey of Maintenance Policies of
ment on a cycle S is given by: Deteriorating Systems. European Journal of Operational
 ∞ Research 139, 469–489.
P(Xs > L) = xτ (Rm ((x − 1)τ ) − Rm (xτ ))dx (11)
0

E(Du (S)), the expected value of the cumulative


unavailability duration before S is given by:
 ∞
E(Du (S)) = xτ (Rm ((x − 1)τ ) − Rm (xτ ))dx (12)
0

475
Safety, Reliability and Risk Analysis: Theory, Methods and Applications – Martorell et al. (eds)
© 2009 Taylor & Francis Group, London, ISBN 978-0-415-48513-5

Condition-based maintenance by particle filtering

F. Cadini, E. Zio & D. Avram


Politecnico di Milano—Dipartimento di energia, Italy

ABSTRACT: This paper presents the application of the particle filtering method, a model-based Monte Carlo
method, for estimating the failure probability of a component subject to degradation based on a set of observations.
The estimation is embedded within a scheme of condition-based maintenance of a component subject to fatigue
crack growth.

1 INTRODUCTION crack growth (Rocha & Schueller 1996 and Myotyri


et al., 2006 and Pulkkinen 1991). The crack propaga-
When the health state of a component can be mon- tion is modelled as a non-linear Markov process with
itored, condition-based policies can be devised to non-additive Gaussian noise (Myotyri et al., 2006);
maintain it dynamically on the basis of the observed the condition-based maintenance strategy considered
conditions (Marseguerra et al., 2002 and Christer et al., amounts to replacing the component at an optimal time
1997). However, often in practice the degradation determined on the basis of a cost model of literature
state of a component can not be observed directly, (Christer et al., 1997); the component remaining life-
but rather must be inferred from some observations, time, upon which the optimal replacement decision is
usually affected by noise and disturbances. based, is predicted by particle filtering.
The soundest approaches to the estimation of the The paper is organized as follows. In Section 2, the
state of a dynamic system or component build a pos- Bayesian formulation of model-based state estimation
terior distribution of the unknown degradation states and the particle filtering method are recalled. Section 3
by combining a prior distribution with the likelihood presents the application of the method to the prediction
of the observations actually collected (Doucet, 1998 of the optimal time of replacement of a component
and Doucet et al., 2001). In this Bayesian setting, subject to fatigue. Finally, some conclusions are drawn
the estimation method most frequently used in prac- in Section 4.
tice is the Kalman filter, which is optimal for linear
state space models and independent, additive Gaus-
sian noises (Anderson & Moore, 1979). In this case, 2 MODEL-BASED STATE ESTIMATION
the posterior distributions are also Gaussian and can AND PARTICLE FILTERING
be computed exactly, without approximations.
In practice, however, the degradation process of Let us consider a system whose state at the discrete
many systems and components is non-linear and the time step tk = kΔt is represented by the vector xk .
associated noises are non-Gaussian. For these cases, The model (transition or state equation) describing the
approximate methods, e.g. analytical approximations evolution of the system state is an unobserved (hidden)
of the extended Kalman (EKF) and Gaussian-sum fil- Markov process of order one:
ters and numerical approximations of the grid-based
filters can be used (Anderson & Moore, 1979).
Alternatively, one may resort to particle filtering xk = f (xk−1 , ωk−1 ) (1)
methods, which approximate the continuous distribu-
tions of interest by a discrete set of weighed ‘particles’ where fk : Rnx × Rnω → Rnx is possibly non-linear
representing random trajectories of system evolution and {ωk , k ∈ N } is an independent identically dis-
in the state space and whose weights are estimates of tributed (i.i.d.) state noise vector sequence of known
the probabilities of the trajectories (Kitagawa 1987 and distribution.
Djuric et al., 2003 and Doucet et al., 2000). The transition probability distribution p(xk | xk−1 ) is
In this paper, the particle filtering method for defined by the system equation (1) and the known dis-
state estimation is embedded within a condition-based tribution of the noise vector ωk . The initial distribution
maintenance scheme of a component subject to fatigue of the system state p(x0 ) is assumed known.

477

A sequence of measurements {zk , k ∈ N } is p(x0:k | z0:k) = p(ξ0:k | z0:k) δ(ξ0:k − x0:k) dξ0:k (6)
assumed to be collected at the successive time steps
tk . The sequence of measurement values is described
by the measurement (observation) equation: and assuming that the true posterior probability
p(x0:k | z0:k ) is known and can be sampled, an estimate
of (6) is given by (Kalos and Whitlock 1986):
zk = hk (xk , υk) (2)

1 
Ns
where hk : Rnx × Rnω → Rnx is possibly non-linear
and {υk , k ∈ N } is an i.i.d. measurement noise vector p̂ (x0:k | z0:k) = δ(x0:k − x0:k
i
) (7)
Ns i=1
sequence of known distribution. The measurements
{zk , k ∈ N } are, thus, assumed to be conditionally
independent given the state process {xk , k ∈ N }. where x0:ki
, i =1, 2, . . . , Ns is a set of independent
Within a Bayesian framework, the filtered posterior random samples drawn from p(x0:k | z0:k ).
distribution p(xk | z0:k ) can be recursively computed in Since, in practice, it is usually not possible to
two stages: prediction and update. sample efficiently from the true posterior distribu-
Given the probability distribution p(xk−1 | z0:k−1 ) tion p(x0:k | z0:k ), importance sampling is used, i.e.
at time k–1, the prediction stage involves using the i
system model (1) to obtain the prior probability dis- the state sequences x0:k are drawn from an arbitrar-
tribution of the system state xk at time k via the ily chosen distribution π(x0:k | z0:k ), called importance
Chapman-Kolmogorov equation: function (Kalos and Whitlock 1986). The probability
p(x0:k | z0:k ) is written as:

p(xk | z0:k−1) = p(xk | xk−1 | z0:k−1)p(xk−1 | z0:k−1)dxk−1 
p (ξ0:k | z0:k)
 p (x0:k | z0:k) = π (ξ0:k | z0:k)
= p(xk | xk−1)p(xk−1 | z0:k−1)dxk−1 (3) π (ξ0:k | z0:k)
× δ(ξ0:k − x0:k) dξ0:k (8)
where the Markovian assumption underpinning the
system model (1) has been used.
and an unbiased estimate is obtained by (Doucet et al.,
At time k, a new measurement zk is collected and
2001 and Arulampalam 2002):
used to update the prior distribution via Bayes rule, so
as to obtain the required posterior distribution of the
1 
current state xk (Arulampalam 2002): NS
p̂∗ (x0:k | z0:k ) = w∗i δ(x0:k − x0:k
i
) (9)
Ns i=1 k
p(xk | z0:k−1) p(zk | xk)
p(xk | z0:k) = (4)
p(zk | z0:k−1)
where:
where the normalizing constant is
p(z0:k | x0:k
i
)p(x0:k
i
)
 wk∗i = (10)
p(z0:k )π(x0:k | z0:k )
i
p(zk | z0:k−1) = p(xk | z0:k−1)p(zk | xk)dxk (5)

is the importance weight associated to the state


The recurrence relations (3) and (4) form the basis sequence x0:k i
, i = 1, 2, . . . , Ns , sampled from
for the exact Bayesian solution. Unfortunately, except π(x0:k | z0:k ) and p(z0:k | x0:k
i
) is the likelihood of the
for a few cases, including linear Gaussian state space observation sequence.
models (Kalman filter) and hidden finite-state space Computing the weight requires knowledge of
Markov chains (Wohnam filter), it is not possible to the p (z0:k ) = p (z0:k | x0:k ) p (x0:k ) dx0:k , normaliz-
evaluate analytically these distributions, since they ing constant, which cannot typically be expressed in
require the evaluation of complex high-dimensional closed form. It can be shown that, to overcome this
integrals. problem, an estimate of the posterior probability distri-
This problem can be circumvented by resorting to bution p(x0:k | z0:k ) may be computed as (Doucet et al.,
Monte Carlo sampling methods (Doucet et al., 2001 2001 and Arulampalam 2002):
and Doucet et al., 2000 and Pulkkinen 1991 and Seong
2002).
Writing the posterior probability p(x0:k | z0:k ) of 
Ns

the entire state sequence x0:k given the measurement p̂(x0:k | z0:k) = w̃ki δ(x0:k − x0:k
i
) (11)
vector z0:k as: i=1

478
where the ‘‘Bayesian’’ importance weights w̃ki are i.e. π(xk | x0:k−1
i
| z0:k ) = p(xk | xk−1
i
) and the non-
given by: normalized weights (16) become (Tanizaki 1997 and
Tanizaki & Mariano 1998):
wki  
w̃ki = (12) wki = wk−1
i
p zk | xki (16)

Ns
j
wk
j=1 Actually, in many engineering applications, the
measurements are not available at each time step, but
p(z0:k | x0:k
i
)p(x0:k
i
)
wki = = wk∗i p(z0:k ) (13) rather at regularly scheduled or even opportunistically
π(x0:ki
| z0:k ) staggered time instants k1 , k2 , . . .. Denoting the obser-
vation sequence up to the current time k as the set
For on-line applications, the estimate of the distri- {z}k = {zj : j = k1 , . . ., kf ≤ k}, the Monte Carlo
bution p(x0:k | z0:k ) at the k-th time step can be obtained estimation formalism described above remains valid,
from the distribution p(x0:k−1 | z0:k−1 ) at the previous provided that the weights updating formulas (16) and
time step by the following recursive formula obtained
by extension of equation (4) for the Bayesian fil- (17) are applied only at the observation times k1 , . . . ,
ter p(xk | z0:k ) (Doucet et al., 2001 and Arulampalam kf , whereas the weights wki remain constant at all other
2002): time instants k  = k1 , . . ., kf .

p (x0:k | z0:k−1) p (zk | x0:k | z0:k−1)


p (x0:k | z0:k) =
p (zk | z0:k−1) 3 CONDITION-BASED MAINTENANCE
p (xk | x0:k−1 | z0:k−1) p (x0:k−1 | z0:k−1) p (zk | x0:k | z0:k−1) AGAINST FATIGUE CRACK
= DEGRADATION
p (zk | z0:k−1)
p (xk | x0:k−1) p (x0:k−1 | z0:k−1) p (zk | x0:k) The Paris-Erdogan model is adopted for describing the
=
p (zk | z0:k−1) evolution of the crack depth x in a component subject
p (zk | x0:k) p (xk | x0:k−1) to fatigue (Rocha & Schueller 1996 and Myotyri et al.,
= p (x0:k−1 | z0:k−1) (14) 2006 and Pulkkinen 1991):
p (zk | z0:k−1)
Furthermore, if the importance function is chosen dx
such that: = C(ΔK)n (17)
dt

k where C and n are constant parameters related to the
π (x0:k | z0:k) = π (x0 | z0) π(xj | x0:j−1 | z0:j ) component material properties (Kozin & Bogdanoff
j=1 1989 and Provan 1987), which can be estimated from
= π(xk |x0:k−1 | z0:k) π (x0:k−1 | z0:k−1) (15) experimental data (Bigerelle & Iost 1989) and ΔK is
the stress intensity amplitude, roughly proportional to
the square root of x (Provan 1987):
the following recursive formulas for the non-
normalized weights wk∗i and wki can be obtained: √
ΔK = β x (18)
i
p(x0:k | z0:k) with β being a constant parameter which may be
wk∗i =
π(x0:k
i
| z0:k) determined from experimental data.
p(zk | xki )p(xki | xk−1
i ) The intrinsic randomness of the process is
i
p(x0:k−1 | z0:k−1) p(zk | zk−1) accounted for in the model by modifying equation (18)
= as follows (Provan 1987):
π(xki | x0:k−1
i
| z0:k)π(x0:k−1
i
| z0:k−1)
p(zk | xki )p(xki | xk−1
i
) dx  √ n
∗i
= wk−1
1 = eω C β x (19)
π(xk | x0:k−1 | z0:k) p(zk | zk−1)
i i dt

p(zk | xki )p(xki | xk−1


i
) where ω ∼ N (0, σω2 ) is a white Gaussian noise.
wki = wk∗i p(z0:k) = wk−1
i
(15) For Δt sufficiently small, the state-space model
π(xki | x0:k−1
i
| z0:k)
(20) can be discretized as:

The choice of the importance function is obvi- xk = xk−1 + eωk C(ΔKt)nΔt (20)
ously crucial for the efficiency of the estimation.
In this work, the prior distribution of the hid- which represents a non-linear Markov process with
den Markov model is taken as importance function, independent, non-stationary degradation increments.

479
For the observation zk , a logit model can be by the knowledge of the crack propagation stochas-
introduced (Simola & Pulkkinen 1998): tic process and a set of available measurements {z}k
related to it, taken at selected times prior to k. The best
zk xk time to replacement lmin is the one which minimizes
ln = β0 + β1 ln + υk (21)
d − zk d − xk the expression (Christer et al., 1997):

where d is the component material thickness, β0 ∈ expected cost per unit time (k, l)
(−∞, ∞) and β1 > 0 are parameters to be estimated
expected cost over the current life cycle (k, l)
from experimental data and υ is a white Gaussian noise =
such that υ ∼ N (0, συ2 ). expected current life cycle (k, l)
Introducing the following standard transforma- (27)
tions,
where:
zk
yk = ln (22) • l denotes the remaining life duration until replace-
d − zk
ment (either preventive or upon failure),
xk • k + l denotes the preventive replacement time instant
μk = β0 + β1 ln (23) scheduled on the basis of the set of observations
d − xk {z}k collected up to time k,
• d ∗ denotes the critical crack threshold above which
then, Yk ∼ N (μk , συ2 ) is a Gaussian random vari- failure is assumed to occur (d ∗ < d),
able with conditional cumulative distribution func- • cp denotes the cost of preventive replacement,
tion (cdf): • cf denotes the cost of replacement upon failure,

• p (k + i) = P (xk+i > d ∗ | x0:k+i−1 < d ∗ , {z}k ))
yk − μk denotes the conditional posterior probability of
FYk (yk | xk) = P(Yk < yk | xk) = Φ (24)
συ the crack depth first exceeding d ∗ in the interval
(k + i − 1, k + i), knowing the component had not
where Φ(u) is the cdf of the standard normal distribu- failed up to time k + i– 1 and given the sequence of
tion N (0,1). observations {z}k available up to time k,
The conditional cdf of the measurement zk related • P(k + l)denotes the probability of the crack depth
to the degradation state xk is then: exceeding d ∗ in the interval (k, k + l).

Neglecting the monitoring costs:
zk
FZk (zk | xk) = FYk ln | xk
d − zk expected cost over the current life cycle (k, l) =

1 zk = cp (1 − P (k + l)) + cf P (k + l) (28)
=Φ ln − μk (25)
σζ d − zk
where:
with corresponding probability density function: P(k + l) = p(k + 1) + (1 − p(k + 1))p (k + 2) + · · ·
⎛ ⎞2 ⎧
zk ⎪ p (k + 1) l=1
⎜ ln −μk ⎟ ⎪

⎜ d − zk ⎨ p (k + 1) + 

⎟ l
−1⎜
2⎝ συ ⎟

1 d = i=2
fZk (zk | xk ) = √ e ⎪
⎪ 
2π συ zk (d − zk ) ⎪

i−1
⎩ × (1 − p (k + j)) p (k + i) l > 1
(26) j=1
(29)
The maintenance actions considered for the com-
ponent are replacement upon failure and preventive and
replacement. For a wide variety of industrial compo- expected current life cycle (k, l) =
nents, preventive replacement costs are lower than fail- ⎧
⎪ k +1
ure replacement ones, since unscheduled shut down ⎪

losses must be included in the latter. ⎨ (k + l) (1 − P (k + l)) + (k + 1) p (k + 1)
=
At a generic time step k of the component’s life, a ⎪
⎪  l 
i−1
decision can be made on whether to replace the com- ⎪
⎩ + (k + i) (1 − p (k + j)) p (k + i)
i=2 j=1
ponent or to further extend its life, albeit assuming the
risk of a possible failure. This decision can be informed (30)

480
Within the Monte Carlo sampling framework of
Section 2, an estimate of the unknown conditional
posterior probability p(k + i) may be obtained as:


wkm
m:xkm >d ∗
p̂ (k + i) =  (31)
wkn
n:xkn >d ∗

where the numerator sums the weights of the sampled


particles which fail for the first time in the time inter-
val (k + i − 1, k + i) and the denominator those of the
particles which survive up to time (k + i − 1).
The computational scheme proposed has been
applied to a case study of literature (Myotyri 2006). Figure 2. Particle filter-estimated cost with critical level
The parameters of the state equation (21) are C = d ∗ = 80 at the measurement times k1 = 100, k2 = 200,
0.005, n = 1.3 and β = 1, whereas those in the mea- k3 = 300, k4 = 400 and k5 = 500.
surement equation (22) are β0 = 0.06, and β1 = 1.25.
The material thickness is d = 100, with a failure Table 1. Expected costs and replacement times.
threshold at d ∗ = 80 in arbitrary units. The pro-
cess and measurement noise variances are σω2 = 2.89 Time step Minimum E[cost
and συ2 = 0.22, respectively. The cycle time step is (k) per unit time] kmin lmin
Δt = 1 and the mission time is T = 1000. The obser-
vations of the crack size are assumed to be collected 100 31.7 527 427
200 31.7 527 327
at time steps k1 = 100, k2 = 200, k3 = 300, k4 = 400 300 29.5 535 235
and k5 = 500. The parameters of the cost model are 400 28.0 568 168
cp = 1.5 · 104 and cf = 7.5 · 104 , also in arbitrary 500 28.5 535 35
units (Christer et al., 1997). The number of particles
sampled by the Monte Carlo procedure is Ns = 1000.
Figure 1 shows the true crack depth evolution over Table 1 reports the minimum values of the expected
the mission time, if no preventive replacement is cost per unit time, the corresponding time instants and
undertaken. The component is shown to fail at k = the remaining life durations before replacement.
577(x577 > d ∗ = 80). At the first measurement time k1 = 100, the sug-
Figure 2 shows the expected cost per unit time com- gested time of replacement is 527, which is then
puted in correspondence of the observation times k1 , successively updated following the additionally col-
k2 , k3 , k4 and k5 . lected measurements up to the last one at k5 = 500
when the recommendation becomes to replace the
component at k = 535, or, equivalently, after lmin =
35 time steps. In this updating process, a decrease
of the recommended time of replacement is observed
from 568 at k4 = 400 to 535 at the successive mea-
surement time k5 = 500; this coherently reflects the
true crack depth evolution (Figure 1), which witnesses
a somewhat sudden increase in the period between
the two measurement times k = 400 and k = 500:
the ‘bayesanly’ updated recommendation based on the
observation at k5 = 500 coherently takes into account
this increase, resulting in a reduced predicted lifetime.

4 CONCLUSIONS

In this work, a method for developing a condition-


based maintenance strategy has been propounded. The
Figure 1. True crack depth evolution. method relies on particle filtering for the dynamic

481
estimation of failure probabilities from noisy measure- Doucet, A. & Godsill, S. & Andreu, C. 2000. On Sequential
ments related to the degradation state. Monte Carlo Sampling Methods for Bayesian Filtering.
An example of application of the method has Statistics and Computing 10: 197–208.
been illustrated with respect to the crack propagation Kalos, M.H. & Whitlock, P.A. 1986. Monte Carlo methods.
dynamics of a component subject to fatigue cycles and Volume I: basics, Wiley.
Kitagawa, G. 1987. Non-Gaussian State-Space Modeling
which may be replaced preventively or at failure, with of Nonstationary Time Series. Journal of the American
different costs. Statistical Association 82: 1032–1063.
The proposed method is shown to represent a Kozin, F. & Bogdanoff, J.L. 1989. Probabilistic Models of
valuable prognostic tool which can be used to drive Fatigue Crack Growth: Results and Speculations. Nuclear
effective condition-based maintenance strategies for Engineering and Design 115: 143–71.
improving the availability, safety and cost effective- Marseguerra, M. & Zio, E. & Podofillini, L. 2002. Condition-
ness of complex safety-critical systems, structures and Based Maintenance Optimization by Means of Genetic
components, such as those employed in the nuclear Algorithms and Monte Carlo Simulation. Reliability
industry. Engineering and System Safety 77: 151–165.
Myotyri, E. & Pulkkinen, U. & Simola, K. 2006. Application
of stochastic filtering for lifetime prediction. Reliability
Engineering and System Safety 91: 200–208.
REFERENCES Provan J.W. (ed) 1987, Probabilistic fracture mechanics and
reliability, Martinus Nijhoff Publishers.
Anderson, B.D. & Moore, J.B. 1979. In Englewood Cliffs. Pulkkinen, U. 1991. A Stochastic Model for Wear Predic-
Optimal Filtering. NJ: Prentice Hall. tion through Condition Monitoring. In K. Holmberg and
Arulampalam, M.S. Maskell, S. Gordon, N. & Clapp, T. A. Folkeson (eds). Operational Reliability and Systematic
2002. A Tutorial on Particle Filters for Online Maintenance. London/New York: Elsevier 223–243.
Nonlinear/Non-Gaussian Bayesian Tracking. IEEE Trans. Rocha, M. M. & Schueller, G. I. 1996. A Probabilistic Crite-
On Signal Processing 50(2): 174–188. rion for Evaluating the Goodness of Fatigue Crack Growth
Bigerelle, M. & Iost, A. 1999. Bootstrap Analysis of FCGR, Models. Engineering Fracture Mechanics 53: 707–731.
Application to the Paris Relationship and to Lifetime Seong, S.-H. & Park, H.-Y. & Kim, D.-H. & Suh, Y.-S. &
Prediction. International Journal of Fatigue 21: 299–307. Hur, S. & Koo, I.-S. & Lee, U.-C. & Jang, J.-W. & Shin,
Christer, A.H. & Wang, W. & Sharp, J.M. 1997. A state space Y.-C. 2002. Development of Fast Running Simulation
condition monitoring model for furnace erosion predic- Methodology Using Neural Networks for Load Follow
tion and replacement. European Journal of Operational Operation. Nuclear Science and Engineering 141: 66–77.
Research 101: 1–14. Simola, K. & Pulkkinen, U. 1998. Models for non-destructive
Djuric, P.M. & Kotecha, J.H. & Zhang, J. & Huang, Y. & inspection data. Reliability Engineering and System Safety
Ghirmai, T. & Bugallo, M. F. & Miguez, J. 2003. Particle 60: 1–12.
Filtering. IEEE Signal Processing Magazine 19–37. Tanizaki, H. 1997. Nonlinear and nonnormal filters using
Doucet, A. 1998. On Sequential Simulation-Based Meth- Monte Carlo methods. Computational Statistics & Data
ods for Bayesian Filtering. Technical Report. University Analysis 25: 417–439.
of Cambridge, Dept. of Engineering. CUED-F-ENG- Tanizaki, H. & Mariano, R.S. 1998. Nonlinear and Non-
TR310. Gaussian State-Space Modeling with Monte Carlo Simu-
Doucet, A. & Freitas, J.F.G. de & Gordon, N.J. 2001. An lations. Journal of Econometrics 83: 263–290.
Introduction to Sequential Monte Carlo Methods. In A.
Doucet, J.F.G. de Freitas & N.J. Gordon (eds), Sequential
Monte Carlo in Practice. New York: Springer-Verlag.

482
Safety, Reliability and Risk Analysis: Theory, Methods and Applications – Martorell et al. (eds)
© 2009 Taylor & Francis Group, London, ISBN 978-0-415-48513-5

Corrective maintenance for aging air conditioning systems

I. Frenkel & L. Khvatskin


Center for Reliability and Risk Management, Industrial Engineering and Management Department,
Sami Shamoon College of Engineering, Beer Sheva, Israel

Anatoly Lisnianski
The Israel Electric Corporation Ltd., Haifa, Israel

ABSTRACT: This paper considers corrective maintenance contracts for aging air conditioning systems, oper-
ating under varying weather conditions. Aging is treated as an increasing failure rate. The system can fall into
unacceptable states for two reasons: through performance degradation because of failures or through an increase
in demand of cold. Each residence in acceptable state, each repair and each entrance to an unacceptable state are
associated with a corresponding cost. A procedure for computing this reliability associated cost is suggested,
based on the Markov reward model for a non-homogeneous Poisson process. By using this model an optimal
maintenance contract that minimizes the total expected cost may be found. A numerical example for a real world
air conditioning system is presented to illustrate the approach.

1 INTRODUCTION the reliability associated cost, accumulated during


the aging system’s life span. By using the model a
Many technical systems are subjected during their life- corrective maintenance contract with minimal relia-
time to aging and degradation. Most of these systems bility associated cost can be defined from the set of
are repairable. For many kinds of industrial systems, different maintenance contracts available in the mar-
it is very important to avoid failures or reduce their ket. The approach is based on the Non-Homogeneous
occurrence and duration in order to improve system Markov Reward Model. The main advantage of the
reliability and reduce corresponding cost. suggested approach is that it can be easily implemented
Maintenance and repair problems have been widely in practice by reliability engineers.
investigated in the literature. Barlow & Proshan
(1975), Gertsbakh (2000), Wang (2002) survey and
summarize theoretical developments and practical
applications of maintenance models. 2 PROBLEM FORMULATION AND MODEL
With the increasing complexity of systems, only DESCRIPTION
specially trained staff with specialized equipment can
provide system service. In this case, maintenance ser- 2.1 Reliability Associated Cost and Corrective
vice is provided by an external agent and the owner Maintenance Contract
is considered a customer of the agent for maintenance We will define the Reliability Associated Cost (RAC)
service. In the literature, different aspects of mainte- as the total cost incurred by the user in operation and
nance service have been investigated (Almeida (2001), maintenance of the system during its lifetime. In that
Murthy & Asgharizadeh(1999)). case RAC will comprise the operating cost, repair cost
Reliability theory provides a general approach for and the penalty cost accumulated during the system
constructing efficient statistical models to study aging lifespan. Therefore,
and degradation problems in different areas. Aging is
considered a process which results in an age-related
increase of the failure rate. The most common shapes RAC = OC + RC + PC (1)
of failure rates have been observed by Meeker and
Escobar (1998), Finkelstein (2002). where
This paper presents a case study where an aging
air conditioning system with minimal repair is consid- – OC is the system operating cost accumulated during
ered. The Markov reward model is built for computing the system lifetime;

483
– RC is the repair cost incurred by the user in t = 0 from state j. According to Howard (1960),
operating and maintaining the system during its the following system of differential equations must be
lifetime; solved in order to find this reward:
– PC is a penalty cost, accumulated during system life
time, which was paid when the system failed. dVj (t)  K  K
= rjj + aij rij + aij Vi (t),
Let T be the system lifetime. During this time the dt i=1 i=1
system may be in acceptable states (system function- i =j
ing) or in unacceptable ones (system failure). After j = 1, 2, . . . , K (2)
any failure, a corresponding repair action is performed
and the system returns to one of the previously accept-
able states. Every entrance of the system into a set of The system (2) should be solved under specified
unacceptable states (system failure) incurs a penalty. initial conditions:
A Maintenance Contract is an agreement between
the repair team and the system’s owner that guar- Vj (0) = 0, j = 1, 2, . . . , K.
antees a specific level of services being delivered.
The Maintenance Contract defines important param- For an aging system, its failure rate λ(t) increases
eters that determine a service level and corresponding with age. In the case of minimal repair, the intensi-
costs. The main parameter is mean repair time Trm , ties aij , i, j = 1, . . . , K of transitions from state i to
where m(m = 1, 2, . . . , M ) is a possible Maintenance state j corresponding to failures are dependent on time.
Contract level and M is the number of such levels. The total expected reward can be found from differen-
Repair cost crm for individual repair action depends tial equations (2), by substitution of formulae for λ(t)
on repair time and, so, it corresponds to a maintenance instead of corresponding aij values.
contract level m. It usually ranges between the highest
and lowest repair costs. Thus Trmin ≤ Trm ≤ Trmax .
The problem is to find the expected reliability 3 NUMERICAL EXAMPLE
associated cost corresponding to each maintenance
contract and choose the corrective maintenance con- 3.1 The system description
tract, minimizing this cost. According to the suggested
approach, this cost is represented by the total expected Consider an air conditioning system, placed in a com-
reward, calculated via a specially developed Markov puter center and used around the clock in varying
reward model. temperature conditions. The system consists of five
identical air conditioners. The work schedule of the
system is as follows. For regular temperature condi-
2.2 Markov reward model for aging system tions two air-conditioners must be on-line and three
others are in hot reserve. For peak temperature condi-
A Markov reward model was first introduced by tions four air-conditioners have to be on-line and one
Howard (1960), and applied to multi-state system is in hot reserve. The number of the air conditioners
(MSS) reliability analysis by Lisnianski & Levitin that have to be on-line define the demand level.
(2003) and others (Lisnianski (2007), Lisnianski et al., We denote:
(2008)).
We suppose that the Markov model for the system c – is the system operations cost per time unit.
has K states that may be represented by a state space cr – is the repair cost paid for every order of the
diagram as well as transitions between states. Inten- maintenance team;
sities aij , i, j = 1, . . . , K of transitions from state i to cps – is a penalty cost, which is paid, when the
state j are defined by corresponding failure and repair system fails.
rates.
It is assumed that while the system is in any state An aging process in air-conditioners is described
i during any time unit, some money rii will be paid. via the Weibull distribution with parameters α =
It is also assumed that if there is a transition from 1.5849 and β = 1.5021. Therefore λ (t) = 3t 0.5021 .
state i to state j the amount rij will by paid for each Service agents can suggest 10 different Corrective
transition. The amounts rii and rij are called rewards. Maintenance Contracts, available in the market. Each
The objective is to compute the total expected reward contract m is characterized by repair rate and corre-
accumulated from t = 0, when the system begins its sponding repair cost (per repair) as presented in the
evolution in the state space, up to the time t = T under Table 1.
specified initial conditions. The operation cost cop , is equal to $72 per year. The
Let Vj (t) be the total expected reward accumulated penalty cost cp , which is paid when the system fails,
up to time t, if the system begins its evolution at time is equal to $500 per failure.

484
Table 1. Maintenance contract characteristics. The transition intensity matrix for the system is as
shown in (3).
Maintenance Repair cost

contract MTTR ($ per repair)  a11 2λ(t) 0 0 0 0 |
M (days) crm 
 μ a22 2λ(t) 0 0 0 |

 0 2μ a33 2λ(t) 0 0 |
1 3.36 36  0 |
2 1.83 40  0 3μ a44 2λ(t) 0
 0 λ(t) |
3 1.22 46  0 0 4μ a55
 0 0 0 0 5μ a66 |
4 0.91 52
a =
5 0.73 58  λ N 0 0 0 0 0 |
6 0.61 66  0 λN 0 0 0 0 |

7 0.52 74  0 0 λN 0 0 0 |

8 0.46 84  0 0 0 λN 0 0 |
9 0.41 94  0 λN |
 0 0 0 0
10 0.37 106  0 0 0 0 0 λN |

| λd 0 0 0 0 0 

| 0 λd 0 0 0 0 

Following Lisnianski (2007), the variable demand, | 0 0 λd 0 0 0 
| 0 λd 
representing variable weather conditions, may be 0 0 0 0 
| 0 λd 
described as a continuous time Markov chain with 0 0 0 0 
2 levels. The first level represents a minimal cold | 0 0 0 0 0 λd 
| a77 4λ(t) 0 0 0 0 
demand during the night and the second level repre- 
sents peak cold demand during the day. The cycle time | μ a88 4λ(t) 0 0 0 

is Tc = 24 hours and the mean duration of the peak is | 0 2μ a99 3λ (t) 0 0 

td = 9 hours. The transition intensities of the model | 0 0 3μ a10,10 2λ (t) 0 
can be obtained as | 0 0 0 4μ a11,11 λ (t) 
| 0 0 0 0 5μ a12,12
1 1
λd = = = 0.066 hours−1 (3)
Tc − td 24 − 9
= 584 year −1 , where

1 a11 = −(2λ(t) + λd ) a77 = − (4λ (t) + λN )


λN = = 0.111 hours−1 = 972 year −1 a22 = − (2λ (t) + μ + λd ) a88 = − (4λ (t) + μ + λN )
td
a33 = − (2λ (t) + 2μ + λd ) a99 = − (3λ (t) + 2μ + λN )
By using the suggested method one will find the a44 = − (2λ + 3μ + λd ) a10,10 = − (2λ (t) + 3μ + λN )
best maintenance contract level m that provides a min- a55 = − (λ (t) + 4μ + λd ) a11,11 = − (λ (t) + 4μ + λN )
imum of Reliability Associated Cost during system a66 = − (5μ + λd ) a12,12 = − (5μ + λN )
lifetime.
To calculate the total expected reward, the reward
matrix for the system is built in the following manner
3.2 The Markov Reward model for the system
(see Lisnianski & Levitin (2003) and Lisnianski et al.,
The state-space diagram for this system is presented (in press)).
in Figure 1. Rectangles in the left and right sides of the If the system is in states 1-5 and 7-11, the operation
state space diagram designate states of air conditioners cost associated with use of air conditioners, depends
corresponding to each system state i = 1, . . ., 12. If an on their number, must be paid during any time unit. In
air conditioner is not available, then the correspond- the states 6 and 12 there are no on-line air-conditioners,
ing rectangle is dashed. The number of available air so there are no rewards associated with these states.
conditioners in any system state i defines the system The transitions from state 4 to state 5, from state 3
performance gi in this state. Also near each state the to state 9, from state 8 to state 9 are associated with the
corresponding demand wi is presented. entrance to unacceptable states and rewards associated
As was written above, technical requirements with this transitions are the penalty.
demand that two on-line air conditioners are needed The transitions from state 2 to state 1, from state
under regular conditions and four in peak condition, 3 to state 2, from state 4 to state 3, from state 5 to
so that there are six acceptable states—states 1–4 and state 4, from state 6 to state 5, from state 8 to state 7,
states 7–8, and 6 unacceptable states: states 5–6 and from state 9 to state 8, from state 10 to state 9, from
states 9–12. state 11 to state 10 and from state 12 to state 11 are

485
d

Reserved Reserved Reserved Reserved


Main Main Available Available Available 1 7 Main Main Main Main Available

(g=5) > (w=2)


(g=5) > (w=4)
N

2 μ 4 μ

Reserved Reserved Reserved Reserved


Main Main On-line Available Available
Main Main Main Main On-line
2 8
(g=4) = (w=2)
(g=4) = (w=4)
N

4
2 2μ 2μ

Reserved Reserved Reserved Reserved


Main Main On-line On-line Available 3 9 Main Main Main Main On-line
(g=3) > (w=2)
(g=3) < (w=4)
N

3μ 3 3μ
2

Reserved
Reserved Reserved Reserved 4 10 Main Main Main Main On-line
Main Main On-line On-line On-line

(g=2) = (w=2)
(g=2) < (w=4)
N

4μ 2 4μ
2
d

Reserved Reserved Reserved Reserved


Main Main On-line On-line On-line
Main Main Main Main On-line
5 11
(g=1) < (w=4)
(g=1) < (w=2)
N



d

Reserved Reserved Reserved Reserve


Main Main On-line On-line On-line
6 12 Main Main Main Main On-line

(g=0) < (w=2) (g=0) < (w=4)


N

Figure 1. The state-space diagram for the air-conditioning system.


| 0 0 0 0 0 0 
associated with the repair of the air conditioner, and 
| 0 0 0 0 0 0 
the reward associated with this transition is the mean 
cost of repair. The reward matrix for the system of air | 0 0 cp 0 0 0 
| 
conditioners is as follows: 0 0 0 0 0 0 
| 
0 0 0 0 0 0 
| 4cop 0 0 0 0 0 

| crm 0 0 0 0 0 
 
 2cop 0 0 0 0 0 | | 0 4cop cp 0 0 0 
 
 crm 2cop 0 0 0 0 | | 0 crm 3cop 0 0 0 
 
 0 crm 2cop 0 0 0 | | 0 0 crm 2cop 0 0 
 | 
 0 0 crm 2cop cp 0 | 0 0 0 crm 0 0 
 crm | | 0 0 0 0 crm 0 
 0 0 0 cop 0
 0 0 0 0 crm 0 |
r = 
 0 0 0 0 0 0 |
 0 0 0 0 0 0 | Taking into consideration the transition intensity

 0 0 0 0 0 0 | matrix (3), the system of differential equations that

 0 0 0 0 0 0 | defines the Markov reward model for the air con-
 | ditioning system for the calculation of the total
 0 0 0 0 0 0
 0 0 0 0 0 0 | expected reward, may be written as shown in (4).

486
x 10
4
4 CALCULATION RESULTS
4
Figure 2 shows the expected Reliability Associated
3.5 Cost for T = 10 years as a function of the Mainte-
Reliability Associated Cost ($)

nance Contract Level (m). The seventh level (m = 7),


3 which provides the minimal expected reliability asso-
ciated cost ($14574) for the system, corresponds to a
2.5 mean repair time of 0.52 days. Choosing a more expen-
sive Maintenance Contract Level, we pay an additional
2 payment to the repair team. Choosing a less expensive
one, we pay more for penalties because of transitions
1.5 to unacceptable states.

1
2 4 6 8 10 5 CONCLUSIONS
Maintenance Contract Level

Figure 2. The expected Reliability Associated Cost vs The case study for the estimation of expected reli-
Maintenance Contract Level. ability associated cost accumulated during system
lifetime is considered for an aging system under min-
imal repair. The approach is based on application
of a special Markov reward model, well formal-
The system is solved under initial conditions: ized and suitable for practical application in reliabil-
Vi (0) = 0, i = 1, 2, . . . , 12 using MATLAB® , the ity engineering. The optimal corrective maintenance
language of technical computing.

dV1 (t)
= 2cop − (2λ (t) + λd ) V1 (t) + 2λ (t) V2 (t) + λd V7 (t)
dt
dV2 (t)
= 2cop + crm μ + μV1 (t) − (2λ (t) + μ + λd ) V2 (t) + 2λ (t) V3 (t) + λd V8 (t)
dt
dV3 (t)
= 2cop + 2crm μ + cp λd + 2μV2 (t) − (2λ (t) + 2μ + λd )V3 (t) + 2λ (t) V4 (t) + λd V9 (t)
dt
dV4 (t)
= 2cop + 3crm μ + 2cp λ (t) + 3μV3 (t) − (2λ (t) + 3μ + λd ) V4 (t) + 2λ (t) V5 (t) + λd V10 (t)
dt
dV5 (t)
= cop + 4crm μ + 4μV4 (t) − (λ (t) + 4μ + λd ) V5 (t) + λ (t) V6 (t) + λd V11 (t)
dt
dV6 (t)
= 5crm μ + 5μV5 (t) − (5μ + λd ) V6 (t) + λd V12 (t) (4)
dt
dV7 (t)
= 4cop + λN V1 (t) − (4λ (t) + λN ) V7 (t) + 4λ (t) V8 (t)
dt
dV8 (t)
= 4cop + crm μ + 4cp λ + λN V2 (t) + μV7 (t) − (4λ (t) + μ + λN ) V8 (t) + 4λ (t) V9 (t)
dt
dV9 (t)
= 3cop + 2crm μ + λN V3 (t) + 2μV8 (t) − (3λ (t) + 2μ + λN ) V9 (t) + 3λ (t) V10 (t)
dt
dV10 (t)
= 2cop + 3crm μ + λN V4 (t) + 3μV9 (t) − (2λ (t) + 3μ + λN ) V10 (t) + 2λ (t) V11 (t)
dt
dV11 (t)
= 4crm μ + λN V5 (t) + 4μV10 (t) − (λ (t) + 4μ + λN ) V11 (t) + λ (t) V12 (t)
dt
dV12 (t)
= 5crm μ + λN V6 (t) + 5μV11 (t) − (5μ + λN ) V12 (t)
dt

487
contract (m = 7), which provides Minimal expected Lisnianski, A., Frenkel, I., Khvatskin, L. & Ding Yi. 2007.
reliability associated cost ($14574), was found. Markov Reward Model for Multi-State System Reliabil-
ity Assessment. In F. Vonta, M. Nikulin, N. Limnios, C.
Huber-Carol (eds), Statistical Models and Methods for
REFERENCES Biomedical and Technical Systems. Birkhaüser: Boston,
153–168.
Lisnianski, A., Frenkel, I., Khvatskin, L. & Ding Yi.
Almeida de, A.T. 2001. Multicriteria Decision Making on
2008. Maintenance contract assessment for aging sys-
Maintenance: Spares and Contract Planning. European
tem, Quality and Reliability Engineering International,
Journal of Operational Research 129: 235–241.
in press.
Barlow, R.E. & Proshan, F. 1975. Statistical Theory of Relia-
Lisnianski, A. & Levitin, G. 2003. Multi-state System Relia-
bility and Life Testing. Holt, Rinehart and Winston: New
bility. Assessment, Optimization and Applications. World
York.
Scientific: NJ, London, Singapore.
Finkelstein, M.S. 2002. On the shape of the mean residual
Meeker. W. & Escobar, L. 1998. Statistical Methods for
lifetime function. Applied Stochastic Models in Business
Reliability Data. Wiley: New York.
and Industry 18: 135–146.
Murthy, D.N.P. & Asgharizadeh, E. 1999. Optimal decision
Gertsbakh, I. 2000. Reliability Theory with Applications to
Making in a Maintenance Service Operation, European
Preventive Maintenance. Springer-Verlag: Berlin.
Journal of Operational Research 116: 259–273.
Howard, R. 1960. Dynamic Programming and Markov
Wang, H. 2002. A survey of Maintenance Policies of
Processes. MIT Press: Cambridge, Massachusetts.
Deteriorating Systems. European Journal of Operational
Lisnianski, A. 2007. The Markov Reward Model for a
Research 139: 469–489.
Multi-State System Reliability Assessment with Variable
Demand. Quality Technology & Quantitative Manage-
ment 4(2): 265–278.

488
Safety, Reliability and Risk Analysis: Theory, Methods and Applications – Martorell et al. (eds)
© 2009 Taylor & Francis Group, London, ISBN 978-0-415-48513-5

Exact reliability quantification of highly reliable systems


with maintenance

R. Briš
Technical University of Ostrava, Czech Republic

ABSTRACT: The paper presents a new analytical algorithm which is able to carry out exact reliability quan-
tification of highly reliable systems with maintenance (both preventive and corrective). A directed acyclic graph
is used as a system representation. The algorithm allows take into account highly reliable and maintained input
components. New model of a repairable component undergoing to hidden failures, i.e. a model when a failure
is identified only at special deterministically assigned times, is analytically deduced within the paper. All con-
sidered models are implemented into the new algorithm. The algorithm is based on a special new procedure
which permits only summarization between two or more non-negative numbers that can be very different. If the
summarization of very small positive numbers transformed into the machine code is performed effectively no
error is committed at the operation. Reliability quantification is demonstrated on a real system from practice.

1 INTRODUCTION is often designed for an operation lasting very long


time without a possibility of help of human hand.
It is a Monte Carlo simulation method (Marseguerra & Safety systems of nuclear power stations represent
Zio (2001)) which is used for the quantification of reli- other example of highly reliable systems. They have to
ability when accurate analytic or numerical procedures be reliable enough to comply with still increasing inter-
do not lead to satisfactory computations of system nationally agreed safety criteria and moreover they are
reliability. Since highly reliable systems require many mostly so called sleeping systems which start and oper-
Monte Carlo trials to obtain reasonably precise esti- ate only in the case of big accidents. Their hypothetical
mates of the reliability, various variance-reducing failures are then not apparent and thus reparable only
techniques (Tanaka et al., 1989), eventually techniques at optimally selected inspective times. The question
based on reduction of prior information (Baca 1993) is how to model the behaviour of these systems and
have been developed. A direct simulation technique how to find their efficient procedure for estimation
has been improved by the application of a parallel algo- and computation of reliability indicators.
rithm (Bris 2008) to such extent that it can be used for
real complex systems which can be then modelled and
quantitatively estimated from the point of view of the
reliability without unreal simplified conditions which
2 A PROBLEM FORMULATION AND INPUT
analytic methods usually expect. If a complex system
COMPONENT MODELS
undergoes a fault tree (FT) analysis, new and fast FT
quantification method with high degree of accuracy
2.1 Problem formulation
for large FTs, based also on a Monte Carlo simula-
tion technique and truncation errors was introduced in Let us have a system assigned with a help of a directed
Choi & Cho (2007). However, if it is necessary to work acyclic graph (AG), see (Bris 2008). Terminal nodes of
and quantitatively estimate highly reliable systems, the AG are established by the definition of determinis-
for which unreliability indicators (i.e. system non- tic or stochastic process, to which they are subordinate.
functions) move to the order 10−5 and higher (10−6 From them we can compute a time course of the
etc.), the simulation technique, whatever improved, availability coefficient, possibly unavailability of indi-
can meet the problems of prolonged and inaccurate vidual terminal nodes, using methodology of basic
computations. renewal theory, as for example in (Bris 2007). The
Highly reliable systems appear more often and in aim is then to find a correspondent time course of
research they are closely connected with a penetrative the (un)availability coefficient for the highest SS node
increase of progress. We can observe the systems for which represents reliability behaviour of the whole
example in space applications where complex device system.

489
2.2 Input component models 2.3 Derivation of an unavailability coefficient
for a model with repairable elements
In the first phase of development we will suppose
and occurrence of hidden failures
exponential distribution for the time to failure, pos-
sibly for the time to restoration. Under this condition With the same indication of failure and repair rates
we can describe all frequently used models with both as given above the unavailability coefficient can be
preventive and corrective maintenance by three of the described with the following function:
following models:
P(τ ) = (1 − PC ).(1 − e−λτ )
• Model with elements (terminal nodes in AG) that  
can not be repaired μ −μτ −λτ
• Model with repairable elements (CM—Corrective + PC 1 + (e − e ) , τ > 0,
μ−λ
Maintenance) for apparent failures, i.e. a model (2)
when a possible failure is identified at the occur-
rence and immediately afterwards it starts a process
leading to its restoration. where τ is a time which has passed since the last
• Model with repairable elements with hidden fail- planned inspection, Pc is the probability of a non-
ures, i.e. a model when a failure is identified only at functional state of an element at the moment of
special deterministically assigned times, appearing inspection at the beginning of the interval to the next
with a given period (moments of periodical inspec- inspection.
tions). In the case of its occurrence at these times
an analogical restoration process starts, as in the Proof Let Tp is an assigned period of inspections or
previous case. examinations of the functional state of an element. Let
us further indicate
An analytical accurate computation of time depen- PA (t) . . . probability that an element is at time t in
dence of the (un)availability coefficient was for the the correct functional state,
first two situations explained enough and derived in PB (t) . . . probability that an element is at the time
Bris & Drabek (2007). Let us remind that in the t in a failure state,
first case of the element that can not be repaired a PC (t) . . . probability that an element is at the time
final course of unavailability coefficient is presented t in a repair state.
by a distribution function of time to failure of the In the first interval, when t ∈ < 0, TP ), the situa-
element: tion is qualitatively equivalent like an element that can
not be repaired. In the interval between two inspections
the situation is however different. As long as the given
P(t) = 1 − e−λt , element is failed, it converts into the state of repair. As
long as we set a time variable τ as the time which has
passed since the moment of the last inspection, then
where λ is the failure rate. this situation can be noted as
In the second case a relation can be derived on
the basis of Laplace’s transformation for a similar
coefficient PB (τ = 0) = 0

  The element can be at the moment of inspection


μ λ
P(t) = 1 − + e−(λ+μ)t either in the state of repair with a probability Pc , which
λ+μ λ+μ can be described as
λ  
= 1 − e−(λ+μ)t , t > 0 (1)
λ+μ PC (τ = 0) = PC

where μ is the repair rate. And finally


The third model is rather more general than the
earlier stated in the model with periodical exchanges PA (τ = 0) = PA
Bris & Drabek (2007) where we assumed a deter-
ministic time to end of repair. If we hereby presume is a probability that an element at the time of the
that a time to the end of repair is exponential random inspection is in the correct function. It is apparent that
variable, it is necessary to derive an analytical compu-
tation of time course of the function of (un)availability
coefficient. PA + PC = 1.

490
In the course of time between two inspections followed 2. In other hypotheses we will need this expression to
one by another it is as follows: be always positive, what is also easy to proof.

PA (τ ) − PA (τ ) · λ · dτ + PC (τ ) · μ · dτ = PA (τ + dτ )
3 THE NEW ALGORITHM
PB (τ ) + PA (τ ) · λ · dτ = PB (τ + dτ )
PC (τ ) − PC (τ ) · μ · dτ = PC (τ + dτ ) 3.1 Probabilities of functional and non-functional
state
After the elementary modification the following set It is evident that probabilities of a functional p and
of differential equations can be obtained, at the known non-functional state q comply with a relation
initial conditions as above:
p + q = 1.
PA (τ ) + λ · PA (τ ) · λ − μ · PC (τ ) = 0
Taking into consideration the final accuracy of real
PB (τ ) − λ · PA (τ ) = 0
numbers in a computer it is important which one from
PC (τ ) + μ · PC (τ ) = 0 p or q we count. If we want to fully use the machine
accuracy, we have to compute the smaller one from
The solution of this set is: both probabilities.
 
PC · μ PC · μ −μτ Example 1.1 If we counted hypothetically on a com-
PA (τ ) = PA + · e−λτ − e
μ−λ μ−λ puter with three-digit decimal numbers, then for the
  value of q = 0.00143, we would instead of a correct
PC · λ −μτ PC · μ
PB (τ ) = e − PA + · e−λτ value p = 0.99857 have only p = 0.999.
μ−λ μ−λ In return for q = 1 − p, we would get: q = 1 − p =
+ PA + PC 0.00100.
It is apparent that it gets to a great loss of accuracy
PC (τ ) = PC · e−μτ if we counted p instead of q.
Seeing that probabilities of a non-function state of
Then the probability that an element is not in the a highly reliable system is very small, we have to
state of correct function inside the interval at the time concentrate on numerical expression of these prob-
τ will be: abilities. For these purposes it is necessary to undergo
a reorganization in a computer calculation and set cer-
P(τ ) = PB (τ ) + PC (τ ) tain rules which do not have the influence on accuracy
of the computation at the numeration process.
PC · λ −μτ
= PA + PC + e
μ−λ
  3.2 Problems with computer subtraction of two
PC · μ near real numbers
− PA + · e−λτ + PC · e−μτ
μ−λ The probability of non-functioning of the simplest pos-
= (1 − PC ) · (1 − e−λτ ) sible element (a model with elements that can not be
  repaired) is given by the relation
μ
+ PC 1 + (e−μτ − e−λτ )
μ−λ P(t) = 1 − e−λt ,

Thus the relationship (2) is proved. that was supposed as an unavailability coefficient, λ
is a failure rate. Similarly, for other models of system
Note: elements the computation of an expression
1. For the purposes of an effective computer numer-
ical expression (see further) the expression in the 1 − e−x , for x ≥ 0 (3)
brackets can be converted into the formation:
  is a crucial moment at probability numerical expres-
μ sion of a non-function (unavailability coefficient).
1+ (e−μτ − e−λτ ) For values x << 1, i.e. near 0, direct numerical
μ−λ
expression written by the formula would lead to great
μ −λτ  
=1− e 1 − e−(μ−λ)τ , τ > 0 errors! At subtraction of two near numbers it gets to a
μ−λ considerable loss of accuracy. On personal computer

491
the smallest number ε, for which it is numerically to work with the whole relevant sub-graph. Com-
evident that binatorial character for the quantification will stay
nevertheless unchanged.
1 + ε = 1,
3.4 The error-free sum of different non-negative
is approximately 10−18 . If x ≈ 10−25 , the real value numbers
of the expression (3) will be near 10−25 . A direct
numerical calculation of the expression gives a zero! The first step to the solution of this problem is to find a
As the algorithm was created in a programming method for the ‘‘accurate’’ sum of many non-negative
environment Matlab, for the need of this paper was numbers.
used the Matlab function ‘‘exmp1’’ which enables The arithmetic unit of a computer (PC) works in a
exact calculation of the expression (3). binary scale. A positive real number of today’s PC con-
tains 53 valid binary numbers, see Figure 2. A possible
order ranges from approximately −1000 to 1000.
3.3 The numeration substance of probability The line indicated as ‘‘order’’ means an order of a
of a non-functional state of a node binary number.
The probability of a non-functional state of a node of The algorithm for the ‘‘accurate’’ quantification of
an AG, for which the individual input edges are inde- sums of many non-negative numbers consists from a
pendent, is in fact given by going over all possible com- few steps:
binations of probabilities of the input edges. For 20 1. The whole possible machine range of binary posi-
input edges we have regularly a million combinations. tions (bites) is partitioned into segments of 32
One partial contribution to the probability of a non- positions for orders, according to the following
functional state of the node in Figure1 has a form: scheme in Figure 3: The number of these segments
will be approx.:
q1 · q2 · · · qi−1 pi · qi+1 · · · qj−1 · pj · qj+1 · · · q20 ,
2000 ∼
where a number of occurring probabilities p (here the = 63
32
number equals to 2) can not reach ‘‘m’’. The probabil-
ity of a non-functional state of the node is generally 2. Total sum is memorized as one real number, which
given by a sum of a big quantity of very small numbers. is composed from 32 bite segments. Each from
These numbers are generally very different! these segments has additional 21 bites used as
If the sum will be carried out in the way that the transmission.
addition runs in the order from the biggest one to the 3. At first a given non-zero number of the sum that
smallest ones, certainly a lost stems from rounding off, must be added is decomposed according to before
more than the addition runs in the order from the small- assigned firm borders (step 1) mostly into three
est ones to the biggest values. And even in this second parts containing 32 binary numbers of the number
case there is not possible to determine effectively how at most, according to the scheme in Figure 4. The
much accuracy ‘‘has been lost’’. individual segments are indexed by numbers 1–63.
Note: In the case of dependence of the input edges 4. Then the individual parts of this decomposed num-
(terminal nodes) we cannot express the behaviour of ber are added to the corresponding members of the
an individual node numerically. There is necessary sum number, as in Figure 5.
5. Always after the processing of 220 numbers (the
limit is chosen so that it could not lead to over-
flowing of the sum number at any circumstances)

… 1 ? … ? …
order: 1000 53 places -1000

Figure 2. A positive real number in binary scale.

.. … … ..
32 31 1 0 -1 -2 -32 -33 -34

Figure 1. One node of an acyclic graph with 20 edges. Figure 3. Segments composed from 32 binary positions.

492
0 .. 1 ? ? ? ? ? 3.5 Permissible context of the usage
not leading to the loss of accuracy
32 bits The probability of a non-functional state of a repairable
end of the 1-st
non zero segment element (see section 2.3—a model with repairable ele-
ments and occurrence of hidden failures) is given by
Figure 4. Decomposition of a given non-zero number. the formula (2), which can be adapted to

P(τ ) = (1 − PC ) · α(τ ) + PC · β(τ ),

The corresponding where PC is the probability of a non-functional state


+ part of the number
of an element at the moment of the inspection at
the beginning of the interval till the next inspec-
tion; α, β are non-negative mechanically accurate and
numerically expressed functions.
Figure 5. Adding a number to the sum number. One contribution to the computation of a non-
functional state of a node generally has a form
0 a β q1 · q2 · · · (1 − qk ) · · · .
A+β A
In both cases it happens to occur (1 −q). It has already
been explained that the recalculation p ≡ 1 − q, can
Figure 6. ‘‘Clearance process’’.
come to catastrophic loss of accuracy. A basic ques-
tion then comes out: Do we have to modify further the
stated patterns for the purpose of removing the sub-
0 0 0 0 0 traction? Fortunately not. In the introduced context
0 a ? (1 − q) · α, where α is expressed numerically in a full
mechanical accuracy there is no loss in mechanical
accuracy! Thanks to rounding off a final product to 53
Figure 7. Demonstration of the final summarization.
binary numbers, lower orders of the expression (1−q),
i.e. a binary numbers on 54th place and other places
behind the first valid number, can not be practically
a modification of the sum number is carried out applied.
which is indicated as the ‘‘clearance’’ process.
Upwards a transmission is separated (in the follow-
ing Figure 6 it is identified by a symbol β) which 3.6 Determination of system probability
is added to the upper sum. behaviour according to a graph structure
6. If a final sum is required, at first the clearance Let all elements appearing in the system are indepen-
process has to be carried out. Then the group of dent. The probability of a non-functional state of a
three sums is elaborated, from which the upper is system, assigned by the help of AG, is thus simply
the highest non-zero one (identified by a symbol α gained on the basis of estimation of all nodes upwards.
in Figure 7). We make a sum of these three sums For instance for AG in the Figure 8:
as usual in a binary scale, when p in the follow-
ing expression is given by an index of the highest • there is carried out numerical expression of the
non-zero segment: probability of a non-functional state of terminal
nodes, i.e. elements 8,9,10 and 5,6,7
• there is carried out numerical expression of the
sum = α · 2p + β · 2p−32 + γ · 2p−64 probability of a non-functional state of an internal
node 4 which is given by the following sum:
So numbers in their full machine accuracy (53
binary numbers beginning with 1) are the input q8 · q9 · q10 + (1 − q8 ) · q9 · q10
for this process of adding. The output is the + q8 · (1 − q9 ) · q10 + q8 · q9 · (1 − q10 )
only number in the same machine accuracy (53
binary numbers beginning with 1). The number is + q8 · (1 − q9 ) · (1 − q10 )
mechanically the nearest number to the real accu- + (1 − q8 ) · q9 · (1 − q10 )
rate error-free sum which contains in principle up
to 2000 binary numbers. + (1 − q8 ) · (1 − q9 ) · q10

493
having the value of 2−75 we obtain the value of 2−20 . If
1
we apply the new algorithm, we obtain error-free sum
2
which equals:

sum = 2−20 + 2−45


2 3 4
1 3 Error of the first result reaches to about half num-
bering mantissa of the result number what can be
considered as relevant error. In spite of the fact that the
example is very hypothetical, it clearly demonstrates
importance of the algorithm.
10
5 6 7 8

4 RESULTS WITH TESTED SYSTEM


9
The new algorithm generated in the Matlab was
tested on real highly reliable system from RATP
Figure 8. The system assigned by the help of a structure AG. company (Regie Autonome des Transports Parisiens).
The system was used as a test case activity
• there is carried out numerical expression of the (Dutuit, Y. & Chatelet E. 1997) of the Methodolog-
probability of a non-functional state of an internal ical Research working group of the French Institute
node 3 which is given by the only item for dependability & Safety (ISDF). The system is
demonstrated in Figure 9.
q5 · q6 · q7 The system is one-out-of-three active parallel sys-
tem including three non identical components. The two
• there is carried out numerical expression of the first components are periodically tested and repaired
probability of a non-functional state of a terminal with test intervals T1 and T2 (T2 >> T1 ). The third
node 2 component is neither tested nor repairable. Each com-
• there is carried out numerical expression of the ponent has a constant failure rate λi ; the repair or
probability of a non-functional state of the highest replacement duration are negligible; the test intervals
SS node 1 which is given: are very different respectively 24 hours and 26280
hours. We can say that the failures of the first two
q2 · q3 · q4 + (1 − q2 ) · q3 · q4 + q2 · (1 − q3 ) · q4 components are hidden failures, which are revealed
only when the next test occurs. At the first glance we
+ q2 · q3 · (1 − q4 )
can say that the system is highly reliable. Our target is
to calculate the dependence of unavailability on time.
Note:
As a first step, the dependence of unavailability on
Internal numeration of nodes is such that the node with
time was calculated for individual components what is
a less number can not be inferior to the node with
greater number. Nodes are numbered in the decreasing
order of numbers.

3.7 Importance of the new algorithm


The algorithm is based on summation of many non-
negative numbers that can be very different. The output
of the summation is one number in full machine accu-
racy, which we call error-free sum. Importance of the
error-free algorithm can be explained on the following
very hypothetical example:
Let us have a system given by the acyclic graph
similar to the one in Fig. 1 with 30 input edges. In such
a case the sum for the probability of a non-function
state of the highest SS node can be composed of about
one billion (230 ) summands. If we have to summarize
(by a software on a common PC) one number of a
value 2−20 with additional 230 numbers all of them Figure 9. Block diagram of the tested system.

494
Figure 10. Dependence of unavailability on time (in hours)
for the Component 1.

Figure 13. Dependence of unavailability on time (in hours)


for the whole system.

Figure 11. Dependence of unavailability on time (in hours)


for the Component 2.

Figure 14. Dependence of unavailability on time (in hours)


for the whole system having all components λ = 1 · e − 7.

The only difference between both systems is the


type of maintenance. In Figure 14 we can see depen-
Figure 12. Dependence of unavailability on time (in hours) dence of unavailability on time for the system in course
for the Component 3. of first 7 years of operation.
We can easily see that the unavailability dropped
demonstrated in Figures 10–12. Secondly, dependence by three orders. Maximal unavailability value in the
of unavailability on time for the whole system in the course of first six years of operation is about 3.3e−11.
course of first 7 years of operation was determined, in
Figure 13.
We see that the maintenance process is very impor- 5 CONCLUSION
tant for the highly reliable system. Thanks mainte-
nance actions the maximal unavailability in the course Maintaining the full machine accuracy requires mainly
of first six years is 2.5e−8. not to carry out subtraction of near values. All required
For better demonstration of the new algorithm, let outputs are therefore necessary to express in the form
us take into account that the system is composed from of the sum of numbers with consistent sign (in our case
identical components such as the Component 1, i.e. all non-negative).
of them having the same failure rate: A problem of a sum of many non-negative num-
bers can be solved by decomposing a sum into more
λ=1·e−7 partial sums which can be carried out without a loss!

495
The process has been numerically realized within a REFERENCES
programming environment Matlab.
Numerical expression of probabilities of a non- Marseguerra M. & Zio E. 2001. Principles of Monte Carlo
functional state of one node of an AG has a combina- simulation for application to reliability and availability
torial character. We have to go over all combinations analysis. In: Zio E, Demichela M, Piccinini N, editors.
of input edges behaviour leading to a non-functional Safety and reliability towards a safer world, Torino, Italy,
September 16–20, 2001. Tutorial notes. pp. 37–62.
state of the node. The astronomic increase of combi- Tanaka T, Kumamoto H. & Inoue K. 1989. Evaluation of a
nations with the increasing number of elements causes dynamic reliability problem based on order of component
that the program will be usable only up to a certain size failure. IEEE Trans Reliab 1989;38:573–6.
of a system. Already at moderate exceeding the criti- Baca A. 1993. Examples of Monte Carlo methods in relia-
cal size of the system it comes to enormous increase bility estimation based on reduction of prior information.
of machine time. All computations above run below IEEE Trans Reliab 1993;42(4):645–9.
1s, on Pentium (R) 4 CPU 3.40GHz, 2.00 GB RAM. Briš R. 2008. Parallel simulation algorithm for maintenance
Model with repairable elements with hidden fail- optimization based on directed Acyclic Graph. Reliab Eng
ures, i.e. a model when a failure is identified only Syst Saf 2008;93:852–62.
Choi JS, Cho NZ 2007. A practical method for accurate
at special deterministically assigned times, has been quantification of large fault trees. Reliab Eng Syst Saf
analytically elicited within the paper. Final formula 2007;92:971-82.
meets the requirement of permissible context which is Briš, R. 2007. Stochastic Ageing Models—Extensions of the
required in the presented algorithm. Classic Renewal Theory. In Proc. of First Summer Safety
The algorithm enables to carry out exact unavail- and Reliability Seminars 2007, 22–29 July, Sopot: 29–38,
ability analysis of real maintained systems with both ISBN 978-83-925436-0-2.
preventive and corrective maintenance. Briš, R. & Drábek, V. 2007. Mathematical Modeling of both
Monitored and Dormant Failures. In Lisa Bartlett (ed.),
Proc. of the 17th Advances in Risk and Reliability Tech-
nology Symposium AR2 TS, Loughborough University:
6 ACKNOWLEDGEMENT 376–393.
Dutuit, Y. & Chatelet E. 1997. TEST CASE No. 1, Periodi-
This work is supported partly by The Ministry of cally tested paralel system. Test-case activity of European
Education, Youth and Sports of the Czech Republic— Safety and Reliability Association. ISdF-ESRA 1997. In:
project CEZ MSM6198910007 and partly by The Workshop within the European conference on safety and
Ministry of Industry and Trade of the Czech reliability, ESREL 1997, Lisbon, 1997.
Republic—project FT-TA4/036.

496
Safety, Reliability and Risk Analysis: Theory, Methods and Applications – Martorell et al. (eds)
© 2009 Taylor & Francis Group, London, ISBN 978-0-415-48513-5

Genetic algorithm optimization of preventive maintenance scheduling


for repairable systems modeled by generalized renewal process

P.A.A. Garcia
Universidade Federal Fluminense—Departamento de Administração, Volta Redonda—RJ, Brasil
Fundação Gorceix—Petrobras (CENPES), Rio de Janeiro, Brasil

M.C. Sant’Ana
Agência Nacional de Saúde, Rio de Janeiro, Brasil

V.C. Damaso
Centro Tecnológico do Exército, Rio de Janeiro, Brasil

P.F. Frutuoso e Melo


COPPE/UFRJ—Nuclear, Rio de Janeiro, Brasil

ABSTRACT: Reliability analyses of repairable systems are currently modelled through punctual stochastic
processes, which intend to establish survivor measures in a failure x repair scenario. However, these approaches
do not always represent the real life-cycle of repairable systems. In order to have a better and most coherent
reality modelling, one has the Generalized Renewal Process (GRP). With this approach, reliability is modelled
considering the effect of a non-perfect maintenance process, which uses a better-than-old-but-worse-than-new
repair assumption. Considering the GRP approach, this paper presents an availability modelling for operational
systems and discusses an optimisation approach based on a simple genetic algorithm (GA). Finally, a case is
presented and the obtained results demonstrate the efficacy of combining GRP and GA in this kind of problems.

1 INTRODUCTION maximize the system availability. Indeed, the avail-


ability of many components that comprise the entire
The operational experience of Brazilian nuclear power system should be considered. In this case, a super-
plants, gained in the last years, led to research on imposed stochastic process combining all stochastic
probabilistic safety assessment (PSA) in the direction processes concerning each component can be used.
of preventive maintenance scheduling optimization to Whatever the optimization method chosen, the
enhance operational safety. probabilistic modeling is of crucial importance and
Generally, due to the complexity of nuclear systems, should be given appropriately. The present work aims
different optimization techniques have been applied. to base the availability modeling of repairable systems
These are commonly based on evolutionary comput- on a generalized renewal process (Kijima & Sumita,
ing, specifically on genetic algorithms. Nowadays, 1986).
one has other heuristics such as particle swarm opti-
mization (Alkhamis & Ahmed, 2005) and ant colony
(Ruiz et al., 2007, and Samarout et al., 2005), among 2 RANDOM PROCESSES COMMONLY
others. However, genetic algorithms approaches have USED IN PSA
been pre-weighed (Taboada et al., 2007, Martorell
et al., 2007, Martorell et al., 2006, and Garcia et al., Firstly, defining repairable systems is appropriate.
2006). According to Rigdon & Basu (2000), a system is called
The use of evolutionary techniques to establish an repairable if, when a failure occurs, it can be restored
optimized preventive maintenance scheduling is jus- to an operating condition by some repair process other
tified because these kinds of problems are generally than replacement of the entire system.
combinatorial, where one intents to obtain, simulta- For situations where downtime associated with
neously, the best policy for all the components and maintenance, repair or replacement actions is

497
negligible, compared with the mean-time-between-
failures (MTBF), the point processes are used as
probabilistic models of the failure processes. The com-
monly adopted point processes in PSA are as follows:
(i) homogeneous Poisson process (HPP), (ii) ordinary Vn

Virtual Age
renewal processes (ORP) and (iii) non-homogeneous
Poisson process (NHPP). However, these approaches V2
do not represent the real life-cycle of a repairable V1
system (Modarres, 2006). Rather, they have some
assumptions that conflict with reality. In HPP and ORP,
the device, after a repair, returns to an as-good-as-new
condition, and in a NHPP the device, after a repair,
returns to an as-bad-as-old condition. t1 t2 tn
Kijima & Sumita (1986) introduced the concept of Real Age
generalized renewal process (GRP) to generalize the
three point processes previously mentioned. With this Figure 1. Visualization of virtual age (Adapted from
approach, reliability is modeled considering the effect Jakopino, 2005).
of a non-perfect maintenance process, which uses
a better-than-old-but-worse-than-new repair assump-
tion. Basically, GRP addresses the repair assumption
by introducing the concept of virtual age, which whereas the assumption of q = 1 leads to an NHPP
defines a parameter q that represents the effectiveness (as bad as old). The values of q that fall in the interval
of repair. 0 < q < 1 represent the after repair state in which the
condition of the system is ‘‘better than old but worse
than new’’.
2.1 Generalized renewal process On the basis of this proposition of virtual age,
As mentioned, the probabilistic modeling to be consid- Kijima et al., (1988) has proposed the following
ered in this work to approach repair action, especially approach to calculate the conditional probability of
imperfect repairs, is the generalized renewal process failure:
(GRP). Nevertheless, for a complete understanding
about GRP, it is necessary to define the concept of F(T + y) − F( y)
virtual age (Vn ). F(T |Vn = y) = (3)
1 − F( y)
The Vn corresponds to the calculated age of par-
ticular equipment after the n-th repair action. Kijima
& Sumita (1986) has proposed two ways to modeling where F is the cumulative distribution function.
this virtual age. The first one, commonly named type Considering (without loss of generality), that time
I, consists basically of the assumption that a repair between failures is modeled by a Weibull distribution,
action acts just in the step time just before. With this it follows that
assumption, the virtual age of a component increases
proportionally to the time between failures: ⎡⎛⎞β
q 
i−1
Vi = Vi−1 + qYi (1) F(ti , α, β, q) = 1 − exp ⎣⎝ tj ⎠
α j=1
where Vi is the virtual age immediately after the i-th
repair action, and Yi is the time between the (i − 1)-th
i−1 β ⎤
ti + q j=1 tj
and i-th failures. − ⎦ (4)
The type II model considers that the repair can α
restore the system considering the elapsed time since
the beginning of its life. In this model, the virtual age
increases proportionally to the total time.
3 AVAILABILITY MODELING
Vi = q(Yi + Vi−1 ) (2) OF REPAIRABLE SYSTEMS

q can be defined as a rejuvenation parameter in both A preventive maintenance policy has an important role
models. to enhance system availability of any power plant. Nev-
According to this modeling, the result of assuming ertheless, scheduling planning for preventive mainte-
a value of q = 0 leads to an RP (as good as new), nance actions must consider aging characteristics.

498
The higher the degradation the lesser the time Notice that, if ti + man is higher than the mission
between maintenance stops. In turn, to model sys- time, this figure is replaced by Tmiss , which is the under
tems availability under this condition, the following evaluation mission time.
assumptions were made: Considering this modeling approach, the mean
availability is
• After each maintenance task, the component will
not go back to an as good as new condition; ti
• There is a probability of failure during? time 1 
n

between maintenance, leading to a corrective action A = A(t)dt (7)
Tmiss i=0 t
and influencing the device unavailability; i−1 +man
• When a component is selected to be maintained, all
the components associated will be unavailable too;
• Time between maintenances is not necessarily 4 MAINTENANCE DISTRIBUTION
constant. AND GENETIC MODELING
Based on these assumptions, a modeling approach The genetic algorithm is used to indicate the preventive
to calculate the component availability considering a maintenance scheduling for each component, in order
maintenance policy is presented. to optimize the system availability along the mission
Considering R(t) as the component reliability, and period. The instants selected by GA for the mainte-
as A(t) its availability, and considering, without loss s nance accomplishment on a certain component should
of generality, that the failure process can be modeled follow a distribution pattern. For instance, it is reason-
by a Power Law process, and time between failures able to suppose that the interval between maintenances
is modeled by a Weibull distribution. Thus, At (t), is reduced as the system ages.
the availability concerning a preventive maintenance This assumption induces choice for a modeling
scheduling, is given by that looks for solutions to the preventive maintenance
 planning with some ordination in that distribution

⎪ t1  (Damaso, 2006). The benefit of such approach is to

⎪ −μt
μ eμt R−1 (t  |V0 )dt  + 1,
⎨R(t|V0 )e limit the universe of solutions to be considered, elim-
At1 (t) = t0 inating those that do not have practical sense.Thus,

⎪0, t0 ≤ t < t1 the search process gains in terms of efficiency and


t1 ≤ t < t1 + man computational time.
(5) In this work, a proportional distribution was
adopted, where the intervals between preventive main-
tenances follow a geometric progression (GP).
whereAt1 (t) is the availability for t0 ≤ t ≤ t1 , t0 is The first interval, T1 , starts at the operation begin-
the initial time and t1 is the time to the first main- ning, in t = 0, and goes until the final instant of the
tenance stop; V0 is the virtual age of the component first intervention, T1 :
in the beginning of the simulation. This parameter is
quite important, because in aging characterization, the T1 = T1 − 0 = T1 . (8)
component is not new, but it has a certain level of aging.
For the other maintenance steps, the modeling is as
The subsequent intervals are given as
follows:
Ti+1 = β · Ti , 0 < β ≤ 1, (9)
At1 (t) =
⎧ where i = 1, 2, . . . , n, Ti is the i-th time interval

⎪ R(t|Vi−1 )e−μ(t−(ti−1 +rep )
⎪  ti

and β is the proportionality factor (common ratio of

⎨  the GP). A unitary value of β means that the inter-
• μ eμ(t −(ti−1 +rep )) R−1 (t  |Vi−1 )dt  + 1,


vention instants are distributed evenly, that is, the




ti−1

⎪ 0, ti−1 + man ≤ t < ti intervals between interventions are all the same. The

⎩ last interval,Tn+1 , is set between the final instant of
ti ≤ t < ti + man the last intervention, Tn , and the operation end, Tf , and
(6) it is given as

where Ati (t) is the availability between (i − 1)-th and Tn+1 = Tf − Tn . (10)
i-th maintenance stops, R(t|Vi−1 ) = 1 − F(t|Vi−1 ),
and F(t|Vi−1 ) is done as in equation 4. man is the Considering that n interventions are foreseen during
maintenance time. the component operation time, the expression for the

499
Heat Table 1. Components parameters modeled by Equations (1)
exchanger 1 and (2).

1/μ
Motor 1
V-1
Comp. (days) β α (days) q V0 (days)
Pump 1

V-2 Pump 1 3 1,12 338,5 0,51 75


Pump 2 3 1,11 180 0,3 25
Pump 3 3 1,12 338,5 0,51 75
Heat Motor 1 2 2,026 76 0,146 165
Motor 2 exchanger 2
Pump 2 Motor 2 2 1,62 128 0,7 145
Motor 3 2 2,026 76 0,146 165
V-3 Valve 1 1 0,73 35,32 0,54 150
Valve 2 1 0,5 105,13 0 200
V-4 Valve 3 1 0,5 105,13 0 200
Motor 3 Pump 3 Valve 4 1 0,56 54,5 0,02 45
Heat Ex 1 5 1,47 321 0,35 272
Heat Ex 2 5 1,47 321 0,35 272
Figure 2. Schematic diagram of CCWS.

calculation of T1 , as a function of Tf , n and β, is


given as Table 2. Maintenance days for the components of CCWS.

Components Maintenance moments (days)


Tf (1 − β)
T1 = . (11)
1 − β n+1 Motor 1 54, 108, 162, 216, 270, 324, 378, 432, 486
Motor 2 270
The other intervals can be evaluated starting from Motor 3 55, 109, 163, 217, 271, 325, 379, 433, 487
the known value of T1 , by Equation 9. Heat Ex. 1 270
Heat Ex. 2 271
For each component, the following values are
determined: number of maintenances, n; proportion-
ality factor, β; and displacement of the maintenance
scheduling, d, the latter allowing to anticipate or delay
the scheduled maintenances to avoid undesirable coin- Notice that the repair rate concerns the mean time to
cidences between component unavailability. Such a set repair considering an unexpected failure. The down-
of parameters is metaphorically called phenotype, and time (one day) associated to a scheduled preventive
it represents a candidate to the solution. The candidate maintenance for all components is presented in table 1
to the solution is codified in a structure denominated In the simulation, mission time is 11/2 years, which
genotype. One of the coding possibilities is to use is the time between reactor fuel recharge.
a superstructure, in which 16 bits by parameter was The results obtained are shown in table 2.
reserved in the employed GA. In appendix I, a figure Some components, due to the reduced mission time,
concerning the mentioned structure is shown. i.e., 540 days, will be maintained just at the end of the
mission.
The system availability, associated to the optimized
5 APPLICATION EXAMPLE maintenance schedule, is 99,68%, and it is graphically
displayed as shown in Figures 3–14.
An important active system of the nuclear power
plant Angra I, located in Angra dos Reis, a Rio de
Janeiro Province, is the Component Coolant Water
System (CCWS). Three motor-driven pumps, two heat 6 CONCLUDING REMARKS
exchangers and a set of valves that make possible,
if necessary, system operation in different configura- The present paper showed a different approach to
tions, compose this system. The operational and safety model system availability through generalized renewal
functions of the power plant can be maintained with processes. By mean of this modeling a simple
only one line in operational condition. This line must GA is adopted to optimize preventive maintenance
be composed of, at least, one motor driven pump and scheduling.
a heat exchanger. Another approach presented in this paper concerns
In Table 1, o the parameters of each component genetic modeling, which was based on a geomet-
necessary for the analysis are listed. ric progression series. In this modeling, the genetic

500
Motor 1 Pump 1
1.000
1.00
0.90 0.995

0.80 0.990

Availability
0.70
0.985
Availability

0.60
0.980
0.50
0.40 0.975

0.30
0.970
0.20 0 100 200 300 400 500

0.10 Time (days)

0.00
0 100 200 300 400 500 Figure 6. Availability of the pump 1.
Time (days)

Pump 2
Figure 3. Availability of the motor 1.
1.000

0.995

Motor 2 0.990
Availability

1.00
0.90 0.985
0.80
0.70
0.980
Availability

0.60
0.50
0.975
0.40
0.30
0.970
0.20
0 100 200 300 400 500
0.10
0.00 Time (days)
0 100 200 300 400 500
Time (days)
Figure 7. Availability of the pump 2.

Figure 4. Availability of the motor 2.


Pump 3
1.000

Motor 3 0.995
1.00
0.90 0.990
Availability

0.80
0.70
0.985
Availability

0.60
0.50
0.40
0.980
0.30
0.20 0.975
0.10
0.00 0.970
0 100 200 300 400 500
0 100 200 300 400 500
Time (days)
Time (days)

Figure 5. Availability of the motor 3.


Figure 8. Availability of the pump 3.

algorithm should find the best proportionality fac-


tors β which will distributethe times for preventive Further work is underway to consider a higher
maintenance stop along the mission period. mission period and verify, in great detail, the mainte-
These results are satisfactory and have contributed nance distribution forall components of the system. We
to demonstrate the efficacy of the proposed approach will also draw upon the test scheduling optimization
in this particular problem. problem concerning safety systems in aging condition.

501
Valve 1
1.000 Heat Exchanger 1
1.00
0.995
0.90
0.990
Availability

0.80
0.985 0.70

Availability
0.60
0.980
0.50
0.975 0.40
0.30
0.970
0 100 200 300 400 500 0.20
Time (days) 0.10
0.00
0 100 200 300 400 500
Figure 9. Availability of the valve 1. Time (days)

Valve 2
Figure 13. Availability of the heat exchanger 1.
1.000

0.995
Heat Exchanger 2
Availability

0.990
1.00
0.985 0.90
0.80
0.980
0.70
Availability

0.975 0.60
0.50
0.970 0.40
0 100 200 300 400 500
0.30
Time (days) 0.20
0.10
0.00
Figure 10. Availability of the valve 2. 0 100 200 300 400 500
Time (days)

Valve 3
Figure 14. Availability of the heat exchanger 2.
1.000

0.995
Availability

0.990
REFERENCES
0.985

0.980 Alkhamis, Talal M. & Ahmed, Mohamed A. (2005).


0.975
Simulation-based optimization for repairable systems
using particle swarm algorithms. In Proceeding of the 37th
0.970
0 100 200 300 400 500
Conference on Winter Simulation, pp. 857–861.
Time (days)
Damaso, Vinícius C. (2006). An integrated optimization
modeling for safety system availability using genetic algo-
rithms. PhD Dissertation, Federal University of Rio de
Figure 11. Availability of the valve 3. Janeiro, Nuclear Engineering Program [in Portuguese].
Garcia, P.A.A., Jacinto, C.M.C. and Droguett, E.A.L. (2005).
A multiobjective genetic algorithm for blowout preventer
Valve 4
test scheduling optimization. In Proceeding of Applied
1.000 Sumulation and Modeling, Benalmádena, Spain.
Jacopino, A.G. (2005). Generalization and Bayesian solution
0.995
of the general renewal process for modeling the reliability
Availability

0.990 effects of imperfect inspection and maintenance based on


0.985 imprecise data. PhD Dissertation, Faculty of the Graduate
School of the University of Maryland, College Park.
0.980
Kijima, M. and Sumita, N. (1986). A useful generaliza-
0.975 tion of renewal theory: counting process governed by
0.970 non-negative Markovian increments. Journal of Applied
0 100 200 300 400 500 Probability, Vol. 23, 71.
Time (days) Martorell, S., Carlos, S., Villanueva, J.F., Sanchez, A.I.,
Galvan, B., Salazar, D. and Cepin, M. (2006). Use of
Figure 12. Availability of the valve 4. multiple objective evolutionary algorithms in optimizaing

502
surveillance requirements. Rel. Eng. Sys. Saf., Vol. 91, nance in flowshop sequencing problems. Computer &
pp. 1027–1038. Operational Research, Vol. 34, 11, pp. 3314–3330.
Martorell, S., Sanchez, A. and Carlos, S. (2007). A toler- Samrout, M, Yalaoui, F, Châtelet, E. and Chebbo, N.
ance interval based approach to address uncertainty for (2005). New methods to minimize the preventive main-
RAMS+C optimization. Rel. Eng. Sys. Saf., Vol. 92, tenance cost of series-parallel systems using ant colony
pp. 408–422. optimization. Rel. Eng. Sys. Saf, Vol. 89, 9, pp. 346–354.
Modarres, Mohamed (2006). Risk Analysis in Engineer- Taboada, Heidi A., Baheranwala, Fatema, Coit, David W. and
ing— Techniques, Tools, and Trends. Taylor & Francis. Wattanapongsakorn, Naruemon (2007). Practical solu-
Boca Raton, FL. tions for multi-objective optimization: An application to
Rigdon, S.E. and Basu, A.P (2000). Statistical Methods for system reliability design problems. Rel. Eng. Sys. Saf.,
the Reliability of Repairable Systems. John Wiley and Vol. 92, pp. 314–322.
Sons, New York.
Rubéns Ruiz, García-Díaz, J. Carlos and Maroto, Conceptión
(2007). Considering scheduling and preventive mainte-

APPENDIX I

⋅⋅⋅
00101100 01101011 00110010 10000111 00011110 10011001 00010001 00101001 10010001 00111001 00000010 00110011 00000011 10000100 00010011
Genotype: 11101100 00101100 1101011 10011101 01100100 11011101 00111110 11011100 11101101 1101001 01111110 10101100 1101110 11011101 10111110

Phenotype: n β d n β d n β d n β d n β d ⋅⋅⋅

Motor 1 Motor 2 Motor 3 Pump 1 Pump 2 ⋅⋅⋅


n – number of foreseen interventions
β – proportionality factor
d – maintenance scheduling displacement

Figure I1. Schematic representation of the genotype/phenotype structure.

503
Safety, Reliability and Risk Analysis: Theory, Methods and Applications – Martorell et al. (eds)
© 2009 Taylor & Francis Group, London, ISBN 978-0-415-48513-5

Maintenance modelling integrating human and material resources

S. Martorell
Dpto. Ingeniería Química y Nuclear, Universidad Politécnica Valencia, Spain

M. Villamizar, A. Sánchez & G. Clemente


Dpto. Estadística e Investigación Operativa Aplicadas y Calidad, Universidad Politécnica Valencia, Spain

ABSTRACT: Maintenance planning is a subject of concern to many industrial sectors as plant safety and
business depend on it. Traditionally, the maintenance planning is formulated in terms of a multi-objective
optimization (MOP) problem where reliability, availability, maintainability and cost (RAM+C) act as decision
criteria and maintenance strategies (i.e. maintenance tasks intervals) act as the only decision variables. However
the appropriate development of each maintenance strategy depends not only on the maintenance intervals but also
on the resources (human and material) available to implement such strategies. Thus, the effect of the necessary
resources on RAM+C needs to be modeled and accounted for in formulating the MOP affecting the set of
objectives and constraints. In Martorell et al., (2007), new RAM + C models were proposed to address explicitly
the effect of human resources. This paper proposes the extension of the previous models integrating explicitly
the effect of material resources (spare parts) on RAM+C criteria. This extended model allows accounting
for explicitly how the above decision criteria depends on the basic model parameters representing the type of
strategies, maintenance intervals, durations, human resources and material resources. Finally, an application
case is performed on a motor-driven pump analyzing how the consideration of human and material resources
would affect the decision-making.

1 INTRODUCTION criteria, facing in this way a typical multi-objective


RAM+C-based optimization problem (MOP). Ideally,
Often, RAMS+C (Reliability, Availability, Maintain- the resolution of such a problem should yield to the
ability, Safety plus Cost) take part of the relevant crite- selection of the optimal set of maintenance strate-
ria for the decision-making of concern to maintenance gies, named herein an ‘‘optimal maintenance plan’’,
planning. which would cope with the dominant failure causes
The RCM (Reliability Centered Maintenance) is a of the equipment and at the same time would maxi-
good example of a systematic methodology to estab- mize its reliability and availability keeping costs at a
lish an efficient maintenance plan to cope with all minimum.
of the equipment dominant failure causes (Figure 1). A weakness of the traditional method above is that
Typically, the main objective in applying the RCM the maintenance planning is formulated in terms of
methodology has been to find out the best set of main- a MOP where RAM+C act as decision criteria and
tenance strategies that provide appropriate balance of maintenance strategies (i.e. maintenance tasks inter-
equipment reliability and availability and associated vals) act as the only decision variables. However, the
costs. appropriate development of each maintenance strat-
In pursuing the RCM goal the decision-maker must egy depends not only on the maintenance intervals
face at least two main problems: 1) there is not a unique but also on the resources available to implement such
maintenance plan (i.e. set of maintenance strategies as strategies under a given period. Therefore, an extended
technical resources) to cope with all the dominant fail- maintenance planning (see Figure 1) accounting for
ure cause as shown in Ref. (Martorell et al., 1995), 2) maintenance intervals affecting not only maintenance
frequencies suggested for performing specific mainte- strategies but also the necessary resources (human
nance tasks are based on deterministic criteria and they and material) for their implementation seems to be
are normally far away of being optimized. Then, one more realistic in many situations where the resources
challenge for the decision-maker is to select the most available are limited or they need to be minimized
suitable set of maintenance strategies with optimal also. Thus, the effect of the necessary resources on
frequencies based on reliability, availability and cost RAM+C needs to be modeled and accounted for in

505
Maintainability
Failure Reliability
Human Resources
Cause #1 Maintenance strategy #1
Critical Equipment

(Technical Resources)
fs

(degradation mechanisms)
Failure

Working conditions
Maintenance strategy #2 w

M aintainability
Cause #2

Strategies
fN
… …
Maintenance strategy #j
Failure
Task #1 Task #2 … Task #k … dN HN
Cause #i
Non-
… … Schedu led

Dominant ur ds Hs
Maintenance Plan Scheduled
Failure Causes

M a ter ia l Cost Maintainability


Availability
H um a n Res ou rc es Material Resources
Re sou r ces

Figure 2. Conceptual framework for the RAM+C based


Figure 1. Extended maintenance planning within a RCM integration of technical, human and material resources.
framework.

using appropriate models. These models have to be


formulating the MOP affecting the set of objectives able to show explicitly the relationships among the
and constraints. RAMS+C criteria (decision criteria) and the variables
In Martorell et al., (2007), new RAM + C mod- (e.g. surveillance and maintenance tasks intervals,
els were proposed to address explicitly the effect of human resources, spare parts, ..) of interest involved
human resources. A sensitivity study was performed for the decision-making (decision variables).
to analysis the effect observed in the cost and unavail- Figure 2 shows the conceptual framework being
ability when human resources are taken into account. considered in this paper for the RAM+C based inte-
This study concluded that the implementation of a gration of technical, human and material resources in
given maintenance plan supposes to adopt both the support of the extended maintenance planning. This
optimized frequencies and resources as only adopting figure follows the principles introduced in (Martorell
optimized frequencies without considering the human et al., 2005a) what concerns RAMS+C informed
resources necessary can suppose large deviations from decision-making and the role of testing and mainte-
the unavailability and cost goals presumed at their nance planning.
optima. Next subsections introduce reliability, availabil-
Based on the model previously presented in (Mar- ity, maintainability and cost models which depend
torell et al., 2007), the objective of this paper is focused explicitly so much on the maintenance intervals as
on the extension of this model integrating explicitly the on the human and material resources. In the mod-
effect of material resources (spare parts) on RAM+C els developed, the opportunistic maintenance has not
criteria. This extended model allows accounting for been considered. This approach extends the models
explicitly how the above decision criteria depends on proposed in Martorell et al., (2007).
the basic model parameters representing the type of
strategies, maintenance intervals, durations, human
and material resources. 2.1 Reliability models
The paper is organized as follow. First, the The inherent reliability (R) of an equipment depends
model assumptions are presented. Section 2 presents on its dominant failure causes, which in turn depend
RAM+C models that include the effect of not only on its physical characteristics of design and working
human but also material resources on the previous cri- conditions, and on the level of quality testing and main-
teria. Section 3 shows an application case which is tenance tasks imposed on it to cope with such dominant
performed on a motor-driven pump analyzing how the failure causes (Martorell et al., 2005a).
consideration of human and material resources would Thus, working conditions introduce degradation
affect the decision-making. Finally, section 4 presents (α) as time passes that affects equipment age (w).
the concluding remarks. This way, degradation imposes loss of equipment per-
formance, which can be represented in terms of a
deterioration of the equipment failure rate (λ), which
2 RAM+C MODELS can result in equipment failure when degradation goes
beyond of the point defined in the functional specifi-
A quantitative model needs to be used to assess how cation. This situation is specific for each degradation
testing and maintenance choices affect equipment mechanisms and the corresponding dominant failure
RAMS+C attributes. Thus, the relevant criteria are cause, the latter representing a sort of interruption of
to be formulated in terms of the decision variables functional capability of the equipment.

506
On the other hand, testing and maintenance tasks influence the real downtime (d) for developing this
affect the component age. In general, one can assume task (see Figure 3).
that each testing and maintenance activity improves Therefore, the downtime depends, between other
the age of the component by some degree, depending factors, on the delay associated with the availability
on its effectiveness, what is often called ‘‘imperfect of human (TDH ) and material resources (TDM ). Both
maintenance’’, which is a natural generalization of two delay times are assumed equal to 0 (TDH = 0 and
extreme situations (Bad As Old (BAO) and Good As TDM = 0) for a scheduled task.
New (GAN) models). Based on Figure 3, the downtime, d, can be esti-
There exist several models developed to simulate mated for a maintenance task using the following
imperfect maintenance (Martorell et al., 1999b). In relationship:
this paper, two models that introduce the improvement  
effect of the maintenance depending on an effective- d = F(ST ) · d  + TDH
ness parameter are considered. Both models assume  
that each maintenance activity reduces the age of + (1 − F(ST )) · d  + max (TDH ; TDM ) (3)
the component in view of the rate of occurrences
of failures. These models are the Proportional Age where F(ST ) is associated with the cumulative distri-
Reduction (PAR) represented by bution function of a random variable x representing
the probability of demand of spare parts, that is,
α
λ = λ0 + · [1 + (1 − ε) · ( f · RP − 1)] (1) F(ST ) = P(x ≤ ST ) (4)
2·f
being ST the inventory associated to a particular spare.
and the Proportional Age Setback (PAS), repre- Thus, eqn. (4) represents the probability that a given
sented by amount of spares x required for performing a given
maintenance task are available, so the delay (if any)
α (2 − ε) must be due to the human delay only, see eqn. (3).
λ = λ0 + · (2)
2·f ε On the other hand, in case of spares shortage; this
occurs with a probability given by:
where RP is the Replacement Period, f is the frecuency
associated to each testing and maintenance task, and 1 − F(ST ) = P(x > ST ) (5)
ε its effectiveness in preventing the equipment from
developing the particular degradation or failure cause. the delay is given by the maximum value between
human and material delay, max(TDH , TDM ). In addi-
tion, d  is the downtime if it is assumed that delay time
2.2 Maintainability models is equal to zero, TD = 0, which can be estimated as:
Maintenance represents all activities performed on
H
equipment in order to assess, maintain or restore d = (6)
it’s operational capabilities. Maintenance introduces (ηP · NP + ηE · NE ) · κ [NP + NE ]
two types of positive aspects: a) corrective mainte-
nance restores the operational capability of the failed being NP and NE the number of own and external
or degraded equipment and b) preventive mainte- personnel involved in this task, respectively. Both,
nance increases the intrinsic reliability of non-failed own personnel and external workforce have an effi-
equipment beyond the natural reliability. ciency associated in performing such task, which
On the contrary, maintenance also introduces is represented by ηP and ηE respectively. Function
adverse effects, called the downtime effect that rep-
resents the time the equipment is out of service to Delay Man-Hours
Task Task Task
overcome maintenance (corrective, preventive, repair, Launched
TD
Starts
H
Ends
overhaul, etc.). Thus, the adverse effect depends on
TD = max {TDH, TDM}
the Maintainability (M) characteristics of the equip-
ment. Maintainability represents the capability of the
equipment to be maintained under specified condi-
tions during a given period, which depends not only
TDPH Human Delay
on the equipment physical characteristics imposing a
given number of man-hours (H) to perform an indi- Material Delay
vidual maintenance task, but also on the human and TDM

material resources (e.g. task force, spare parts) avail-


able to perform the task, delay times, etc., which Figure 3. Schematic view of equipment maintainability.

507
κ[.] represents the law of decreasing effectiveness 2.4 Cost models
as human resources which is formulated as follows
Two types of costs are considered: a) costs associated
(Ricardo, 1821):
to performing surveillance and maintenance activities,
   and b) costs related to the management of spare parts
1
κ[N ] = exp K · −1 + (7) inventory. In the following subsections these costs are
N developed.

where K ranges in the interval [0,1]. 2.4.1 Surveillance and maintenance costs
The relevant costs in analyzing test surveillance and
2.3 Availability models maintenance (TS&M) optimization of safety-related
equipment include the contributions to the cost model
Each couple dominant failure cause and maintenance of standby components, which, in general, undertake
task is associated at least one contribution to the equip- surveillance testing, preventive maintenance and cor-
ment unavailability, which corresponds to one of the rective maintenance to restore their operability after a
following equations (Martorell et al., 2002). failure has been discovered during a test (Martorell
  et al., 2000b, 1996c, 1999). Each couple domi-
1  1 nant failure cause and maintenance task is associated
ur = 1 − · 1 − e−λ·I ≈ ρ + λ · I (8)
λ·I 2 one cost contribution to the equipment total LCC
(Life-Cycle Cost), which corresponds to one of the
following equations in accordance to the unavailability
uS = fS · dS (9) contributions above (Martorell et al., 2001):

cS = 8760. fS · c1S (12)


uN = fN · dN · G[AOT ] (10)

Eqn. (8) represents the unavailability associated cN = 8760. fN · c1N (13)


with an undetected failure corresponding to the partic-
ular sort of failure cause being considered. In addition,
I represents interval for performing a scheduled main- cD = 8760. fN · (1 − G[D]) · c1D (14)
tenance task that is intended or supposed to detect the
occurrence of such a failure. Eqn. (12) represents a yearly cost contribution as
Eqn. (9) represents the unavailability contribution a consequence of performing planned or scheduled
associated with a planned or scheduled testing or main- testing or maintenance task over a year period. Eqn.
tenance task, where fS and dS represent the frequency (13) represents a yearly cost contribution as a con-
and downtime respectively for the scheduled activity, sequence of performing non-planned or unscheduled
given by fS = 1/I and eqn. (3) respectively. maintenance task over a year period. Eqn. (14) rep-
Finally, Eqn. (10) represents the unavailability con- resents the yearly cost contribution associated with
tribution associated with a non-planned or unsched- the number of plant outages and the corresponding
uled maintenance task, where fN and dN represent the loss of production that is estimated to occur over a
frequency and downtime respectively for the activity, year time horizon as a consequence of unscheduled
given by for example fN = λ for on failure corrective maintenance activities exceeding the allowed down-
maintenance and Eqn. (3) respectively. In addition, time D. In addition, variables c1S and c1N in Eqns.
function G[AOT] represents the probability of ending (12) and (13) respectively represent unitary costs as a
the non-planned activity before the maximum allowed consequence of performing one single task, scheduled
outage time given by AOT, which can be formulated by or non-planed respectively, which can be formulated
using the following relationship:
− AOT
G[AOT ] = 1 − e dN
(11)
cli = NP · cHP · TP + NE · cHE · TE (15)
where AOT ranges in the interval [0, +∞[. Factor AOT
accounts for the fact that maintenance tasks at NPP where TP = d and TE = max{dmin ≥ 0, d} usually,
operating at full power have limited the allowed outage representing the time spent by the NP own and NE
time, typically to limit the unavailability contribution external personnel respectively. In addition, cHE rep-
due to unscheduled maintenance activities. It is simple resents the hourly cost for external personnel supposed
to realize that when this limitation does not apply, then to be a fixed amount, while cHP is the hourly cost for
AOT → ∞ and consequently G[AOT ] → 1. own personnel, which is not constant as it depends

508
on a number of factors as proposed in the following inventory, depends on the number of failures and
equation: maintenance tasks in this interval L.
In order to obtain this cost contribution, it it required
NP · S first to obtain the average inventory level in the period
cHP =
(16) L considering the demand of spares associated to
Neq · (fA · TPA )
∀A∈P failures. Two possibilities are considered, which are
showed in Figures 4 and 5 respectively.
where the unknown variables S and Neq represent Figure 4 shows a situation in which the demand of
the annual salary and number of similar components spare parts, xi , in the period L, is lower than the original
respectively assigned to everyone of the NP own inventory level ST .
personnel. The aggregation extends over all of the Figure 5 represents the case in which the demand,
tasks, scheduled or not, affecting the equipment being xi , exceeds the original inventory, ST , in the period L.
performed by Np personnel. When the inventory level is equal to reorder point,
Finally, c1D represents the unitary cost due to a plant R, new spares are ordered. However, in this case,
outage. the ordered spares arrive after the L1 and the origi-
nal inventory is not enough for the period L. As it is
2.4.2 Costs of spare parts observed in Figure 5, L1 is the duration in which the
The cost related to the management of spare parts is original spare inventories decrease from ST to 0.
assumed to consist of the following contributions:

Ci = ch + co + cex (17)

where the inventory holding costs, ch , includes the


capital cost of money tied up in inventory and physical ST
cost of having the inventory, co includes the regular and
urgent ordering costs and cex . represents the holding xi
super plus or excess costs.
The following subsections present a detailed
description of the models adopted to represent the dif- ST-xi-R
ferent contributions, which have been developed under R ST-xi
the following assumptions:
1. Demand rate of spares associated to failures is equal L
to λ. The value of the parameter λ is given by Eqn.
(1) or (2) depending on the imperfect maintenance
Figure 4. Case 1: Original inventory is enough to supply
model, PAS or PAR, selected. the demand of spare parts during the period L.
2. Demand of spares due to failures is modeled using
a Poisson process under constant demand rate λ.
3. Demand of spares associated to surveillance and
maintenance tasks are planned well in advance, so
there is only cost of regular ordering.
4. Period between two consecutives orders of spares ST
is equal to L.
5. ST − R spare parts are ordered when the ordering
point Ris reached.
2.4.2.1 Inventory holding costs xi
Inventory holding costs include the capital cost of R
money tied up in inventory and the physical cost of
having the inventory. The total average holding cost
is often determined as a percentage of the average xi-ST
total inventory cost (Crespo, 2007). This percentage L1 L2
can usually not be the same for all types of items, L
for example, it should be high for computers due to
obsolescence. Figure 5. Case 2: The ordered spares arrive after the L1 and
Demand for spares during the period L, and con- the original inventory is not enough to supply the demand of
sequently the variation of the level of spares in the spare parts during the period L.

509
Based on Figures 4 and 5, the average inventory 2.5 Aggregation of availability and cost models
level, Iav , in the period L can be evaluated as: for a given strategy

As proposed in Martorell et al., (2007), under the
ST RCM approach, each couple dominant failure cause
Iav = + R − λ · TDM
i=0
2 and maintenance strategy is associated a global effi-
∞ ciency affecting partially equipment availability and
+ (xi + R) · P(x = xi ) (18) associated costs, which are associated with both the
R
probability that this failure occurs and the develop-
Based on the inventory average level, Iav , given ment of the strategy itself, which, in turn, depends on
by eqn. (18), the yearly cost contribution due to the the frequency of developing scheduled and unsched-
inventory holding, ch , can be evaluated as: uled tasks belonging to the maintenance strategy and
their corresponding durations and costs. Thus, the

ch = p · Iav · csp + cdp (19) unavailability and cost models for a given strategy can
be formulated accordingly by simply aggregating the
where p is the percentage of the average total inventory previous single task models for the k tasks involved
cost considered, and csp and cdp represent, respectively, in j-nth strategy used to cope with i-nth failure cause
the cost of a spare part and the depreciate cost per spare (i → j).
part.

2.4.2.2 Inventory ordering costs Ui→j = Ui→j, k (24)
The yearly cost contribution due to the inventory order- ∀ k∈j
ing, cio , includes: a) costs for a regular order, cro ,

b) cost of expedited order, ceo , and c) urgency or Ci→j = Ci→j, k (25)
emergency order cost,cuo ., which can be formulated ∀ k∈j
as follows:
λ Eqns. (24) and (25) are similar to their equivalents
co = · (ceo + cro + cuo ) (20)
ST in Martorell et al., (2007). However, notice the sort
contributions and their formulation vary as follows.
where ceo is a fixed ordering cost for order, cro is the What concern unavailability contributions, the
cost for a regular order, which can be calculated as: main difference comes from the novel formulation of
the duration of the a maintenance task to cope with
cro = csp · ST (21) a given dominant failure cause, see eqns. (3) to (7),
which now addresses also the possibility of not having
and cuo is the emergency order cost, which is calculated spare parts available at the time of a demand, in particu-
(see Figure 5) as: lar for performing an unscheduled task (e.g. corrective
maintenance), which introduces a delay time in start-

ing the task while one is waiting for a spare part that
cuo = cu · (xi − R) · P(xTDM = xi ) (22) has been ordered urgently.
i=R What concern cost contributions, one important dif-
ference comes also from the novel formulation of the
being cu the emergency cost per spare part and xTDM duration of the maintenance task to cope with a given
the demand of spare part during the delay associated dominant failure cause. Moreover, the novel formula-
to material resources. tion of the cost for a given couple (i → j) addresses
2.4.2.3 Holding excess costs now additional cost contributions, i.e. those contribu-
Based on Figure 4, the average surplus stock cost can tions used to address the cost of managing spare parts
be evaluated it xi < (ST − R) as: as described in section 2.4.2, which are also formu-
lated for a given couple (i → j). The remaining cost

ST contributions, i.e. those associated with surveillance
cex = (cop − cdp ) · ((ST − R) − xi ) · P(x = xi ) and maintenance costs, as described in section 2.4.1,
i=0 keep almost the same as formulated in Martorell et al.,
(23) (2007), except what concerns eqn. (15) in this paper,
where an additional third constant term appeared in
where cop is the opportunity cost per spare part, as a the right hand side of this equation, which was used to
consequence of that the capital invested in its purchase account for the cost of the spares used in a given tasks.
is not available for other uses, and cdp represents the This additional term does not make sense now, as this
component depreciate cost. contribution was included to address in a simplified

510
way the cost of managing spare parts, which however study after performing the RCM process described in
is well represented now with the extended modeling Martorell et al., (1995). In addition, Table 2 shows the
of section 2.4.2. maintenance plan selected in Martorell et al., (2005b),
which allows covering all the dominant failure causes
of the equipment. This table shows the associated
2.6 Equipment based aggregation of availability
surveillance and maintenance tasks belonging to a type
and cost models
of strategy identified to be appropriate (Y) or not (N)
Following the reasoning introduced above, one can to control every cause and the maintenance intervals
realize there is a need to find out a set of maintenance (I ), originally optimized.
strategies to prevent the component from developing Tables 3 and 4 show the additional data necessary
all of its dominant failure causes since more than one for using the models proposed in section 2. Table 3
normally applies. shows the data related to human resources. Table 4
According to the study in Martorell et al., (2007) shows data related to material resources. For sake of
there is not a unique combination of maintenance
strategies to cope with all the dominant failure causes.
Each combination is associated a given equipment Table 2. Maintenance plan selected to cover dominant
failures causes.
availability and corresponding cost given by
Task I (hrs) Failure causes
U= ui→j (26)
∀ i→j c1 c2 c3 c4 c5 c6
Lub oil change (t1) 26000 Y N Y N N N
C= ci→j (27) Operational test (t2) 13000 Y N N N Y N
∀ i→j Visual inspection
Motor (t3) 26000 Y Y N N N N
Visual inspection
3 APPLICATION EXAMPLE Pump (t4) 26000 N N N Y N Y

The case of application is performed on the basis of the


results obtained in Martorell et al., (2005b), where the Table 3. Data related to human resources.
best maintenance plan to cope with all the dominat fail-
Own External
ure causes of a motor-driven pump was found. Later, Parameter personnel personnel
in Martorell et al., (2007) it was studied how man-
agement of human resources affects decision-making Efficiency 0.9 1
on the optimal frequency for performing the sched- Delay for 0h 3h
uled tasks included in the above maintenance plan. unscheduled tasks
The objective of the application example in this paper Delay for 0h 0h
is on analyzing how management of spare parts as scheduled tasks
well as management of human resources affects this Cost 20000 C = /year 30 C= /hour
Neq 100 —
decision-making using the extended models of equip-
Persons (N) [0, 1, 2, 3] [0, 1, 2, 3]
ment unavailability and cost contributions presented K (law of decreasing
in section 2. effectiveness) 0.25
For sake of completeness of the presentation,
Table 1 shows again the six dominant failure causes
considered for the motor-driven pump analyzed in this
Table 4. Data relating to material resources {c4,t4}.

Table 1. Motor-driven pump dominant failure causes. Parameter Value

# Cause Code Description RP 40 years


Spare part cost, csp 1328 C =
c1 IAL Inadequate lubrication Emergency cost per order, cu 1992 C =
c2 DEM Damaged electric Fixed ordering cost, cfo 132 C=
or electronic module Percentage p 20%
c3 MBW Motor bearing wear TDM 720
c4 PIW Pump impeller wear Reorder point R 1
c5 SPD Set point drift Opportunity cost, cop 66 C
=
c6 MIS Misalignment Depreciate cost, cdp 33C
= /year

511
U -C plot for several strategies of human us stocks

0,0515
(1,0,0)
Without stocks
0,0510
Without Reorder point
Reorder point (R)
0,0505 (1,0,1) (1,0,1)R
(Np, Ne, Stock)
(2,0,0)
0,0500
(1,0,2)R
(1,2,0) (1,0,3)R
Unavailability

0,0495
(1,3,0)
(1,4,0) (1,1,1)
0,0490 (3,4,0)
(2,4,0) (4,4,0)
(1,2,1)
0,0485 (1,1,2)R (1,1,2)
(1,1,3)R
0,0480 (1,2,2)
(1,2,3)R
(1,0,3)
(1,2,4)R (2,2,3)
0,0475 (2,2,4)R (3,2,3)
(3,2,4)R (4,2,3) (3,2,4)
(4,2,4)
(4,2,4)R
0,0470

0,0465
2000 4000 6000 8000 10000 12000 14000 16000
Cost

Figure 6. Unavailability and cost effects of using different couples [NP, NE].

simplicity, it has been considered herein that task 4, highest unavailability. On the contrary, when a stock of
which covers dominant failure cause 4 as shown in spare parts is considered it results in a decrease of the
Table 3, is the only one that requires spare parts. Data unavailability at expenses of an increase of the costs.
related to the equipment reliability characteristic and Comparing two alternatives, with or without
others not included herein are the same as proposed in reorder point, the former provides better results in
Ref. Muñoz et al., (1997). terms of both unavailability and cost, which may
As said, the results obtained in Martorell et al., suggest that management of spare parts with reorder-
(2005b) have been adopted as a reference point. Next, ing point dominates equivalent solutions without
a sensitivity study has been performed to analysis the reordering.
effect observed on the unavailability and cost scores
under several when the human and material resources
are taken into account. Several management policies 4 CONCLUDING REMARKS
have been considered, which address own versus exter-
nal personnel and spare part inventory only for sake of This paper proposes the extension of previous models
simplicity. developed by the authors to integrating explicitly the
Figure 6 shows the effect of changing the human effect of material resources (spare parts) on RAM+C.
and material resources, i.e. use of different triplets This novel modeling allows accounting for explicitly
[NP , NE , ST ] representing own personnel, external how the above decision criteria depends on the basic
taskforce and spare parts inventory respectively, for model parameters representing the type of strategies,
performing the maintenance plan selected with the maintenance intervals, durations, human resources
periods showed in Table 2. Three alternatives were and material resources.
considered: a) without stocks, b) with stocks but with- An application example is performed on a motor-
out reorder point and c) with stocks and reorder point. driven pump analyzing how the consideration of
Consequently, three curves were found representing human and material resources affects the decision-
each a non-dominated set of solutions in the space making. It shows how changes in managing human and
U-C for the corresponding alternative. material resources affect both cost and unavailability.
Figure 6 shows the alternative with the lowest It is observed also, unavailability can be reduced by
costs corresponds to the case of having no inventory. introducing a spare inventory, although, logically, this
Nevertheless, this situation is the one that imposes the option supposes a greater cost. Finally, management

512
of spare parts with reordering point provides better Martorell, S., Sanchez, A., Carlos, S., Serradell, V. 2004.
results than without reordering. Alternatives and challenges in optimizing industrial safety
using genetic algorithms. Rel. Engng & System Safety,
Vol. 86 (1) 25–38.
ACKNOWLEDGMENTS Martorell, S., Villanueva, J. F., Carlos, S., Nebot, Y.,
Sánchez, A., Pitarch J. L. and Serradell, V. 2005a
RAMS+C informed decision-making with application to
Authors are grateful to the Spanish Ministry of Educa- multi-objective optimization of technical specifications
tion and Science for the financial support of this work and maintenance using genetic algorithms. Reliability
(Research Project ENE2006-15464-C02-01, which Engineering & System Safety, 87(1):65–75.
has partial financial support from the FEDER funds Martorell, S., Carlos, S., Sanchez, A. 2005b. Use of met-
of the European Union). rics with multi-objective GA. Application to the selection
of an optimal maintenance strategy in the RCM context.
In Proceedings of European Safety and Reliability
Conference ESREL 2005. Ed. Tailor & Francis Group,
REFERENCES pp. 1357–1362.
Martorell, S., Carlos, S., Sanchez, A. 2006. Genetic
Axsäter, S. 2006. Inventory Control. Springer, United States Algorithm applications in surveillance and maintenance
of America. optimization. In Computational Intelligence in Reliabil-
Crespo, A. 2007. The maintenance management frame work: ity Engineering. Volume 1: Evolutionary techniques in
models and methods for complex systems maintenance. reliability analysis and optimization Ed. Springer.
Springer series in reliability engineering. Martorell, S., Villamizar, M., Sanchez, A., Clemente G. 2007.
Kaufmann, A. 1981. Métodos y Modelos Investigación de Maintenance modeling and optimization integrating
Operaciones. CIA. Editoral Continental, S.A. de C.V, strategies and human resources. European Safety
Mexico. And Reliability Conference (ESREL 2007). Stavanger,
Leland T. Blank, Anthony Tarquin. 2004. Engineering Noruega.
Economy. McGraw-Hill Professional. United States. Muñoz, A., Martorell, S., Serradell V. 1997. Genetic algo-
Martorell, S., Muñoz, A., Serradell V. 1995. An approach rithms in optimizing surveillance and maintenance of of
to integrating surveillance and maintenance tasks to pre- components. Reliability engineering and system safety,
vent the dominant failure causes of critical components. Vol. 57, 107–120.
Reliability engineering and systema safety, Vol. 50, Ribaya, F. 1999. Costes. Ediciones Encuentro, Madrid.
179–187. Ricardo, D. 1821. On the principles of Political Economy and
Martorell S., Sanchez A., Serradell V. 1999. Age-dependent Taxation. John Murray, Albemale-Street, London.
reliability model considering effects of maintenance and Sarabiano, A. 1996. La Investigación Operativa. Edisorfer,
working conditions. Reliability Engineering & System S. L, Madrid.
Safety, 64(1):19–31 Sherbrooke, C. 2004. Optimal Inventory Modeling Of Mod-
Martorell, S., Carlos, S., Sanchez, A., Serradell, V. eling Of Systems, United States of America.
2001. Simultaneous and multi-criteria optimization of Starr, M. y D. Miller. 1962. Inventory Control: Theory and
TS requirements and maintenance at NPPs. Annals of Practice,Prentice-H all, Englewood Cliffs, United States
Nuclear Energy, Vol. 29, 147–168. of America.

513
Safety, Reliability and Risk Analysis: Theory, Methods and Applications – Martorell et al. (eds)
© 2009 Taylor & Francis Group, London, ISBN 978-0-415-48513-5

Modelling competing risks and opportunistic maintenance


with expert judgement

T. Bedford & B.M. Alkali


Department of Management Science, Strathclyde University, Graham Hills Building, Glasgow, UK

ABSTRACT: We look the use of expert judgement to parameterize a model for degradation, maintenance
and repair by providing detailed information which is then calibrated at a higher level through course plant
data. Equipment degradation provides signals by which inferences are made about the system state. These may
be used informally through the use of red/yellow/green judgements, or may be based on clear criteria from
monitoring. Information from these signals informs the choices made about when opportunities for inspection
or repair are taken up. We propose a stochastic decision model that can be used for two purposes: a) to gain
an understanding of the data censoring processes, and b) to provide a tool that could be used to assess whether
maintenance opportunities should be taken or whether they should be put off to a following opportunity or
scheduled maintenance. The paper discusses the competing risk and opportunistic maintenance modeling with
expert judgement and the broad features of the model. Numerical examples are given to illustrate how the process
works. This work is part of a larger study of power plant coal mills reliability.

1 INTRODUCTION The motivation for looking at this model was the


aim to produce realistic competing risk models, that
In this paper we take a highly (subjective) decision- capture the way in which failure events are censored by
oriented viewpoint in setting up a model to capture other events, in particular by maintenance. A number
the interaction between system degradation and oppor- of competing risk models have been produced with the
tunistic maintenance/inspection. This model is primar- reliability area in mind. However they have not ade-
ily a decision model, rather than a reliability model, quately captured the fact that much censoring occurs
although reliability provides the context. The model due to opportunistic maintenance interventions.
considers the uncertainty about the state of the system Finally, we remark that because the model is sup-
from the maintainer/operator point of view, and mod- posed to capture the typical behaviour of the system,
els how information gained about the system (which we shall not worry too much about describing the tail-
includes the usage time, but also other pieces of behaviour of lifetime distributions which are outside
information) will change that uncertainty. the regime of practical interest. In practice changes
The model we have is a complicated stochastic to system decision making are only ever going to be
model that we have to simulate because of the lack of incremental and therefore they allow for the revision
convenient closed form solution, but it is designed to of the model further into higher time values if that
fit closely to an expert judgement elicitation process, becomes necessary.
and hence be relatively easy to quantify through a com-
bination of expert judgement and plant data. Because
1.1 Background
of the difficulties of using messy plant data, we give
preference to a combination of expert judgement and The background to this paper is a project studying the
plant data, as the most cost efficient way of quantifying reliability of power-generating plant coal mills. This
the model. In other words, the model quantification equipment is heavily used and must be maintained
process uses expert information to give fine details regularly. The coal mills accumulate high operating
of the model, while using plant data to calibrate that hours and are necessary for the coal fired boilers in
expert data against actual experience. We have used the power plant generating units. Maintenance activi-
a strategy of reducing the difficulty of expert judge- ties are planned to a fixed number of operating hours,
ments by eliciting relative risks as we believe these to but regular unscheduled maintenance is also required
be easier to elicit than absolute risks. due to rapid degradation. Due to the high costs of lost

515
production, there is a real need to avoid unplanned opportunities often depend on the duration and eco-
shutdowns. However, there are regular opportunities nomic dependence for set up costs which may require
arising from the failure of systems elsewhere in the a compromise in some circumstances.
plant. Furthermore, there is a level of redundancy The concept of opportunistic maintenance and com-
in the coal mills which may be used when required. peting risk modeling is important within an industrial
Hence it is the case that equipment in need of repair power plant. The role of expert judgments within
and operating in a degraded state may continue to be the maintenance area may give an insight and bet-
used until an opportunity presents itself. The model ter understanding of the interrelationships between
described here is designed to capture this type of events which could strongly support our modeling
situation and to be quantifiable by expert judgements. later. Expert judgment in maintenance optimization
is discussed in Noortwijk, et al. (1992). Their elicited
information is based on discretized lifetime distribu-
1.2 Competing risks tion from different experts. That is, they performed a
straightforward elicitation of failure distribution quan-
The competing risk problem arises quite naturally
tiles, rather than the kind of information being sought
in coal mill reliability. An intuitive way of describ-
for the model we build here.
ing a competing risks situation with k risks, is to
assume that to each risk is associated a failure time
Tj , j = 1, . . ., k. These k times are thought of as the
failure times if the other risks were not present, or 2 BASE MODEL
equivalently as the latent failure time arising from each
risk. When all the risks are present, the observed time Here we consider the basic model discussed in Alkali
to failure of the system is the smallest of these fail- & Bedford (2007). This model explicitly assumes that
ure times along with the actual cause of failure. For some kind of information about the state of the sys-
further discussion of competing risk models in gen- tem is known to the decision-maker. The basic idea
eral see ..Tsiatis (1975) and Crowder (2001). Bedford behind the model is that the decision makers percep-
(2005) also discussed the use of competing risk mod- tions about the state of the system is driven by discrete
els in reliability. Specific models are considered in events whose occurrence changes the underlying fail-
.Langseth & Lindqvist (2006), .Bunea & Mazzuchi ure rate of the system. Such discrete events might be
(2006), .Cooke & Morales-Mapoles (2006). Yann et external shocks, or might be internal shocks as a result
al., (2007) introduce a generalized competing risk of internal degradation of the system. These shocks
model and use it to model a particular case of for which occur randomly but can be observed by the decision
potential time to next preventive maintenance and cor- maker. We discuss the single failure rate and multiple
rective maintenance are independent conditionally to failure rate case.
the past of maintenance processes. For more specific
applications in reliability see the surveys of reliabil-
ity database in perspective see Cooke and Bedford 2.1 Single failure mode model
(2002). In this model only one category of failure mode is
considered. The failure time of the model requires
us to define the failure time distribution through a
1.3 Opportunistic maintenance process rather than through the more conventional fail-
Opportunistic maintenance can be defined as a strat- ure rate/survival functions approach, although these
egy that preventive maintenance actions on a com- quantities could be calculated.
ponent can be performed at anytime by other com- We define an increasing sequence of shock times
ponents’ failure or arrival of preventive replacement S0 = 0, S1 , S2 , . . . and for the period between shocks
ages of designated components (separate replacement) (Si−1 , Si ], a failure rate λi (t), t ∈ (0, Si − Si−1 ]. The
or the joint replacement, Kececioglu & Sun (1995). distribution of failure time T is defined conditionally:
Opportunistic maintenance can clearly have impacts Given that T > Si−1 the probability that T > Si is
on component and hence on system reliability. Oppor- equal to
tunities, interrupt replacement options and many unde-
sirable consequences of interruptions are discussed ⎛ −Si−1
Si

in Dagpunar (1996). Opportunities occur randomly exp⎝− λi (t)dt ⎠ . (1)
and sometime have restricted durations, implying that
only restricted packages can be executed. The main 0
idea is to set up a model to determine an optimal
package for individual packages and to develop cost and given that t ∈ (0, Si − Si−1 ] it has conditional
criterion, Dekker & Smeitink (1994). Maintenance failure rate λi (t − Si−1 ).

516
Clearly, given S0 = 0, S1 , S2 , . . . the failure rate for 2.3 Competing risk data
T is fixed (deterministic), but prior to knowing S0 =
The model above will generate competing risk data as
0, S1 , S2 , . . . the failure rate is stochastic.
follows. Let X be the time of the system failure, and let
This definition of a failure time, while relatively
Z be the time at which an opportunity would be taken.
unusual, has the advantage that it links fairly easily
The data that would be recorded in the plant records
to an elicitation scheme that can be used to assess a
would be min(X , Z) and the indicator of which event
simplified model using expert judgement. The sim-
had occurred. Note that as a consequence of the above
ple parametric version of this model assumes that
model, and consistent with practice, we can have that
there are a finite number of shocks that the times
P(X = Z) = 0.
between shocks are exponential (possibly with differ-
From the plant record data we can estimate only
ent parameters) and the failure rate in between shocks
the subsurvival function for X and Z, but not
is constant. Hence the parameters of this model are the
the true survival functions. Recall that the survival
number of shocks n the mean times between shocks
functions are
μ1 , μ1 , . . . , μn and the failure rate between shocks
λ1 , λ1 , . . . , λn (where S1 is the ith shock and λi is the
rate in the period (Si−1 − Si ], and Sn+1 = ∞). The SX (t) = P(X > t), SZ (t) = P(Z > t), (3)
expression for the conditional distribution function
given the shock times is given as and the subsurvival functions are

 SX∗ (t) = P(X > t, X < Z), SZ∗ (t) = P(Z > t, Z < X )
j−1
− (Si+1 −Si )λi +(t−Sj )λj
F (t|S1 , S2 , . . . . . . , Sn ) = 1 − e i=1 (4)
(2)
A simple quantity that measures the degree of cen-
where λi is the failure rate after shocks i, and Sj is the soring is SX∗ (0) = P(X < Z). This is simply the
largest event time less than t. probability that the next event will be a failure rather
than a maintenance action.
2.2 Opportunistic maintenance time
In order to model the censoring process, we need to 2.4 Multiple failure modes
model two things:
Very often there is more than one failure mode rel-
• how opportunities arise, and evant to system performance. In terms of competing
• when opportunities are taken. risk, these may be modelled as different variables that
In the setting we are considering, opportunities arise are competing themselves to fail the system. For the
typically as a result of faults elsewhere in the plant— purposes of this paper we assume that there are just
typically upstream. As there is a whole mixture of two failure modes. In many cases it seems reasonable
different fault types occurring, it is natural to assume to suggest that different failure modes develop accord-
that opportunities arise according to a Poisson process. ing to statistically independent processes, but as they
The modelling of when opportunities are taken is the develop further there may be some dependence. We
area where we can capture the idea that opportuni- capture these ideas in the following model. Associ-
ties will be taken with some knowledge of the system ated to each failure mode there is a set of shock times
state, and hence that the censoring of failure data will S0 = 0, S1 , S2 , . . . . We denote the ith shock associated
be correlated to the actual failure times. In this model to the kth failure mode by Sik and assume that there
we assume that the decision maker will choose to take are m(k)shocks for the kth mode.
the first opportunity after the system has encountered a The associated failure rate given that we have just
critical shock, or been assessed as in a critical state. We had the ith shock for Failure Mode 1 and the jth shock
denote this time, the time at which the system would be for Failure Mode 2 is denoted byλi,j .
maintained as Z. Note that Z can be equivalently writ-
ten as the first opportunity after the subjective failure Theorem 1.1 Suppose for a given system there are
rate has increased above a critical level. This is proba- two modes in which the system can fail. If the two
bly the way the decision makers will be thinking of Z. shock time processes are independent, and λi,j can be
However, so as not to confuse the definition of the written in the form λi,j = λi + λj then the failure times
model parameters with steps taken in the elicitation X1 and X2 from failure modes 1 and 2 respectively, are
process (where experts are asked for ratios of fail- independent.
ure rates, rather than absolute failure rates) we define
Z in terms of a critical shock rather than a critical Sketch Proof : Consider the conditional survival joint
failure rate. probability given the shock times,

517

P X1 > t, X2 > t|S01,... Sm(1)1 , S02,... Sm(2)2 (5) • Straightforward: identification of the failure modes,
identification of signals (note that is often done
and by splitting the terms for the shocks the resulting informally through use of green amber red codings,
distribution is given as but may not be thought of as a signal in the sense
that we are using it).
P(X1 > t|S01,... Sm(1)1 ).P(X2 > t|S02,... Sm(2)2 ) (6) • Moderate difficulty: quantification of distributions
for times to shocks, specification of relative risk
(i.e. ratios of the λi,j ).
Hence the conditional survival probability factor- • Very difficult: specification of absolute values for
izes because of the additivity assumption on λi,j . the λi,j .
Under the assumption that the two shock processes are
independent, we can then say that the unconditional The elicitation procedures and model quantifica-
survival probabilities also factorize. tion/validation steps have to be designed to use infor-
Hence when the two failure mode processes are mation at the least difficulty level possible. Clearly,
considered to be truly independent, then they can be the model does require information about the abso-
modelled as two different cases of the base model, and lute values of the λi,j , but we can also use empirical
the failure rates added together. However, when they data to calibrate the model. This represents something
are not considered independent then can capture this of a departure from previous competing risk models
in one of two ways: that we have considered, where the aim was usually to
check identifiability of the model—i.e. whether or not
1. The shock time processes are dependent the parameters could be estimated from the data.
2. The failure intensities are not additive.
The simplest way in which the shock processes 3.1 Single model
could be dependent is for there to be common shocks Here we assess the set of signals, the mean times to
for both failure modes. For the purposes of this paper signals, and the relative increase in risk after a signal.
we shall not consider more complex forms of depen-
dency between the shock processes. Regarding failure
intensities, we would typically expect that the fail- 3.2 Two failure modes
ure intensities are additive for early shocks, and then The model assumes that, by default, the failure intensi-
may become superadditive for late shocks, that is ties arising from different failure modes are additive.
λi,j > λi + λj Hence they can be elicited in a first round accord-
ing the procedure used for the single mode situation.
Remark 1.1 When there is more than one failure However, to provide a check on interactions, we check
mode, there is a modelling issue relating to the mean- whether there is a critical signal level after which we
ing of the failure intensities. While in the single failure could expect the FM of the other to be affected, or
mode model we can simply consider the failure inten- if there is a common signal. Then we explicitly elicit
sity to cover failures arising from any reason, when relative risk values above the critical signal levels.
there are two or more failure modes we have to dis-
tinguish failure intensities arising from the different 4 MODEL CALIBRATION
failure modes. This will avoid double counting of any
residual failure intensity not ascribable to those two For simplicity, from now on we just consider the single
failure modes, but also ends up not counting it at all. failure mode case, but the other case works similarly.
Therefore in this case, if there are significant failure Since the elicitation steps above only give the failure
intensities from residual failure causes, then it is best rates up to an unknown constant, we have to calibrate
to explicitly assess these alongside the main failure the overall model. Suppose that the relative failure
modes, so that the failure intensities can be properly rates elicited from the experts are κ1 , . . . , κn , so that
added. the actual failure rates are of the form ακ1 , . . . , ακn .
The following result allows us to consider the impact
of the calibration variable α: α
3 ELICITATION
Theorem 1.2 The marginal distribution for X , is
The purpose of using this modelling approach is that it stochastically decreasing as a function of α.
makes discussion with experts easier. However, there
is always a problem with obtaining good expert data, Proof. It is enough to show that the probability of
and some parts of the model will be easier to assess failure after time t is increasing as a function of α. In
than others. In particular, one would expect turn, for this, it is enough to show that the conditional

518
probability of failure after time t given the shock times, first is 0.2 and to the second is 0.8. The distribution
is increasing. However, that is clear since this is of time between shocks is modelled as exponential.
There are three failure rates associated to the periods


j−1
demarked by the two shocks. We suppose that the fail-
exp −α. (Si+1 − Si )κi + (t − Sj )κj ure rate ratios are estimated as increasing by a factor
i=1 of 10 each time, that is, the failure rate in the sec-
ond period is 10 times the initial rate, and that in the
where j is the largest integer such that Sj < t. third period is 10 times that in the second. Finally, we
assume that the mean time between opportunties is 1.2,
In practice we can use this result to scale model out- and that there is a major overhaul after time 10 (so that
puts to the observed failure data. However, to do that no times observed will ever be larger than 10 in any
we also need to take into account the censoring taking case). Finally, we assume that the current opportunis-
place in the observed data. What we actually see in tic maintenance strategy is to take the first opportunity
the observed data is the effect of current opportunistic that arises after the second shock.
maintenance decisions. Because we do not have a closed form solution to
the model, we simulate it as follows:
Theorem 1.3 If the opportunistic maintenance inter-
vention shock level is held constant, then the subdistri- 1. Simulate the shock times
bution function evaluated at any point tis increasing as 2. Calculate the conditional distribution function for
a function of α. In particular, the probability of observ- the lifetime given the shock times as in Equation
ing a failure, P(X < Z), is increasing as a function of α. 2, and simulate a lifetime.
3. Simulate a set of opportunity times and then choose
P(X = t, t < Z|S1 , S2 , . . . , Sn ) the first one beyond the critical shock time, as the
time at which opportunistic maintenance would be
= P(X = t|S1 , S2 , . . . , Sn , t < Z) carried out.
P(t < Z|S1 , S2 , . . . , Sn ) We sampled 1000 cases (in excel) and used these
= P(X = t|S1 , S2 , . . . , Sn )P(t < Z|S1 , S2 , . . . , Sn ). to numerically estimate the quantities of interest such
as the failure distributions and the probability of
The proof is similar to that given above, using the observing a failure.
fact that the opportunistic maintenance time is inde- Assuming that the scaling variable α = 1, we get the
pendent of the failure time, given the shock times, for following distributions for the underlying failure time,
The first term is increasing in α while the second is maintenance time, and their minimum, see Figure 1.
constant in α. The above result allows us to use the When we increase the scaling parameter to 1.5, this
empirically observable quantity P(X < Z) as a way increases the overall failure rate, thus making it more
of calibrating the model by finding the appropriate α. likely that a failure will be observed. This is illustrated
in Figure 2.
Although we have concentrated here on trying to When we reduce the scaling parameter to 0.5, this
establish a value for the scaling parameter α, it is reduces the overall failure rate, making it less likely
worth also looking at other parameters. In particular, that a failure will be observed, as illustrated in Figure 3
although we assumed that the experts were able to give below.
mean times until shocks, and the rate of opportunities,
it is possible to see whether a scaling adjustment of 1.2
these values would improve the overall fit of the model
to observed data. These are, however, secondary ‘‘tun- 1
ing parameters’’ that should only be considered after Taken
the α has been fitted. This fine tuning is then carried 0.8 opportunity
out using a different distance quantity—for example Failure time
using Kolmogorov-Smirnov distance on the observed 0.6
and model-based subdistribution functions. Minimum
0.4

5 RESULTS 0.2

We illustrate the results of the method proposed here 0


by using simulated data. 0 5 10
The example supposes that there are two shocks
that occur in the system, and that the mean time to the Figure 1. Failure time distribution with α = 1.

519
1.2 difficult to quantify from plant data both because of
the cost of cleansing the data sufficiently to make it
1 amenable to statistical analysis, and also because—
Taken due to identifiability problems—it may not be possible
0.8 opportunity to characterize a unique model in any case.
Failure time The model discussed here is designed to bring the
0.6
mathematical modelling closer in to line with the way
0.4 Minimum plant operators and maintainers adjust their beliefs
about the reliability of the equipment. Although we
0.2 have described the model in terms of shocks occurring
to the system, in practice these may not be hard discrete
0 events, but the progression of degradation past a stan-
0 5 10
dardized level (for example, where a vibration monitor
consistently measures high vibration, or where the cri-
Figure 2. Failure time distribution with α = 1.5. teria used by the operator to move the equipment state
from green to amber state have been met). Such transi-
1.2
tions are the ways in which staff assess a change in the
risk of system failure, and therefore it is reasonable
1 to build the subjective component reliability around
Taken
them.
0.8 opportunity The model described here is a dependent competing
Failure time
risk model where maintenance times are statistically
0.6 dependent on the component lifetime, and where
Minimum different failure modes can also be modelled in a
0.4
dependent way. In the former case the dependence
0.2
arises through the use of signals to the decision maker
about the component state, which both change the
0 rate of failure and which also lead to opportunities for
0 5 10 15 20 maintenance being taken. In the latter case the depen-
dency arises through the same signals marking failure
Figure 3. Failure time distribution with α = 0.5. rate changes for distinct failure modes, and through
other potential interactions.
Table 1. Probability of failure with varying α
values.
ACKNOWLEDGEMENTS
alpha P(X < Z)
This research is part of an EPSRC funded project on
0.4 0.230 dependent competing risks. We would like to thank
0.6 0.256 Scottish Power for its support in this project.
0.8 0.313
1 0.346
1.2 0.402
1.4 0.470 REFERENCES
1.6 0.599
Alkali, B.M. and T. Bedford. 2007. Competing Risks and
Reliability Assessment of Power plant Equipment. 17th
Finally we give a table showing how the prob- Advances in Risk and Reliability Symposium (AR2 TS)
ability of observing failure depends on the scaling 17th–19th April. Loughborough University.
parameter α. Bedford, T. 2005. Competing risk modelling in relabil-
This confirms empirically the theoretical result ity, in Modern Mathematical and Statistical Methods
given above, and shows that the scaling parameter is a in Reliability, Series on Quality Reliability and Engi-
neering Statistics, Vol 10. Eds A. Wilson, N. Limnios,
first order model parameter. S. Keller-McNulty, Y. Armijo, CRC Press.
Bunea, C. and T. Mazzuchi. 2006. Competing Failure Modes
in Accelerated life Testing. Journal of Statistical Planning
6 CONCLUSIONS and Inference 136(5): 1608–1620.
Cooke, R. and O. Morales-Mapoles. 2006. Competing Risk
Expert judgement has an important role to play in the and the Cox Proportional Hazards Model. Journal of
development of fine detail models. Such models are Statistical Planning and Inference 136(5): 1621–1637.

520
Crowder, N. 2001. Classical Competing Risks. Chapman & Langseth, H. and B. Lindqvist. 2006. Competing Risks for
Hall: Boca Raton Repairable Systems: A Data Study. Journal of Statistical
Dagpunar, J.S. 1996. A maintenance model with opportu- Planning and Inference 136(5): 1687–1700.
nities and interrupt replacement options. Journal of the Meyer, M.A. and J.M. Booker. 2001. Eliciting and Analyzing
Operational Research Society 47(11): 1406–1409. Expert Judgment: A practical guide, SIAM.
Dekker, R. and E. Smeitink. 1994. Preventive Maintenance Noortwijk, V., J. Dekker, R. Cook and T. Mazzuchi. 1992.
at Opportunities of Restricted Durations. Naval Research Expert judgement in maintenance optimization. IEEE
Logistics 41: 335–353. Transaction on Reliability 41(3): 427–432.
French, S. 1986. Calibration and the Expert Problem. Tsiatis, A., 1975. A nonidentifiabilty aspect of the problem
Management Science 32(3): 315–321. of competing risks. Proceedings of the National Academy
Kececioglu, D. and F.B. Sun. 1995. A General Discrete-Time of Sciences of the USA 72(1): 20–22.
Dynamic-Programming Model For The Opportunistic Yann Dijoux, L. Doyen and O. Gaudoin., 2007. Conditionally
Replacement Policy And Its Application To Ball-Bearing Independent Generalized Competing Risks for Mainte-
Systems. Reliability Engineering & System Safety 47(3): nance Analysis. 5th Mathematical Methods in Reliability
175–185. Conference, Strathclyde University, Glasgow Scotland.

521
Safety, Reliability and Risk Analysis: Theory, Methods and Applications – Martorell et al. (eds)
© 2009 Taylor & Francis Group, London, ISBN 978-0-415-48513-5

Modelling different types of failure and residual life estimation


for condition-based maintenance

Matthew J. Carr & Wenbin Wang


CORAS, Salford Business School, University of Salford, UK

ABSTRACT: Probabilistic stochastic filtering is an established method for condition-based maintenance


activities. A recursive filter can be used to estimate the time that remains before a component, or part, fails
using information obtained from condition monitoring processes. Many monitored processes exhibit specific
behavioural patterns that typically result in specific types of failure. These behavioural patterns are often iden-
tifiable using historical data and are defined here, for future monitoring purposes, as potential failure patterns
that result in a specific type of failure.
This paper presents a case application of a model designed to predict the forthcoming failure type using stochas-
tic filtering theory. Firstly, a stochastic filtering model is developed for each potential failure type. Secondly,
a Bayesian model is used to recursively assess the probability that an observed behavioural pattern for a given
component, or part, will result in a specific type of failure. At monitoring points during the operational life of
a component, estimation of the residual life is undertaken using a weighted combination of the output from the
individual stochastic filters. The model is applied to an oil-based condition monitoring data set obtained from
aircraft engines and the benefits of modelling the individual failure types are assessed.

1 INTRODUCTION model is used to calculate the probability associated


with each individual failure type occurring and weight
Condition-based maintenance (CBM) involves the the RL output from the individual stochastic filters
scheduling of maintenance and replacement activi- accordingly.
ties using information obtained from one or more
condition monitoring (CM) processes over time. As
with many reliability applications, we are interested
2 NOTATION
in modelling the time remaining before failure, or
residual life (RL) of a component, part or piece of
To describe the RL estimation process for a given
machinery. Relevant prognostic techniques include;
component, the following notation is introduced;
proportional hazards models (see Makis & Jardine
(1991) and Banjevic & Jardine (2006)), proportional – i = 1, 2, . . . is an index of the CM points,
intensity models (see Vlok et al., (2004)), neural – ti is the time of the ith CM point, where t0 = 0,
networks (see Zhang & Ganesan (1997)) and stochas- – M ∈ {1, 2, . . ., r} is a discrete random variable (r.v.)
tic filters (see Wang & Christer (2000) and Wang representing the forthcoming failure type for the
(2002)). component,
This paper documents research into the categori- – T > 0 is a continuous r.v. representing the failure
sation and modelling of different CM behavioural time of the component,
patterns that typically result in specific types of failure.
The investigation is an applied version of a method- At the ith CM point;
ology originally presented in Carr & Wang (2008b) for
a single vibration CM parameter that is extended and – Xi is a continuous r.v. representing the RL, where
modified here to cater for multivariate oil-based CM Xi = T − ti > 0, and xi is the unknown realisation,
information. – Yi = {Yi1 , Yi2 , . . ., Yiq } is a random vector of q CM
At monitoring points throughout the life of a com- parameters with realisation yi = {yi1 , yi2 , . . ., yiq } ,
ponent and under the influence of a given failure – Vi = {y1 , y2 , . . ., yi } is a q × i matrix of historical
type, probabilistic stochastic filtering (see Carr & CM observations,
Wang (2008a)) is used to model the conditional RL – i = {y1 ≤ Y1 ≤ y1 + dy, . . ., yi ≤ Yi ≤ yi + dy}
using the multivariate CM information. A Bayesian where dy↓0.

523
For a data set consisting of N component CM type is recursively established as
histories;
pi (j|Vi )
– nj is the number of component histories for which
⎛⎛ ∞ ⎞
a posterior analysis has revealed 
that the jth failure 
type occurred and we have N = rj=1 nj . = ⎝⎝ g(yi |u, j)fi−1 (u + ti − ti−1 |j, Vi−1 )du⎠
0
For the component index, k = 1, 2, . . ., nj ; ⎞ ⎛⎛ ∞


r 
– bk is the number of monitoring points, × pi−1 ( j|Vi−1 )dyi ⎠ ⎝⎝ g(yi |u, d)· · ·
– T̃k is the failure time, d=1 0
– tkbk is the time of the last CM point; T̃k − tkbk > 0. ⎞ ⎞

× · · · fi−1 (u + ti − ti−1 |d, Vi−1 )du⎠ pi−1 (d|Vi−1 )dyi⎠

3 DEFINITIONS (3)

Before monitoring begins, it is necessary to specify


and parameterise the component elements. Under the A point estimate of the RL is evaluated as the
assumptions of failure type j; conditional mean

– f0 (x|j)dx = Pr(x ≤ X0 ≤ x + dx | X0 > 0, M = j) ∞


is a prior pdf representing the initial residual life, E[Xi |Xi > 0, i ] = xfi(1) (x|Vi )dx (4)
– g(y|x, j)dy = Pr(y ≤ Yi ≤ y+dy| · · · · · · x ≤ Xi ≤
x + dx, M = j) is a conditional pdf representing a 0

stochastic relationship between the CM parameters


and the RL, However, when utilising the model to evaluate
– p0 ( j) = Pr(M = j | X0 > 0) is a prior pmf for the maintenance and replacement decisions, the descrip-
unknown failure type. tion of the posterior conditional pdf for the RL of
the component is usually more useful than a point
estimate. Particularly useful is the cdf
4 THE FAILURE TYPE RL ESTIMATION
MODEL Pr(Xi ≤ x|Xi > 0, i ) = Fi(1) (x|Vi )
x
Under the assumption of an unknown forthcoming
failure type, a posterior pdf for the RL of an individual = fi(1) (u|Vi )du (5)
component is established at the ith CM point as 0

Pr(x ≤ Xi ≤ x + dx | Xi > 0, i ) We parameterise f0 (x | j) and g(y|x, j) using max-


imum likelihood estimation and the nj CM histories

r
from components that failed according to the jth fail-
= fi(1) (x|Vi )dx = fi (x|j, Vi )pi (j|Vi )dx (1) ure type. The likelihood is the product of the relevant
j=1 conditional probabilities for all the histories;

The posterior conditional pdf for the RL under the nj bk



jth failure type is obtained recursively, using stochastic L(θ j ) = (dyki )−1 Pr(yki ≤ Yki ≤ yki + dyki |
filtering theory, as k=1 i=1

· · · Xki > 0, M = j, k,i−1 ) Pr(Xki > 0|Xk,i−1 > 0,


g(yi |x, j)fi−1 (x + ti − ti−1 |j, Vi−1 )
fi (x|j, Vi ) =
∞ M = j, k,i−1 ) · · · × · · · (dx)−1 Pr(T̃k − tkbk ≤ Xkbk
g(yi |u, j)fi−1 (u + ti − ti−1 |j, Vi−1 )du
0
(2) ≤ T̃k − tkbk + dx | Xkbk > 0, . . . M = j, kbk ) (6)

where f0 (x|j, V0 ) ≡ f0 (x|j). At the ith CM point, a where, θ j is the unknown parameter set. After insert-
posterior conditional pmf for the forthcoming failure ing the relevant pdf’s and re-arranging, the likelihood

524
function can be written as 6 INITIAL CASE ANALYSIS
⎛⎛

nj
bk ∞ In preparation for a larger project, we applied the
L(θ j ) = ⎝⎝ g(yki |u, j)fk,i−1 (u + tki − tk,i−1 | failure type RL estimation model to a test data set
k=1 i=1 0 of component monitoring histories (with associated
⎞ ⎞ failure times) obtained from a certain model of air-
craft engine. The objective of the analysis is to assess
· · · · · · j, Vk,i−1 )du⎠ fkbk (T̃k − tkbk |j, Vkbk )⎠ (7) the applicability and performance of the failure type
model when applied to multivariate oil-based CM
information.
See Carr & Wang (2008) for details. Maximisation We select forms and parameterise using histori-
of equation (7), with respect to θ j , is undertaken using cal CM data sets and the model is applied to new
an optimisation algorithm such as the BFGS quasi- component monitoring information for demonstra-
Newton method on the log of the likelihood function. tion purposes. The model is also compared with a
To select between different forms for f0 and g, we use general model with no failure type assumptions to
the Akaike information criterion (AIC). illustrate the benefits of the failure type model.
The prior probabilities that each failure type will More information on the analysis will be given at
occur for a given component are estimated from the the conference and in an extended version of this paper.
data set and described using the pmf p0 (j) ≈ nj /N for
j = 1, 2, . . . , r. 6.1 The data
The CM data we are considering consists of the parts
5 ASSESSING THE MODEL per million (ppm) of contaminating metallic particles
in oil lubrication samples that are obtained from a
To assess the applicability of the methodology for a type of component used in aircraft engines. At each
given scenario, we compare the failure type RL esti- monitoring point, we observe the ppm of iron (Fe),
mation model against a single general stochastic filter copper(Cu), aluminium (Al), magnesium (Mg) and
for the RL with no failure type assumptions. For the chromium(Cr). Initially, 10 component CM histories
general model, the posterior conditional RL pdf is are used for the purpose of fitting the models. The
CM information from 2 ‘new’ components is then
(2) used to demonstrate the application of the models and
g (2) (yi |x)fi−1 (x + ti − ti−1 |Vi−1 )
fi(2) (x|Vi ) = compare the performance of the failure type RL esti-
∞ (2) mation model and the general model with no failure
g (2) (yi |u)fi−1 (u + ti − ti−1 |Vi−1 )du
0 type assumptions.
(8)
6.2 Model fitting
The comparison is achieved using an average mean
square error (AMSE) criterion. At the ith CM point To reduce the dimensionality of the data and remove
for the kth component, the MSE is any potential collinearity between the individual CM
variables over time, principal components analysis
∞ (PCA) is applied to the covariance matrix of the CM
variables. After PCA, we have a vector of principal
MSE ki = (x + tki − T̃k )2 fi(a) (x|Vi )dx (9)
components at time ti ; yi = {yi1 , yi2 , . . ., yik } where,
0
yic = αc1 Fei + αc2 Cui + αc3 Ali + αc4 Mgi + αc5 Cr i
for models a = 1, 2. For the failure type RL model,
the AMSE is represents the cth principal component and Fei , for
example, is the cumulative iron reading at the ith CM

r 
bk nj
point. Figure 1 illustrates the standardised first prin-
AMSE = MSE jki /N (10) cipal component over time and the associated failure
j=1 k=1 i=1 times for the 10 components that are to be used to fit
the models.
and for the general filter, the AMSE is
6.2.1 Failure type RL estimation model

N 
bk
For the failure type RL estimation model, the individ-
AMSE = MSE ki /N (11) ual filters are parameterised using those CM histories
k=1 i=1 that are relevant to the particular failure type. As

525
2 influence of the jth failure type) is
1.5
fi (x|j, Vi ) =
Standardised 1st PC

1
(yz −μj (x+ti −tz ))2
βj 
i −
2σj2
0.5 (x + ti )βj −1 e−(αj (x+ti )) e
z=1
0
(yz −μj (u+ti −tz ))2
0 200 400 600 800 1000 1200 1400 1600 1800
∞ βj 
i −
2σj2
– 0.5 (u + ti )βj −1 e−(αj (u+ti )) e du
u=0 z=1
-1
(15)
–1.5
Time
and using equation (3), the posterior conditional pmf
Figure 1. Illustrating the first principal component over
for the forthcoming failure type is
time and the associated failure times for the model fit data.
 ∞
βj
pi (j|Vi ) = (σj2 2π)−1/2 (u + ti )βj −1 e−(αj (u+ti ))
indicated in figure 1, it is possible to group the CM u=0
histories according to the observed failure times and
the behaviour of the first principal component over
i −
(yz −μj (u+ti −tz ))2 
2σj2
time. The first failure type is defined as underlying ··· × ··· e pi−1 (j|Vi−1 )du
behaviour that results in failure within the range 0 z=1
≤ T < 1200 and for the second failure type, we have  ∞
T ≥1200 as the specified range. (u + ti−1 )βj −1 e−(αj (u+ti−1 ))
βj

Under the assumptions of the jth failure type, the


prior pdf for the RL is u=0


i−1 −
(yz −μj (u+ti−1 −tz ))2  
r
β βj 2σj2
f0 (x|j) = αj j βj xβj −1 e−(αj x) (12) × e du
z=1 d=1
where, x > 0 and αj , βj > 0 for j = 1, 2. For lin-  ∞
β
early independent principal components, we have the (σd2 2π)−1/2 (u + ti )βd −1 e−(αd (u+ti )) d · · · × · · ·
combined conditional pdf
u=0

(yz −μd (u+ti −tz ))2 



k i

2σ 2
g(yi |x, j) = g(yic |x, j) (13) × e d pi−1 (d|Vi−1 )du
c=1 z=1
 ∞
For the particular data set under consideration, only βd
(u + ti−1 )βd −1 e−(αd (u+ti−1 ))
the first principal component is found to be signifi-
cant, in that, it describes a satisfactory amount of the u=0
variance within the data set. As such, the first princi-
i−1 (yz −μd (u+ti−1 −tz ))2 

2σ 2
pal component is used as the sole input to the models. × e d du (16)
The stochastic relationship between the standardised z=1
first principal component and the underlying RL is
described using a normal distribution as Under the assumptions of the failure type model,
the conditional pdf for the RL at the ith CM point,

(y−μj (x))2
fi(1) (x|Vi ), is given by equation (1), using equations
g(y|x, j) = (σj2 2π )−1/2 e 2σj2
(14) (15) and (16), for i = 0,1,2,. . . and r = 2.

where, μj (x) is a function of the RL. Note that, when 6.2.2 Failure type 1
applying the models to new component information, In equation (12), a Weibull prior distribution is used
the same principle component and standardisation to model the initial residual life under the influence
transformations must be applied to the CM data before of a given forthcoming failure type. Using the like-
insertion into the models. lihood function in equation (7) and the 7 component
Using equations (2), (12) and (13), the posterior histories deemed to have failed according to failure
conditional RL pdf at the ith CM point (under the type 1, the shape and scale parameters are estimated as

526
Table 1. The estimated parameters and selection results Table 3. The estimated parameters and selection
under the influence of failure type 1. results for the general model.

μ1 (x) μ (x)

A1 + B 1 x A1 + B1 /x A1 + B1 exp{−C1 x} A + Bx A + B/x A + B exp{−Cx}

A1 1.628 −0.031 −2.326 A 0.735 −0.168 −1.308


B1 −0.006281 18.882 4.386 B −0.002052 14.367 3.454
C1 – – 0.002746 C − − 0.004233
σ1 0.393 0.866 0.334 σ 0.696 0.894 0.333

ln L(θ 1 ) −63.036 −165.753 −41.852 ln L(θ ) −236.721 −292.852 −71.27


AIC 132.072 337.506 91.704 AIC 479.442 591.704 150.54

6.2.4 The general RL estimation model


Table 2. The estimated parameters and selection results Without any failure type assumptions, the prior pdf for
under failure type 2. the RL is taken to be
μ2 (x) β
f0(2) (x) = α β βxβ−1 e−(αx) (17)
A2 + B2 x A2 + B2 /x A2 + B2 exp{−C2 x}
where, x > 0 and α, β > 0. The relationship between
A2 0.606 −0.401 −1.228 the standardised first principal component and the
B2 −0.001624 11.761 3.268 underlying RL is described using the conditional pdf
C2 – – 0.003995
σ2 0.618 0.851 0.279 − (y−μ(x))
2
g (2) (y|x) = (σ 2 2π)−1/2 e 2σ2 (18)
ln L(θ 2 ) −88.109 −118.21 −13.531
AIC 182.218 242.42 35.062 where σ > 0 and a number of different functional
forms are considered for μ(x).
For the general model with no failure type assump-
tions, all 10 CM histories are used for parameterisa-
tion. The likelihood function of equation (7) is used to
α1 = 0.001606 and β1 = 5.877 respectively. A num- estimate the parameters with the sub-script j removed
ber of forms are considered for the function μ1 (x) in from consideration. The estimated parameters for the
equation (14) that describes the relationship between prior RL pdf in equation (17) are α = 0.0009735 and
the RL and the observed (and transformed) CM infor- β = 1.917. The estimated parameters and selection
mation under the influence of failure type 1. The results for μ(x) are given in table 3.
estimated parameters and the AIC results are given The selected function is μ(x) = A + B exp (−Cx)
in table 1 where, the objective is to minimise the AIC. where, A = −1.308, B = 3.454, C = 0.004233 and
The selected function is μ1 (x) = A1 + B1 exp the standard deviation parameter is σ = 0.333.
(−C1 x) where, A1 = −2.326, B1 = 4.386, C1 = From equation (8), the posterior conditional pdf for
0.002746 and the standard deviation parameter is σ1 = the RL under the general model assumptions is
0.334. In addition, the prior probability that failure
type 1 will occur for a given component is estimated fi(2) (x|Vi )
as p0 (1) = 7/10 = 0.7.
β 
i

(yz −μ(x+ti −tz ))2
(x + ti )β−1 e−(α(x+ti )) e 2σ 2
z=1
6.2.3 Failure type 2 =
∞ i

(yz −μ(u+ti −tz ))2
For failure type 2, the prior Weibull RL pdf is param- (u + ti )β−1 e−(α(u+ti ))β e 2σ 2 du
eterised as α2 = 0.000582 and β2 = 19.267. The u=0 z=1
estimated parameters and selection results for the (19)
function μ2 (x) from equation (14) are given in table 2.
The selected function is μ2 (x) = A2 +
6.3 Comparing the models
B2 exp(−C2 x) where, A2 = −1.228, B2 = 3.268,
C2 = 0.003995 and σ2 = 0.279. The prior probability The models are compared using new component data.
that failure type 2 will occur is p0 (2) = 0.3. The first component is known (in hindsight) to have

527
1 1
2
0.8 0.8
Probability

Probability
0.6 0.6
2
0.4 1
0.4

0.2 0.2 1
0 0
0 100 200 300 400 500 600 700 0 500 1000 1500
Time Time

Figure 2. Tracking the forthcoming failure type over time Figure 4. Tracking the underlying failure type over time
for the first new component. for the second new component.

1200 Actual RL
Actual RL
1500
1000 FM Model RL
FM Model RL 1200 Estimate
Residual Life

Residual Life
800 Estimate
General Model
600 General Model RL 900 RL Estimate
Estimate
400 600
200
300
0
0 100 200 300 400 500 600 700 800 0
Time 0 500 1000 1500
Time
Figure 3. Comparing the conditional RL estimates of the
failure type and general models for the first new component. Figure 5. Comparing the conditional RL estimates of
the failure type and general models for the second new
component.
failed according to failure type 1 and the second
according to type 2. We demonstrate the tracking of the
appropriate forthcoming failure type over time using
equation (16) and compare the mean RL estimates at Again, the failure type model correctly tracks the
each CM point with those obtained using the general forthcoming type, as illustrated in figure 4. Figure 5
model. demonstrates that the failure type model tracks the RL
With regard to the modelling of optimal mainte- more accurately and rapidly than the general model.
nance and replacement decisions, the availability of This is again reflected in the AMSE results with
a posterior conditional pdf for the residual life is of 68172.64 for the failure type model and 84969.59 for
greater use than a point estimate. With this in mind, the general model.
we use equations (10) and (11) to compare the AMSE
produced by the models for the two new components.
Figures 2 and 3 illustrate the tracking of the failure 7 DISCUSSION
type and the conditional RL estimation for the first
new component. In this paper, we have presented a brief overview of
It is evident from figure 2 that the failure type a model for failure type analysis and conditional RL
model correctly tracks the forthcoming failure type estimation. The modelling concepts have been demon-
over time. The conditional RL estimation process is strated using a trial oil-based data set of component
more accurate using the failure type model, as illus- monitoring observations. Although the data set is rel-
trated in figure 3, when compared with the general atively small, the results do indicate that the failure
model. This is also reflected in the AMSE results with type model could be very useful for CM scenarios
94884.66 for the failure type model and 105309.92 for that display different behavioural patterns and have
the general model. categorisable failures.
Figures 4 and 5 illustrate the tracking of the failure In the case study, the failure types are categorised
type and the conditional RL estimation for the second according to the time at which the failure occurs and
new component that is known, in hindsight, to have the behaviour of the various CM processes over time.
failed according to failure type 2. However, for operational components, the definition

528
of different failure types can be used to represent REFERENCES
different types of operation, or potentially different
faults in the system that are affecting the future life of Makis, V. and Jardine, A.K.S. (1991) Optimal replacement
the component. in the proportional hazards model, INFOR, 30, 172–183.
In the initial analysis, we compared the performance Zhang, S. and Ganesan, R. (1997) Multivariable trend anal-
of the failure type model with a general model with no ysis using neural networks for intelligent diagnostics of
rotating machinery, Transactions of the ASME Journal of
failure type assumptions. The models are compared Engineering for Gas Turbines and Power, 119, 378–384.
using a MSE criterion. At each monitoring point, the Wang, W. and Christer, A.H. (2000) Towards a general con-
MSE criterion compares the fit of the established con- dition based maintenance model for a stochastic dynamic
ditional RL pdf about the actual underlying residual system, Journal of the Operational Research Society, 51,
life. When utilised in maintenance and replacement 145–155.
models, if the density is tighter about the actual value, Wang, W. (2002) A model to predict the residual life of rolling
the decisions are improved in the sense that, greater element bearings given monitored condition information
operational time is available whilst still avoiding the to date, IMA Journal of Management Mathematics, 13,
occurrence of failures. The AMSE is substantially 3–16.
Vlok, P.J., Wnek, M. and Zygmunt, M. (2004) Utilising sta-
smaller, particularly in the second case, when using tistical residual life estimates of bearings to quantify the
the failure type model. influence of preventive maintenance actions, Mechanical
We are currently in the process of applying the Systems and Signal Processing, 18, 833–847.
model to a much larger project involving multiple Banjevic, D. and Jardine, A.K.S. (2006) Calculation of reli-
potential failure types that are categorised according ability function and remaining useful life for a Markov
to both the nature of the CM information and the failure time process, IMA Journal of Management Math-
associated failure times. ematics, 286, 429–450.
Carr, M.J. and Wang, W. (2008a) A case comparison of
a proportional hazards model and a stochastic filter for
condition based maintenance applications using oil-based
ACKNOWLEDGEMENT condition monitoring information, Journal of Risk and
Reliability, 222 (1), 47–55.
The research reported here has been supported by the Carr, M.J. and Wang, W. (2008b) Modelling CBM failure
Engineering and Physical Sciences Research Council modes using stochastic filtering theory, (under review).
(EPSRC, UK) under grant EP/C54658X/1.

529
Safety, Reliability and Risk Analysis: Theory, Methods and Applications – Martorell et al. (eds)
© 2009 Taylor & Francis Group, London, ISBN 978-0-415-48513-5

Multi-component systems modeling for quantifying complex


maintenance strategies

V. Zille, C. Bérenguer & A. Grall


Université de Technologie de Troyes, Troyes, France

A. Despujols & J. Lonchampt


EDF R&D, Chatou, France

ABSTRACT: The present work proposes a two-stages modeling framework which aims at representing both
a complex maintenance policy and the functional and dysfunctional behavior of a complex multi-component
system in order to assess its performances in terms of system availability and maintenance costs. A first stage
consists in a generic model of component, developed to describe the component degradation and maintenance
processes. At the second stage, a system of several components is represented and its behavior is simulated when
given operating profile and maintenance strategy are applied, so as to estimate the maintenance costs and the
system availability. The proposed approach has been validated for a simplified turbo-pump lubricating system.

1 INTRODUCTION costs and decisions are generally based on qualitative


information.
1.1 Industrial context For this reason it appears convenient to develop
methods to assess the effects of the maintenance
The large impact maintenance process has on system
actions and to quantify the strategies.
performances gives a real importance to its optimiza-
tion, rendered difficult by the various and sometimes
antagonistic criteria like availability, safety, and costs
that must be simultaneously taken into account.
In order to work out preventive maintenance pro- 1.2 Scientific context
grams in nuclear and fossil fired plants, EDF applies Numerous maintenance performances and costs mod-
on critical systems a dedicated Reliability Centered els have been developed during the past several
Maintenance method (RCM) which makes it possible decades, see e.g. (Valdez-Flores & Feldman 1989),
to determine the critical failure modes and to help the with different interesting objectives reached.
experts to propose applicable, effective, and economic However, they remain difficult to adapt to multi-
preventive maintenance tasks. component systems and complex maintenance strate-
Maintenance strategies are then established consid- gies such as those developed and applied in the
ering various possible options among the nature of RCM context, since most of them are devoted
maintenance (corrective, preventive), the type of tasks to simple maintenance strategies (periodic mainte-
(overhaul, monitoring, scheduled replacements. . .), nance, condition-based maintenance, age-based main-
their frequency, the maintenance line (repair on site tenance, . . .) with a finite number of actions and
or in workshop), etc. and such choices are frequently defined effects (perfect inspections, perfect replace-
supported by experts opinion, good sense and intu- ments, minimal repair, . . .) or applied on single-unit
ition, that is qualitative information, which could systems, see e.g. (Dekker 1996, Moustafa et al., 2004).
be helpfully complemented by quantitative informa- Other approaches, based on stochastic simulation,
tion resulting from deterministic and/or probabilistic already allow taking into account more complex main-
calculations. tenance strategies but they generally aim at developing
However, due to the probabilistic nature of the simulation techniques such as Monte Carlo simulation,
failures, it is not easy to compare different options or optimization procedures, see e.g. (Marseguerra and
on quantified bases. It is difficult to evaluate the Zio 2000).
results of a maintenance program application over sev- Finally, only a few part of maintenance simula-
eral years, in terms of availability, safety level, and tion works give an interest to components degradation

531
and failure phenomena and to the effects of mainte- System operation
System failure
nance actions. All these observations have led to the behaviour model model
definition of a general and overall description of all
the aspects related to a system and its components
behavior and the different possible maintenance tasks Generic component
applied. PERFORMANCES model
EVALUATIONS

2 MODELING APPROACH PRESENTATION


System-level
maintenance model
2.1 System representation for maintenance
strategies assessment
Figure 1. Overall structure for multi-component system
The assessment of complex maintenance program main-tenance modeling : white arrows represent interactions
performances, resulting for example from implemen- between the various parts of the system level, black arrows
tation of RCM, encounters several methodological represent the interactions between system and component
difficulties, whose resolution constitutes the scientific levels. Assessment results are obtained in terms of main-
challenge of this work. These difficulties are due first tenance costs and availa-bility performances evaluations.
to the complexity of the systems on which mainte-
nance programs are implemented (systems constituted
of several dependent components, with several degra- these observations have resulted in working out an
dation mechanisms and several failure modes possibly overall model, shown on Figure 1, divided into four
in competition to produce a system failure) and sec- parts sharing information.
ondly to the complexity of the maintenance programs A generic component model is developed for each
(diversity of maintenance tasks). system component and the four modeling levels can be
Since both system behavior and maintenance pro- validated independently and then associated by mean
cess are strictly linked to the way system components of interactions in order to simulate complete systems,
evolve and react to maintenance tasks, maintenance as described in the following sections.
strategies assessment can be done throughout a two-
level system description, representing the system and
its components and the way their relative behaviors 3 MODEL OF COMPONENT
and maintenance processes are interacting together.
3.1 Generic modeling of a maintained component
2.2 Overall model To model system behavior one must represent how
the components it is made of can evolve during their
Equipment failure modes analysis, carried out for mission time through different operational states of
RCM method application are often based on a FMECA functioning, failure and unavailability for mainte-
matrix which contains, if complete, the necessary nance. These behaviors are consequences of differ-
information to represent the complexity of the prob- ent degradation mechanisms evolution that impact
lem, since the data collected characterizes: on components and may lead to some failure mode
• components with their failure modes, degradation occurrences. Thus, it is necessary to describe those
mechanisms and applied maintenance tasks; evolutions and the way maintenance tasks can detect
• the system which is composed of these components them and repair if necessary, in order to prevent or
and characterized by: correct the effects on the system (Bérenguer et al.,
2007).
– its operation which influences the component All the phenomena and aspects that need to be rep-
degradation mechanisms; resent and the impact they can have on the others are
– its unavailability which results from components described on Figure 2 and are presented more precisely
failure modes occurrence and/or from the pre- in paragraphs 3.2 and 3.3.
ventive and corrective maintenance tasks carried Here again, a real complexity comes from the
out; huge amount of possible relations between the various
• maintenance rules depending on the maintenance characteristics to be represented. Indeed, influencing
policy to gather elementary tasks carried out on the factors often have an effect on various degradation
components. mechanisms which can involve various failure modes.
Moreover, symptoma can be produced from vari-
To model reality as accurately as possible it is con- ous degradations and maintenance tasks can make it
venient to represent the various characteristics and possible to detect or repair several degradations.

532
Operating
profile Environment Degradation
Mechanism

System Influencing Factors Failure


Operation Level 0 rates

Influencing
Degradation Preventive Factors Failure Mode
Symptoma Level i
Mechanisms Maintenance occurence

Corrective
Failure Modes
Maintenance Maintenance Level n

System Dysfunction Effects on system Figure 3. Degradation and failure processes modeling.
Rectan-gular elements represent general aspects and phe-
Figure 2. Generic modeling of a maintained component. nomena in-volved in the described behavior and oval elem-
Rec-tangular elements represent the phenomena and aspects ents precise their different possible states. Black arrows
in-volved in a maintained component behavior and arrows represent evolution transitions and grey arrows represent the
de-scribe how they are interacting together. impact of one pheno-mena on another.

Degradation
Component unavailability, due either to failure Mechanism
Symptom
mode occurrence on an unscheduled way, either to pre-
ventive maintenance operation on a scheduled way, is
deterministic in terms of effects on the system opera- Unsignificant
tion. Unavailability duration is also linked to resources Maintenance:
sharing problems, since specific materials or mainte- Repair
nance repair teams have to be available at a given time Significant
to perform the needed tasks. Symptom
observation

Maintenance:
3.2 Component degradation behavior Detection

Component behavior is defined by the way degrada-


tion mechanisms can evolve and may lead to some Figure 4. Symptom apparition representation. When a
failure mode occurrence. This evolution is due to the symp-tom reaches a given threshold it becomes significant
influence of various factors such as operating con- of a de-gradation evolution and its detection may avoid the
ditions, environment, failure of another component, degradation observation.
etc.
As shown on Figure 3, each degradation mechanism
evolution is described through various degradation choose among them the most adapted to a given situ-
levels and at each level an increasing failure proba- ation, depending on the component, the mechanisms
bility represents the random apparition of the different and obviously the information data available. Differ-
possible failure modes that can appear. The degrada- ent classical approaches can be considered (statistical,
tion level can be reduced after maintenance actions semi-parametric, life-time model,. . .).
performance. Another important aspect related to component
Such a representation is motivated by the fact that degradation mechanisms is that of symptom, i.e. a
the different degradations that can affect a component phenomenon that may appear due to one or more mech-
can be detected and observed by specific preventive anism evolutions and whose detection gives informa-
maintenance tasks which may lead to preventive repair tion about component degradation without directly see
on a conditional way. Thus, it seems more appropriate the degradation itself. If possible, this type of detec-
to describe each mechanism than to describe the global tion can be made by less costly tasks which do not need
degradation state of the component. the component to be stopped, and so present some real
Due to the large amount of degradation mecha- advantages.
nisms and to try to get a generic model which could As for degradation evolution, symptom apparition
be applied in a large part of situations, it appears and evolution can be modeled through some levels and
convenient to propose various alternatives of degra- by means of thresholds that represent its significance,
dation mechanism evolution representation in order to i.e. the fact that it testifies a degradation evolution.

533
Figure 4 describes the relationships between degra- Table 1. RCM method Maintenance Tasks characteristics.
dation, symptom and maintenance.
Task Activation Effects

Corrective maintenance
3.3 Component maintenance process
Repair Failure mode Unavailability
Within RCM method, different types of maintenance occurrence failure repair
tasks can be performed, devoted to different phenom-
ena to detect and with different effects on the compo- Predetermined preventive maintenance
nent behavior. The model developed has already been
dedicated to this type of strategies in order to let them Scheduled Time period
been represented integrally. Replacement elapsed Unavailability
Indeed, maintenance activities effectiveness is mod-
eled to represent the ability of preventive actions to Condition-based preventive maintenance
detect components degradations, and the ability of
External Time period No unavailability
both preventive and corrective actions to modify and
inspection elapsed symptom
keep under control the degradation mechanism evolu- detection
tion in order to avoid failure occurrence, as shown on Overhaul Time period Unavailability
Figures 3 and 4 with the impact of Maintenance on the elapsed degradation
various phenomena described. detection
Other maintenance policies such as opportunistic Test Failure observed
maintenance are defined at the system level model during stand-by Unavailability
and their activation are made thanks to information period failure repair
exchange between the component-level model and the Symptom >
system maintenance model. Preventive detection threshold Unavailability
In addition to duration and costs, tasks are differing Repair degradation > degradation
repair threshold repair
in terms of nature, activation condition, and effects on
the component state and on its availability, as shown
in Table 1. Regarding repair tasks effectiveness, repair
action are considered either As Good As New, As
Bad As Old or partial and detection tasks are defined still in evolution with an increasing probability of fail-
with some possible non detection and false alarm risk ure mode occurrence). Finally, tests are performed
errors. are expensive but efficient tasks that are performed
A focus can be made on the differences between the on stand-by components to detect an eventual fail-
various condition base maintenance detection tasks, ure before the component activation, but can have bad
ie external inspections, overhauls and tests, and their effects on it.
relative vantages and disadvantages. Indeed they are
differing both in terms of unavailability engendered
for the component under maintenance and in terms
4 SYSTEM MODEL: MAINTENANCE COST
of efficiency of the detection with a relative impact
ASSESSEMENT
on the task performance cost. In particular, on one
hand, overhaul performance implies both the compo-
4.1 Three models to describe the system behavior
nent scheduled unavailability and a high cost but is
really efficient in terms of detection since it consists The system level consists in representing a system
in a long and detailed observation of the component to of several components and simulating its behaviour
evaluate its degradation states and eventually decide when given operating profiles and maintenance strate-
to repair it preventively. On the other hand, external gies are applied, so as to estimate the maintenance
inspections are less expensive and consist in observe costs and the system availability. This is done through
the component without stopping it. These two van- three different models describing its dysfunction and
tages imply a larger distance from the degradation functioning behaviour and the maintenance rules.
with some associated error risks of non detection or The system dysfunction model describes all the
false alarm. Obviously, this kind of task can easily be degradation/failure scenarii that can affect the system.
used to observe some eventual symptoma characteris- It gives out the global performance indicators of the
tics of one or more degradation mechanisms evolution, maintained system. Indeed, it allows the evaluation of
and so some appreciation error can exist when deci- the system unavailability, due either to a failure either
sion of preventive repair are taken (treatment of the to some maintenance actions, and also the associated
bad degradation mechanisms whereas another one is maintenance costs.

534
The system operation model aims at describing the A global maintenance cost model can be defined by
nominal behavior and the operating rules of the sys- Equation 1:
tem. This model interacts with the component models
and evolves according to the operating profile and 
to the needs of the system : activating of a required ni ci + tsu csu + tuu cuu
i
component, stopping of a superfluous component, . . . Cost(Strategy) = lim
TMiss→∞ TMiss
Obviously the operating behavior of the system can-
(1)
not be described by the simple juxtaposition of the
component-level models and it is necessary to take into
account all the possible interactions and dependences where T miss = the mission time throughout which the
between components. At this level one can model spare system operates, ni = number of maintenance task i
equipments, activating of defense systems in case of performed;ci = cost of maintenance task i; tsu = time
an equipment failure, stopping of a line in case of the system is under scheduled unavailability; tuu =
maintenance of one of its component, . . . time the system is under unscheduled unavailability;
In the system maintenance model, one can define csu = cost of scheduled unavailability; cuu = cost of
the applied system maintenance strategy. It allows the unscheduled unavailability.
description of grouping procedures which are used to According to Equation 1, one can decide to asses
take advantage of economic or technical dependences the global costs of strategies that differ in terms of
between components in the case of opportunistic main- tasks type and frequency, knowing their relative costs
tenance. This model also includes resources sharing of operation in order to impact on the associated
and availability problems due to a limited number of durations of scheduled and unscheduled unavailability
repair teams or specific tools and spare parts stocks. engendered.

4.2 Interactions between the different models 5 CASE STUDY AND RESULTS
The three system-level models and the component-
level model interact together in order to represent 5.1 Model development and validation: Petri Nets
completely the system behavior, its unavailability and and Monte Carlo simulation
expenditures, knowing the behavior of its components The proposed generic methodology has been imple-
and the maintenance tasks that are carried out. mented using the Stochastic Synchronized Petri nets
Component-level models give information on com- (SSPN) and it has been coupled with the Monte Carlo
ponents states (failure, unavailability for maintenance) simulation to compute the performances assessment
and on maintenance costs to the three other system- of industrial systems, see e.g. (Barata et al., 2002;
level models which evolve according to this input data Bérenguer et al., 2007; Dutuit et al., 1997).
and possibly sent feedback data. For systems dependability studies, SSPN offer a
As shown on Figure 1, the system operation model powerful and versatile modeling tool that can be used
sends information to the component-level models to jointly with Monte Carlo simulation, which is widely
active a stand-by component or stop an auxiliary com- used in this kind of work, see e.g. (Barata et al., 2002;
ponent that has become useless after the repair of the Simeu-Abazi & Sassine 1999).
main component. In order to validate the modeling approach, simu-
The system maintenance model can send data to lations of real complex systems behavior have been
the component-level model to force the maintenance made to study the effects of parameters variations,
of a component coupled together with a component such as maintenance tasks period, on the system
already in maintenance. behavior, with really interesting results obtained that
have encouraging further developments to improve its
specificities.
4.3 Maintenance strategy assessment
The overall model presented permit maintenance
5.2 Case study: Turbo-pump lubricating system
strategies assessment by evaluating the performances
obtained from the system on which a given strategy is We provide here results obtained on a part of a sim-
applied. plified turbo-pump lubricating system, described on
This takes into account both the maintenance costs, figure 5, to underline the main originality of the
depending on the number of tasks performed and the proposed approach, that is the capability of multi-
relative resources used and the system availability and component system modeling and complex mainte-
unavailability during its mission time. nance assessment. In particular the objective of the

535
Cl1 Failure Mode 1 Failure Mode 2
(Unscheduled (Impossible starting)
Po1 shutdown)
Ca

Degradation Degradation
Po2 Cl2
Mechanism A Mechanism B
(Bearing Wear) (Oxydation)

Figure 5. Part of a turbo-pump lubricating system made of


two pumps Po1 and Po2 and check valves Cl1 and Cl2 and a
sensor Ca.
Symptom 1 Symptom 2
(Vibrations) (Temperature)

case study was to compare different possible strate- Figure 6. Description of the relationships between the
gies in terms of global cost, composed of different degrada-tion mechanisms, failure modes and symptoma
type of tasks, depending on their periodicity. considered in the modeling of Pumps P01 and P02 behavior.
Expert opinions and information data have been
collected to define components and system charac-
teristics as well as those of the maintenance tasks Failure Mode 3 Failure Mode 4
(No opening) (External leaks)
possibility performed, to let the modeling approach
been applied and simulated.
Indeed, for each component, degradation mecha-
nism and maintenance tasks basic parameters such as
those in table 2 have to be specified.
Obviously, maintenance tasks are also described Degradation Degradation
in terms of periodicity and decision rule criteria, Mechanism C Mechanism D
that is which degradation levels can be observed and (Axis blocking) (Joint wear)
when preventive repair are decided to be performed.
These characteristics defined the maintenance strategy
applied and simulated.
To model the particular case presented, for each Symptom 3
component, the main degradation mechanisms have (Deposits)
been characterized in terms of levels of degradation
and relative failure rate for the various possible fail-
ure modes, possible symptoma and their probability Figure 7. Description of the relationships between the
degrada-tion mechanisms, failure modes and symptoms con-
or delay of apparition and evolution until some signif-
sidered in the modeling of check valves Ca1 and Ca2
icant thresholds. We also defined the evolution transi- behavior.
tions from one degradation level to the successive one
and finally, the influencing factors that have effects
on the mechanisms evolution. In the present case Then, different tasks were considered to define
study, mechanism evolution has been modelled using the preventive maintenance strategies applied to the
a Weibull Life-Time distribution law, whose parame- system with condition-base maintenance for pumps
ters were depending on the mechanisms described and and check valves and systematic maintenance for the
the information available from the experts, to compute sensor :
the time of the transition from one degradation level
to the successive one. – Pumps Po1 and Po2 degradation can be notified
In particular the following statements, described directly by practicing overhauls, specifics to each
on Figures 6 and 7, have been defined regarding degradation mechanisms, or indirectly thanks to
the different relationships between the degradation external inspections, to detect the symptoms even-
mechanisms, associated symptoma and failure modes tually appeared. Then, depending on the detec-
considered for each component: tion results, preventive repair can be performed
Concerning Sensor Ca, only very rare random to avoid failure mode occurrences. As previously
occurrence of an electronic failure has been consid- said, overhauls are very effective detection tasks
ered, and no symptom nor degradation mechanism. but engender component scheduled unavailability

536
Table 2. Basic parameters needed for model simulation.

Degradation mechanism evolution description

Basic evolution Influencing factors


Levels description impact

Minimal and Representation chosen Modification of the


maximal and parameters values basic parameters
degradation (ex: Weibull law depending on the
thresholds parameters) influencing factors

Degradation mechanism evolution description

Levels Failure Symptoma


For each Failure modes that Symptoma that
degradation can occur. appears. Eventual
level Failure rates. delay or probability.

Maintenance Tasks (External inspections, overhauls, tests)

Observations Effectiveness Parameters


Failures, degradations Error risk Cost
or symptoma (non detection Duration
observed and false alarm) Resources

Maintenance tasks (Preventive and corrective repairs)

Impacts Effectiveness Parameters


Degradations or Failures Repair type Cost
Repaired or reduced (AGAN, ABAO, Duration
or partial) Resources

for maintenance, whereas external inspections are been represented by means of classical reliability tools
made without stopping the component but present such as failure and event trees. In particular, the two
a risk of error in the detection (non detection or branches composed of pump and check valve are oper-
false alarm). Moreover, in the proposed case study, ating on a redundancy way and the switch over from
one of the symptoms can appear due to the two one to the other is made thanks to the sensor which
degradation mechanisms, which implies another detect the failure of the activated branch and lead to
error risk for the repair decision that can be taken the activation of the stand-by one.
after the component inspection : the possible repair Condition-based and systematic maintenance tasks
of the bad mechanism without reducing the fail- are performed with a given frequency in order to
ure mode occurrence probability linked to the other detect and repair if necessary component degradation
mechanism. to avoid their failure. After the activated branch has
– Check-valves Ca1 and Ca2 degradation can also be failed, the other branch is activated so as to render
notified directly by practicing overhauls specifics possible the corrective repair without creating some
to each degradation mechanisms. Since one of the additional unavailability.
degradation mechanisms cannot be detected thanks The system can be unavailable either on a scheduled
to an associated symptom, external inspections can way, for a preventive maintenance task performance,
only be performed to detect the evolution of the or on an unscheduled way, due to some component
other mechanisms, avoiding the performance of the failure.
relative specific overhaul.
– Sensor Ca failure is supposed to be random and Different strategies, differing essentially on the type
rare so a systematic repair is performed before its of preventive tasks performed on the pumps and check
expected occurrence. valves, have been simulated and for each one an eval-
uation of the relative maintenance costs as defined
Finally, the way the entire system can function by Equation 1 has been computed according to the
or become failed or unavailable for maintenance has variation of the tasks periodicity. The objective was

537
Maintenance global costs Minimal strategies global costs
5000 1400
4500
1200
4000
1000
3500
3000 800
2500 600
2000
400
1500
1000 200

500 0
0 Maintenance strategies
Maintenance tasks periodicity increasing
Figure 10. Costs maintenance for maintenance strategies
Figure 8. Costs maintenance for a strategy made of made of both overhaul and external inspections. Black bar
overhauls. represent the minimal cost of the only overhauls strategy,
grey bars represent the minimal cost of both overhauls and
external in-spections strategies.
Maintenance global costs
4500
the fact that all the external inspections tasks repre-
4000 sent some non detection and false alarm error risks
which are even more important regarding degradation
3500
mechanism C. It is indeed impossible to detect its evo-
lution through some symptom detection and it has been
3000
assumed that it is done with a very poor efficiency due
2500
to the distance between the degradation evolution and
the task performance.
2000 Thus, it is interesting to compare the advantage
of performing both the different type of tasks. By
1500 so doing, it is indeed possible to control the system
components degradation evolution indirectly when
1000 possible, with external inspection devoted to some
Maintenance tasks periodicity increasing symptom detection, and also on a more direct way
with overhauls. Those are more efficient in terms of
Figure 9. Costs maintenance for a strategy made of external degradation detection and, when performed with a
in-spections. higher periodicity, the global maintenance costs can
be reduced.
Figure 10 presents the minimal global costs for
different strategies composed as follow:
to compare the strategies in terms of costs difference
induced by preferring one type of detection task to
the other one, ie overhauls and external inspections, – external inspections supported by overhauls to
for the preventive maintenance of pumps and check detect degradation mechanism A, B and D, ie
valves, given the relative characteristics of the tasks. those detectable through some symptom obser-
Figure 8, presents the results obtained for a strat- vation, with overhauls preformed with a higher
egy only made of overhauls for pumps and check periodicity than external inspections one,
valves; whereas on Figure 9, only external inspec- – overhauls to detect the evolution of degradation
tions were performed to detect pumps and check valves mechanism C, since it is not convenient in terms
degradation. of efficiency to observe it through external inspec-
As it is easy to note, even if external inspections are tions.
less costly and do not induced some system unavail-
ability, the second strategy evaluation leads to higher Again, strategies were differing in terms of tasks
maintenance global costs. That is obviously due to periodicity and are here compared to the minimal cost

538
corresponding of the strategy composed only by over- tems. In MM2007 proceedings Maintenance Management
hauls to show that some appropriate combinations conference, Roma, 27–28 September 2007.
render possible to reduce the global maintenance costs. Barata, J., Guedes Soares, C., Marseguerra, M. &
Zio, E. 2002. Simulation modeling of repairable multi-
component deteriorating systems for on-condition main-
tenance optimisation. Reliability Engineering and System
6 CONCLUSIONS Safety 76: 255–267.
Dekker, R. 1996. Applications of maintenance optimization
The objective of the work presented here is to model models : a review and analysis. Reliability Engineering
and to simulate maintenance programs in order to pro- and System Safety 51(3): 229–240.
vide quantitative results which could support choices Dutuit,Y., Chatelet, E., Signoret, J.P. & Thomas, P. 1997.
between different maintenance tasks and frequen- Dependabiliy modeling and evaluation by using stochas-
cies. The approach is dedicated to multi-component tic Petri Nets :Application to two test cases. Reliability
systems and RCM type complex maintenance strate- Engineering and System Safety 55: 117–124.
gies. This is done through a structured and modular Marseguerra, M. & Zio, E. 2000 Optimizing Maintenance
and Repair Policies via a combination of Genetic Algo-
model which allows taking under consideration depen- rithms and Monte Carlo Simulation. Reliability Engineer-
dences between system components due either to ing & System Safety 68(1): 69–83.
failures either to operating and environmental condi- Moustafa, M.S., Abdel Maksoud, E.Y. & Sadek, S. 2004.
tions. Maintenance activities effectiveness is modeled Optimal major and minimal maintenance policies for dete-
to represent the ability of preventive actions to detect riorating systems. Reliability Engineering and System
components degradations, and the ability of both pre- Safety 83(3): 363–368.
ventive and corrective actions to modify and keep Simeu-Abazi, Z. & Sassine, C. 1999. Maintenance inte-
under control the degradation mechanism evolution gration in manufacturing systems by using stochastic
in order to avoid failure occurrence. It also let take Petri Nets. International Journal of Production Research
37(17): 3927–3940.
into account resources sharing problems such as repair Valdez-Flores, C. & Feldman, R.M. 1989. A survey of
teams or spare parts stocks. A case study regarding the preventive maintenance models for stochastically dete-
modeling of a turbo-pump lubricating system shows riorating single-unit systems. Naval Research Logistics
how the approach can efficiently been used to compare Quarterly 36: 419–446.
various maintenance strategies.

REFERENCES

Bérenguer, C., Grall, A., Zille, V., Despujols, A. &


Lonchampt, J. 2007. Modeling and simulation of com-
plex maintenance strategies for multi-component sys-

539
Safety, Reliability and Risk Analysis: Theory, Methods and Applications – Martorell et al. (eds)
© 2009 Taylor & Francis Group, London, ISBN 978-0-415-48513-5

Multiobjective optimization of redundancy allocation in systems with


imperfect repairs via ant colony and discrete event simulation

I.D. Lins & E. López Droguett


Department of Production Engineering, Universidade Federal de Pernambuco, Recife, Pernambuco, Brazil

ABSTRACT: Redundancy Allocation Problems (RAPs) are among the most relevant topics in reliable sys-
tem design and have received considerable attention in recent years. However, proposed models are usually
built based on simplifying assumptions about the system reliability behavior that are hardly met in practice.
Moreover, the optimization of more than one objective is often required as, for example, to maximize sys-
tem reliability/availability and minimize system cost. In this context, a set of nondominated solutions—system
designs with compromise values for both objectives—are of interest. This paper presents an ACO approach
for multiobjective optimization of availability and cost in RAPs considering repairable systems subjected to
imperfect repairs handled via Generalized Renewal Processes. The dynamic behavior of the system is modeled
through Discrete Event Simulation. The proposed approach is illustrated by means of an application example
involving repairable systems with series-parallel configuration.

1 INTRODUCTION can be adopted. Both GA and ACO are stochastic


approaches based in nature. GA mimics the natu-
Suppose that a system is composed of several sub- ral evolution process and has been widely applied in
systems in series and that each one of them may solving RAPs (Cantoni et al.,, 2000, Busacca et al.,
have a number of redundant components in parallel. 2000, Taboada & Coit 2006, Taboada et al., 2007).
The determination of the quantity of redundant com- However, GA demands a substantial computational
ponents in order to maximize the system reliability effort when considering large-scale problems. ACO,
characterizes a redundancy allocation problem. in turn, is based on the behavior of real ants and has
The increase of the redundancy level usually been proposed by Dorigo et al., (1996) to solve hard
improves not only system reliability, but also increases combinatorial optimization problems.
the system associated costs. With the purpose of incor- In a number of practical situations, components
porating costs limitations, RAPs are often modeled and systems are repairable, that is, after a failure
as single optimization problems with the objective of they undergo maintenance actions that do not consist
maximizing system reliability subject to costs con- in their entire replacement (Rigdon & Basu 2000).
straints. However, in real circumstances, one may Moreover, such maintenance actions often restore
desire to consider the associated costs not as a con- the system to an intermediate condition between ‘‘as
straint but as an additional objective to be attained. good as new’’ and ‘‘as bad as old’’ conditions, that
In these situations, in which multiple objectives are is, imperfect repairs are performed. In order to
taken into account, it is necessary a multiobjective model failure-repair processes of systems subjected
optimization approach in modeling RAP. to imperfect repairs, Generalized Renewal Processes
RAPs are essentially combinatorial since the aim (GRP—Kijima & Sumita 1986) can be applied.
is to find optimal combinations of the components The consideration of repairable systems composed
available to construct a system. The complexity of of components under deterioration processes and sub-
such problems may considerably increase as the num- jected to imperfect repairs (with significant downtime
ber of components grows leading to situations where time) provides a realistic handling of RAPs. On the
the application of classical methods such as inte- other hand, the introduction of such real character-
ger programming (Wolsey 1998) is prohibitive in istics makes the evaluation of system failure-repair
light of the time required to provide results. Alterna- process analytically intractable. Therefore, Discrete
tively, heuristic methods such as Genetic Algorithms Event Simulation (DES—Banks et al., 2001) can be
(GAs—Goldberg 1989, Michalewicz 1996) and Ant applied in order to obtain some system performance
Colony Optimization (ACO—Dorigo & Stützle 2004) measures such as system availability. Basically, DES

541
attempts to imitate system behavior by randomly gen- times between failures (see, for example, Busacca
erating discrete events (e.g. failures) during simulation et al., (2001), Chiang & Chen (2007), Juang et
time (mission time). In addition, the DES flexibility al., (2008)). This hypothesis does not allow for the
permits the introduction of many system real aspects, incorporation of the effects of component deteriora-
such as the taking into account the availability of tion and may incur in grotesque estimation errors of
maintenance resources during mission time. some important performance measures such as sys-
tem availability (Bowles 2002). Although Cantoni
et al., (2000) tackle repairable systems with imper-
fect repairs, they consider constant failure rates from
1.1 Previous works
the moment that a component returns into opera-
Shelokar et al., (2002) exemplify the application of tion until the occurrence of the very next failure.
ACO coupled with a strength Pareto-fitness assign- Moreover, they use a Brown-Proschan (B-P) model
ment to handle multiobjective problems in reliability of imperfect repairs, which is a specific type of failure
optimization. In the context of electrical systems, intensity model (Doyen & Gaudoin 2004).
Ippolito et al., (2004) propose a multiobjective ACO In a recent paper, Kuo & Wan (2007) emphasizes
to encounter the optimal planning strategy for elec- that other optimization approaches than GA, such as
trical distribution systems. The authors use as many ACO, may be investigated and more widely applied
ant colonies as many handled objectives and the algo- in solving RAPs. Besides, they also stress that non-
rithm is divided in two different phases: a forward renewable multi-state systems ought to be taken into
phase in which each ant colony attempts to optimize account in RAPs approaches.
separate objectives and a backward phase in which In this paper, as an attempt to join ACO and non-
nondominated solutions are taken into account. ACO renewable systems in RAPs, a multiobjective ACO
was applied by Liang & Smith (2004) to solve a RAP algorithm coupled with DES is provided to solve RAPs
with the single objective for maximizing system reli- in the context of repairable systems with components
ability subject to costs and weight constraints. They subjected to imperfect repairs. The two selected objec-
assume nonrepairable components with constant reli- tives are the system average availability and the total
ability values over mission time. Nahas & Nourelfath system cost. DES is used to obtain the former objec-
(2005) use ACO to find the best set of technologies to tive and also some parameters required for the cost
form a series system with the aim of obtaining maximal calculation over the established mission time. The sys-
reliability given budget constraints. So as in Liang & tem cost is composed of the components acquisition
Smith (2004), reliability values are fixed during mis- and operating costs, corrective maintenance costs and
sion time. Zhao et al., (2007) use a multiobjective ant costs incurred due to system unavailability. A limited
colony system approach in order to maximize reliabil- amount of maintenance resources is considered to give
ity in a RAP with a series-parallel system formed by maintenance support to the entire system. As soon as
k-out-of-n: G subsystems. It is important to emphasize a component fails, a maintenance resource is required
that the authors do not obtain a set of nondominated and if it is available, a delay time is generated accord-
solutions, since they only aim to maximize system ing to an Exponential distribution. Each component
reliability subject to cost and weight constraints. Nev- is supposed to have times to failure (TTF) modeled
ertheless, each component features—reliability (r), by Weibull distributions. The times to repair (TTR), in
cost (c) and weight (w)—are summarized in the quo- turn, are assumed to be exponentially distributed with
tient (r/(c + w)), which is used as a problem-specific different means.
heuristic information during the execution of ACO. The desired outcome is a set of nondominated sys-
The majority of the works in the literature regard- tem designs. The decision maker can then have an idea
ing ACO for solving RAPs does not give enough of system average availability during mission time and
attention to the system reliability aspect. RAPs of the related costs that might have been spent at the
are usually built based on simplifying assumptions end of that period. Moreover, note that the system life-
about the reliability behavior of the system and time cycle is thus taken into account and it is not a
of its components in order to facilitate the prob- merely issue of obtaining compromise designs at the
lem modeling and solution. These simplifications system acquisition moment.
(e.g. considering nonrepairable components with The paper organization is as follows. Section 2
constant reliability values over mission time) are introduces some concepts related to GRP and to
often non-realistic and do not permit the evalua- multiobjective optimization. Section 3 details the
tion of some important real-world problems. Indeed, use of multiobjective ACO, whereas Section 4
models for repairable systems in RAPs using GA describes the DES approach considered in this work.
as optimization algorithm assume that components In Section 5 an example application is discussed
have constant failure rates, i.e., that they have an and the following section gives some concluding
underlying Exponential distribution to model the remarks.

542
2 PRELIMINARIES of Equation 1 yields:

2.1 Generalized renewal processes 


n
Vn = q Xi (2)
Failure-repair processes of repairable items are often i=1
modeled according to either Nonhomogeneous Pois-
son Processes (NHPP) or Renewal Processes (RP). where i Xi , i = 1, . . ., n, is the real component age.
Both of them are counting processes meaning that, The definition of virtual age presented in Equation
in the context of reliability engineering, the times to 1 and in Equation 2 is in accordance with Kijima Type
repair are negligible compared to the component oper- I model (Kijima & Sumita 1986, Yañes et al., 2002),
ational time. NHPP assume that the repairs restore the which assumes that the repair acts just on the very last
system into an operational state with the same con- failure and compensates only the damage accumulated
dition it had just before failure, that is, corrective in the interval between the (n − 1)th and n th failures.
maintenance corresponds to minimal repairs. On the Hence, it reduces only the additional age Xn .
other hand, RP assume that repairs bring the system The cumulative distribution of the ith TBF(xi ) can
back to a like new condition (a renewal occurs) and be calculated by a probability distribution function
then corrective maintenance corresponds to perfect conditioned on the (i − 1)th component virtual age,
repairs. Moreover, in RP, times between failures (TBF) as follows:
are independent and identically distributed (i.i.d.) with
an arbitrary density function, whereas in NHPP they F(xi |vi−1 ) = P(X ≤ xi |V > vi−1 )
are neither independent nor identically distributed.
Nevertheless, the hypotheses of minimal or perfect F(vi−1 + xi ) − F(vi−1 )
= (3)
repairs required by either NHPP or RP, respectively, 1 − F(vi−1 )
are often not realistic. In practical situations, mainte-
nance actions usually restore the system to a condition If TBF are supposed to have a Weibull distribution,
between these two extremes, that is, they are imperfect Equation 3 becomes
repairs and the component returns into operation with a  
condition better than old and worse than new. To over-  v β  v − x  β
i−1 i−1 i
come this limitation, Generalized Renewal Processes F(xi |vi−1 ) = 1 − exp −
α α
(GRP) can be used to model failure-repair processes of
components subjected to imperfect repairs. GRP are (4)
a sort of virtual age models, in which repair actions
result in a reduction in the component real age. A When TTR are not negligible if compared to the
parameter q (effectiveness or rejuvenation parameter) component operational time, they may be consid-
is introduced in the model and it assumes values related ered in the evaluation of its failure-repair process. In
to the maintenance action efficacy (see Table 1). The this situation, there are two stochastic processes—one
common values of q are in [0,1]. regarding the failure process and the other related to
The rejuvenation parameter is used in the calcula- the repair process. The superimposing of these two
tion of the component virtual age (Vn ) as defined as stochastic processes results in an alternating process
follows: that characterizes the component state (either opera-
tional or unavailable due to repair action). However,
when the system is composed of several components
Vn = Vn−1 + qXn (1) each one having an alternating process, the analytical
handling of the entire system failure-repair process
becomes infeasible. In these cases, DES may be used
where Xn is the time between the (n − 1)th and the n as an alternative to overcome such difficulties.
th failure. By definition, V0 = 0. Thus, the expansion
2.2 Multiobjective optimization

Table 1. Repair classification according to The general formulation of a multiobjective optimiza-


values of parameter q. tion problem is as follows:

Value Repair type Max z = [ f1 (x), f2 (x), . . ., fk (x)] (5)


q=0 Perfect Subject to
0<q<1 Imperfect
q=1 Minimal
gi (x) = 0, i = 1, . . ., p (6)

543
hi (x) ≤ 0, i = p + 1, . . ., m (7) based on the single objective Ant System for the Trav-
eling Salesman Problem (TSP) proposed by .Dorigo
where z is the vector formed by k objective functions, et al., (1996). The following subsections are dedicated
x is the n-dimensional vector of decision variables, p is to the discussion of the proposed multiACO.
the number of equality constraints gi (x) and m−p is the
quantity of inequality constraints hi (x). Frequently, in 3.1 Input data
the multiobjective context, a unique solution that opti-
mizes all objectives is very difficult to be found or it Initially it is necessary to identify the number of
does not exist. Thus, a set of compromise solutions subsystems in series (s), the maximum (ni,max ) and
among objectives, i.e. nondominated solutions, may minimum (ni,min ) number of redundant components in
be encountered. A solution is said to be nondominated ith subsystem, i = 1, . . ., s. Each subsystem can be
if, for all objectives, it has a performance at least as composed by different technologies, which may have
great as the performances of the other solutions and, different reliability and cost features. Hence, it is also
at least for one of the objectives, its performance over- required the quantity of available component types (ci ,
come the performance of the others. In mathematical i = 1, . . ., s) that can be allocated in each subsys-
terms: tem. With this information, the ants’ environment is
constructed.
Moreover, information about components TTF and
x 1  x 2 ⇔ fh (x 1 ) ≥ fh (x2 ), ∀h and TTR distributions and the components related costs
fh (x1 ) > fh (x2 ), for some h (8) need to be specified. The ACO specific parameters
nAnts, nCycles, α, β, ρ and Q are also required and
they are discussed in Subsection 3.3.
where  denotes that x 1 dominates x 2 (i.e., x 1 is
nondominated) considering a maximization problem.
Otherwise, if a minimization problem is taken into 3.2 Environment modeling
account the symbols ≥ and > must be replaced by The environment to be explored by the ants is mod-
≤ and <, respectively. Once the set of nondominated eled as a directed graph D = (V , A), where V is the
solutions (also known as Pareto front) is obtained, the vertices set and A is the arcs set. D has an initial vertex
decision maker can choose any of the solutions to be (IV) and a final vertex (FV) and is divided in phases
implemented according to his preferences. that are separated by intermediate vertices. An inter-
Optimization methods usually handle multiobjec- mediate vertex indicates the ending of a phase and also
tive problems by transforming them into a single objec- the beginning of the subsequent phase and therefore
tive problem and then applying classical mathematical is common to adjacent phases. In this work, a phase
programming methods such as linear and nonlinear is defined as the representation of a subsystem and
programming (Luenberger 1984). The Weighted Sum vertices within a phase represent either its extremities
Method and the ε-Perturbation Method are examples or the possible components to be allocated in paral-
of traditional multiobjective approaches (Coello et al., lel in such subsystem. The quantity of vertices of the
2002, Deb 1999). However, a significant drawback in ith phase is equal to ni,max · cti plus the two vertices
applying those methods is the fact that they have to indicating its beginning and its ending.
be executed several times in order to obtain different Vertices are connected by arcs. Firstly, consider a
nondominated solutions. Moreover, objective func- problem with a unique subsystem. Hence, there is only
tions that do not have some features such as continuity one phase with IV and FV, but intermediate vertices
and differentiability render these classical methods are not necessary. IV is linked to all vertices within
useless. the existent phase (except FV), which in turn are con-
Alternatively, stochastic optimization methods nected with each other and also with FV. Now suppose
based on nature such as ACO can be applied in opti- a problem involving two subsystems. Then, an inter-
mizing multiobjective problems given their flexibility mediate vertex plays the role of FV for the first phase
in handling each objective separately. Since these and also the role of IV to vertices within the second
methods consider various potential solutions simulta- phase. All vertices in second phase (except the ver-
neously, a number of Pareto solutions can be reached tex indicating its beginning) are linked to FV. These
in a single execution of the algorithm. instructions can be followed in the case of s subsys-
tems. Arcs linking vertices within ith phase belong to
such phase.
3 MULTIOBJECTIVE ACO For the sake of illustration, Figure 1 shows an
example of a directed graph representing an ants’ envi-
The multiobjective ACO (multiACO) put forward in ronment, where s = 2, n1,min = 1, n1,max = 2, ct1 = 2,
this work is a problem specific algorithm which is n2,min = 1, n2,max = 1, ct2 = 4.

544
where ρ is the pheromone evaporation between times
t and t + m. The quantity τvw is obtained by


nAnts
τvw = τvw,k (11)
k=1

in which τvw,k is the pheromone laid by ant k on arc


linking vertices v and w between times t and t + m,
and nAntsis the total number of ants. If ant k has not
crossed such arc during the cycle, then τvw,k = 0.
Otherwise, τvw,k is proportional to the entire solution
Figure 1. Example of directed graph representing an ants’ contribution (SC). Hence, in the latter case, τvw,k =
environment. Q · SC, in which Q is a constant. In this work, SC is
defined as a quotient involving the obtained values by
ant k for the handled objectives:
3.3 Transition probabilities and pheromone
updating SC = Cmax · Ak /Ck (12)
Suppose that an ant k is in vertex v at time t. Then ant
k must choose a vertex w to visit at time t + 1. Several where Cmax is the maximum system cost obtained in
limitations are imposed in this selection: (i) vertex w the current cycle and Ak and Ck are, respectively, the
must not have been visited by ant k previously; (ii) if system mean availability and the system total cost met
ant k is in phase i and it has not visited at least ni,min by ant k. Cmax is used in order to bring the denominator
vertices yet, w can not be the very next intermediate in SC into (0,1].
vertex (or FV, if phase i is the last one); (iii) if ant k Visibility (ηvw ) is a problem specific heuristic
has already visited ni,max vertices, than w must be the information that does not change during algorithm
subsequent intermediate vertex (or FV, if phase i is the execution. Arcs with ending vertex equal to an inter-
last one). Besides, let Wk be the set of allowed vertices mediate vertex or to FV have their visibilities set to
that ant k can be at t + 1. Hence, the probability of ant k 1. Conversely, if w represents a component, then arcs
going from v to w is given by the following expression toward w have their visibilities defined as follows:
(.Dorigo & Stützle 2004):
MTTFw cai,max
ηvw = · (13)
α
[τvw (t)] · [ηvw ] β MTTFw + MTTRw caw
pkvw = (9)
[τvu (t)]α · [ηvu ]β
u∈Wk where MTTFw and MTTRw are, respectively, the mean
time to failure and the mean time to repair of the com-
ponent represented by w. In addition, cai , max is the
where τvw (t) is the pheromone quantity on the arc link- maximum acquisition cost observed in available com-
ing vertices v and w at time t, ηvw is the visibility ponents of the ith subsystem (phase) and caw is the
of the same arc, α is the relative importance of the acquisition cost of the component represented by w.
pheromone quantity, and β is the relative importance Similarly to the case of SC, cai,max is used in Equa-
of ηvw . tion 10 to bring the cost measure into (0,1]. Note that
In this work, the amount of pheromone is initially the first factor in Equation 11 is the mean availabil-
set to 1 / ni,max · ct i for each arc within the ith phase. ity of the considered component, provided that it was
A cycle is finished when all ants reach FV. When that renewed after corrective maintenance action. In this
occurs, the pheromone quantity on arcs is updated (i.e., paper, however, components are subjected to imper-
multiACO is an ant-cycle algorithm—see Dorigo et fect repairs and such interpretation only holds up to the
al., (1996), Dorigo & Stützle (2004)). Let m be the first failure. Besides, instead of involving other type of
highest number of visited vertices in a cycle. Then, costs, the factor regarding cost only uses the acquisi-
at time t + m, the pheromone quantity in each arc is tion cost. This is due to the fact that the acquisition cost
updated in accordance with the rule: does not change over the component lifetime. Never-
theless, Equation 11 is still applied as an attempt to
locally guide ants to vertices representing components
τvw (t + m) = (1 − ρ)•τvw (t) + τvw (10) with higher mean availabilities and lower costs.

545
3.4 Dominance evaluation
The desired result of multiACO is a set of nondomi-
nated solutions (N ), which may contain all compro-
mise system designs found during the algorithm run.
Therefore each ant is evaluated for each objective and
has a number of associated objective values equal to
the quantity of objectives (each objective is treated
separately).
The set N is updated at the end of each cycle.
Firstly, however, a set of candidate solutions (CS) to
be inserted in N is obtained by the assessment of the
dominance relation among ants within the cycle under
consideration. If ant k is dominated by other ants in
the current cycle, then it is not introduced in CS. Oth-
erwise, if ant k is nondomidated in relation to all ants
in the present cycle, then it is inserted in CS. Next, it is
necessary to evaluate the dominance relation of each
element in CS in relation to solutions already stored
in N . Suppose that ant k is in CS, then: (i) if ant k is
dominated by elements in N , then ant k is ignored; (ii)
if ant k dominates solutions in N , then all solutions
dominated by ant k are eliminated from N and a copy
of ant k is inserted in N .

3.5 The algorithm


Figure 2 shows the pseudocode of the proposed mul-
tiACO. The required input data was discussed in
Subsection 3.1.

4 DISCRETE EVENT SIMULATION

DES attempts to mimic the behavior of real systems


by randomly generating discrete events during simula-
tion time (Banks et al., 2001). The inherent flexibility
of DES permits the introduction of several real world
aspects in reliability problem modeling.
In this paper, DES is coupled with the multiob- Figure 2. Pseudocode of the proposed multiobjective ACO.
jective ACO described in the previous section at the
objectives evaluation stage. Each system design rep-
resented by an ant has some of its dynamic features
(e.g. mean availability, mean number of repairs per a repair, the component is immediately brought back
component, among others) assessed via DES. into operation. Moreover, at the end of a step, sys-
Suppose that ant k has its associated system design tem state (either available or unavailable) is evaluated.
evaluated over a predefined mission time (tM ). In DES, Since the entire system state depends on the state of
firstly tM is divided into n equal steps. For each step, its components, it is necessary to construct the rela-
a number of events (e.g. failures, start/end of repair) tionship among components in order to obtain system
are generated for every component. If a component logic. In the implemented DES algorithm, such con-
fails, it is necessary to assess if there are sufficient struction is done by means of a BDD (Binary Decision
maintenance resources to perform the repair. If so, the Diagrams—see Rauzy (2001)). These stochastic tran-
failed component must wait a certain amount of time sitions are replicated several times for every step
until such resources are ready to initiate the repair. until tM is reached. Figure 4 presents the coupling of
Otherwise, if no resources are available, the failed multiACO + DES.
component is inserted in a queue in order to wait The system availability is estimated via DES
until its required resources become available. After as follows. Let zik [=1(available); =0(unavailable)],

546
and each component number of repairs and operating
time.

5 EXAMPLE APPLICATION

This section presents an example application in which


the desired result is a set of nondominated solutions
representing alternative designs for a system subjected
to imperfect repair. The two objectives are the mean
availability, which is obtained by means of DES, and
the system total cost. The latter objective is composed
by the acquisition (CA ), operating (CO ) and corrective
maintenance (CCM ) costs of components and also by
the costs incurred due to system unavailability (CU ).
These costs are defined as follows:


s 
mi
CA = caij · xij (14)
i=1 j=1

Figure 3. Pseudocode for the system availability where s is the number of subsystems, mi is the quantity
estimation. of components in subsystem i, caij is the acquisition
cost of the jth component type of the ith subsystem
and xij is the quantity of that component.
zk [=1(available); =0(unavailable)] and tk be the com-
ponent state, system state and time at the kth step, s 
 mi 
xij
respectively. Moreover, let ck be a counter of the num- CO = coij · toijk (15)
ber of times that the system is available by the end i=1 j=1 k=1
of the kth step, hi (·) and mi (·) be the time to failure
and repair time probability density functions of the
where coij is the operating cost per unit time for the jth
component i, and A(tk ) be the system availability at
component type of the ith subsystem and toijk is the
time tk . If nC is the number of system components, a
operating time of the kth copy of that component.
DES iteration can be written in pseudocode as shown
in Figure 3.
In a nutshell, the algorithm described above may be 
s 
mi 
xij
CCM = ccmij · nijk (16)
thought in the following way: while the process time
i=1 j=1 k=1
ti is lower than the mission time tk , the following steps
are accomplished: (i) the time to failure τi of the com-
ponent i is sampled from hi (·); (ii) ti is increased by τi ; where ccmij is the corrective maintenance cost for the
(iii) the condition (ti ≥ tk ) means component i ends jth component type of subsystem i and nijk is the quan-
kth step on an available condition (zik = 1); otherwise, tity of repairs to which the kth component is subjected
component i failed before the kth step; (iv) in the lat- over mission time.
ter, the repair time xi is sampled from mi (t) and ti is
increased by it; if (ti ≥ tk ) the component iends kth CU = cu · tu (17)
step under a repair condition and therefore unavailable
(zik = 0); (v) upon assessing the states of the nC com- where cu is the cost per unit time related to system
ponents at the kth step, the system state zk is assessed unavailability and tu is the system unavailable time
via the corresponding system BDD; (vi) finally, the during mission time. Hence, C = CA + CO + CCM +
counter ck is increased by zk . The aforementioned pro- CU , in which C is the system total cost. The variables
cedure is repeated M times, a sufficiently large number toijk (Equation 13), nijk (Equation 14) and tu (Equation
of iterations, and then the availability measure A(tk ) 15) are obtained via DES of the system dynamics.
at kth step is estimated dividing the values ck by M . Moreover, it is considered a repairable system
A number of random variables are obtained via DES with 3 subsystems in series, S1 , S2 and S3 . The
and fed back to multiACO with the aim of calculating minimum and maximum number of components in
the objectives (that is, system mean availability and parallel and the quantity of different technologies for
system total cost): system operating/unavailable time each subsystem are listed in Table 2.

547
Figure 4. Flowchart of the used multiACO + DES.

Table 2. Subsystems features for example application. Components characteristics are presented in Table 3.
The components corrective maintenance and operat-
ni,min ni,max cti ing costs are taken as 2% and 10% of the acquisition
cost, respectively. Moreover, cu = 500 monetary
S1 1 3 5
units.
S2 1 6 5
S3 1 4 3 The parameters for the multiACO were nAnts =
100, nCycles = 50, α = 1, β = 1, ρ = 0.5 and Q =
1. System designs were evaluated over a mission time
equals to 365 time units. A set of 47 nondominated
Components are supposed to have their failure pro- solutions were obtained, as shown in Figure 5. Some
cess modeled according to a GRP. More specifically, of the alternative system designs indicated in Figure 5
it is assumed a Kijima Type I model with TTF given are presented in Figure 6.
by Weibull distributions with different scale (α, in
time units), shape (β) and rejuvenation (q) parameters. 5.1 Return of investment analysis
On the other hand, TTR are exponentially distributed
with different parameters (λ) per component type. All system designs in the set of nondominated solu-
In addition, components are subjected to imperfect tions are optimal in accordance with the multiobjective
repairs. As soon as a component fails, the repair does approach. However, the decision maker may select
not start immediately and it is necessary to check only one system design to be implemented. In order to
the availability of required maintenance resources. If guide such selection, he can make a return of analysis
resources are available, a random time representing investment, that is, observe the gain in system mean
the logistic time for resource acquisition is generated availability in relation to the required investment in
according to an Exponential distribution with λ = 1, the corresponding system design. Mathematically the
i.e., the failed component must wait up to such time return of investment (ROI ) is:
to go under repair. Otherwise, the failed component
waits in queue for the required maintenance resources. ROI = (Ak − Ak−1 )/(Ck − Ck−1 ) (18)

548
Table 3. Components features for example application.

# fTTF fTTR ca co ccm

1 Wei(40, 1.8, 0.5) Exp(1.5) 9900 198 990


2 Wei(38, 1.6, 0.5) Exp(1.2) 9400 188 940
S1 3 Wei(34, 1.5, 0.5) Exp(0.9) 8500 170 850
4 Wei(31, 1.4, 0.5) Exp(1.0) 8800 176 880
5 Wei(30, 1.5, 0.5) Exp(1.1) 8200 164 820
1 Wei(20, 1.4, 0.4) Exp(0.9) 7000 140 700
2 Wei(27, 1.5, 0.4) Exp(1.2) 8700 174 870
S2 3 Wei(22, 1.2, 0.4) Exp(0.8) 7800 156 780
4 Wei(29, 1.4, 0.4) Exp(1.1) 9100 182 910
5 Wei(23, 1.4, 0.4) Exp(1.1) 7500 150 750
1 Wei(19, 1.2, 0.6) Exp(0.5) 5500 110 550
S3 2 Wei(25, 1.8, 0.6) Exp(0.6) 5800 116 580
3 Wei(22, 1.4, 0.6) Exp(0.6) 5200 104 520

Figure 6. System designs for selected solutions in Figure 5.

Table 4. ROI of some selected system designs.

Solution Mean availability Cost ROI

B 0.763128 348357 7.555.10−6


C 0.816275 355391
D 0.990152 773650 1.200.10−8
E 0.991839 906926

paper presented an attempt to tackle RAPs consid-


ering the more realistic repairable system behavior
consisting of imperfect repairs. This was achieved by
coupling multiobjective ACO and DES. In this context,
the dynamic behavior of potential system designs was
Figure 5. Nondominated set for example application.
evaluated by means of DES, providing the decision
maker a better comprehension of the incurred costs
during mission time due to different mean availability
where k and k − 1 are adjacent solutions in the Pareto values.
front, and Ak , Ak−1 , Ck , Ck−1 are their respective The proposed multiobjective ACO algorithm can
system mean availabilities and system costs. certainly be improved by applying more sophisticated
As an example, Table 4 presents the return of inves- pheromone updating rules and coupling it with a local
tment of solutions B and C, and solutions D and E. search algorithm with the aim of obtaining more accu-
Note that the investment on system design is of 7,034 rate solutions. Moreover, some improvements can be
monetary units to go from solution B to C for a gain of done in the reliability modeling. For example, the
about 0.05 in mean availability, whereas such invest- amount of maintenance resources could be a deci-
ment is 133,276 monetary units to gain about 0.001 sion variable itself, and hence the decision maker
in mean availability when upgrading from configu- could have the compromise system designs taking into
ration D to E. consideration the required quantity of maintenance
resources over mission time.

6 CONCLUDING REMARKS
REFERENCES
Although there are some works in the literature apply-
ing ACO in RAPs, they often make simplifications Banks, J., Carson, J.S., Nelson, B.L. & Nicol, D.M. 2001.
concerning system reliability behavior that are usu- Discrete event system simulation. Upper Saddle River:
ally not satisfied in practical situations. Therefore this Prentice Hall.

549
Bowles, J.B. 2002. Commentary—caution: constant failure- non-negative Markovian increments. Journal of Applied
rate models may be hazardous to your design. IEEE Probability 23: 71–88.
Transactions on Reliability 51(3): 375–377. Kuo, W. & Wan, R. 2007. Recent advances in optimal relia-
Busacca, P.G., Marseguerra, M. & Zio, E. 2001. Multiob- bility allocation. IEEE Transactions on Systems, Man and
jective optimization by genetic algorithms: application to Cybernetics 37(4): 143–156.
safety systems. Reliability Engineering & System Safety Liang, Y.-C. & Smith, A.E. 2004. An ant colony optimization
72: 59–74. algorithm for the redundancy allocation problem (RAP).
Cantoni, M., Marseguerra, M. & Zio, E. 2000. Genetic IEEE Transactions on Reliability 53(3): 417–423.
algorithms and Monte Carlo simulation for optimal plant Luenberger, D.G. 1984. Linear and nonlinear programming.
design. Reliability Engineering & System Safety 68: Massachusetts: Addison-Wesley.
29–38. Michalewicz, Z. 1996. Genetic algorithms + data struc-
Chiang, C.-H. & Chen, L.-H. 2007. Availability allocation tures = evolution programs. Berlin: Springer.
and multiobjective optimization for parallel-series sys- Nahas, N. & Nourelfath, M. 2005. Ant system for reliabil-
tems. European Journal of Operational Research 180: ity optimization of a series system with multiple-choice
1231–1244. and budget constraints. Reliability Engineering & System
Coello, C.A.C., Veldhuizen, D.A.V. & Lamont, G.B. 2002. Safety 87: 1–12.
Evolutionary algorithms for solving multiobjective prob- Rauzy, A. 2001. Mathematical Foundations of Minimal
lems. New York: Kluwer Academic. Cutsets. IEEE Transactions on Reliability 50: 389–396.
Deb, K. 1999. Evolutionary algorithms for multicriterion Rigdon, S.E. & Basu, A.P. 2000. Statistical methods for the
optimization in engineering design. In: Proceedings of reliability of repairable systems. New York: John Wiley &
Evolutionary Algorithms in Engineering and Computer Sons.
Science (EUROGEN’99). Shelokar, P.S., Jayaraman, V.K. & Kulkarni, B.D. 2002. Ant
Dorigo, M., Maniezzo, V. & Colorni, A. 1996. Ant system: algorithm for single and multiobjective reliability opti-
optimization by cooperating agents. IEEE Transactions mization problems. Quality and Realiability Engineering
on Systems, Man and Cybernetics 26(1): 29–41. International 18(6): 497–514.
Dorigo, M. & Stützle, T. 2004. Ant colony optimization. Taboada, H. & Coit, D.W. 2006. MOEA-DAP: a new multiple
Massachusetts: MIT Press. objective evolutionary algorithm for solving design allo-
Doyen, L. & Gaudoin, O. 2004. Classes of imperfect repair cation problems. Under review. Reliability Engineering &
models based on reduction of failure intensity or virtual System Safety.
age. Reliability Engineering & System Safety 84: 45–56. Taboada, H., Espiritu, J. & Coit, D.W. 2007. MOMS-GA:
Goldberg, D.E. 1989. Genetic algorithms in search, optimiza- a multiobjective multi-state genetic algorithm for system
tion, and machine learning. Reading: Addison-Wesley. reliability optimization design problems. In print. IEEE
Ippolito, M.G., Sanseverino, E.R. & Vuinovich, F. 2004. Transactions on Reliability.
Multiobjective ant colony search algorithm for optimal Wolsey, L.A. 1998. Integer programming. New York: John
electrical distribution system strategical planning. In: Wiley & Sons.
Proceedings of 2004 IEEE Congress on Evolutionary Yañes, M., Joglar, F. & Modarres, M. 2002. General-
Computation. Piscataway, NJ. ized renewal process for analysis of repairable systems
Juang, Y.-S., Lin, S.-S. & Kao, H.-P. 2008. A knowledge with limited failure experience. Reliability Engineering
management system for series-parallel availability opti- & System Safety 77: 167–180.
mization and design. Expert Systems with Applications Zhao, J.-H., Liu, Z. & Dao, M.-T. 2007. Reliability optimiza-
34: 181–193. tion using multiobjective ant colony system approaches.
Kijima, M. & Sumita, N. 1986. A useful generaliza- Reliability Engineering & System Safety 92: 109–120.
tion of renewal theory: counting process governed by

550
Safety, Reliability and Risk Analysis: Theory, Methods and Applications – Martorell et al. (eds)
© 2009 Taylor & Francis Group, London, ISBN 978-0-415-48513-5

Non-homogeneous Markov reward model for aging multi-state system


under corrective maintenance

A. Lisnianski
The Israel Electric Corporation Ltd., Haifa, Israel

I. Frenkel
Center for Reliability and Risk Management, Industrial Engineering and Management Department,
Sami Shamoon College of Engineering, Beer Sheva, Israel

ABSTRACT: This paper considers reliability measures for aging multi-state system where the system and
its components can have different performance levels ranging from perfect functioning to complete failure.
Aging is treated as failure rate increasing during system life span. The suggested approach presents the non-
homogeneous Markov reward model for computation of commonly used reliability measures such as mean
accumulated performance deficiency, mean number of failures, average availability, etc., for aging multi-state
system. Corresponding procedures for reward matrix definition are suggested for different reliability measures.
A numerical example is presented in order to illustrate the approach.

1 INTRODUCTION q = 1, it means that the maintenance action is perfect


(system becomes ‘‘as good as new’’ after repair). If
Many technical systems are subjected during their q = 0, it means that the failed system is returned
lifetime to aging and degradation. After any fail- back to a working state by minimal repair (system
ure, maintenance is performed by a repair team. The stays ‘‘as bad as old’’ after repair), in which failure
paper considers an aging multi-state system, where the rate of the system is nearly the same as before. The
system failure rate increases with time. minimal repair is appropriate for large complex sys-
Maintenance and repair problems have been widely tems where the failure occurs due to one (or a few)
investigated in the literature. Barlow & Proshan component(s) failing. In this paper we are dealing
(1975), Gertsbakh (2000), Valdez-Flores & Feldman only with minimal repairs and, therefore, q = 0. In
(1989), Wang (2002) survey and summarize the- such situation, the failure pattern can be described
oretical developments and practical applications of by non-homogeneous Poisson process and reliability
maintenance models. Aging is usually considered as measures can be evaluated by using non-homogeneous
a process which results in an age-related increase of Markov reward model.
the failure rate. The most common shapes of failure In this paper, a general approach is suggested
rates have been observed by Gertsbakh & Kordonsky for computing reliability measures for aging MSS
(1969), Meeker & Escobar (1998), Bagdonavicius & under corrective maintenance with minimal repair.
Nikulin (2002), Wendt & Kahle (2006). An interesting The approach is based on non-homogeneous Markov
approach was introduced by Finkelstein (2002), where reward model, where specific reward matrix is deter-
it was shown that aging is not always manifested by mined for finding any different reliability measure.
the increasing failure rate. For example, it can be an The main advantage of the suggested approach is that
upside-down bathtub shape of the failure rate which it can be easily implemented in practice by reliability
corresponds to the decreasing mean remaining lifetime engineers.
function.
After each corrective maintenance action or repair,
the aging system’s failure rate λ(t) can be expressed 2 MODEL DESCRIPTION
as λ(t) = q · λ(0) + (1 − q) · λ∗ (t), where q is an
improvement factor that characterizes the quality of According to the generic Multi-state System (MSS)
the overhauls (0 ≤ q ≤ 1), λ∗ (t) is the aging system’s model (Lisnianski & Levitin 2003), MSS output per-
failure rate before repair (Zhang & Jardine 1998). If formance G(t) at any instant t ≥ 0 is a discrete-state

551
continuous-time stochastic process that takes its values If, for example, the state K with highest perfor-
from the set g = {g1 , g2 , . . ., gk }, G(t) ∈ g, where gi is mance level is defined as the initial state, the value
the MSS output performance in state i, i = 1, 2, . . ., k. VK (t) should be found as a solution of the system (1).
Transition rates (intensities) aij between states i and j It was shown in Lisnianski (2007) and Lisnianski
are defined by corresponding system failure and repair et al., (2007) that many important reliability measures
rates. The minimal repair is a corrective maintenance can be found by the determination of rewards in a
action that brings the aging equipment to the con- corresponding reward matrix.
ditions it was in just before the failure occurrence.
Aging MSS subject to minimal repairs experiences
reliability deterioration with the operating time, i.e. 2.2 Rewards determination for computation
there is a tendency toward more frequent failures. In of different reliability measures
such situations, the failure pattern can be described
by a Poisson process whose intensity function mono- For an availability computation, we partition the set of
tonically increases with t. A Poisson process with states g, into g 0 , the set of operational or acceptable
a non-constant intensity is called non-homogeneous, system states, and g f , the set of failed or unaccept-
since it does not have stationary increments (Gerts- able states. The system states acceptability depends
bakh 2000). Therefore, in this case corresponding on the relation between the MSS output performance
transition intensities will be functions of time aij (t). and the desired level of this performance — demand,
which is determined outside the system. In general
case demand W (t) is also a random process that can
2.1 Non-homogeneous Markov reward model take discrete values from the set w = {w1 , . . ., wM }.
The desired relation between the system performance
A system’s state at time t can be described by a
and the demand at any time instant t can be expressed
continuous-time Markov chain with a set of states
by the acceptability function (G(t), W (t)) (Lisni-
{1, . . ., K} and a transition intensity matrix a = |aij (t)|,
anski & Levitin 2003). The acceptable system states
i, j = 1, . . ., K. For Markov reward model it is assumed
correspond to (G(t), W (t)) ≥ 0 and the unaccept-
that if the process stays in any state i during the time
able states correspond to(G(t), W (t)) < 0. The last
unit, a certain cost rii is paid. It is also assumed that
inequality defines the MSS failure criterion. In many
each time the process transits from state i to state j
practical cases, the MSS performance should be equal
a cost rij should be paid. These costs rii and rij are
to or exceed the demand. Therefore, in such cases the
called rewards (Hiller & Lieberman 1995). A reward
acceptability function takes the following form:
may also be negative when it characterizes a loss or
penalty. Such a reward process associated with sys-
tem states or/and transitions is called a Markov process (G(t), W (t)) = G(t) − W (t) (3)
with rewards. For such processes, in addition to a tran-
sition intensity matrix a = |aij (t)|, i, j = 1, . . ., K,
a reward matrix r = |rij |, i, j = 1, . . ., K should be and the criterion of state acceptability can be expres-
determined (Carrasko 2003). sed as
Let Vi (t) be the expected total reward accumulated
up to time t, given the initial state of the process as
time instant t = 0 is in state i. According to Howard (G(t), W (t)) = G(t) − W (t) ≥ 0 (4)
(1960), the following system of differential equations
must be solved under specified initial conditions in
order to find the total expected rewards: Here without loss of generality we assume that
required demand level is constant W (t) ≡ w and
all system states with performance greater than or
dVi (t) K K
= rii + aij (t)rij + aij (t)Vj (t) equal to w corresponds to the set of acceptable states
dt and all system states with performance lower than w
j=1 j=1
j =i correspond to the set of unacceptable states.
We define the indicator random variable
i = 1, 2, . . ., K (1)

1, if G(t) ∈ g0 ,
In the most common case, MSS begins to accumu- I (t) = (5)
late rewards after time instant t = 0, therefore, the 0 otherwise.
initial conditions are:
The MSS instantaneous (point) availability A(t) is
Vi (0) = 0, i = 1, 2, . . ., K (2) the probability that the MSS at instant t > 0 is in one

552
of acceptable states: in a Markov reward model should be defined as

A(t) = Pr{I (t) = 1} = Pi (t) (6) 
w − gj , if w− gj > 0,
rjj = (10)
i∈g0 0, if w− gj ≤ 0.
where Pi (t) is the probability that at instant t the system
is in state i. All transitions rewards rij , i  = j should be zeroed.
For an aging MSS an average availability is often Therefore, the mean reward Vi (T ) accumulated dur-
used. The MSS average availability A (T ) is defined as ing the time interval [0, T ], if state i is in the initial
a mean fraction of time when the system resides in the state, defines the mean accumulated performance
set of acceptable states during the time interval [0, T ], deficiency

T ⎧ T ⎫
1
A (T ) = A(t)dt (7) ⎨ ⎬
T Vi (T ) = E (W (t) − G(t))dt (11)
0 ⎩ ⎭
0
To assess A(T ) for MSS, the rewards in matrix r
can be determined in the following manner.
where E - expectation symbol and G(0) = gi .
• The rewards associated with all acceptable states Mean Time To Failure (MTTF) is the mean time
should be defined as 1. up to the instant when the MSS enters the subset of
• The rewards associated with all unacceptable states unacceptable states for the first time. For its computa-
should be zeroed as well as all the rewards associated tion the combined performance-demand model should
with all transitions. be transformed - all transitions that return MSS from
unacceptable states should be forbidden, as in this case
The mean reward Vi (T ) accumulated during inter- all unacceptable states should be treated as absorbing
val [0, T ] defines a time that MSS will be in the set of states.
acceptable states in the case where state i is the initial In order to assess MTTF for MSS, the rewards
state. This reward should be found as a solution of the in matrix r for the transformed performance-demand
system (1). After solving the system (1) and finding model should be determined as follows:
Vi (t), MSS average availability can be obtained for
every i = 1, . . . , K: • The rewards associated with all acceptable states
 should be defined as 1.
Ai (T ) = Vi (T ) T (8) • The reward associated with unacceptable (absorb-
ing) states should be zeroed as well as all rewards
Usually the state K is determined as an initial state associated with transitions.
or in another words the MSS begins its evolution
in the state space from the best state with maximal In this case, the mean accumulated reward Vi (t)
performance. defines the mean time accumulated up to the first
Mean number Nfi (T ) of MSS failures during the entrance into the subset of unacceptable states or
time interval [0, T ], if state i is the initial state. This MTTF, if the state i is the initial state.
measure can be treated as a mean number of MSS Reliability function and Probability of MSS failure
entrances into the set of unacceptable states during during the time interval [0, T ]. The model should be
the time interval [0, T ]. For its computation rewards transformed as in the previous case — all unacceptable
associated with each transition from the set of accept- states should be treated as absorbing states and, there-
able states to the set of unacceptable states should fore, all transitions that return MSS from unacceptable
be defined as 1. All other rewards should be zeroed. states should be forbidden. Rewards associated with
In this case the mean accumulated reward Vi (T ), all transitions to the absorbing state should be defined
obtained by solving (1) provides the mean number of as 1. All other rewards should be zeroed. The mean
entrances into the unacceptable area during the time accumulated reward Vi (T ) in this case defines the
interval [0, T ]: probability of MSS failure during the time interval
[0, T ], if the state i is the initial state. Therefore, the
Nfi (T ) = Vi (T ) (9) MSS reliability function can be obtained as:

Mean performance deficiency accumulated within


interval [0, T ]. The rewards for any state number j Ri (T ) = 1 − Vi (T ), i = 1, . . ., K (12)

553
3 NUMERICAL EXAMPLE According to the state space diagram in Figure 1 the
following transition intensity matrix a can be obtained:
Consider a multi-state power generating unit with
nominal generating capacity 360 KWT. Correspond-

ing multi-state model is presented in fig.1 and has 4 −μ14 0 0 μ14
different performance levels: complete failure level
0 −μ24 0 μ24
(g1 = 0), two levels with reduced capacity (g2 = a=
0 0 −μ34 μ34
215 KWT, g3 = 325 KWT), and level of perfect λ λ42 (t) λ43 − (λ41 + λ42 (t) + λ43 )
41
functioning (g4 = 360 KWT).
Aging was indicated as increasing transition failure (13)
rate λ42 (t) = 7.01 + 0.2189t 2 . Other failure rates are
constant: λ41 = 2.63 year −1 and λ43 = 13.14 year −1
Repair rates are the following: μ14 = 446.9 year 1 , In order to find the MSS average availability Ā(T )
μ24 = 742.8 year 1 , μ34 = 2091.0 year 1 . according to introduced approach we should present
The demand is constant w = 300 KWT and power the reward matrix r in the following form.
unit failure is treated as generating capacity decreasing
below demand level w.
The state-space diagram for the system is presented
0 0 0 0
in Figure 1. 0 0 0 0

r = r ij =
0
By using the presented method we assess the MSS (14)
average availability, mean total number of system 0 0 1
0 0 0 1
failures, accumulated mean performance deficiency,
Mean Time To Failure and Reliability function for 5
years time interval.
The system of differential equations (1) will be
presented as the following:

4
g4 = 360 dV1 (t)
= −μ14 · V1 (t) + μ14 · V4 (t)
dt
dV2 (t)
= −μ24 · V2 (t) + μ24 · V4 (t)
λ43 μ 34 dt
dV3 (t)
= 1 − μ34 · V3 (t) + μ34 · V4 (t)
dt
3 dV4 (t)
g3 = 325 = 1 + λ41 · V1 (t) + λ42 (t) · V2 (t) + λ43 · V3 (t)
dt
μ 24
λ42 (t) − (λ41 + λ42 (t) + λ43 ) · V4 (t) (15)
w = 300
The system of differential equations must be sold
under initial conditions: Vi (0) = 0, i = 1, 2, 3, 4.
The results of calculation one can see in Figure 2.
g2 = 215 λ41 2 μ 14 Calculation results are presented for two cases: for
aging unit with λ42 (t) = 7.01 + 0.2189t 2 and for non-
aging unit where λ42 = 7.01 ≡ constant.
In order to find the mean total number of system
failures Nf (t) we should present the reward matrix r
in the following form:


0 0 0 0
g1= 0 1
0 0 0 0
r = rij = (16)
0 0 0 0
1 1 0 0
Figure 1. State space diagram of generated system.

554
1 The system of differential equations must be sold
under initial conditions: Vi (0) = 0, i = 1, 2, 3, 4.
0.998
The results of calculation are presented in Figure 3.
0.996 In order to find Accumulated Performance Defi-
Average Availability

ciency we should present the reward matrixes r in the


0.994 following form:
0.992
300 0 0 0
0.99 0
85 0 0
r = r ij =
0 0 0
(18)
0
0.988 0 0 0 0
0.986
0 1 2 3 4 5
Time (years)
The system of differential equations (1) will be
presented as follows:
Figure 2. Calculation the MSS average availability. (Dash
dot line: λ42 (t) = 7.01 + 0.2189t 2 . Bold line: λ42 = 7.01 ≡ dV1 (t)
constant). = 300 − μ14 · V1 (t) + μ14 · V4 (t)
dt
dV2 (t)
= 85 − μ24 · V2 (t) + μ24 · V4 (t)
60 dt
Mean Total Number of System Failures

dV3 (t)
50 = −μ34 · V3 (t) + μ34 · V4 (t)
dt
dV4 (t)
40 = λ41 + λ42 + λ41 · V1 (t) + λ42 (t) · V2 (t)
dt
30 + λ43 · V3 (t) − (λ41 + λ42 (t) + λ43 ) · V4 (t)

20
(19)

10 The system of differential equations must be sold


under initial conditions: Vi (0) = 0, i = 1, 2, 3, 4.
0 The results of calculation are presented in Figure 4.
0 1 2 3 4 5 For computation of the Mean Time To Failure and
Time (years)
the Probability of MSS failure during the time interval
Figure 3. Mean total number of system failures. (Dash dot
the state space diagram of generated system should be
line: λ42 (t) = 7.01 + 0.2189t 2 . Bold line: λ42 = 7.01 ≡ transformed—all transitions that return system from
constant).
Accumulated Performance Deficiency (KW.Hours)

61320

Then the system of differential equations (1) will 52560


be obtained:
43800

dV1 (t) 35040


= −μ14 · V1 (t) + μ14 · V4 (t)
dt 26280
dV2 (t)
= −μ24 · V2 (t) + μ24 · V4 (t) 17520
dt
dV3 (t) 8760
= −μ34 · V3 (t) + μ34 · V4 (t)
dt 0
0 1 2 3 4 5
dV4 (t) Time (years)
= λ41 + λ42 + λ41 · V1 (t) + λ42 (t) · V2 (t)
dt
+ λ43 · V3 (t) − (λ41 + λ42 (t) + λ43 ) · V4 (t) Figure 4. Accumulated performance deficiency. (Dash dot
line: λ42 (t) = 7.01 + 0.2189t 2 . Bold line: λ42 = 7.01 ≡
(17) constant).

555
unacceptable states should be forbidden and all unac- The system of differential equations (1) will be
ceptable states should be treated as absorbing state. presented as follows:
The state space diagram may be presented as follows.
According to the state space diagram in Figure 5 dV0 (t)
=0
transition intensity matrix a can be presented as dt
follows: dV3 (t)
= 1 − μ34 · V3 (t) + μ34 · V4 (t)
dt
dV4 (t)
0 0 0 = 1 + (λ41 + λ42 (t)) · V0 (t) + λ43 · V3 (t)

a= 0 −μ34 μ34 ,

dt
λ41 + λ42 (t) λ43 − (λ41 + λ42 (t) + λ43 ) − (λ41 + λ42 (t) + λ43 ) · V4 (t) (22)
(20) The system of differential equations must be sold
under initial conditions: Vi (0) = 0, i = 0, 3, 4.
The results of calculation are presented in Figure 6.
In order to find Mean Time To Failure we should In order to find Probability of MSS failure during
present the reward matrixes r in the following form the time interval [0, T ] we should present the reward
matrixes r in the following form


0 0 0
0 0 0
r = r ij = 0 0 0
(23)
r = r ij = 0 1 0
(21) 1 0 0
0 0 1
The system of differential equations (1) will be
presented as follows:

dV0 (t)
=0
dt
4 dV3 (t)
g4=360 = −μ34 · V3 (t) + μ34 · V4 (t)
dt
dV4 (t)
= λ41 + λ42 (t) + (λ41 + λ42 (t)) · V0 (t)
dt
+ λ43 · V3 (t) − (λ41 + λ42 (t) + λ43 ) · V4 (t)
μ34
(24)

The system of differential equations must be sold


under initial conditions: Vi (0) = 0, i = 0, 3, 4.

3 0.12
g3=325
0.1
Mean Time to Failure (years)

0.08
w=300
0.06

0.04

0.02
0
0
0 0.2 0.4 0.6 0.8 1
Time (Years)
Figure 5. State space diagram of generated system with
absorbing state. Figure 6. Mean time to failure.

556
1 Barlow, R.E. & Proshan, F. 1975. Statistical Theory of Reli-
ability and Life Testing. Holt, Rinehart and Winston:
New York.
0.8
MSS Reliability Function

Carrasco, J. 2003. Markovian Dependability/Performability


Modeling of Fault-tolerant Systems. In Hoang Pham (ed),
0.6 Handbook of Reliability Engineering: 613–642. London,
NJ, Berlin: Springer.
Finkelstein, M.S. 2002. On the shape of the mean residual
0.4 lifetime function. Applied Stochastic Models in Business
and Industry 18: 135–146.
Gertsbakh, I. 2000. Reliability Theory with Applications to
0.2
Preventive Maintenance. Springer-Verlag: Berlin.
Gertsbakh, I. & Kordonsky, Kh. 1969. Models of Failures.
0 Berlin-Heidelberg-New York: Springer.
0 0.1 0.2 0.3 0.4 0.5 Hiller, F. & Lieberman, G. 1995. Introduction to Operation
Time (years)
Research. NY, London, Madrid: McGraw-Hill, Inc.
Howard, R. 1960. Dynamic Programming and Markov
Figure 7. MSS reliability function during the time interval Processes. MIT Press: Cambridge, Massachusetts.
[0, T ]. Lisnianski, A. 2007. The Markov Reward Model for a
Multi-state System Reliability Assessment with Variable
Demand. Quality Technology & Quantitative Manage-
The results of calculation the MSS reliability func- ment 4(2): 265–278.
tion according the formulae (12) are presented in Lisnianski, A., Frenkel, I., Khvatskin, L. & Ding Yi. 2007.
Figure 7. Markov Reward Model for Multi-State System Reliabil-
From all graphs one can see age-related unit relia- ity Assessment. In F. Vonta, M. Nikulin, N. Limnios, C.
bility decreasing compared with non-aging unit. In the Huber-Carol (eds), Statistical Models and Methods for
last two figures graphs for mean time to failure and Biomedical and Technical Systems. Birkhaüser: Boston,
153–168.
reliability functions for aging and non-aging unit are Lisnianski, A. & Levitin, G. 2003. Multi-state System Relia-
almost the same, because of the fact that first unit fail- bility. Assessment, Optimization and Applications. World
ure usually occurs within short time (less than 0.5 year Scientific: NJ, London, Singapore.
according to Figure 7) and aging impact is negligibly Meeker. W. & Escobar, L. 1998. Statistical Methods for
small for such short period. Reliability Data. Wiley: New York.
Trivedi, K. 2002. Probability and Statistics with Reliability,
Queuing and Computer Science Applications. New York:
4 CONCLUSION John Wiley & Sons, Inc.
Valdez-Flores, C & Feldman, R.M. 1989. A survey of
preventive maintenance models for stochastically deteri-
1. The non-homogeneous Markov reward model was orating single-unit systems. Naval Research Logistics 36:
developed for reliability measures computation for 419–446.
aging MSS under corrective maintenance with Wang, H. 2002. A survey of Maintenance Policies of
minimal repair. Deteriorating Systems. European Journal of Operational
2. The method is based on a different reward matrix Research 139: 469–489.
determination for the non-homogeneous Markov Wendt, H. & Kahle, W. 2006. Statistical Analysis of Some
reward model. Parametric Degradation Models. In M. Nikulin, D. Com-
3. The suggested approach is well formalized and menges & C. Huber (eds), Probability, Statistics and
suitable for practical application in reliability engi- Modelling in Public Health: 266–79. Berlin: Springer
Science + Business Media.
neering. Zhang, F. & Jardine, A.K.S. 1998. Optimal maintenance
4. The numerical example is presented in order to models with minimal repair, periodic overhaul and com-
illustrate the suggested approach. plete renewal. IIE Transactions 30: 1109–1119.

REFERENCES

Bagdonavicius, V. & Nikulin, M.C. 2002. Accelerated life


models. Boca Raton: Chapman & Hall/CRC.

557
Safety, Reliability and Risk Analysis: Theory, Methods and Applications – Martorell et al. (eds)
© 2009 Taylor & Francis Group, London, ISBN 978-0-415-48513-5

On the modeling of ageing using Weibull models: Case studies

Pavel Praks, Hugo Fernandez Bacarizo & Pierre-Etienne Labeau


Service de Métrologie Nucléaire, Faculté des Sciences Appliquées, Université Libre de Bruxelles (ULB), Belgium

ABSTRACT: Weibull models appear to be very flexible and widely used in the maintainability field for
ageing models. The aim of this contribution is to systematically study the ability of classically used one-mode
Weibull models to approximate the bathtub reliability model. Therefore, we analyze lifetime data simulated
from different reference cases of the well-known bathtub curve model, described by a bi-Weibull distribution
(the infant mortality is skipped, considering the objective of modeling ageing). The Maximum Likelihood
Estimation (MLE) method is then used to estimate the corresponding parameters of a 2-parameter Weibull
distribution, commonly used in maintenance modeling, and the same operation is performed for a 3-parameter
Weibull distribution, with either a positive or negative shift parameter. Several numerical studies are presented,
based first on large and complete samples of failure data, then on a censored data set, the failure data being
limited to the useful life region and to the start of the ageing part of the bathtub curve. Results, in terms of
quality of parameter estimation and of maintenance policy predictions, are presented and discussed.

1 INTRODUCTION studied, a similar systematical benchmark study is


hardly found in the current literature.
Our contribution is motivated by a common situation The paper is organized as follows. We begin with a
in industrial maintenance, where the failure dataset is definition of the four-parameter model describing the
often limited, not only in size, but also in time to the assumed reality (i.e. our reference cases) in Section 2.
useful period and the very beginning of ageing, thanks In Section 3 we present the one-mode Weibull mod-
to the preventive maintenance operations. Although els with whom the reference cases are approximated.
lots of Weibull models are available in the literature Section 4 deals with the maintenance cost models that
(for detailed information, consult e.g. (Murthy, Xie, have been chosen to compare the optimal maintenance
and Jiang 2004)), there is a trend in industry to use, period predictions associated to the reference cases
for the maintenance applications, simple one-mode and to the one-mode Weibull approximations. Parame-
Weibull models with a limited number of parameters ter estimations for one-mode Weibull models inferred
to cover such cases. from samplings of the reference bathtub curves are
The main aim of the paper is to analyze how accept- presented in Section 5. Results are presented and
able this approximation is, by systematically studying discussed in Section 6 according to the quality of
the ability of classically used one-mode Weibull mod- the lifetime distribution and to the optimal mainte-
els to approximate the bathtub reliability model. For nance period predicted, as mentioned above. Finally,
this reason, we would like to illustrate how satisfac- conclusions are summarized in Section 7.
torily the ageing model corresponding to the bathtub
curve reliability model - in which the infant mortality
is skipped, considering the objective of modeling age- 2 FOUR-PARAMETER MODEL OF THE
ing - can be estimated by a one-mode Weibull model BATHTUB CURVE
with 2 or 3 parameters. Indeed, 2- or 3- parameter
Weibull models though commonly used in mainte- We will use as reference model a three-parameter
nance optimization do not map well more than one Weibull distribution with a positive location parame-
part of the bathtub curve reliability model. We will see ter, combined with an exponential model. We therefore
that randomly sampling failure times from the bathtub assume the following hazard function:
model and subsequently estimating by the maximum  
likelihood method the parameters of one-mode, 2- or β t − ν β−1
λ(t) = λ0 + H (t − ν) (1)
3- parameter Weibull models can lead to unexpected η η
results in terms of parameter estimation as well as to
erroneous predictions of an optimal maintenance pol- Symbol λ0 ≥ 0 denotes the failure rate associ-
icy. Although properties of Weibull models are widely ated to random failures. Symbols η>0, β>0 and

559
ν ≥ 0 are called scale, shape and location parameters, avoiding to implicitly assume λc (0) = 0 The PDF
respectively. H (.) is the step function. has the following conditional form:
The 3-parameter Weibull rate models the failure  βc
mode due to ageing, and is active only for t>ν. In  βc −1 t−νc
βc t−νc − ηc
other words, a failure of the system in time t>0 can ηc ηc e
occur by: fc (t) =  βc H (t) (6)
−νc
− ηc
a) a contribution due to the exponential distribution, e
which describes a random failure of the system for
any t>0
b) a contribution due to the Weibull distribution for 4 MAINTENANCE COSTS ANALYSIS
t>ν , where ν is a positive location parameter (a shift).
It means that we assume that this failure mode, which We assume the following hypotheses in our analysis
is based on the effect of ageing of the system, can of the maintenance policy:
occur only if t>ν. • Costs are kept constant and do not change with time
(interest rate or inflation are not considered).
• Repair durations are contemplated as negligible.
3 APPROXIMATE WEIBULL MODELS USED • Failures are detected instantaneously.
IN THE COMPARISON • Labour resources are always available to repair.
The following three maintenance strategies (Kece-
We assume the following probability density functions cioglu 1995; Vansnick 2006) are considered.
(PDF) for our parameter estimation study:
• Weibull model with 2 parameters: 4.1 Periodic replacement: As Bad as Old (ABAO)
 βa −1  βa In this policy an item is replaced with a new one at
βa t − t
every Tp time units of operation, i.e. periodically at
fa (t) = e ηa
(2)
ηa ηa time Tp , 2Tp , 3Tp , . . . . If the component fails before
Tp time units of operation, it is minimally repaired so
When ν = 0 and λ0 = 0, the model is reduced that its instantaneous failure rate λ(t) remains the same
to the 2-parameter Weibull law. The corresponding as it was prior to the failure.
cumulative distribution function (CDF) is: The expected total cost will be presented per unit
time and per preventive cost, and the predetermined
 βa maintenance interval is denoted by Tp . For the analysis,
− t
Fa (t) = 1 − e ηa
(3) it is interesting to introduce a new coefficient r, the
ratio of the curative cost over the preventive cost:
The scale parameter symbol ηa has the following Cc
meaning: Solving equation (3) for t = ηa we have: r= (7)
 βa
Cp
ηa

Fa (ηa ) = 1−e ηa
= 1 − e−1 The total expected cost per unit time can be
= 1 − 0.3679 = 0.6321 (4) written as

So, the scale parameter ηa is the 63rd percentile Cp + Cc E(N (Tp ))


γ (Tp ) = (8)
of the two-parameter Weibull distribution. The scale Tp
parameter is sometimes called the characteristic
life. The shape parameter β describes the speed of where E(N (Tp )) is the expected number of failures in
the ageing process in Weibull models and has no time interval [0, Tp ]. If the element is not replaced
dimension. but only minimally repaired (ABAO), the expected
• Weibull model with 3 parameters including a posi- number of failures at time Tp is
tive shift ν>0:  Tp  0
 βb −1  β
E(N (Tp )) = λ(t) dt. k dt (9)
βb t − νb −
t−νb b 0 tp
fb (t) = e ηb
H (t − νb ) (5)
ηb ηb Finally, the total expected cost per unit time and per
preventive cost is given by
• Weibull with 3 parameters, including a negative γ (Tp ) 1 r
shift ν<0. The Weibull model with a negative shift = + E(N (Tp )). (10)
can be used to model pre-aged components, hence Cp Tp Tp

560
4.2 Age replacement: AGAN (As-Good-As-New) • parameter η is the characteristic time of the ageing
failure mode, and
The AGAN maintenance strategy consists of replac-
• parameter ν is a time shift corresponding to the delay
ing the component with a new one after it has failed
before the onset of ageing.
or when it has been in operation for Tp time units,
whichever comes first. The total expected total cost per
unit time per preventive cost of a component depends 5.1 Characterization of useful period
on a summation of contributions of preventive and
curative costs: In this section, we specify the value of the parameter
related to the useful period (i.e. period with a constant
γ (Tp ) R(Tp ) + r(1 − R(Tp )) failure rate λ0 ) of our model of reality.
=  Tp . (11) We set 1/λ0 = 14000 h, because this value is
Cp R(t) dt
0 close to that estimated for reciprocating compressors
whose modelling were an inspiration of our study
4.3 Block replacement: AGAN (Vansnick 2006). This arbitrary choice hence defines
the time units of the problem, and only 3 independent
In this maintenance strategy an item is replaced parameters remain.
preventively (maximum repair) at time points
Tp , 2Tp , 3Tp , and so on. In addition to this, the com-
ponent is maintained correctively (maximum repair) 5.2 Parameters describing ageing
each time it fails. The total expected total cost per
unit time per preventive cost can be expressed as In this section, we specify values of parameters related
(Kece-cioglu1995) to the ageing part of our model of reality. We assume
the following parameters in our tests:
γ (Tp ) 1 1 − R(Tp ) • The values (2, 2.5, 3, 3.5, 4) are assumed for the
= + r  Tp . (12)
Cp Tp R(t) dt shape parameter β. We have chosen these values
0
because when β = 2, the increasing failure rate part
(due to ageing), is increasing linearly with time and
5 PARAMETER ESTIMATIONS OF ONE-MODE for values of β higher than 2 the curve is increasing
WEIBULL LAWS BASED ON SAMPLINGS non-linearly.
FROM THE BATHTUB CURVE MODEL • The value of the scale parameter η is selected in the
following way:
The aim of the section is to show the possible paradox Since MTTF = ν + η(1 + (1/β)) holds for a
between the classical reference to the bathtub-shaped single Weibull law, the straightforward observation
failure rates and the usual way of resorting to basic, is that η is a characteristic time on which the ageing
one-mode Weibull laws to model ageing and deduce process spans.
maintenance policy optimization. This is first per- Then we can use the following quantity ηλ0 as a
formed by assuming that the real behavior of the measure of the relative duration of the ageing failure
failure rate obeys a bathtub shape, in which the infant mode with respect to the purely random one. Let us
mortality is neglected (so that a bi-Weibull is used), assume that ηλ0 < = 1 holds in our cases because
by sampling failure data from this assumed reality, the ageing period lasts less time than the useful life
then by estimating the parameters of reduced 2- or period. Then we use the following range for values
3-parameter Weibull models and by comparing the (λ0 is kept constant in all cases):
predictions of optimal maintenance period based on ηλ0 = (1, 0.9, 0.8, 0.7, 0.6) (13)
this estimation with that deduced from the assumed
reality. In a second time, we also analyse failure data Even if this assumption might be valid in many
censored at the real MTTF point, which is an approxi- cases, in some situations however, one could meet
mation of the industrial reality where data are usually a behavior of the component where ageing starts
limited. quite quickly, even if it is then likely to be a slower
Parameters (η, β, ν) and λ0 have to be chosen to process than in other cases.
define what is the initial ‘‘reality’’ of the failure density • The value of the location parameter ν is selected
of the component. from the following equation:
Actually, three of these parameters correspond to
characteristic times: R(ν) = (0.9, 0.8, 0.7, 0.6, 0.5, 0.4) (14)
• parameter 1/λ0 represents the Mean Time To Fail-
ure (MTTF) of the random failure mode taken Here R(ν) denotes the reliability function of the
alone, reference model at the end of the useful period

561
(i.e. period with constant failure rate λ0 ):

R(ν) = 1 − F(ν) = exp(−νλ0 ) (15)

R(ν) is chosen between 0.9 and 0.4 because for val-


ues over 0.9 the useful life would be too short and
values under 0.4 are not worth to be analysed. This
is because, for a constant failure rate, the residual
reliability at t = MTTF is equal to 0.3679. Due to
this, R(ν) = 0.4 is a good limit as in practice a piece
of equipment will not be operated longer.
In order to express what ‘‘reality’’ is being con-
sidered in each case, it can be specified by the
3-uple (ηλ0 , β, R(ν)). For example, the ‘‘reality’’ Figure 1. Optimal predetermined maintenance intervals for
(ηλ0 = 1, β = 3, R(ν) = 0.8) can be expressed as Situation 1.
(ηλ0 , β, R(ν)) = (1, 3, 0.8).
In total, 5 × 5 × 6 = 150 samples of data have been
analysed, in each of the following two situations:
1. In the first situation, we used a complete set of
N = 1000 failure data taken from the ‘‘Model of
Reality’’, see eq. 1, with the purpose of estimating
the quality of the approximation obtained with the
single-mode Weibull laws.
2. In the second situation we used a set of N = 100
observation data, however in this case suspended
data are also considered. The MTTF point is cho-
sen as the predetermined maintenance interval Tp .
All values greater than this period will be consid-
ered as suspended data. This is an approximation of
the industrial reality where failure data are usually Figure 2. Optimal predetermined maintenance intervals for
Situation 2.
limited.

6 RESULTS OF THE ESTIMATIONS the interval [1.17, 1.9] for the 3-parameter Weibull
model with a positive location parameter, and the same
In this section the predetermined maintenance interval range was obtained for the 2-parameter Weibull model.
(Tp ) obtained with the 2- and 3- parameters Weibull On this account, the ability of these approximations
models will be compared to the Tp given by the to model the wear-out effects given by our reference
reference model, which was presented in Section 5. model is limited. It does not cover the reference model,
As examples, the next figures show the results of the which is represented by the shape parameters within
analysis done in these two different situations. Figures the interval [2, 4]. In fact, the estimation using the
1 and 2 show sorted results (i.e. in the 150 reference Weibull model with a positive location parameter is
cases considered) of the optimal predetermined main- not very beneficial: Because of random failures, rep-
tenance intervals for the reference model (denoted as resented by the second part of the bathtub model,
4P ABAO) with bounds corresponding to the mini- the effect of a positive shift of the location param-
mum cost multiplied by the factor 1.05 (denoted as eter is negligible in our cases: the estimated values
4P Int. Sup and 4P Int. Inf, respectively). We assume of the location parameter were very close to zero
the ratio between the curative cost over the preven- (≈ 2.10−8 h).
tive cost r = 10 in all cases. These figures also show It seems that the paradox in the Weibull estima-
results of the poor ability of the 2- and 3- parame- tion leads to a conservative maintenance policy in
ters Weibull models (denoted as 2P ABAO and 3P the first situation. Optimal maintenance intervals esti-
ABAO, respectively) to provide fitting of the assumed mated by one-mode Weibull models are almost in all
reality. cases smaller compared to the optimal maintenance
In the first situation, the results estimated from 2- period of the assumed reference model: Estimated val-
and 3- parameters Weibull models are conservative: ues of Tp for the 3-parameter Weibull model with a
the estimated values of the shape parameter were in positive location parameter and also for 2-parameter

562
Figure 3. Parameters of the reality (1, 3, 0.8), Situation 1. Figure 4. Parameters of the reality (1, 3, 0.8), Situation 1.
Failure rate. Expected cost per unit time and per preventive cost.

Weibull model correspond to 35 − 82% of the val- In Figure 3, the hazard function of the reality and
ues of Tp obtained with the reference models. When its 2- or 3-parameters approximations for the first sit-
the 3-parameter Weibull model with a negative loca- uation are presented. The symbol ‘‘Zero’’ represents
tion parameter is assumed, the situation is better: The the MLE results related to the Weibull model with 2
estimated values of Tp reached 43-100 percent of the parameters. The symbol ‘‘Neg’’ represents the Weibull
reference Tp . In 7 cases (i.e. in 4.66% of all assumed model with 3 parameters including a negative shift
cases), the value of Tp was correctly estimated. and finally the symbol ‘‘Pos’’ represents the Weibull
In the second situation (i.e. with censored data), the with 3 parameters, including a positive shift. We can
results estimated from the 2- and 3- parameter Weibull see that the wear-out period of the assumed bathtub
models are also not desirable: In 36.66% of the estima- ‘‘reality’’ is hardly modelled by the one-mode distribu-
tions done we obtain a value of the shape parameter tions. Although the Weibull 2P and 3P approximations
β<1 (a rejuvenation!). This means that there is no model accurately only the useful period of the reality,
minimum of the expected cost per unit time and per it can be seen that it is not a problem to find the optimal
preventive cost function. It is thus not interesting to preventive maintenance period Tp in both situations,
schedule a systematic preventive maintenance; this is see Figures 4 and 5. This happens because in these sit-
completely wrong in comparison the reference model. uations the optimal preventive maintenance period is
In 43.34% of the estimations, the estimated value of the found before the ageing process becomes strong. For
shape parameter obtained lies in the interval [1, 1.2]. this reason, approximations appear acceptable.
This values also does not correspond to the reference In this example it is also worth noticing the
model and the expected cost per unit time and per pre- MLE results obtained with the Weibull 2P and 3P
ventive cost becomes almost constant. Finally, in 20% approximations in the second situation: (η, β, ν) =
of the analyses, the estimated value of the shape param- (12487, 1.11, 0) and (η, β, ν) = (12319, 1.18, −91).
eter is between 1.2 and 1.62. This fact does not enable We observe that these results of parameter estimations
us to model correctly the wear-out effects given our do not imply similar results in estimating of Tp ’s, see
reference model. Figure 5. This little difference in shape parameters
As it was said in Section 5.2, in order to express implies that the Weibull 2P displays difficulties to find
what ‘‘reality’’ is being considered in each case, the a minimum in its corresponding expected cost and per
parameters that determine it can be expressed by the unit time per preventive cost function.
3-uple (ηλ0 , β, R(ν)). Let us analyse the behavior of
two selected realities more deeply.
6.2 Reality (ηλ0 , β, R(ν)) = (0.6, 2, 0.6)
Figure 6 contains failure rates for situation 1. Although
6.1 Reality (ηλ0 , β, R(ν)) = (1, 3, 0.8)
cost predictions in situation 1 remain numerically
In this case, obtaining good policy predictions may not non-problematic, see Figure 7, both approximation
be determined by the wear-out period approximation functions display difficulties to find a minimum in
quality in this case. their corresponding expected cost per unit time and

563
Figure 5. Parameters of the reality (1, 3, 0.8), Situation 2. Figure 7. Parameters of the reality (0.6, 2, 0.6), Situation 1.
Expected cost per unit time and per preventive cost. Expected cost per unit time and per preventive cost.

were analyzed. We presented various numerical exper-


iments showing that a random sampling from the
well-known bathtub reliability model and subsequent
MLE of Weibull models with 2 or 3 parameters lead
to potentially dangerous shortcuts.
For huge, complete sets of failure data, the esti-
mated shape parameters are almost in all cases inside
interval [1.17, 1.9] for the Weibull model with a posi-
tive location parameter or close to 6, when the Weibull
model with a negative location parameter is assumed.
These estimations do not correspond at all to wear-out
failures represented in our reference model by shape
parameters within the interval [2, 4]. This paradox in
the Weibull estimation leads to a conservative main-
tenance policy, which involves more cost and can be
Figure 6. Parameters of the reality (0.6, 2, 0.6), Situation 1. even risky: Too short maintenance periods can reduce
Failure rate. the reliability of the system, because not all of these
interventions finish successfully. Moreover, this esti-
mation drawback is observed in cases with a large set
per preventive cost function in the second situation, of failure data (1 000).
such like 2P approximations from Figure 5: The MLE For limited sets with censored data, the estimated
results obtained with the Weibull 2P and 3P approxi- values of the shape parameter are often close to
mations in the second situation have shape parameters one or even less than one and estimated predeter-
very close to 1: (η, β, ν) = (10355, 1.09, 0) and mined maintenance intervals are not connected to
(η, β, ν) = (10353, 1.12, −62). the reference model. It could be very interesting to
repeat tests contained in this contribution and also
to provide sensitivity analysis of estimated parame-
ters with goodness-of-fit tests (Meeker and Escobar
7 CONCLUSIONS 1995).
The message of our contribution should be kept
The main aim of the paper was to systematically in mind during the Weibull parameter estimation
study the ability of classical one-mode Weibull mod- process, in which also real failure data are analyzed:
els to approximate the bathtub reliability model. The Although the Weibull models are flexible and widely
novelty of the work includes a construction of a bench- used, we would like to point limitations of ability
mark study, in which 150 carefully selected test-cases of one-mode Weibull models with 2 or 3 parameters

564
to successfully fit more than exactly one part of the REFERENCES
bathtub curve reliability model.
Kececioglu, D. (1995). Maintainability, Availability and
Operational Readiness Engineering Handbook. Prentice
ACKNOWLEDGMENT Hall.
Meeker, W. Q. and L. A. Escobar (1995). Statistical Methods
for Reliability Data. John Wiley & Sons.
Pavel Praks (from VŠB-Technical University of Murthy, D. N. P., M. Xie, and R. Jiang (2004). Weibull Models.
Ostrava, the Czech Republic)’s postdoctoral stay at the John Wiley & Sons.
Université Libre de Bruxelles, Belgium is part of the Vansnick, M. (2006). Optimization of the maintenance of
ARC Project - ‘‘Advanced supervision and depend- reciprocating compress or based on the study of their per-
ability of complex processes: application to power formance deterioration. Ph. D. thesis, UniversitéLibre de
systems’’. Bruxelles.

565
Safety, Reliability and Risk Analysis: Theory, Methods and Applications – Martorell et al. (eds)
© 2009 Taylor & Francis Group, London, ISBN 978-0-415-48513-5

On-line condition-based maintenance for systems with several modes


of degradation

Amélie Ponchet, Mitra Fouladirad & Antoine Grall


Université de Technologie de Troyes, Institut Charles Delaunay, France

ABSTRACT: In this paper, a maintenance model for a deteriorating system with several modes of degradation
is proposed. The time of change of mode and parameters after the change are unknown. A detection procedure
based on an on-line change detection/isolation algorithm is used to deal with unknown change time and unknown
parameters. The aim of this paper is to propose an optimal maintenance versus detection policy in order to
minimize the global maintenance cost.

1 INTRODUCTION Fouladirad, Dieulle, and Grall 2007) are no-more


appropriate. Hence, in this paper, to take into account
In this paper, we consider a system which nominally the presence of unknown parameters in the accelerated
deteriorates according to a given mode. The mean mode, condition-based maintenance policy embedded
deteriorating rate increases suddenly at an unknown with a detection/isolation procedure is proposed. The
time. It is supposed that at this time a change of mode on-line change detection/isolation algorithm should be
is occurred. On-line information is available from on- adequate in the sense that it should minimize the detec-
line monitoring about the change of mode. To deal with tion delay and have a low false alarm rate and a low
the unknown change time an adequate on-line change false isolation rate. The proposed parametric main-
detection algorithm is used. In this framework, we pro- tenance decision rule which uses an online detection
pose to develop a maintenance model and optimize the procedure of the change of mode is optimized in order
decision rule by taking into account an on-line change to minimize an average maintenance cost criterion.
detection algorithm. The considered system is subject to continuous
The particularity of this model is the presence of random deterioration and a scalar aging variable can
unknown deterioration parameters in the second mode. summarize the system condition (Wang 2002). When
These parameters are unknown but the set to which the system deteriorates the ageing variable increases
they belong is known. We deal with these parameters and a failure occurs when the ageing variable (the sys-
by using an adequate detection/isolation algorithm in tem state) exceeds a known fixed threshold L called
the framework of the conditional maintenance. failure threshold. The failure threshold corresponds to
Throughout this paper we deal with a system which the limit value of wear beyond which the mission of the
nominally deteriorates according to a given known system is no longer fulfilled. The information about
mode and its mean deteriorating rate increases sud- the system state is only collected through inspections,
denly at an unknown time. The time of change of it means that such systems are usually subject to non-
the degradation mode is called the change time. An obvious failures. Therefore, the system is known to
adaptive maintenance policy based on on-line change be failed only if the failure is revealed by inspections.
detection procedures is already proposed in (Fouladi- The system can be declared as ‘‘failed’’ as soon as a
rad, Dieulle, and Grall 2006; Fouladirad, Dieulle, defect or an important deterioration is present, even
and Grall 2007) when the parameters of the acceler- if the system is still functioning. In this situation, its
ated mode are known. In this paper we suppose that high level of deterioration is unacceptable either for
the parameters of the accelerated mode are unknown. economic reasons (poor products quality, high con-
They take values belonging to a known set. As theses sumption of raw material, etc. . . ) or for safety reasons
parameters are no longer known in advance (even if (high risk of hazardous breakdowns). An example is
they take values belonging to a known set), the meth- given in (van Noortwijk and van Gelder 1996), where
ods proposed in (Fouladirad, Dieulle, and Grall 2006; the problem of maintaining coastal flood barriers is

567
modeled. The sea gradually erodes the flood barrier state at time t can be summarized by a scalar random
and the barrier is deemed to have failed when it is no ageing variable Xt . When no repair or replacement
longer able to withstand the pressure of the sea. action has taken place, (Xt )t≥0 is an increasing stochas-
If through inspections it is discovered that the sys- tic process, with initial state X0 = 0. If the state of
tem has failed, a corrective maintenance operation the process reaches a pre-determined threshold, say
immediately replaces the failed system by a new one. L, the system is said to be failed. The system can be
Preventive maintenance actions are performed in order declared as ‘‘failed’’ as soon as a defect or an important
to avoid a failure occurrence and the resulting period of deterioration is present, even if the system is still func-
inactivity of the system. The preventive maintenance tioning. This means that it is no longer able to fulfill
operation is less costly than the corrective maintenance its mission in acceptable conditions. The threshold L
operation. The preventive maintenance action takes is chosen in respect with the properties of the consid-
place when the system state exceeds a predetermined ered system. It can be seen as a safety level which has
threshold known as the preventive threshold. not to be exceeded. The behavior of the deterioration
The inter-inspection interval times and the value of process after a time t depends only on the amount of
the preventive threshold are two factors which influ- deterioration at this time.
ence the global maintenance cost. For example, in The parameters of the deterioration process (Xt )t≥0
the case of costly inspections it is not worthwhile to can suddenly change at time T0 . After T0 the mean
inspect often the system. But if the system is scarcely deterioration rate suddenly increases from a nominal
inspected, the risk of missing a failure occurrence value to an accelerated rate at time T0 . The first mode
increases. In (Grall, Dieulle, and Berenguer 2002) corresponds to a nominal mode denoted by M1 and the
or (Dieulle, Bérenguer, Grall, and Roussignol 2003) accelerated mode is denoted by M2 . In this paper, the
and (Bérenguer, Grall, Dieulle, and Roussignol case of two degradation modes is treated but this choice
2003) authors propose condition-based inspec- is not restrictive. The results exposed in this paper
tion/replacement and continuous monitoring replace- can be generalized to the case of multiple degradation
ment policies for a single-mode deteriorating system. modes which can be subject to further works.
In those previous works, a maintenance cost model In this paper, it is assumed that the deterioration
is proposed which quantifies the costs of the main- process in mode Mi (i = 1, 2), denoted by (Xti )t≥0 , is
tenance strategy and propose a method to find the a gamma process i.e. for all 0 ≤ s ≤ t, the increment
optimal strategy leading to a balance between mon- of (Xti )t≥0 between s and t, Yt−s = Xti − Xsi , fol-
itoring and maintenance efficiency. When the system lows a gamma probability distribution function with
undergoes a change of mode it seems reasonable to shape-parameter αi .(t − s) and scale parameter βi .
incorporate the on-line information available about the This probability distribution function can be written
system in the maintenance decision rule. In (Fouladi- as follows:
rad, Dieulle, and Grall 2006; Fouladirad, Dieulle,
− βy
and Grall 2007) authors studied the on-line change 1 yαi (t−s)−1 e i

detection in the framework of the condition based fαi (t−s),βi (y) = · 1{y≥0} .
(αi (t − s)) βiαi (t−s)
maintenance in the case of known parameters after
(1)
the change.
In this paper an adaptive maintenance policy based
on an embedded optimal on-line change detection
algorithm is proposed. The originality is due to the fact The average deterioration speed rate in mode Mi is
that the parameters after the change (i.e. the parame- αi (t − s).βi and its variance is αi (t − s).βi2 .
ters of the second mode) can take unknown values, but It should be recalled that gamma process is a pos-
these values belong to a known and finite set. itive process with independent increments. It implies
In section 2, the deteriorating system is described. frequent occurrences of tiny increments which make
In section 3, an adequate maintenance decision rule is it relevant to describe gradual deterioration due to
proposed. Section 4 is devoted to the presentation of continuous use such as erosion, corrosion, con-
the on-line change detection algorithm. A method for crete creep, crack growth, wear of structural com-
the evaluation of the maintenance cost is proposed in ponents (van Noortwijk 2007; Cooke, Mendel, and
section 5. In section 6 numerical implementations of Vrijling 1997; Çinlar, Bažant, and Osman 1977;
our results are presented and analyzed. Blain, Barros, Grall, and Lefebvre 2007; Fran-
gopol, Kallen, and van Noortwijk 2004). Furthermore,
the gamma process is the existence of an explicit
2 SYSTEM DESCRIPTION probability distribution function which permits feasi-
ble mathematical developments. It has been widely
The system to study in this paper is an observable sys- applied to model condition-based maintenance, see
tem subject to accumulation of damage. The system (van Noortwijk 2007).

568
The originality of this paper is due to the existence change detection result. As in (Saassouh, Dieulle, and
of unknown parameters after the change. In (Fouladi- Grall 2005),(Fouladirad, Dieulle, and Grall 2006) and
rad, Dieulle, and Grall 2006; Fouladirad, Dieulle, and (Fouladirad, Dieulle, and Grall 2007), the preventive
Grall 2007) the second mode parameters are known maintenance decision is based on different preventive
in advance but in this paper the second mode parame- thresholds corresponding to each of the possible dete-
ters are no longer known in advance. They Take their rioration modes of the system. Such maintenance poli-
values in the following set: cies are extensions of inspection/replacement struc-
tures for single mode deteriorating systems. Let Anom
S = {(α21 , β21 ), . . . (α2K , β2K )} (2) and Aac be the decision thresholds associated to
the ‘‘limit’’ cases corresponding to the single-mode
As card(S) = K, there are K different possibilities deterioration system. The decision threshold Anom
of second degradation mode. (respectively Aac ) is chosen in order to minimize the
It is supposed that the mode M2 corresponds to an cost criterion of the nominal single-mode deteriorat-
accelerated mode so in the second mode the parame- ing system. In the nominal mode the threshold Anom
ters are such that α2 .β2 > α1 .β1 . The case of a slower is effective and as soon as the system is supposed to
second mode is discarded because the systems we con- have switched in the second mode (accelerated mode)
sider can not be stabilized nor deteriorate slower than then threshold is adapted from Anom to Aac .
before. The possible decisions which can arise at each
inspection time tk are as follows:

3 MAINTENANCE POLICY • If Xtk ≥ L the system has failed then it is correctively


replaced.
In the framework of this paper, the considered dete- • If no change of mode has been detected in the cur-
riorating systems cannot be continuously monitored. rent cycle and if Xtk ≥ Anom then the system is
Hence, the deterioration level can only be known preventively replaced.
at inspection times. We shall denote by (tk )k∈N the • If a change of mode is detected at time tk or has been
sequence of the inspection times defined by tk+1 −tk = detected earlier in the current cycle and if Xtk ≥ Aac
t (for k ∈ N) where t is a fixed parameter. If the then the system is preventively replaced.
deterioration level exceeds the threshold L between • In all other cases, the decision is reported to time
two inspections, the system continues to deteriorate tk+1 .
until the next inspection. If the deterioration exceeds As a consequence of the previous decision rule, if
the threshold L between two inspections, then the sys- the system is deteriorating according to the mode i,
tem continues to deteriorate until the next inspection i = 1, 2, and if a change of mode is detected at time
and can undergo a breakdown. The failure of such a tdetect = kt, k ∈ N, where tdetect ≥ T0 , the two
system being able to have disastrous consequences, a following scenarios can arise:
preventive maintenance action has to take place before
and the inter-inspection time t has to be carefully • If Xtdetect < Aac then the system is left unchanged
chosen in order to be able to replace the system before and a replacement is performed at time tn > tdetect
the failure. The preventive replacement leads to restore such that Xtn−1 < Aac ≤ Xtn .
the system in a ‘‘good as new’’ state. The mainte- • If Xtdetect ≥ Aac then the system is immediately
nance decision scheme provides possible preventive replaced.
replacements according to the measured amount of
deterioration. The parameters of the maintenance decision rule are
There are two possible maintenance actions at each respectively the thresholds Aac , Anom , the inter-
inspection time tk : inspection interval t and the parameters of the
detection algorithm.
• the system is preventively replaced if the deteriora-
tion level is close to L,
• if a failure has occurred (Xtk > L) then the system
4 ON-LINE CHANGE DETECTION/
is correctively replaced.
ISOLATION
A detection method can be used to identify the
unknown time of change of degradation mode T0 . The on-line abrupt change detection, with low false
The on-line change detection algorithm presented in alarm rate and a short detection delay, has many impor-
section 4 is used to detect the change mode time tant applications, including industrial quality control,
T0 when the parameters after the change belong to automated fault detection in controlled dynamical sys-
the set presented in (2). The maintenance decision is tems. Authors in (Basseville and Nikiforov 1993)
based on a parametric decision rule according to the present a large literature on the detection algorithms

569
in complex systems. A first attempt to use an opti- As it is initially proposed by (Lorden 1971), usually
mal on-line abrupt change detection in the framework in on-line detection algorithms the aim is to min-
of maintenance policy is presented in (Fouladirad, imize the mean delay for detection/isolation in the
Dieulle, and Grall 2006), (Fouladirad, Dieulle, and worst case:
Grall 2007). The on-line change detection algorithms
permit to use online available information on the dete- τ ∗ = max τ ∗l , (5)
rioration rate to detect the occurred abrupt change 1≤l≤K
time. These algorithms take into account the informa- τ ∗l = sup esssupτ
tion collected through inspections, so they treat with T0 ≥1
on-line discrete observations (i.e. system state at times
(tk )k∈N ). In (Fouladirad, Dieulle, and Grall 2007) is is τ = ElT0 (N − T0 + 1|N ≥ T0 , X1 , . . . , XT0 −1 ).
supposed that a prior information on the change time
is available. In this case an adequate on-line detection for a given minimum false alarm rate or false isolation
algorithm which takes into account the available prior rate:
information on the change time is proposed.
The authors in (Fouladirad, Dieulle, and Grall min min EiT0 (N ν=j ) = a (6)
2006), (Fouladirad, Dieulle, and Grall 2007) con- 0≤i≤K 1≤j =i≤K
sidered the case of two deteriorating modes (one
change time) and known parameters after the change. where E0T0 = E0 .
In this paper, the aim is to propose an adequate Let us recall the detection /isolation algorithm ini-
detection/isolation method when the accelerated mode tially proposed by (Nikiforov 1995). We define the
parameters can take unknown values. These values stopping time N l∗ in the following manner:
belong to a known set defined in (2).
We collect observations (Xk )k∈N at inspection times
k ∈ N. Let be Yk for k ∈ N the increments of the N l∗ = inf N l∗ (k), (7)
k≥1
degradation process. Therefore Yk for k ∈ N follows a  
gamma law with density fθi = fαi t,βi according to the
N l∗ (k) = inf t ≥ k, min Skt (l, j) ≥ h (8)
degradation mode Mi , i = 1, 2. We shall denote fl = 0≤j =l≤K
fα2l t,β2l , l = 1, . . . , K, the density function associated
to the accelerated mode when (α2 , β2 ) = (α2l , β2l ). where Skt (l, j) is the the Likelihood ratio between fl and
We shall denote by N the alarm time at which  (Yi )
a ν-type change is detected/isolated and ν, ν = fj : Skt (l, j) = ti=k log fflj (Yi)
, and f0 (·) = fθ1 (·).
1, . . . , K, is the final decision. A change detec- The stopping time and the final decision of
tion/isolation algorithm should compute the couple the detection/isolation algorithm are presented as it
(N , ν) based on Y1 , Y2 , . . . . We shall denote by Pr 0 follows:
the probability knowing that no change of mode has
occurred, Pr lT0 the probability knowing that the change N ∗ = min{N 1∗ , . . . , N K∗ } (9)
of mode has occurred at T0 . Under Pr lT0 the incre- ∗
ments Y1 , Y2 , . . . , YT0 −1 have each the density function ν = argmin{N , . . . , N 1∗ K∗
} (10)
fθ1 and a change at T0 has occurred and YT0 is the
first observation with distribution fl , l = 1, . . . , K.E0 In (Nikiforov 1995) author proved that the mean
(resp. ElT0 ) is the expectation corresponding to the time to the detection in the worst case τ ∗ defined by
probability Pr 0 (resp. Pr lT0 ). (5) satisfies the following relations:
The mean time before the first false alarm of a j
type is defined as follow: ln(a)
τ ∗ ≤ max ElT0 (N ∗ ) ∼ as a → ∞ (11)
1≤l≤K ρ∗
 
ρ ∗ = min min ρlj (12)
E0 (N ν=j ) = E0 inf {N (k) : ν(k) = j} (3) 1≤l≤K 0≤j =l≤K
k≥1
  
fl
ρlj = fl ln dμ 0 ≤ j  = l ≤ K (13)
The mean time before the first false isolation of a j fj
type is defined as follow:
where ρlj is the Kullback-Leibler distance. The
  detection/isolation algorithm presented in this section
ElT0 (N ν=j ) = ElT0 inf {N (k) : ν(k) = j} (4) reaches asymptotically the lower asymptotic bound
k≥1 ln(a)
ρ ∗ initially proposed by (Lorden 1971).

570
5 EVALUATION OF THE MAINTENANCE In this work, the cost criterion is optimized as a
POLICY function of the parameter of the considered main-
tenance policy: the detection threshold h defined in
Each time that a maintenance action is performed on (7,8).
the system, a maintenance cost is incurred. Each cor-
rective (respectively preventive) replacement entails a
cost Cc (respectively Cp ). Since a corrective mainte- 6 NUMERICAL IMPLEMENTATION
nance operation is performed on a more deteriorated
system, it is generally more complex and consequently In this section we apply the maintenance policy pre-
more expensive than a preventive one. Hence it is sented in this paper to the case of a system with
supposed that Cp < Cc . The cost incurred by any two degradation modes and four possible accelerated
inspection is Ci . In the period of unavailability of the modes (K = 4).
system (i.e the time spent by the system in a failed The proposed maintenance policies are analyzed by
state) an additional cost per unit of time Cu is incurred. numerical implementations. Throughout this section,
All direct and indirect costs are already included in the values of the maintenance costs are respectively
the unit costs Ci , Cc , Cp , Cu . The maintenance pol- Ci = 5, Cp = 50, Cc = 100 and Cu = 250. For
icy is evaluated using an average long run cost rate the numerical calculations it is supposed that in the
taking into account the cost of each type of mainte- nominal mode M1 , α1 = 1 and β1 = 1. Hence,
nance actions. Let us denote by Np (t) the number of the maintenance threshold Anom is equal to 90.2.
preventive replacements before t, Nc (t) the number The previous value is the optimal value which mini-
of corrective replacements before t, du (t) the cumula- mizes the long run maintenance cost for a single mode
tive unavailability duration of the system before t and deteriorating system in mode M1 . For this optimiza-
Ni (t) the number of inspections before t. We know that tion (from Monte Carlo simulations), we use a single
Ni (t) = [t/t] where [x] denotes the integer part of degradation mode results with t = 4. The couple
the real number x. Let us denote by T the length of a (Anom , t) = (90.2, 4) is the optimal couple which
life-time cycle and TL the random time at which the minimizes the long run maintenance cost for a single
system state exceeds threshold L. The property of the mode deteriorating system in mode M1 . T0 is simu-
regeneration process (Xt )t≥0 allows us to write: lated by a uniform law from Monte Carlo method. To
evaluate each maintenance policy, four different accel-
E(C(t)) E(C(T )) erated modes are considered. So the parameters of the
C∞ = lim = (14) accelerated mode belong to the following set:
t→∞ t E(T )

where (α2 , β2 ) ∈ {(2, 1), (1, 3), (2, 2), (1, 7)}

The properties of the different second modes are


C(t) = Ci Ni (t) + Cp Np (t) + Cc Nc (t) + Cu du (t). presented table 1. The threshold Aac corresponds to
the optimal value which minimizes the long run main-
We know that tenance cost for a single mode deteriorating system in
mode M2 .
E(Np (T )) = Pr(cycle ends by a preventive repl.),
E(Nc (T )) = Pr(cycle ends by a corrective repl.) 6.1 Parameter optimization
  In this section, the problem of parameter optimization
T
E(Ni (T )) = E , for the considered maintenance policies is investi-
t gated. The parameter of interest is the detection
E(du (T )) = E(T − TL ){1{TL <T }, threshold h defined in (7) and (8).

Let us set the ‘‘inspection scheduling function’’ Table 1. Characteristic data of the second degradation
introduced in (Grall, Dieulle, and Berenguer 2002) mode.
be constant then the threshold Anom and the inter-
inspection time t can be obtained by numerical Real second mode cases 1 2 3 4
minimization of the cost criterion of the single-mode
α2 2 1 2 1
deteriorating system in mode M1 . The threshold Aac β2 1 3 2 7
corresponds to the optimal threshold for the single- Aac 85.6 74.6 73.7 51.6
mode deteriorating system in mode M2 .

571
The ‘‘optimal’’ value of h which leads to a minimal Table 5. Optimal costs corresponding to the maintenance
maintenance cost is numerically calculated. To define with one preventive threshold.
the optimal value of h, the maintenance cost, the false
alarm rate and the isolation rate are obtained for dif- Case 1 2 3 4
ferent values of h in the interval [0, 15]. This interval Costs 1.97 2.21 2.36 2.66
is chosen because the cost remains stable around the
same values after h = 15. The values of h correspond-
ing to the lowest maintenance cost, lowest false alarm
rate and highest correct isolation are defined. To study In table 3 the properties of the maintenance ver-
the impact of the variation of the threshold h on the sus detection/isolation algorithm corresponding to the
properties of the maintenance policy, in addition to value of h which leads to the lowest false alarm rate
the maintenance cost, the probability of preventive and are exposed. It can be noticed that maintenance costs
corrective maintenance for different values of h in the are very close to costs when only one threshold is used
interval [0, 15] are calculated. The choice of h is not (without detection procedure). The use of the detection
always based on the value which minimizes the main- algorithms when a low false alarm is requested doesn’t
tenance cost or which can optimizes the properties of improve the quality of the maintenance policy. In this
the detection algorithm. For example, it is not sensible configuration, a maintenance policy without detection
to take a value of h leading to the lowest maintenance procedure seems to be adequate. The results in table
cost if it corresponds to a false isolation rate close to 1. 4corresponds to the properties of the maintenance ver-
We present in table 2 the properties of the main- sus detection/isolation algorithm corresponding to the
tenance versus detection/isolation algorithm corre- value of h which leads to the highest correct isolation.
sponding to the value of h which leads to the lowest In this table except the case 1 (α2 = 2 and β2 = 1)
maintenance cost. It can be noted that except the case the highest correct isolation is not very high but the
1 (α2 = 2 and β2 = 1) the maintenance costs are very corresponding false alarm rate and maintenance costs
low in comparison with the conservative case when are acceptable. The maintenance cost is still lower than
only one threshold is used presented in table 5. the maintenance policy without detection procedure.
In the first case (α2 = 2 and β2 = 1) the correct isola-
Table 2. Optimal maintenance policy corresponding to the tion rate is always very high. This should be due to the
lowest maintenance cost. global optimization of the detection threshold h. This
optimization is more sensitive to the properties of the
Real second mode cases 1 2 3 4 first case where the two modes are very close. It is
possible that if in the optimization procedure, for each
Maintenance Cost 1.98 1.99 1.99 2.00 second mode l = 1, . . . , K, a detection threshold hl in
Detection threshold 1 1 1 1 equation (7) is used, the result of the correct isolation
False alarm rate 0.9 0.89 0.89 0.88
Correct isolation rate 0.87 0.03 0.04 0.05
could be different. But this method requests a complex
optimization procedure and the feasibility is arguable.
If the only criteria is the result of the maintenance
(low maintenance cost) we can neglect the value of
Table 3. Optimal maintenance policy corresponding to the false alarm rate and false isolation. But if in the
lowest false alarm rate. maintenance procedure the properties of the detec-
Real second mode cases 1 2 3 4
tion algorithms are of great importance we can not
base our choice only on the maintenance cost and we
Maintenance Cost 1.99 2.22 2.37 2.67 should take into account the properties of the detection
Detection threshold 5 7 6 15 algorithm.
False alarm rate 0.016 0.02 0.014 0.24 In figure 1 the maintenance properties correspond-
Correct isolation rate 1 0 0 0 ing to the accelerated mode (α2 = 1, β2 = 3) are
depicted. To illustrate the results the threshold h varies
in [0, 15]. The maintenance cost is stable around 2.2
Table 4. Optimal maintenance policy corresponding to the and reaches its minimum value 1.99 for h = 0. The
highest correct isolation rate. probability of corrective maintenance is very low and
the probability of preventive maintenance is very high.
Real second mode cases 1 2 3 4 We can say that there is mostly a preventive policy.
In figure 2 the detection algorithm properties cor-
Maintenance Cost 1.98 2.09 2.17 2.34 responding to the accelerated mode (α2 = 2, β2 = 2)
Detection threshold 12 2 2 2
False alarm rate 0.018 0.3 0.31 0.27
are depicted. To illustrate the results the threshold h
Correct isolation rate 1 0.19 0.2 0.3 varies in [0, 15]. The false alarm rate is very high for
the small values of h and it decreases as h grows. For

572
3.0
Cost versus h
h > 4, the false alarm rate is close to 0. In the contrary
2.8
2.6
2.4
as h grows the detection delay also grows. It is sensi-
2.2
2.0 ble that for a low detection threshold the false alarm
1.8
1.6
1.4
is very significant and as the detection improves and
1.2
1.0
0 5 10 15
the false alarm decreases the detection delay appears.
Proba of corrective maintenance vs h It is natural that with 0.9 false alarm rate the detection
1.0
0.9
0.8
delay is inexistent.
0.7
0.6 In figure 3 the correct isolation rate correspond-
ing to the accelerated mode (α2 = 2, β2 = 1) is
0.5
0.4
0.3
0.2
0.1 depicted. To illustrate the results the threshold h varies
in [0, 15]. These results show the good quality of the
0.0
0 5 10 15
Proba of preventive maintenance vs h
1.0
0.9
detection/isolation algorithm.
0.8
0.7
0.6
0.5
0.4
0.3
0.2
0.1
7 CONCLUSION
0.0
0 5 10 15

In this paper we have proposed a maintenance pol-


Figure 1. Maintenance properties when α2 = 1, β2 = 3. icy combined with an on-line detection method. This
policy leads to a low maintenance cost. We took into
account the possibility that the system could switch
1.0
False alarm rate vs h on to different accelerated modes. By considering this
0.9 possibility the proposed detection algorithm is also
0.8
0.7 an isolation algorithm. We have noticed that the low-
0.6
0.5
est false alarm rate and false isolation rate does not
0.4 always necessarily corresponds to the lowest mainte-
0.3
0.2
nance cost. The proposed algorithm has generally low
0.1 correct isolation rate and some times high false alarm
0.0
0 5 10 15 rate. The aim in the future works is to improve these
mean delay vs h properties in order to obtain a low cost maintenance
20
18
versus detection policy which can easily isolate the
16 real accelerated mode.
14
12
10
8
6
4
REFERENCES
2
0
0 5 10 15 Basseville, M. and I. Nikiforov (1993). Detection of
Abrupt Changes: Theory and Application. Prentice. Hall,
Figure 2. Detection algorithm properties when α2 = 2 and www.irisa.fr/sigma2/kniga.
β2 = 2. Bérenguer, C., A. Grall, L. Dieulle, and M. Roussignol
(2003). Maintenance policy for a continuously monitored
correct isolation rate vs h
deteriorating system. Probability in the Engineering and
Informational Science 17(2), 235–250.
1.5
Blain, C., A. Barros, A. Grall, and Y. Lefebvre (2007, Jun.).
Modelling of stress corrosion cracking with stochastic
processes - application to steam generators. In T. Avenand
J. Vinnem (Eds.), Risk, Reliability and Societal Safety,
1.0
Proceedings of the European Safety and Reliability Con-
ference 2007 (ES-REL 2007), Stavanger, Norway, 25–27
June 2007, Volume 3, Stavanger, Norway, pp. 2395–2400.
Taylor and Francis Group, London.
Çinlar, E., Z. Bažant, and E. Osman (1977). Stochastic
0.5 process for extrapolating concrete creep. Journal of the
Engineering Mechanics Division 103(EM6), 1069–1088.
Cooke, R., M. Mendel, and H. Vrijling (Eds.) (1997). Engi-
neering Probabilistic Design and Maintenance for Flood
Protection. Kluwer Academic Publishers.
0.0
0 5 10 15 Dieulle, L., C. Bérenguer, A. Grall, and M. Roussignol
(2003). Sequential condition-based maintenance schedul-
Figure 3. Correct isolation rate when α2 = 2 and β2 = 1. ing for a deteriorating system. EJOR 150(2), 451–461.

573
Fouladirad, M., L. Dieulle, and A. Grall (2006). The use Lorden, G. (1971). Procedure for reacting to a change in
of online change detection for maintenance gradually distribution. The Annals of Matematical Statistics 42,
deteriorating system. In ES-REL 2006 Congress, Estoril, 1897–1908.
Portugal, 18–22 september, 2006. Nikiforov, I. (1995). A generalized change detection problem.
Fouladirad, M., L. Dieulle, and A. Grall (2007). On the IEEE Transactions on Information Theory 41, 171–187.
use of bayesian on-line change detection for mainte- Saassouh, B., L. Dieulle, and A. Grall (2005, 30 June). Adap-
nance of gradually deteriorating systems. In ESREL 2007 tive maintenance policy for a deteriorating system with
Congress, Stavanger, Norway 25–27 June. random change of mode. pp. 5. ESREL 2005. Avec acte.
Frangopol, D., M. Kallen, and J. van Noortwijk (2004). van Noortwijk, J. (2007). A survey of the application of
Probabilistic models for life-cy cle performance of gamma processes in maintenance. In Reliability Engineer-
deteriorating structures: review and future directions. ing and System Safety.
Progress in Structural Engineering and Materials 6(4), van Noortwijk, J. M. and P. van Gelder (1996). Optimal
197–212. Maintenance Decisions for Berm Breakwaters. Structural
Grall, A., L. Dieulle, and C. Berenguer (2002). Continuous- Safety 18(4), 293–309.
time predective maintenance scheduling for a detetio- Wang, H. (2002). A survey of maintenance policies of
rating system. IEEE transactions on reliability 51(2), deteriorating systems. European Journal of Operational
141–150. research 12, 469–489.

574
Safety, Reliability and Risk Analysis: Theory, Methods and Applications – Martorell et al. (eds)
© 2009 Taylor & Francis Group, London, ISBN 978-0-415-48513-5

Opportunity-based age replacement for a system under two types


of failures

F.G. Badía & M.D. Berrade


Department of Statistics, Centro Politécnico Superior, University of Zaragoza, Zaragoza, Spain

ABSTRACT: This paper is concerned with an opportunity-based age replacement policy for a system under two
types of failures, minor and catastrophic. We consider a general distribution for the time to the first opportunity,
dropping the usual assumption of exponentially distributed times between opportunities. Under this model
the system undergoes a minimal repair whenever a minor failure occurs whereas a perfect restoration follows
any catastrophic failure and after the N minor failure. The system is preventively replaced at maintenance
opportunities arising after instant S and also at the moment its age reaches T . We take into account the costs
due to minimal repairs, perfect repairs, opportunity based replacements and preventive maintenances. We focus
on the optimum policy (T ∗ , N ∗ ) that minimizes the long-term cost per unit of time, providing conditions under
which such optimum policy exists.

1 INTRODUCTION (Dekker and Smeitink 1991) consider a block replace-


ment model in which a component can be replaced
Preventing systems from failing constitutes a signifi- preventively at maintenance opportunities only. Main-
cant task for manufacturers because of the costs and tenance opportunities occur randomly and are modeled
dangers incurred due to failures. Such importance through a renewal process. (Dekker and Dijkstra 1992)
has make the design of maintenance policies to be an establish conditions for the existence of a unique
essential research area in reliability theory. The pre- average control limit policy when the opportunities
ventive activities carried out to reduce the breakdown arise according to a Poisson process. (Jhang and Sheu
risk are combined in practice with corrective actions 1999) propose an opportunity-based age replacement
that restore the failed systems to the operating con- with minimal repair for a system under two types of
dition as soon as possible. Depending on the quality failure. (Iskandar and Sandoh 2000) present a mainte-
of the corrective actions, the repair can be classified nance model taking advantage or not of opportunities
into perfect or imperfect. The former brings the system in a random way. (Satow and Osaki 2003) consider
back to an as-good-as-new condition whereas the latter that intensity rate of opportunities change at spe-
restores the system somewhere between as-good-as- cific age. (Coolen-Schrijner et al., 2006) describe an
new and as-bad-as-old. The as-bad-as-old restoration opportunity-based age replacement by using nonpara-
is known as minimal repair. metric predictive inference for the time to failure of a
Many maintenance policies consider that preven- future unit. They provide an alternative to the classical
tive replacements can be carried out at any moment. approach where the probability distribution of a unit’s
However the work interruption time is likely to pro- time to failure is assumed to be known. (Dohi et al.,
duce high costs. If so, taking advantage of inactivity 2007) analyze this type of replacement policies which
periods is recommended, delaying the maintenance result to be less expensive than those carried out at any
until the moment when the mechanism is not required time without taking opportunities into account.
for service. In general such opportunities are a type In this work we present an opportunity-based main-
of event occurring at random, e.g., when another tenance model for a system that may undergo two types
unit in the same system undergoes a failure. Repairs of failures, minor and catastrophic. Minor failures are
and replacements carried out outside these so-called followed by a minimal repair to bring the system back
‘‘maintenance opportunities’’ can cause cost ineffec- into use whereas catastrophic failures are removed
tive maintenances and may be even responsible for with a perfect repair. A perfect repair is also carried out
high economic loses. after the N th minor failure. From time S on, opportu-
The following references are based on the principle nity maintenances occur which are independent from
that an event type can be considered as an opportunity. the system time to failure. In addition the maintenance

575
policy is completed with a preventive replacement at Renewal opportunities independent from the sys-
age T . We assume costs derived from minimal and tem arise from time S on with 0 < S < T . Let G(x)
perfect repairs as well as both opportunity-based main- denote the reliability function of the time elapsed from
tenance and preventive maintenance costs, aiming at S to the first maintenance opportunity. We assume
minimizing the expected cost per unit of time. This that whenever the system is renewed, G(x) remains
work is concerned with the conditions under which the same.
there exists an optimum policy (T , N ). To this end the A cycle, that is, the total renewal of the system is
model is described in Section 2 where the cost func- completed after one of the four events described next
tion is also derived. Sections 3 and 4 contain the results
related to the existence of the optimal policy. 1. the renewal that follows a catastrophic failure or the
N th minor failure, whichever comes first, occurring
before S.
2 THE MODEL 2. the renewal that follows a catastrophic failure or the
N th minor failure, whichever comes first, occurring
Consider a system that may experience two types of after S.
failures. Provided that a failure occurs at time x, then 3. a maintenance opportunity.
it belongs to the minor failures class with probabil- 4. a preventive maintenance.
ity p(x) or it is a catastrophic failure with probability
q(x) = 1 − p(x). The former are removed by means Let p1 denote the probability of a cycle ending
of a minimal repair whereas the system is restored after the corrective maintenance that follows events 1)
with a perfect repair after catastrophic failures. In and 2), whereas p2 and p3 represent the correspond-
addition, the system is preventively maintained at ing probabilities of events 3) and 4). The foregoing
age T . probabilities are calculated next.
Let r(x) be the failure rate of the time to the first  
failure. Hence, the corresponding reliability function S T

is given as p1 = dH (x, N ) + G(x − S)dH (x, N )


0 S
x
F(x) = e− 0 r(u)du  T
= H (S, N ) + G(x − S)dH (x, N )
A failure happening at instant x belongs to catas- S
trophic class with probability q(x) = 1 − p(x). The
reliability function of the time to the first catastrophic the previous integrals represent the probability of a
failure is corrective maintenance in [0, S] and [S, T ] respec-
x tively.
e− 0 q(u)r(u)du

 T −S
The distributions corresponding to the first catas- p2 = H (y + S, N )dG(y)
trophic failure and the N th minor failure are indepen- 0
dent. The reliability function corresponding to that
occurring first is given by  T
= H (x, N )dG(x − S)
−1 S

N x
H (x, N ) = Dk (x)e− 0 r(u)du

k=0 Integrating by parts with u = H (x, N ) and dv =


dG(x − S) we arrive at
where
 x k p2 = H (S, N ) − H (T , N )G(T − S)
p(u)r(u)du
Dk (x) = 0
, k = 0, 1, . . .
k!  T
− G(x − S)dH (x, N )
and its density function S


N −2 x
h(x, N ) = q(x)r(x)Dk (x)e− 0 r(u)du p3 = H (T , N )G(T − S)
k=0
x The following formula provides the mean length of
+ r(x)DN −1 (x)e− 0 r(u)du
a cycle

576
  −1
S

SN
E(τ ) = xdH (x, N ) MR1 = kDk (x)s(x)q(x)r(x)m(x)dx
0 0 k=0
 T

+ xG(x − S)dH (x, N ) S
S + (N − 1)m(x)p(x)r(x)DN −1 (x)s(x)dx
0
 T −S
+ (y + S)H (y + S, N )dG(y)  S  x 
0
= p(u)r(u)du h(x, N − 1)dx
0 0
+ T H (T , N )G(T − S)
 T 
N −1
where the four foregoing terms correspond respec- MR2 = G(x − S) kDk (x)s(x)q(x)r(x)m(x)dx
tively to the mean length of a cycle ending after events S k=0
previously numbered as 1) to 4). The third term in the
formula can be rewritten as  T
 T + (N − 1)G(x − S)m(x)p(x)r(x)
xH (x, N )dG(x − S) S
S
× DN −1 (x)s(x)dx
Integrating by parts with u = xH (x, N ) and dv =
dG(x − S) leads to  T  x 
= G(x − S) p(u)r(u)du h(x, N − 1)dx
 T S 0
xH (x, N )dG(x − S) = −T H (T , N )G(T − S)
S  T −S  ∞ 
MR3 = H (x, N − 1)q(x)r(x)
+ SH (S, N ) 0 y+S

  y+S 
T
+ H (x, N )G(x − S)dx × p(u)r(u)du dG(y)
S 0

 T  T −S  ∞ x

− xG(x − S)dH (x, N ) + p(x)r(x)DN −2(x)e− 0 r(u)du
dx
S 0 y+S

Hence,  y+S 
 S
× p(u)r(u)du dG(y)
0
E(τ ) = xdH (x, N ) + SH (S, N )
0  T −S  ∞ 
 T = h(x, N − 1)dx
+ H (x, N )G(x − S)dx 0 y+S
S
 
 S  T
y+S
× p(u)r(u)du dG(y)
= H (x, N )dx + H (x, N )G(x − S)dx 0
0 S
 T −S
Let us denote MR1 , MR2 , MR3 and MR4 the mean
= H (y + s, N − 1)
number of minimal repairs in cycle ending after one 0
of the four events described above.
In what follows we denote by s(x) and m(x) the  y+S 
following functions: × p(u)r(u)du dG(y)
0
x
s(x) = e− 0 p(u)r(u)du
 T  x 
x = H (x, N − 1) p(u)r(u)du dG(x − S)
m(x) = e− 0 q(u)r(u)du S 0

577
Integratingby parts with dv = dG(x − S) and u = opportunity arises with cost c2 and also when it is
x
H (x, N − 1) 0 p(u)r(u)du , we get time for preventive maintenance, being c3 the cor-
responding cost. Minimal repairs are worth c4 each.
 S  Hence the mean cost of a cycle is expressed by the
MR3 = p(u)r(u)du H (S, N − 1) following formula
0

 T  E[C(τ )] = c1 p1 + c2 p2 + c3 p3 + c4 MR
− p(u)r(u)du H (T , N − 1)G(T − S)
0 = c1 H (S, N ) + c2 H (S, N )
 T
 T
+ (c1 − c2 ) G(x − S)dH (x, N )
+ p(x)r(x)H (x, N − 1)G(x − S)dx S
S

  + (c3 − c2 )H (T , N )G(T − S)
 T x
 S
− p(u)r(u)du
S 0 + c4 p(x)r(x)H (x, N − 1)dx
0
 T
× G(x − S)h(x, N − 1)dx
+ c4 p(x)r(x)G(x − S)H (x, N − 1)dx
 T  S

MR4 = G(T − S) p(u)r(u)du


0 The key theorem of the renewal reward processes
ensures that the expected cost per unit of time can be
 
∞N −2 x expressed as the ratio of the expected cost of a cycle
× Dk (x)q(x)r(x)e− 0 r(u)du
dx to its expected length, that is
T k=0

 T  E[C(τ )]
+ G(T − S) p(u)r(u)du Q(T , N ) =
E[τ ]
0

 ∞ x
× p(x)r(x)DN −2 (x)e− 0 r(u)du
dx 3 OPTIMUM T WHEN N IS SET
T

 T  The value of T minimizing the cost function is


= G(T − S) p(u)r(u)du obtained equating to zero the corresponding deriva-
,N )
0 tive that is, dQ(T
dT = 0 which is equivalent to the
 ∞
equation below
× h(x, N − 1)dx
T M (T , N ) = R(T , N )L(T , N ) − C(T , N ) = 0
 T 
= G(T − S) p(u)r(u)du H (T , N − 1) where
0

L(T , N ) = E[τ ]
Therefore the mean number of minimal repairs in a
cycle is given by C(T , N ) = E[C(τ )]
R(T , N ) = c4 r(T ) + (c2 − c3 )l(T − S)
MR = MR1 + MR2 + MR3 + MR4
 S + (c1 − c3 − c4 )(q(T ) + p(T )Z(T ,N ))r(T )
= p(x)r(x)H (x, N − 1)dx
0 with l(x) being the failure rate corresponding to the
 T time to the first maintenance opportunity distribution,
+ p(x)r(x)G(x − S)H (x, N − 1)dx G, and
S
DN −1 (T )
Upon catastrophic failures the system is perfectly Z(T , N ) = N −1
restored at cost c1 and so is when a maintenance k=0 Dk (T )

578

N −2  T
Z(T , N ) verifies that 0 ≤ Z(T , N ) ≤ 1 and is increas-
ing with T provided that its corresponding derivative is − (c4 p(x)r(x)Dk (x)b(x)dx
k=0 0
 T
dZ(T , N ) DN −2 (T ) − (c1 − c3 ) DN −1 (x)(r(x)
= p(T )r(T ) + p(T )r(T )
dT Dk (T ))2 0
N −1 (N −2+j)!
+ l(x − S))b(x)dx − (c1 − c3 )
j=1 (j−1)!(N −2)! DN −2+j (T ) j − N −1
1 1

× −1 
N −1
( Nk=0 Dk (T ))2 × B(T , k)e(T , N )
k=0
Given that
where b(x), B(T , k) and d(T , k) are given by
d(q(T ) + p(T )Z(T , N )) x
b(x) = e− 0 r(u)+l(u−S)du
dT

dq(T ) dZ(T , N ) T
= (1 − Z(T , N )) + p(T ) B(T , k) = Dk (x)b(x)dx, k = 0, 1, . . .
dT dT 0
 T
whenever q(T ) is an increasing function with T so is d(T , k) = l(x − S)Dk (x)b(x)dx, k = 0, 1, . . .
q(T ) + p(T )Z(T , N ). Assuming that r(t) is increas- 0
ing and l(x) is decreasing, along with c2 < c3 and Dk (T )F(T )G(T − S)
c1 −c3 −c4 > 0 then R(T , N ) is also increasing. More- e(T , k) = , k = 0, 1, . . .
B(T , k)
over the derivative of Q(T , N ) keeps the same sign of
M (T , N ) which exhibits the same monotonic behav- By means of the next identity
ior than R(T , N ). Therefore as M (0, N ) = −c3 < 0,
if lim→∞ M (T , N ) > 0 and under the foregoing  T
assumptions, then Q(T , N ) has a finite minimum TN∗ . p(x)r(x)DN −1 (x)b(x)dx
Such optimum policy is the unique zero of the equation 0
M (T , N ) = 0. In addition the corresponding optimum = DN (T )F(T )G(T − S)
cost is Q(TN∗ , N ) = R(TN∗ , N ).  T
+ DN (x)(r(x) + l(x − S))b(x)dx
0
4 OPTIMUM N WHEN T IS SET
which is obtained integrating by parts in the previous
When the time of the preventive maintenance T expression, we get
is fixed in advance, the optimum N should verify
both Q(T , N ) ≤ Q(T , N + 1) and Q(T , N − 1) > W (T , N + 1) − W (T , N )
Q(T , N ). The foregoing conditions are equivalent to N
W (T , N ) ≥ c3 and W (T , N − 1) < c3 respectively, 
= B(T , k) (c4 (F(T , N + 1) − F(T , N ))
being W (T , N ) defined as follows
k=0

N −1 + (c1 − c3 − c4 )(e(T , N ) − e(T , N + 1))


B(T , k)  
W (T , N ) = c4 k=0
d(T , N ) d(T , N + 1)
B(T , N ) + (c1 − c2 − c4 ) − (1)
 T B(T , N ) B(T , N + 1)
× p(x)r(x)DN −1 (x)b(x)dx + (c1 − c2 )
0 where
N −1 N −1
 T
k=0 B(T , k)
× d(T , k) − d(T , N ) r(x)Dk (x)b(x)dx
B(T , N ) F(T , k) = 0
, k = 0, 1, . . .
k=0 B(T , k)

N −2  T
− (c1 − c3 )q(x)r(x)Dk (x)b(x)dx Following similar strategies to those provided in
k=0 0 (Nakagawa 2005) (chapter 4, Theorems 4.3 and 4.4)
−2  T
we get that F(T , N ) is increasing with N if r(x) is

N
also increasing whereas if l(x) is decreasing with N
− (c1 − c3 )l(x − S)Dk (x)b(x)dx d(T ,N )
k=0 0 so is B(T ,N ) . Moreover it can be easily proved that

579
e(T , N ) increases with N . Then, whenever r(x) and conditions that guarantee the existence of an optimum
l(x) are increasing and decreasing respectively along (T ∗ , N ∗ ) seems to be a difficult task.
with c1 − c3 − c4 < 0 and c1 − c2 − c4 > 0, it
follows from (1) that W (T , N ) is increasing. Hence,
under the foregoing assumptions the optimum N when
T is previously set, NT∗ , is the minimum N satisfying REFERENCES
W (T , N ) ≥ c3 . In case that W (T , N ) < c3 for all N
then NT∗ = ∞. Coolen-Schrijner, P., F. Coolen, and S. Shaw (2006). Non-
parametric adaptive opportunitybased age replacement
strategies. Journal of the Operational Research Society.
5 CONCLUSIONS Dekker, R. and M. Dijkstra (1992). Opportunitybased age
replacement: exponentially distributed times between
opportunities. Naval Research Logistics (39), 175–190.
The high cost incurred due to some preventive mainte- Dekker, R. and E. Smeitink (1991). Opportunitybased block
nances motivate carrying out opportunity-based poli- replacement. European Journal of Operational Research
cies. This paper provides conditions under which (53), 46–63.
an optimum opportunity-based policy exists in two Dohi, T., N. Kaio, and S. Osaki (2007). Discrete time
cases, the optimum T ∗ for a given N and the opti- opportunistic replacement policies and their application.
mum N ∗ when N is fixed. Such conditions involve Recent advances in stochastic operations research. World
an increasing failure rate of the time to failure and Scientific.
a decreasing failure rate of the time to the first Iskandar, B. and H. Sandoh (2000). An extended opportunity-
based age replacement policy. RAIRO Operations
opportunity apart from cost-related conditions. Con- Research (34), 145–154.
cerning the simultaneous optimization of both T and Jhang, J. and S. Sheu (1999). Opportunity-based age replace-
N we consider the use of the following algorithm ment policy with minimal repair. Reliability Engineering
proposed by (Zequeira and B´erenguer 2006) and and System Safety (64), 339–344.
(Nakagawa1986): Nakagawa, T. (1986). Periodic and sequential preventive
maintenance policies. Journal of Applied Probability (23),
1. Set N = 1 536–542.
2. If Q(TN∗ +1 , N + 1)) < Q(TN∗ , N ), then go to step 3 Nakagawa, T. (2005). Maintenance Theory of Reliability.
or to step 4 otherwise Springer.
3. N = N + 1 Satow, T. and S. Osaki (2003). Opportunity-based age
4. Set N  = N replacement with different intensity rates. Mathematical
and Computer Modelling (38), 1419–1426.
The optimal policy turns out to be (T ∗ , N ∗ ). Note Zequeira, R. and C. B´erenguer (2006). Optimal scheduling
that the foregoing algorithm doesn’t ensure a global of non-perfect inspections. IMA Journal of Management
optimum but just a local one. Moreover obtaining Mathematics (2), 187–207.

580
Safety, Reliability and Risk Analysis: Theory, Methods and Applications – Martorell et al. (eds)
© 2009 Taylor & Francis Group, London, ISBN 978-0-415-48513-5

Optimal inspection intervals for maintainable equipment

O. Hryniewicz
Systems Research Institute, Warsaw, Poland

ABSTRACT: Maintainable technical systems are considered whose failures can be revealed only by special
inspections. It is assumed that these inspections may be imperfect, i.e. they result may suggest wrong decisions.
Two optimization models are considered: one in which the coefficient of availability is maximized and second in
which related costs are minimized. For both models approximately optimal solutions have been found. Presented
examples show that these solutions are very close to the exact solutions when the time to failure is exponentially
distributed. The paper is illustrated with two numerical examples.

1 INTRODUCTION There also exist papers, such as papers by Badia et al.


(2001, 2002), Vaurio (1999), and Khan et al. (2008)
For some technical systems their reliability state (oper- presenting very complex models which e.g. include
ation or failure) cannot be directly observable. For the possibility of preventive maintenance actions, but
example, when a failure of a system is not of catas- in our opinions these models are too complicated for
trophic nature it can be recognized either by doing using them by the majority of practitioners.
some measurements or by observing its performance. In the paper we consider the problem of the opti-
Technical equipment of a production line serves as mal choice of inspection intervals for maintainable
a good example in such a case, when failures result (renewable) equipment. We assume that the state of
rather in worse performance (measured by observed a system is evaluated periodically, and when its failure
percentage of produced nonconforming items) than is revealed the system is renewed in the sense of reli-
in complete stoppage of a production process. One- ability theory. We consider two types of optimization
shot systems such as back-up batteries are examples tasks.
of other systems of that kind. Their reliability state is
1. Minimization of costs incurred by inspections and
not directly observable until the moment when they
failures of the system; applicable for the case of
are needed. In all such cases there is a need to perform
e.g. production equipment.
inspections in order to reveal the actual reliability state
2. Maximization of availability, applicable e.g. for
of a considered system. Inspections policies may vary
one-shot equipment.
from very simple, when times between consecutive
inspections (inspection intervals) are constant, to very The structure of the paper is the following. In
complicated, computed with the usage of additional the second section we present mathematical models
information about the system. describing periodically inspected maintainable equip-
The problem of finding optimal inspection inter- ment. Optimization tasks are formulated and solved
vals has attracted many researchers since the works of in the third section. Simple practical applications are
Savage (1956), Barlow et al. (1960) and some other presented in the last section of the paper.
authors published in late 1950s and early 1960s. The
first general model was presented by Barlow, Hunter
and Proschan (1963) who assumed perfect inspections. 2 MATHEMATICAL MODEL
Later generalizations of this model were proposed, i.a.
by Menipaz (1978) and von Collani (1981). Among Consider renewable equipment whose state can be
relatively recent papers devoted to this problem, and revealed only as a result of special inspection. The
presenting simple and applicable in practice models, considered equipment can be in two exclusive relia-
we may list the papers by Baker (1990), Chung (1993), bility states: the ‘‘up’’ state, denoted as state 1, and the
Vaurio (1994), and Hariga (1996). More complex ‘‘down’’ state, denoted as state 2. Let us assume that
inspection models are presented in recent papers by a random time to failure T of this equipment, i.e. the
Fung and Makis (1997), Chelbi and Ait-Kaidi (1999), time of transition from state 1 to state 2, is distributed
Wang and Christer (2003), and Berger et al. (2007). according to a known cumulative distribution function

581
F(t). The actual reliability state of this equipment is is given by
revealed by periodical inspection, performed every h
time units. When the failure is revealed the system is Tr = [A1 (h) + A2 ] h + A1 (h) (μ0 + αμa )
renewed or replaced by a new one. (4)
+ A2 μ0 + μr
Let us assume that inspections may not be perfect.
In such a case there exists the probability of a false
Hence, the stationary availability coefficient for
alarm α, which is the probability of revealing non-
this equipment is given by
existent failure of the inspected equipment, and the
probability β of not revealing the actual failure. Such τ
probabilities are always greater than zero when the K (h) = , (5)
Tr (h)
actual reliability state of the considered equipment is
evaluated as the result of a statistical analysis of results where τ = E(T ) is the expected time to failure.
of measurements. From the analysis of (4) and (5) it is obvious then
It is easy to show that the expected number of in case of h tending 0 the expected number of inspec-
of inspections performed during the time when the tion, and thus the expected time devoted to inspections
system is ‘‘up’’ can be expressed as follows tends to infinity. Hence, the availability in such a case
tends to zero. Similarly, when h tends to infinity the

 expected duration of the state II is infinite, and the
A1 (h) = i [P (T < (i + 1) h) − P (T < ih)] availability also tends to zero. Hence, there exists an
i=0 optimal inspection interval hk , for which the avail-
(1) ability K(h) attains its maximum. We will consider


= R (ih) , the problem of finding this optimal value in the next
section of this paper.
i=1
The optimization of the inspection interval h can be
also considered as the problem of the minimizations
where R(ih) = 1 − F(ih). Thus, the expected number of some costs. Let us assume that the average cost of
of false alarms is equal to each inspection is equal to c0 . Moreover, we assume
that the average additional cost of a false alarm is equal
EI (h) = αA1 (h) . (2) to cr , and the average cost of systems renewal is equal
to cr . We can also distinguish the average cost related
to a failure cf , but this cost as unavoidable can be also
The expected number of inspections during the time included in the cost of renewal cr which has the same
when the system is ‘‘down’’ entirely depends upon the nature. There exist also losses which depend upon the
probability β, and is equal to duration of state 2. We assume that this cost is propor-
tional to the duration of state 2, i.e. the time between
1 the failure and the moment of its detection, and this
EII (h) = A2 = (3) cost calculated per time unit is equal to cl .
1−β
Now we can calculate the expected costs of inspec-
tions C1 , the expected costs related to false alarms C2 ,
There are numerous examples of such systems. For and the expected costs related to the failure C3 . They
instance, the state of certain production process which can be calculated from the following expressions:
produces items whose quality can be evaluated by
destructive tests may be revealed by periodical sam- C1 (h) = [A1 (h) + A2 ] c0 (6)
pling of its output. Another simple example of such
a system is a battery which back-ups power supply
of other equipments. Such a system is a typical one- C2 (h) = αA1 (h) ca (7)
shot system whose state is usually monitored during
periodical inspections.
In our model we assume that all actions last some C3 (h) = cf + cr + {[A1 (h) + A2 ] h − τ } cl (8)
time during which the system is out of service. Let us
assume that the expected time of inspection is equal to The total expected cost related to one renewal period
μ0 , and does not depend upon the actual state of the is now given by
inspected system. When the inspection reveals a false
alarm the expected additional out of service time is C (h) = C1 (h) + C2 (h) + C3 (h) (9)
equal to μa , and when it reveals an actual failure the
expected repair (renewal) time is equal to μr . There- Similarly to the previous case its easy to show that
fore the expected time between consecutive renewals for h tending to zero and for h tending to infinity the

582
expected cost C(h) is infinite. Thus, there exists an the objective function is in this case expressed as
optimal inspection interval hc , for which the expected
cost C(h) attains its minimum. We will consider the K (h)
problem of finding this optimal value in the next  
τ eλh − 1
section of this paper. =     
1 + A2 eλh λh
− 1 h + e − 1 (μr + A2 μ0 ) + μ0 + αμa
(14)

3 OPTIMAL INSPECTION INTERVALS


In order to find the optimal value of h we have to
Optimal values of the inspection interval h can be find the derivative of (14) with respect to h, and then
found using a standard optimization procedure. First, to equate this derivative to zero. As the result of these
we find the derivative with respect to h of the goal calculations we arrive at the following equation:
function K(h) or C(h), respectively. Then we have to
equate these derivatives to zero arriving at equations − τ A2 e2λh + heλh + eλh [μ0 + αμa + τ (2A2 − 1)]
whose solutions are equal to the respective optimal − τ (A2 − 1) = 0 (15)
values of h. In the case of the maximization of K(h)
given by (5) this equation is expressed as follows This nonlinear equation can be solved numerically
using, e.g., the Raphson-Newton method. Application
−hA1 (h) + [A1 (h) + A2 ] + A1 (h) (μ0 + αμ1 ) = 0 of this method requires the computation of the deriva-
tive of the left-hand side of (15). This derivative is
(10) expressed as

where D = −2A2 e2λh + λheλh


(16)

+ eλh [1 + λ (μ0 + αμa ) + (2A2 − 1)] .

A1 (h) = − if (ih) , (11)
When our objective is to minimize the cost given
i=1
by (9) in case of the exponentially distributed lifetime
the goal function is expressed as follows:
and f (.) is the probability density function of the time
to failure T .  −1
C(h) = eλh − 1 [hcl + (c0 + αca )] + cl A2 h
In case of the minimization of C(h) given by (9) the
respective equation is given by + cr + cf − τ cl . (17)

A1 (h) (hcl + c0 + αca ) + A1 (h) cl + cl A2 = 0. (12) When we equate to zero the derivative of (17) we
arrive at the following nonlinear equation:

The main computational problem related either to cl A2 e2λh − λcl heλh − eλh [λ(c0 + αca )
the maximization of K(h) given by (5) or the minimiza- (18)
tion of C(h) given by (9) is caused by difficulties with + cl (2A2 − 1)] + cl (A2 − 1) = 0.
the calculation of A1 (h) according to (1), and its deriva-
tive given by (11). This function can be expressed in This equation can be solved only numerically. When
a closed form only for specially defined probability for this purpose we use the Raphson-Newton method
distributions F(t). The only popular probability dis- we need to find the derivative of the left-hand side
tribution for which A1 (h) is given by a close formula of (18). This derivative is given by the following
is the exponential distribution with the hazard rate λ expression:
equal to the reciprocal of the expected time to failure,
i.e. λ = 1/τ . In this case by simple calculations we DC = 2λcl A2 e2λh − λcl eλh (1 + λh)
can show that
− λeλh [c0 + αca + cl (2A2 − 1)] . (19)
1
A1 (h) = . (13) Even in the simplest case of the exponentially
eλh − 1 distributed time to failure the solutions of the opti-
mization problems require numerical computations. In
Let us consider the problem of the minimization both cases it is necessary to find an appropriate initial
of (5) when the time to failure is exponentially dis- value of h. This value can be found using the following
tributed. By elementary calculations we can find that approximate optimization procedure which is based on

583
the following approximation proposed by Hryniewicz Table 1. Comparison of approximately opti-
(1992) mal and optimal inspection intervals.

τ ZK hK hK,opt
A1 (h) ≈ − 0.5 (20)
h 1 31.622 31.485
2 63,245 62,701
This approximation is valid for when h is signifi- 3 94,868 93,645
cantly smaller than τ , and – what is more important – 5 158,114 154,728
is valid for any probability distribution of the time to 10 316,228 302,804
failure.
When we apply this approximation to the objective
function given by (5) we obtain

τh time to failure as 1000 time units, and will vary the


K (h) = , (21)
W1 h2 + W2 h + W3 value of

where

αμa + μ0
W1 = A2 − 0.5, (22) ZK = (28)
A2 − 0.5

W2 = τ − 0.5αμa + μr + (A2 − 0.5) μ0 , (23)


which determines the relation between hK and τ . The
results of this comparison are presented in Table 1.
The results presented in Table 1 show that even
W3 = τ (αμa + μ0 ) . (24) when inspection interval is comparable with the
expected time to failure the difference between the
The derivative of (21) with respect to h gives a approximately optimal solution and the exact opti-
simple equation for the approximately optimal value mal solution is negligible from a practical point of
of h view. Moreover, even in the worse case, i.e. when
ZK = 10, the values of the availability coefficients are
−W1 h2 + W3 = 0. (25) nearly equal for both optimal solutions. It is interest-
ing, however, how the difference between the length of
Hence, the approximately optimal inspection inter- the approximately optimal inspection interval and the
val is given by the following simple expression: length of the exactly optimal inspection interval may
influence the value of the availability coefficient. Let
 us consider an example where τ = 1000, α = 0.01,
αμa + μ0
hK = τ . (26) β = 0.1, μa = 100, μ0 = 60, and μr = 200. The
A2 − 0.5 value of ZK is equal to 9.991, and the length of the
approximately optimal inspection interval is equal to
It is interesting that the optimal solution does not hK = 315.94. Under the assumption of the exponen-
depend on the expected time of renewal actions. How- tial distribution of the time to failure by minimization
ever, this is hardly surprising as the idle time of the of (14) we find that the length of the exactly optimal
system related to these actions is unavoidable, and is inspection interval is in this case equal to 302.54, i.e.
not influenced by the inspection policy. It is also inter- it is about 5% shorter. However, when we compare
esting that in the case of perfect error-free inspections the availability coefficients calculated for these two
the optimal inspection interval can be found from a values we obtain 0.61266 and 0.61281, respectively.
very simple formula Thus, even in the case when the length of the opti-
 mal inspection interval is comparable to the expected
hK,p = 2τ μ0 . (27) time to failure, and the compared optimal values are
visibly different, the difference between availability
Let us evaluate the accuracy of the approximate coefficients may be neglected. In more realistic exam-
solution of the considered optimization problem. We ples, when ZK is much smaller, this difference is even
compare the approximately optimal inspection inter- smaller.
vals calculated from (26) with the optimal values Now, let us find the similar approximate solution of
calculated from (15) for the case of the exponential the optimization problem defined as the minimization
distribution of the time to failure. We fix the expected of costs given by (9). The objective function is in this

584
case given by Table 2. Comparison of approximately opti-
mal and optimal inspection intervals.
τ

C(h) = − 0.5 [hcl + (c0 + αca )] + cl A2 h ZC hc hC,opt


h
+ cr + cf − τ cl . (29) 1 31.622 31.620
2 63,245 62,868
The derivative of (28) with respect to h also gives 3 94,868 93,803
a simple equation for the approximately optimal value 5 158.114 154,861
of h 10 316.228 302,879

cl (A2 − 0.5) h2 − τ (c0 + αca ) = 0. (30)

Hence, the approximately optimal inspection inter- inspections, measured either in terms of availability
val is given by the following simple expression: or in terms of related costs, are equivalent.

c0 + αca 1
hC = τ . (31)
cl A2 − 0.5 4 NUMERICAL EXAMPLES

Also in this case the optimal solution does not Let us consider two applications of the proposed
depend upon costs of renewal, cr , and failure, cf , as solutions to the optimization problems. As the first
these costs are in this model unavoidable, and are not example let us consider a production machine which
influenced by the inspection policy. In case of perfect produces metal cans. The performance of this machine
inspections, the respective optimal inspection interval is described by its ability to produce tight non-leaky
is now given by cans, and is measured by the probability of the pro-
duction of a good can. There are two types of failures
 of this machine. The first one is a catastrophic, and
c0
hC,p = 2τ . (32) may be noticed immediately after its occurrence. We
cl
are interested, however, in failures of the second type
We may also note the similarity between approx- which are manifested by the increased probability of
imately optimal solutions in both considered mod- the production of potentially leaky cans.
els, suggesting the existence of certain equivalence Let us assume that the acceptable quality level,
between both approaches. expressed in terms of the probability of the produc-
In order to evaluate the accuracy of the approximate tion of a leaky can, is equal to p1 = 0.001, i.e.
solution of this optimization problem we compare the there is on average not more than one nonconforming
approximately optimal inspection intervals calculated can for one thousand of produced cans. The non-
from (30) with the optimal values calculated from (18) acceptable level, whose occurrence is treated as the
for the case of the exponential distribution of the time occurrence of machine’s failure is equal to p2 = 0.01,
to failure. As in the previous case we fix the expected i.e. we consider the process faulty if there is on average
time to failure as 1000 time units, and will vary the one or more nonconforming can for each one hun-
value of dred of produced cans. The current performance of
the machine is measured using a sampling inspec-
tion plan taken from the international standard ISO
c0 + αca 1 2859-1 (1999) for which the lot sample size is equal
ZC = (33)
cl A2 − 0.5 to n = 125 cans. The alarm is raised when at least
one nonconforming can is found among the sampled
which determines the relation between hC and τ . The cans. Thus, the probability of a false alarm is equal
results of this comparison are presented in Table 2. to α = 1 − (0.001)125 = 0.1176 and the proba-
The results presented in Table 2 show exactly the bility of not revealing the actual failure is equal to
same properties of the optimal inspection intervals as β = (1 − 0.01)125 = 0.7153. Let us also assume that
it was presented in Table 1 for the previously consid- the cost of this sampling action is equal to c0 = 1
ered model. What is more interesting, the accuracy unit (i.e. we relate all the considered costs to the cost
of approximate solutions, measured in terms of dif- of inspection). When the alarm is raised additional
ferences between optimal and approximately optimal 500 cans are tested, and when all of them are free
results, is very similar for both models. This obser- from defects the alarm is considered as false. Other-
vation confirms our suggestion that the equality of wise, the machine is considered as failed, and renewal
ZK and ZC means that the consequences of making actions have to be undertaken. Note, that in this case

585
the probability of not revealing a failure is equal to Badia, F.G., Berrade, M.D., Campos, C.A. 2002. Opti-
0.0019, so this additional procedure may be consid- mal inspection and preventive maintenance of units with
ered as practically error free. The additional cost of a revealed and unrevealed failures. Reliability Engineering
false alarm is in the considered case equals ca = 4. If & System Safety, 78: 157–163.
we assume that the expected time between consecutive Baker, M.J.C. 1990. How often should a machine be
inspected? International Journal of Quality and Relia-
failures is equal to τ = 1000 time units, and the loss bility Management, 4 (4): 14–18.
per time unit caused by the work of a failed machine Barlow, R.E., Hunter L.C., Proschan F. 1960. Opti-
equals cl = 10, then we have ZC = 0.221, and the mum checking procedures. In: Proc. of the Seventh
approximately optimal inspection interval, calculated National Symposium on Reliability and Quality Control,
according to (32) is equal to 6.986 time units. When 9: 485–495
the time to failure is exponentially distributed the opti- Barlow R.E., Hunter L.C. Proschan F. 1963. Optimum
mal inspection interval, calculated as the solution of checking procedures. Journal of SIAM, 11: 1078–1095.
(18) with the precision to 3 decimal places, is exactly Berger, K., Bar-Gera, K.,Rabinowitz, G. 2007. Analytical
the same. model for optimal inspection frequency with considera-
tion of setup inspections. Proc. of IEEE Conference on
Let us now consider another practical example Automation Science and Engineering: 1081–1086.
where the evaluation of costs is difficult or even hardly Chelbi, A., Ait-Kadi, D. 1999. Reliability Engineering &
possible. In this example we consider a UPS battery System Safety, 63: 127–131.
backup system which backups the power supply of a Chung, K.-J. 1993. A note on the inspection interval of a
continuously working computer system. The failure machine. International Journal of Quality and Reliability
of the UPS system occurs when its batteries are dis- Management, 10(3): 71–73.
charged or/and its switching system is out of order. Collani von, E. 1981. On the choice of optimal sampling
The state of the batteries can be evaluated immedi- intervals to maintain current control of a process. In:
ately. However, the state of the switching system can Lenz, H.-J., et al. (Eds.) Frontiers in Statistical Quality
Control: 38–44, Wuerzburg, Physica Verlag.
be evaluated by tests which last on average 1 time unit. Fung, J., Makis, V. 1997. An inspection model with generally
Let us assume now that the probability of a false alarm distributed restoration and repair times. Microelectronics
is equal to α = 0, 05 (i.e. on average one of every and Reliability, 37: 381–389.
20 routine tests triggers an unnecessary alarm), and Hariga, M.A. 1996. A maintenance inspection model for a
the average time necessary to reveal that this alarm single machine with general failure distribution. Micro-
is actually false is equal to 5 time units. Moreover, electronics and Reliabiliy, 36: 353–358.
let us assume that the probability of not detecting an Hryniewicz, O. 1992. Approximately optimal economic pro-
existing failure is equal to β = 0.1, and - as in the pre- cess control for a general class of control procedures.
vious example - let us assume that the expected time In: H.J. Lenz et al. (Eds.) Frontiers in Statistical Quality
Control IV: 201–215, Heidelberg, Physica Verlag.
between failures is equal to τ = 1000 time units. The ISO 2859-1: 1989(E): Sampling procedures for inspection
approximately optimal inspection interval, calculated by attributes. Part 1. Sampling schemes indexed by
according to (26) is equal to 45.227 time units. When acceptable quality level (AQL) for lot-by-lot inspection.
the time to failure is exponentially distributed the opti- Khan, F.I., Haddara, M., Krishnasamy, L. 2008. A new
mal inspection interval, calculated as the solution of methodology for Risk-Based Availability Analysis. IEEE
(15) equals 44.948 time units, but the availability coef- Transactions on Reliability, 57: 103–112.
ficient (calculated for different values of the renewal Menipaz, E 1978. On economically based quality control
time) are nearly the same. decisions. European Journal of Operational Research, 2:
Both examples show that the proposed approximate 246–256.
Savage, I.R. 1956. Cycling. Naval Research Logistic Quar-
method for the calculation of the optimal inspec- terly, 3: 163–175.
tion interval allows finding of the optimal solution Vaurio, J.K. 1994. A note on optimal inspection inter-
with the help of a simple pocket calculator. What is vals. International Journal of Quality and Reliability
more important, and perhaps even surprising, that the Management, 11(6): 65–68.
approximate solutions are very close to the exact ones. Vaurio, J.K. 1999. Availability and cost functions for period-
ically inspected preventively maintained units. Reliability
Engineering & System Safety, 63: 133–140.
REFERENCES Wang, W., Christer, A.H. 2003. Solution algorithms for
a nonhomogeneous multi-component inspection model.
Badia, F.G., Berrade, M.D., Campos, C.A. 2001. Optimiza- Computers & Operations Research, 30: 19–34.
tion of inspection intervals based on costs. Journal of
Applied Probability, 38: 872–881.

586
Safety, Reliability and Risk Analysis: Theory, Methods and Applications – Martorell et al. (eds)
© 2009 Taylor & Francis Group, London, ISBN 978-0-415-48513-5

Optimal periodic inspection of series systems with revealed


and unrevealed failures

M. Carvalho
University of Minho, Braga, Portugal

E. Nunes & J. Telhada


University of Minho, Braga, Portugal

ABSTRACT: Maintenance is one of the main used tools to assure the satisfactory functioning of components
and equipment and the reliability of technological systems. Literature on policies and maintenance models is
enormous, and there are a great number of described contexts where each politics of requisite maintenance is
selected to satisfy technical and financial restrictions. However, by assuming very simplified conditions, many
studies have a limited applicability in reality. Considering a maintenance policy based on periodic inspections,
a model is presented in this article that determines the time interval between inspections that minimizes the
global cost of maintenance per unit of time. It is assumed that the system consists on n series components. It
is recognized the occurrence of failures that are immediately revealed and failures that are only revealed at the
first inspection after their occurrence. The model also incorporates repairing times of components, but both
duration of inspections and duration of preventive maintenance are neglected. The analytical development of
the model allows us to obtain a closed-form function to determine the optimal time period between inspections.
This function will be discussed and a numerical example will be presented.

1 INTRODUCTION Chiang & Yuan (2001) and Wang & Zhang (2006)
also propose similar maintenance models, aiming the
From an operational perspective, the most critical optimization of the global cost, but consider that both
phase in the life cycle of a system of technological the preventive and corrective maintenances are not per-
nature is the phase of operation and maintenance. This fect. Yet, under the same optimizing objective, other
is also the phase that contributes with the biggest parcel authors propose models to multi-component systems.
for the Life Cycle Cost (LCC) of the system. This is the For example, Bris et al., (2003) present a plan of
reason because the problem of estimating the instants periodic maintenance for series and parallel systems,
of time for inspections is considered as of great inter- where the failures are only detected at inspections.
est or (many times) of primordial importance. These Zequeira & Bérenguer (2005) develop an analogous
instants are scheduled to carry out the necessary pre- maintenance policy for a parallel system with two
ventive maintenance of the system in order to maintain components. Barros et al., (2006) add to the previous
it running normally at a pre-specified level of service. model, the possibility of the failures to be detected at
In the last few decades, the problems of mainte- the instant of its occurrence. In all of these works, the
nance and substitution of systems have been exten- repairing times are neglected.
sively studied by many researchers. Optimizing the
maintenance costs has been clearly the most common
formulated objective function and many related mod- 2 THE MODEL
els had been proposed for those problems. Kallen &
Noortwijk (2006) and Badía et al., (2002) develop This paper develops a policy of periodic inspections
models of maintenance policies with imperfect inspec- comprising preventive maintenance actions for n inde-
tions for single component systems, whose failures are pendent components in series. It is considered that the
(or are not) revealed randomly at the instant of failure. failure of a component is immediately revealed with
Their models neglected the duration of maintenances probability p and it is not revealed with probability
and consider that those lead the component to a state 1 − p. The term failure does not necessarily imply that
of as good as new. the system stops working at such an occurrence, but it

587
NOTATION
C Total cost of maintenance per cycle
C1 Cost of maintenance of revealed failures per cycle
C2 Cost of maintenance of unrevealed failures per cycle
I1 Number of inspections until the occurrence of a revealed failure per cycle
I2 Number of inspections until a detention of an unrevealed failure per cycle
D Time of not detention of bad functioning for cycle
U Down-time per cycle
X Lifetime of the system - E[X ] - (MTBF)
n Number of components in the system
T Period of inspection
T∗ Optimal period of inspection
Pr (CMk ) Probability of performing corrective maintenance to k components (k = 1, . . . n) after inspection
CI Cost of each inspection plus preventive maintenance
CD Cost of not detecting a bad functioning (per unit of time)
CU Cost per down-time unit
CR Cost of each corrective maintenance to a component
RS (t) Reliability of the system for a time of mission t
Rk (t) Reliability of the component k for a time of mission t
τ1 Cycle of functioning with revealed failure
τ2 Cycle of functioning with failure not revealed
τR Repairing time of a component (MTTR)

should be understood as the imperfect functioning of but only at the following inspection, failures in other
the system. components that can effectively occur in the mid-
The basic considerations of the model are: time are considered as being of the unrevealed type,
therefore supposing that the system continues to work
• Whenever an failure is revealed, a corrective main-
uninterruptedly (but imperfectly) until the inspection.
tenance is immediately carried out;
Finally, it is admitted only one repair facility, so when-
• The duration of corrective maintenances are taken
ever two or more failures are detected at a given
into account, but assumed as constants;
inspection, the down time, U , will be sum of the
• Unrevealed failures are detected at inspections only;
repairing times of the damaged components.
• The inspections are periodic, perfect and do not have
Considering then a maintenance policy based on
any effect on the reliability of the system;
periodic inspections and maintenances that are sched-
• In a given inspection, if it is found that no failure has
uled for instants iT , (i = 1, 2, . . .), it is intended
occurred in the system (since the previous inspec-
to determine the optimum time interval or period
tion), only a preventive maintenance is taken; if it
between two consecutive inspections, T ∗ , that mini-
is found that one or more failures have occurred,
mizes the average total cost of maintenance for unit of
then a identical number of corrective actions must
time,O[T ]. This cost can be express as:
be performed as well as the regular preventive main-
tenance. Both types of maintenance, corrective and
preventive, restitutes the system (i.e., each of its E [C (τ )]
components) to the condition of ‘‘as good as new’’; O [T ] = (1)
• Duration of inspections and preventive mainte- E [τ ]
nances are neglected (null values).
The structure of the series system determines that In the previous expression, τ represents the func-
the failure of a component implies the failure of the tioning cycle, which is defined as the time interval
system. If the failure of a component is revealed at between two consecutive renewals of the system. The
the instant of its occurrence, it will be immediately length of the cycle depends on the type of the fail-
repaired. In this case, it is considered that only one ures occurred. The occurrence of a revealed failure
repair is conducted; unrevealed failures that may have determines the end of a cycle of functioning and the
occurred before will be repaired at the next scheduled commencement of a new (after repairing). In this case,
inspection. the ending cycle, τ1 , is estimated from the life time
On the other hand, if a given failure (in a given com- of the system and the down-time associated with the
ponent) is not detected immediately at its occurrence repairing of the failure. Thus, the average cycle of

588
functioning is given by: The average time of unavailability, E[U ], is:

E [τ1 ] = E [X ] + τR (2) E [U ] = τR E [CM ] (8)


On the other hand, when an unrevealed failure where E[CM ] represents the average number of cor-
occurs, it will be detected and repaired at the first rective maintenances per functioning cycle.
inspection after its occurrence. In this case, suppos-
ing that no revealed failure occurs in the mid-time, the
cycle, τ2 , extends to a larger time interval that com- 2.2 Probabilities of inspections (and preventive
prises the expected time of life of the component, the maintenances) and corrective maintenances
remaining time until next inspection and the sum of all per functioning cycle
repairing times needed to be covered by the inspection. We consider that down-times are considerably inferior
That is, the average cycle of functioning becomes: to the interval of time between two inspections. In the
scenario 1, where an occurred failure is immediately
E [τ2 ] = E [X ] + E [D] + E [U ] (3)
detected and repaired, if the life time of the system is
Since revealed (unrevealed) failures occur with lesser than the interval of time until the first inspection,
probability p (1 − p), the average time of a functioning T , then no inspection per cycle is performed. If the life
cycle is then given by: time extends behind the instant of the first inspection,
T , but not the instant of the second inspection, 2T , then
E [τ ] = E [τ1 ] × p + E [τ2 ] × (1 − p) (4) an inspection per cycle is counted. In general then, if
the life time of the system lies between the ith and the
(i + 1)th inspection, thus i inspections per cycle must
2.1 Average total cost of maintenance be counted. Thus, the expected number of inspections
per functioning cycle per cycle, E[I1 ], for i = 1, 2, . . ., comes as:
Concerning to failures, it is assumed that two possi-
ble scenarios can occur in the system: (1) the failure E[I1 ] = 0 Pr(0 ≤ X < T ) + 1 Pr(T ≤ X < 2T ) + · · ·
is immediately detected at its occurrence (revealed + i Pr[iT ≤ X < (i + 1)T ] + · · · (9)
failure) which implies the execution of an immediate
corrective maintenance, and (2) the failure occurs but That is,
is only detected in the first inspection after the occur-
rence (unrevealed failure). In the first case, the main- +∞

tenance costs per functioning cycle must include the E [I1 ] = RS (iT ) (10)
cost of the expected number of inspections until the i=1
failure occurrence plus the cost of their implicit pre-
ventive maintenances, the down-time cost associated The same reasoning applied to scenario 2 of unre-
to the (unproductive) time that is necessary to com- vealed failures, implying that the average number of
plete the maintenance, and the cost of the corrective inspections per cycle, E[I2 ], with i = 1, 2, . . . , is given
maintenance itself that is necessary to repair the dam- by:
aged component. In this scenario, the expected cost
per functioning cycle is then given by: E [I2 ] = 1 Pr (0 < X ≤ T ) + 2 Pr (T < X ≤ 2T )
E [C (τ1 )] = CI E [I1 ] + CU τR + CR (5) + · · · + i Pr [(i − 1) T < X ≤ iT ] + · · ·
(11)
In scenario 2, the global cost must incorporate an
additional parcel representing the cost associated with That is,
running the system under bad functioning conditions
due to one or more unrevealed failures In this case, the +∞

expected cost per cycle is given by: E [I2 ] = RS ((i − 1) T ) (12)
i=1
E [C (τ2 )] = CI E [I2 ] + CD E [D] + CU E [U ]
Considering again the scenario 1, the average num-
+ CR E [CM ] (6) ber of corrective maintenances per functioning cycle,
E[CM ], is given by:
where the average time of not detecting any failure
(implying bad functioning) is given by: 
n
E [CM ] = k Pr (CMk ) (13)
E [D] = T · E [I2 ] − E [X ] (7) k=1

589
where Pr(CMk ) represents the probability of occur- where
ring k corrective maintenances in a cycle, and is
given by: +∞

a (T ) = CI RS (iT ) + p (CU τR + CR − CI

n 
n 
n
i=0
Pr (CMk ) = ...
i1 =1 i2 =i1 +1 ik =ik−1 +1 −CD τR ) + (1 − p) (CU τR + CR − CD τR )
   
× 1 − Rj (T ) Rt (T )  
n
j=i1 ,i2 ,... ,ik t=i1 ,i2 ,... ,ik × n− Rk (T ) − CD E [X ] (19)
(14)
k=1

The average number of corrective maintenances per


functioning cycle can still be defined by: and


n b (T ) = p (E [X ] + τR )
E [CM ] = n − Rk (T ) (15)
 +∞
 
k=1  
n
+ (1−p) T RS (iT )+τR n− Rk (T )
2.3 Average total maintenance cost per unit of time i=0 k=1
From the equations above, the average total cost per (20)
functioning cycle can be formulated by the following
condensed equation:
Our aim is to determine the optimal time interval
 +∞
 between inspections, T ∗ , that minimizes the average
 total maintenance cost per unit of time, O[T ]. As we
E [C (τ )] = CI RS (iT ) + CU τR + CR × p
will show later, there is no such a finite optimum under
 i=1
+∞ certain conditions. Basically, the existence of an abso-

+ (CI + CD T ) RS (iT ) lute minimum for the function O[T ] depends on the
relative amplitude among the various costs integrated
 i=0  into the model. And, independently of such relation-
 n
+ (CU τR + CR ) n − Rk (T ) ship among costs, it can be shown that T → +∞
whenever O[T ] → CD .
 k=1
In the remaining part of this sub-section, we
− CD E [X ] × (1 − p) (16) prove the sufficiency of a relationship among costs
for the existence of a optimal period T ∗ . Badía
et al., (2002) verified the validity of this rela-
Now, applying equation (1) and taking in account tionship for the case of a single component sys-
the relationships expressed in (2), (3), (4) and (16), tem following a maintenance policy with imperfect
the average total cost per unit of time comes as: inspections.
 +∞


CI RS (iT ) + CU τR + CR p
i=1
O [T ] =  +∞ 


n
p (E [X ] + τR ) + (1 − p) T RS (iT ) + τR n − Rk (T )
i=0 k=1



(17)

+∞
n
(CI + CD T ) RS (iT ) + (CU τR + CR ) n − Rk (T ) − CD E [X ] (1 − p)
i=0 k=1
+  +∞ 


n
p (E [X ] + τR ) + (1 − p) T RS (iT ) + τR n − Rk (T )
i=0 k=1

It can be expressed as: The function a(T ), as defined by equation (19), is


a continuous function in R+ , where:
a (T )
O [T ] = CD + (18) lim a (T ) > 0
b (T ) T →0

590
and 3 NUMERICAL EXAMPLES

lim a (T ) = CI + p (CU τR + CR − CI − CD τR ) Table 1 analyzes three numerical cases differentiated


T →+∞ by the probability (or percentage) of occurring unre-
vealed failures: p = 0.1, p = 0.5 e p = 0.9. In all
+ (1 − p) n (CU τR + CR − CD τR ) cases, we consider the existence of 10 components in
− CD E [X ] series, whose lifetimes are represented by independent
Weibull distributions.
The reliability function of component k is then
The expression defined by equation (20) shows that given by:
b(T ) > 0, ∀T ∈ R+ . Taking in account equation (19),
if lim a (T ) < 0, that is, if  T β
T →+∞
Rk (T ) = e− θ , θ > 0, β > 0, T ≥ 0
E [X ]
The reliability function of the system with 10
CI +p (CU τR +CR −CI −CD τR )+(1−p) n (CU τR +CR −CD τR ) components in series is:
>
CD
 T β
RS (T ) = e−10 θ , θ > 0, β > 0, T ≥ 0
(21)
The expected lifetime of the system is:
then there exists T0 ∈ R+ such that a(T0 ) < 0. Apply-
⎛ ⎞
ing equation (18), we have that: 

θ ⎠ 1 + 1 ,
E[X ] = ⎝ θ > 0, β > 0
a (T0 ) 1 β
O [T0 ] = CD + < CD = lim O [T ] 10 /β
b (T0 ) T →+∞
Assuming θ = 1, the values of the optimal inspec-
This result proves that the condition defined in (21) tion periods and corresponding minimal cost are cal-
is sufficient for the existence of an absolute minimum culated for each cases. The unit costs considered in
for the function O[T ]. the examples were CI = 10, CD = 100, CR = 25 and

Table 1. Optimum inspection time and optimum cost when the time to failure is a Weibull distribution.

p = 0.1 p = 0.5 p = 0.9

β E [X ] T∗ O [T ∗ ] T∗ O [T ∗ ] T∗ O [T ∗ ]

0.2 0.0012 ∞ 100 ∞ 100 ∞ 100


0.5 0.02 ∞ 100 ∞ 100 ∞ 100
0.8 0.0637134 ∞ 100 ∞ 100 ∞ 100
1 0.1 ∞ 100 ∞ 100 ∞ 100
1.1 0.118959 ∞ 100 ∞ 100 ∞ 100
1.2 0.138069 ∞ 100 ∞ 100 ∞ 100
1.3 0.157124 ∞ 100 ∞ 100 ∞ 100
1.4 0.175968 ∞ 100 ∞ 100 ∞ 100
1.5 0.194491 ∞ 100 ∞ 100 ∞ 100
2 0.28025 ∞ 100 ∞ 100 ∞ 100
2.1 0.295865 0.196586 97.7365 ∞ 100 0.414277 98.0627
2.2 0.31096 0.205087 93.1971 0.250725 98.4482 0.428473 93.6608
2.3 0.325544 0.213363 89.2353 0.259736 94.3313 0.442472 89.7803
2.4 0.339628 0.221413 85.7534 0.268439 90.6934 0.456246 86.3365
2.5 0.353226 0.229242 82.6741 0.276828 87.4593 0.469762 83.2616
3 0.414484 0.265415 71.4919 0.313843 75.5591 0.532297 71.8094
4 0.509708 0.327678 59.7013 0.362972 62.7978 0.628862 59.224
5 0.579325 0.369116 53.2903 0.390392 56.017 0.695681 52.4872
6 0.632048 0.394981 49.0747 0.408787 51.6696 0.743316 48.299
10 0.755685 0.444197 39.9996 0.449002 42.7585 0.843921 40.5767
20 0.867637 0.477394 31.8144 0.478998 35.5176 0.921921 35.3

591
O[T] equal (p = 0.5); this feature is illustrated by Figure 2
400 for the case of β = 2.5.
350 The condition defined in equation (21) indeed
reveals that it is a sufficient condition for the existence
300
of a (finite) optimal value T ∗ . However, it is clearly
250 not necessary, as can be demonstrated by the values of
200 Table 1.
150
100 4 CONCLUSIONS
50
This paper developed a model for a maintenance policy
supported by periodic inspections, suitable for apply-
1 2 3 4 5
ing to technological series systems of n independent
Figure 1. Optimum inspection time and optimum cost when components. The model tolerates the occurrence of
the time to failure is a Weibull distribution. both revealed and unrevealed failures, and the inspec-
tions are considered perfect and instantaneous. The
repairing times are admitted as constant values.
O[T] A relationship among the involved costs of main-
300 tenance, sufficient for the attainment of an optimal
p=0.1 p=0.5
inspection period, was developed and analyzed. Under
250
way, we are studying the possibility of developing a
200 sufficient and necessary condition for the existence of
optimal values inspections periods. We are extending
150 the herein proposed model to integrate other types of
uncertainty inherent to real systems by making use of
100
fuzzy set theory. Moreover, the work presented in this
50 paper can be extended in several directions that would
p=0.9 enhance its applicability too, such as k-out-of-n and
T parallel systems.
0.5 1 1.5 2

Figure 2. Optimum inspection time and optimum cost when REFERENCES


the time to failure is a Weibull distribution with β = 2.5.
Badía, F.G., Berrade, M.D. & Campos, C.A. 2002. Opti-
mal inspection and preventive maintenance of units with
revealed and unrevealed failures. Reliability Engineer-
CU = 50. It is still considered that τR = 0.005. The ing & Systems Safety 78: 157–163.
results were obtained by using Mathematica. Barros, A., Berenguer, C. & Grall, A. 2006. A maintenance
Table 1 shows that an optimal value for the inspec- policy for two-unit parallel systems based on imper-
tion time T ∗ does not always exist. Whenever it does fect monitoring information. Reliability Engineering &
not exist, that is, when T → +∞, it is verified that Systems Safety 91: 131–136.
O[T ] → CD . This means that, under such circum- Bris, R., Châtelet, E. & Yalaoui, F. 2003. New method to min-
stance, it is not economically compensatory to perform imize the preventive maintenance cost of series-parallel
inspections to the system, and the cost of maintenance systems. Reliability Engineering & Systems Safety82:
247–255.
of the system per unit of time corresponds solely to Chiang, J.H. & Yuan, J. 2001. Optimal maintenance policy for
the cost of the non-detention of damages per unit of a Markovian system under periodic inspection. Reliability
time. In the case of the previous examples this value Engineering & Systems Safety71: 165–172.
has been fixed at CD = 100. Kallen, M.J. & van Noortwijk, J.M. 2006. Optimal periodic
For the costs under consideration, whenever O[T ] of a deterioration process with sequential condition states.
admits a minimum, it can be observed that as bigger International Journal of Pressure Vessels and Piping 83:
is parameter β, the bigger is the value of the opti- 249–255.
mum time (T ∗ ) and the lesser is the maintenance cost. Wang, G.J., & Zang, Y.L. 2006. Optimal Periodic Preven-
Figure 1 illustrates this effect for p = 0.1. Also, it tive Repair and Replacement Policy Assuming Geometric
Process repair. IEEE Transactions on Reliability55, 1:
becomes clear that as the fraction of revealed fail- 118–122.
ures decreases, the system must be inspected with an Zequeira, R.I. & Bérenguer, C. 2005. On the inspection
increasingly frequency. In turn, the optimal mainte- policy of a two-component parallel system with failure
nance cost per unit of time seems to be maximal when interaction. Reliability Engineering & Systems Safety71:
the fraction of revealed and unrevealed failures are 165–172.

592
Safety, Reliability and Risk Analysis: Theory, Methods and Applications – Martorell et al. (eds)
© 2009 Taylor & Francis Group, London, ISBN 978-0-415-48513-5

Optimal periodic inspection/replacement policy for deteriorating systems


with explanatory variables

Xuejing Zhao
Université de Technologie de Troyes, TROYES Cedex, France
Lanzhou University, Lanzhou, P. R. China

Mitra Fouladirad & Christophe Bérenguer


Université de Technologie de Troyes, TROYES Cedex, France

Laurent Bordes
Laboratoire de Mathèmatiques Appliquées
Université de Pau et des Pays de l’Adour, PAU Cedex, France

ABSTRACT: This paper discusses the problem of the optimization of condition based maintenance policy for
a stochastic deteriorating system in presence of covariates. The deterioration is modelled by a non-monotone
stochastic process. The process of covariates is assumed to be a finite-state Markov chain. A model similar
to the proportional hazards model is used to represent the influence of the covariates. In the framework of a
non-monotone system, we derive the optimal maintenance threshold, optimal inspection period and the optimal
delay ratio to minimize the expected average maintenance cost. Comparison of the expected average costs under
different conditions of covariates and different maintenance policies is given by numerical results of Monte
Carlo simulation.

Keywords: condition based maintenance, covariates, Markov chain, proportional hazards model, non-
monotone system, maintenance, expected average cost

1 INTRODUCTION also on the value of covariates. The second approach


consists in considering a non-monotone deteriorating
Optimal replacement problems for deteriorating system with increasing tendency. Authors in Newby
systems have been intensively studied in the past and Barker (2006) (also in Barker and Newby (2008))
decades (see Wang (2002), Bérenguer et al. (2003)). studied the optimal inspection and maintenance pol-
Most of the attention has been focused on stati- icy for a non-monotone system, where the decisions
cal environment and increasing degradation systems are based on a non-monotone scale variable, they used
(Lawless and Crowder (2004), Kharoufeh and Cox the last exit time from a critical set instead of the first
(2005), van Noortwijk (2008), Dieulle et al., (2006)). hitting time to determine the optimal policy.
Recently more interest and attention are given to two Generally, the cost-based optimal policy depends
approaches. In the first approach the problem is to on the inspection interval, as well as the preventive
deal with degradation models including explanatory maintenance level. Some of the researches focus on
variables (covariates), these variables describe the the periodic inspection scheme, e.g. Kong and Park
dynamical environment in the experiments of life sci- (1997) and Park (1988) used a numerical examples
ence and engineering and they are often expressed by in which cost is considered as a function of preven-
the proportional hazards model (Singpurwalla (1995), tive maintenance level while the inspection interval is
Newby (1994)). Bagdonavičius and Nikulin (2000) given. Newby and Dagg (2003) considered the cost as
gave a method to model the increasing degradation a function of inspection interval while the preventive
by a gamma process which includes time-dependent maintenance level is fixed. Jia and Christer (2002)
covariates. Makis and Jardine (1992) considered considered the maintenance cost as the combination
an optimal replacement problem for a system with of the preventive maintenance level and inspection
stochastic deterioration which depends on its age and interval.

593
It is also possible to consider the non-periodic The other particularity of this paper is that we
inspection scheme. In Grall et al., (2002) authors con- compare three different cases for the global expected
sidered the ‘sequence’ inspection/remplacement poli- average maintenance cost: (1) The optimization when
cies where the inspection intervals depend on the infor- the covariates are defined by a Markov chain; (2) The
mation of the system deterioration (a function of the optimization when the covariates Zn = i (i = 1, 2, 3)
deterioration state). Failure is detected only by inspec- are fixed; and (3) The weighted mean of the optimal
tion and a cost of ‘inactivity of the system’ per unit time results for each Z = i (i = 1, 2, 3) weighted by the sta-
is calculated as soon as failure occurs. In order to deal tionary probabilities of the Markov chain. All results
with aperiodic inspections, they used the long-term are illustrated by a Monte Carlo study.
expected average cost per unit time,which is based The structure of the paper is as follows. We model
on of the semi-regenerative properties of the main- the degradation processes by a stochastic univari-
tained system condition with respect to the steady-state ate process in Section 2, where the influence of the
stationary probability distribution of the maintained covariates is modelled by a multiplicative exponen-
system state. Grall et al., (2002) considered a main- tial function. In Section 3 we study the optimization
tenance policy using a multi-level control-limit rule, maintenance problem when there are two thresh-
where failures are detected immediately. olds, corrective replacement threshold and preventive
In this paper we study the optimal policy of periodic replacement threshold. For different maintenance cost
inspection/replacement for a non-monotone deteri- units, we find the optimal preventive threshold, the
orating system with covariates Zt . The stochastic optimal inspection period and the optimal delay ratio.
deteriorating process Dt = D(t|Ft ) represents the Finally, we compare the expected average maintenance
degradation level of the system given the history of the cost per unit time for the three cases mentioned above.
covariates {Zt } : Ft = σ {Zs : s ≤ t}. Dt is modelled by
the difference of two conditional independent stochas-
tic processes (Conditionally on the covariates Zt ). We
suppose that the covariates Zt ∈ {1, . . ., k} form a 2 STOCHASTIC DETERIORATION PROCESS
Markov chain with finite state space which describes a
dynamical environment. Following an approach sim- In this Section, we consider a single-unit replace-
ilar to the proportional hazards model proposed by able system in which an item is replaced with a
Cox (See Cox (1972), also in Gouno et al., (2004)), new one, either at failure or at planned replacement.
the covariates effect is modelled by a multiplicative The degradation of the system is represented by a
exponential function. continuous-state univariate stochastic process D(t)
A method similar to the one proposed by Barker and with initial degradation level D(0) = 0. In this paper,
Newby (2008) is used to give the optimal maintenance without loss of generality, we suppose that the deterio-
policy. Suppose that the system is inspected perfectly ration system has an upward trend degradation, though
at the periodic times Π = {τ , 2τ , . . .}, the system not necessarily monotonically increasing.
states are only known at inspection times, and main-
tenance actions are instantaneous. We shall denote
by Dkτ the process D(t) at t = kτ . We define a 2.1 Deterioration model without covariates
failure threshold L and a maintenance threshold Lp
(Lp < L). Suppose that t = (m + 1)τ is the first Suppose that the system is subject to stochastic dete-
inspection time where D(m+1)τ ≥ Lp , the system is rioration. The level of deterioration is represented by
maintained at the time t = (m + R + 1)τ if (1) a continuous-state univariate stochastic process D(t)
Dmτ < Lp ≤ D(m+r+1)τ < L for r = 0, 1, 2, . . ., R, with initial degradation level D(0) = 0.
where R, defined as the delay ratio, is a decision To describe the non-monotonicity of a system, we
variable to be determined, and (2) Dt < L for t ∈ suppose that the variation of the degradation at t,
[(m + 1)τ , . . . , (m + R + 1)τ ). The system is consid- ΔD(t), is represented by a stochastic process A(t) =
ered to be failed at time GL = inf {t ∈ R+ : Dt ≥ L} X + (t) − X − (t), the difference of two independent
and to be replaced at the first inspection time after its stochastic processes, where X + (t) and X − (t) denote
failure. The purpose is to propose an optimal main- respectively the degradation and improvement of the
tenance policy for the considered system in order system. Suppose that the system can be observed at
to minimize the global long-run expected average each time unit tk (k = 1, 2, . . .), so only the discrete
maintenance cost per time unit. In the framework of stochastic processes D(tk ), X + (tk ) and X − (tk ) can be
the non-monotone system presented previously, we observed, denoted respectively by Dk , Xk+ and Xk− .
derive the optimal preventive maintenance threshold The process Dn (n ≥ 1) is defined as:
Lp , the optimal inspection interval τ and the opti-
mal delay ratio R which lead to a minimal expected + −
average cost. Dn = max(Dn−1 + Xn−1 − Xn−1 , 0), (1)

594
where Xn+ , Xn− are independent random variables, of Let
exponential distribution with mean μ+ −
n and μn resp.
(without loss of generality, we assume that μ+
n ≥ μ−
n ). pij (k) = P(Zk+1 = j|Zk = i) (4)
The distribution function and the density function
of the variable Xn+ − Xn− are given by be the transition probabilities of the process Z. Filtra-
tion Ft = σ {Zs : s ≤ t} denotes the history of the
covariates.
μ−
Fn (x) = n
exp(x/μ−
n )1(x≤0)
We assume that the variation of the degradation at
μn + μ−
+
n time tn only depends on the covariates at time tn . Let
  stochastic process D(t|Ft ) be the degradation level of
μ+ +
+ 1− n
exp(−x/μ ) 1(x≥0) , system given Ft . This process is observed at discrete
μ+
n + μn
− n
times t = tn (n ∈ N ). We shall denote by Dn the
1  −
observed process at time t = tn , defined as:
fn (x) = + − exp(x/μn )1(x≤0)
μn + μn + −
 Dn = max(Dn−1 +Xn−1 (Zn−1 )−Xn−1 (Zn−1 ), 0), (5)
+ exp(−x/μ+n )1(x≥0) .
for n ≥ 1, where Xn+ (Zn ), Xn− (Zn ) are condition-
ally independent random variables (given Zn ), with
Conditionally on Dk = dk (k = 1, 2, . . . ), for exponential distribution of mean parameters μ+
x > 0, the r.v Dk+1 has the distribution n (Zn )
and μ− n (Zn ) (without loss generality, we assume that
μ+ −
n (Zn ) ≥ μn (Zn )).
P(Dk+1 ≤ x|Dk = dk ) = P(Ak + Dk ≤ x|Dk = dk ) The distribution function and the density function
of the variation An+1 (Zn ) = Xn+ (Zn ) − Xn− (Zn ) are
= P(Ak ≤ x − dk |Dk = dk ) given by
= P(Ak ≤ x − dk ), (2)
μ− (Zn )
Fn (x, Zn ) = exp(x/μ− n (Zn ))1(x<0)
μ+
n (Zn )+ μ−
n (Zn )
and for x = 0,  
μ+ (Zn ) exp(−x/μ+ n (Zn )
+ 1− n + 1(x≥0) ,
μn (Zn ) + μ− n (Zn )
P(Dk+1 = 0|Dk = dk ) = P(Ak + Dk ≤ 0|Dk = dk ) 
exp(x/μ− n (Zn ))
= P(Ak ≤ −dk |Dk = dk ) fn (x, Zn ) = 1(x<0)
μn (Zn ) + μ−
+
n (Zn )
= P(Ak ≤ −dk ) 
exp(−x/μ+ n (Zn ))
+ + 1 (x≥0) .
μ− −
k exp(−dk /μk )
μn (Zn ) + μ− n (Zn )
= + − . (3)
μk + μk n
So the distribution of Dn = k=1 ΔDk can be
derived using the method of convolution and the total
So conditionally on Dn = dn , the r.v Dn+1 has a probability formula.
mixture distribution, with a density of fn (x − dn ) in To describe precisely the influence of the covariates
μ− − Zn = zn on An , similar to the proportional hazards
(0, +∞) and a mass distribution μ++μ
n
− exp(−dn /μn )
n n model proposed by Cox (1972), we suppose that the
at x = 0. parameters μ+ −
n and μn depend on zn as follows:

μ+ +
n (Zn ) = μn Ψ(Zn )
2.2 Modelling influence of covariates + +
on degradation = μ+
n exp (β1 1(Zn =1) + · · · + βK 1(Zn =K) )
+
We are interested in the influence of covariates on = μ+
n exp (βZn ), (6)
degradation. The covariate process Z = {Z(t), t ≥ 0}
is assumed to be a finite state Markov chain with states
{1, 2, . . ., K} which describe the states of the envi- μ− −
n (Zn ) = μn Ψ(Zn )
ronment, such as normal, warning, dangerous, etc.
− −
The covariates are available only at the time points tk , = μ−
n exp (β1 1(Zn =1) + · · · + βK 1(Zn =K) )
(k = 0, 1, 2, . . .), when we observe the degradation −
process. = μ−
n exp (βZn ), (7)

595
where μ+ −
n (μn ) denote the degradation rate (improve- and
ment rate) of the system when there is no covariates
considered, β + = (β1+ , β2+ , . . . , βK+ ) and β − =
lim πin = πi . (8)
(β1− , β2− , . . . , βK− ), from (6) and (7), these parame- n→∞
ters allow to account the influence of covariates on the
degradation rate.
Considering the symmetrical property of βi , with- where πi is the stationary distribution of the Markov
out loss of generality, in what following, we assume chain.
that β1+ ≤ β2+ ≤ . . . ≤ βK+ and β1− ≤ β2− ≤ . . . ≤ βK− . Furthermore, we shall denote by ΔDn (Z) and
ΔDn (π)) respectively the increments of the degrada-
tion with a covariate process Z a general Markov chain
Example 2.1: An example of the degradation for 100 starting at Z1 = 1 and a steady-state Markov chain with
days (inspected per 5 days) is given in Figure 1, where stationary distribution π.
Zn is a 3-state Markov chain with transition matrix Let us recall that the covariates form a steady-state
⎛ ⎞ Markov chain. Each replacement makes the system
0.95 0.05 0.00 restart from its initial state (D0 = 0) and the covari-
P = ⎝ 0.02 0.95 0.03 ⎠, ates Zn follow their trajectory. Let us denote by τn the
0.00 0.05 0.95 instant of replacement. Hence (Dt )(t≥0) et (Dt+τn )(t≥0)
have the same distribution. So the trajectory of the
and initial state Z0 = 1, β + = (0.2, 0.5, 1) and degradation does not depend on the history before the
β − = (0.1, 0.1, 0.1), the baseline mean parameters replacement, henceforth the deterioration process is a
μ+ −
n = 0.5 and μn = 0.3. Notice that the stationary
renewal process.
distribution Π = (π1 , π2 , π3 ) = (0.3, 0.5, 0.2).

For simplification, in what follows, we assume that 3 CONDITION-BASED PERIODIC


μ0 does not depend on n, and Z ∈ {1, 2, 3} is a 3-state MAINTENANCE MODEL
Markov chain as described in the above Example 2.1.
For covariates with initial state Z0 = 1, denote that In this section, we study the optimal periodic mainte-
π n = (π1n , π2n , π3n ) with πin = P(Zn = i|Z0 = 1), (i = nance policy for the deteriorating system presented
1, 2, 3), is the distribution of Zn then: in Section 2. Suppose that the system is assumed
to be a non-monotone stochastic deteriorating sys-
(π1n , π2n , π3n ) tem with initial state D0 = 0, and the state can
= (P(Zn = 1|Z0 = 1), P(Zn = 2|Z0 = 1), exclusively be monitored by inspection at periodic
times tk = kτ (k = 1, 2, . . . ). We now give some
P(Zn = 3|Z0 = 1)) assumptions under which the model is studied.
= (1, 0, 0)P n ,
1. Inspections are perfect in the sense that they reveal
the true state of the system and the explanatory
30 variables.
25 2. The system states are only known at inspection
Degradation

20 times and all the maintenance actions take place


15
only at inspection times and they are instantaneous.
10
3. Two maintenance operations are available only at
5
the inspection time: preventive replacement and
0
0 10 20 30 40 50 60 70 80 90 100 corrective replacement.
(a) 4. The maintenance actions have no influence on the
covariates process.
Covariables

2
3.1 Maintenance decision
Suppose that the system starts at D0 = 0 and is
1

0.0
0 10 20 30 40 50 60 70 80 90 100
inspected perfectly at periodic times Π = {τ , 2τ , . . . }.
(b) The states are only known at inspection times and
maintenance actions are instantaneous. We define
Figure 1. An example of non-maintained deterioration a failure threshold L and a preventive maintenance
system (a) and covariate process (b). threshold Lp (Lp < L).

596
• The system is considered to be failed and correc- [0, Lp ) after an interval of time less than Rτ time units.
tively replaced at the first hitting time We define I = max{s ∈ {1, . . . , R + 1} : D(k+s)τ <
Lp , Dt < L, t ∈ [kτ , (k+s)τ ]}, then event E3 is equal to
GL = inf {t ∈ R+ : Dt ≥ L | D0 = 0}. (9) R
R
{I = s} ≡ E3s . In this case, no maintenance action
• The degradation of the system maybe recover back s=1 s=1
takes place and the decision is reported to (k + s)τ .
to the critical level or below the critical level after
The only cost incurred in interval [kτ , (k + s)τ ] is the
the exceeding (See Fig. 1). So the system can be
inspection cost s × Ci .
still used for a period of time, especially when the
critical level is the preventive threshold. In practice, An example of the maintained system is given in
it is useful to consider this information because it Figure 2, where the preventive threshold is Lp = 20,
can reduce the maintenance cost. So the preventive the corrective threshold is L = 30, the system is
maintenance decision is based on the a last hitting inspected for each 5 days with delay ratio R = 3, the
time other parameters are same as in Example 2.1.
The total maintenance cost in [0, t] is:
HLp = sup{t ∈ R+ : Dt < Lp | D0 = 0}. (10)

Since the last hitting time is not a stopping time, so it C(t) = Ci Ni (t) + Cp Np (t) + CF NF (t) + Cd d(t),
is seemed to have to search alternative way to deal with
this problem. In Barker and Newby (2008) for multi- where Ni (t) (resp. Np (t), NF (t)) is the number of
component described by a Bessel process with drift, inspections (of preventive replacements, of corrective
the maintenance decision is based on the probability replacements) till time t.
that the system never recovers, in other words, they use The expected average cost is:
the probability that the last up-crossing occurs between
the current time and the next scheduled inspection. E(C(t))
We deal with the problem as follows: considering the EC∞ = lim . (11)
increasing tendency of the system degradation and
t→∞ t
the probability of the recover, at the time t when the
degradation is exceed the preventive threshold, we take When the stochastic process (D, Z) forms a regen-
no maintenance action and we continue inspect for a erative process, we can calculate the expected cost per
period of time, say Rτ units time, where R is defined time unit as (Rausand and Høyland (2004))
as the delay ratio, a decision variable to be determined.
At each inspection time tk = kτ when Dkτ < Lp , E(v(Z))
three exclusive events occur: EC∞ (Z) = , (12)
E(l(Z))
(E1 ) Dt ≥ L (for some t ∈ [kτ , (k + R + 1)τ ]). Which

R+1
is equivalent to {(k + s − 1)τ < GL ≤ (k + s)τ } ≡ 35

30
s=1
Degradation


25
R+1
E1s , where E1s means that the system fails at time 20

15
s=1
t ∈ ((k + s − 1)τ , (k + s)τ ]. In this case, a corrective 10

replacement should take place at time T = (k + s)τ , 5

0
and the system returns to the initial state 0. The main- 0 50 100 150 200 250 300 350

(a)
tenance cost includes the corrective replacement cost
CF and cumulated unavailability cost Cd × d, where
d = T − t is the cumulated unavailability time.
Covariates

(E2 ) Lp < D(k+r)τ < L for r = 1, 2, . . ., R and Dt < 2


L (t ∈ [kτ , (k + R + 1)τ ]) The system crosses the
threshold Lp (but it is still lower than L) and there is no 1

observation of recovering during a time period Rτ . In 0.0


0 50 100 150 200 250 300 350
this case, the system is replaced preventively at time
(b)
(k + R + 1)τ with the preventive replacement cost Cp ,
+: inspection; : preventive replacement;
and the system returns to the initial state 0. •: corrective replacement.
(E3 ) D(k+r)τ ≤ Lp for some r = 1, 2, . . ., R and Dt < L
(t ∈ [kτ , (k + R + 1)τ ]) It means that the system does Figure 2. (a) Maintained deteriorating system with two
not cross the preventive threshold Lp , or it crosses thresholds: preventive and corrective; (b) The corresponding
the threshold Lp (whereas less than L) but returns to covariates process.

597
where E(v(Z)) and E(l(Z)) are respectively the Example 2.1. We consider four different cases of unit
expected cost and expected length of a renewal cycle. maintenance cost:
Considering the three above exclusive events E1 to
E3 , and denoting by Vk (resp. Lk ) the total cost (resp. • Case I (Normal cost): Ci = 10, Cp = 60, CF =
the total length) from inspection time Tk to the time 100, Cd = 250;
when the system is replaced, since the total cost Vk • Case II (Expensive PR): Ci = 10, Cp = 100, CF =
(the total length Lk ) is the combination of the cost 100, Cd = 250;
(the length) in a time interval [Tk , Tk+s ) and the total • Case III (Expensive inspection): Ci = 100, Cp =
cost (the length) after Tk+s (s = 1, 2, . . . , R + 1), we 60, CF = 100, Cd = 250;
calculate the total maintenance cost and the length of • Case IV (Inexpensive unavailability): Ci =
a renewal cycle by the following iterative method: 10, Cp = 60, CF = 100, Cd = 100.
For each case of maintenance cost, we compare the

R+1 following three quantities:
Vk = (CF + Ci s + Cd × ((k + s)τ − GL ))1(E1s )
• Optimal Maintenance cost when Zn form a general
s=1
Markov chain;

R+1 • Optimal Maintenance cost when Zn are fixed to Z =
+ (Cp + RCi ) × 1((E2 )) + (Ci s + Vk+s )1(E3s ) , i, (i = 1, 2, 3) respectively;
s=1 • Weighted mean of the optimum cost for Z = i (i =
1, 2, 3) with weight the steady-state probability:

R+1
Lk = (Rτ )1(E2 ) + sτ 1(E1s )
s=1
3

EC∞ = EC∞ (k)πk . (14)

R+1
k=1
+ (sτ + Lk+s ) × 1(E3s ) ,
s=1
1. The optimal maintenance for three parameters:
preventive threshold Lp , the inspection periods τ
and the expectation will be and the delay ratio R
A numerical optimization was used to give the opti-

R+1 mal value of the decision variables Lp∗ = 16, τ ∗ =
E(Vk ) = (CF P(E1s ) + E(Ci s + Cd 12, R∗ = 0 for a deteriorating system with the same
s=1 system parameters depicted in Example 2.1 when
Ci = 10, Cp = 60, CF = 100, Cd = 250. These opti-
× ((k + s)τ − GL ))1(E1s ) ) + (Cp + RCi )
mal values lead to the optimal expected average cost


R+1 EC∞ = 2.8650. Figure 3 gives the iso-level curves of
× P(E2 ) + E((Ci s + Vk+s ) × 1(E3s ) ), EC∞ in the function of (Lp , τ ) and R takes its optimal
s=1 value R∗ = 0.

R+1
E(Lk ) = (Rτ )P(E2 ) + E(sτ 1(E1s ) ) 40

s=1
35


R+1
+ E((sτ + Lk+s ) × 1(E3s ) ). 30

s=1
25

The optimization problem is therefore to chose the 20

value of Lp∗ , τ ∗ and R∗ that minimize the expected


long-run average maintenance cost: 15

10

(Lp∗ , τ ∗ , R∗ ) = arg min EC∞ (Z). (13)


(Lp ,τ ,R)∈(0,L)×R+ ×N 5

5 10 15 20 25 30
L
p
3.2 Numerical simulation
In this section we give some numerical simulation of Figure 3. Iso-level curves of expected average maintenance
our model, the deteriorating system is the same as cost as a function of (Lp , τ ).

598
Table 1. The optimal preventive threshold, the optimal 35
Case1
Case2
inspection period and the expected average maintenance cost Case3
Case4
with periodical inspection. 30

Covariates (10, 60, 100, 250) (10, 100, 100, 250) 25

Z general (16, 12, 0, 2.8650) (20, 10, 0, 3.9037) 20

EC∞
Z =1 (14, 33, 0, 1.2658) (16, 27, 0, 1.8097)
Z =2 (12, 21, 0, 2.1614) (16, 18, 0, 3.0943) 15

Z =3 (14, 9, 0, 4.3149) (16, 9, 0, 6.1939)


C̄ 2.6283 3.7673 10

Covariates (100, 60, 100, 250) (10, 60, 100, 100) 5

Z general (8, 20, 0, 7.6926) (16, 12, 0, 2.7734) 0


5 10 15 20 25 30

Z =1 (2, 60, 0, 2.7868) (14, 33, 0, 1.2194) L


p
Z =2 (4, 36, 0, 4.7785) (6, 36, 0, 2.078) (a)
Z =3 (2, 18, 0, 9.536) (14, 12, 0, 4.0442) 60
Case1
C̄ 5.8076 2.4963 Case2
Case3
Case4
50

18

16
Excepted average cost 40

14
30
Expected average cost

12

20
10

8
10

0
4 5 10 15 20 25 30 35 40

2
(b)
35
Case1
0
0.5 1 1.5 2 2.5 3 3.5 4 4.5 5 Case2
Case3
30 Case4
3

Figure 4. The optimal cost as a function of β3 for β1 = 25

0.2, β2 = 0.5.
20
EC∞

15
Results in Table 1 summarize the results of an
optimization for a deteriorating system with different 10
maintenance costs.
In all cases of different unit maintenance cost 5

(expensive or inexpensive), the optimal expected aver-


age cost under the condition Z = 1(β = β1 ) are 0
0 1 2 3 4 5 6 7 8 9 10

the smallest, because the degradation increments are R


(c)
smaller in comparison with other cases. The mainte-
nance for Z = 2 (β = β2 ) is more costly than that of
Figure 5. The expected average cost as a function of
Z = 1, and the maintenance for Z = 3 (β = β3 ) is the (a) Lp for optimal R, τ ; (b) τ for optimal R, Lp ; (c) R for
most expensive. As a consequence, the parameter β optimal Lp , τ .
can be used to express the influence of the dynamical
environment on the deteriorating system.
In order to reveal the way that the maintenance cost
is influenced by the system parameters β, using the an increasing function of the system parameter β3 . In
symmetrical property of β, the optimal expected aver- fact, since β expresses the influence of the dynamic
age cost is computed for various values of β3 with fixed environment, the expected average maintenance cost
β1 and β2 . The result appears in Figure 4, it can be seen under the worst environment has higher cost than that
that the optimal expected average maintenance cost is of better environment.

599
The expected average maintenance cost for system 2. The optimal inspection interval includes the infor-
with a Markov chain is always more than the weighted mation on R, and in order to optimize the aver-
mean (C̄ in Table 1) of the optimal cost for the three age cost, we can only consider the parameters
statical cases, since in Markov chain we have less (Lp , τ ).
information on the degradation system. As a conse- 3. The expected average maintenance cost for system
quence, the weighted mean of the optimal cost also with a covariate Markov chain is always greater
gives a lower bound of the maintenance cost for the than the weighted mean of the optimal cost for the
deteriorating system. three statical cases.
2. Comparison of the optimal expected average
maintenance cost for different unit maintenance
costs
REFERENCES
The influence of each decision variable on the
expected average maintenance cost function is given Bagdonavičius, V. and Nikulin, M. 2000. Estimation in degra-
in Figure 5. The influence is given by the curves of dation models with explanatory variables. Lifetime Data
the expected average cost as functions of the R, τ Analysis 7(1): 85–103.
and Lp respectively. All results show that no matter Barker, C.T. and Newby, M. 2008. Optimal non-periodic
what choice of Lp , R and τ , the expected average inspection for Multivariate degradation model. Reliability
maintenance cost for an expensive inspection (case Engineering and System Safety (In press).
III) is always the most expensive, and the inexpensive Bérenguer, C., Grall, A. Dieulle, L. and Roussignol, M.
unavailability (case IV) is always the cheapest one. 2003. Maintenance policy for a continuously monitored
For fixed τ = 12, R = 0, Figure 5(a) shows deteriorating system. Probability in the Engineering and
Informational Sciences 17(2): 235–250.
that for a relative smaller preventive threshold, the Cox, D.R. 1972. Regression models and life-tables. Journal
preventive replacement cost Cp determine the opti- of the Royal Statistical Society. Series B 34(2): 187–220.
mal maintenance cost (only weak dependence to the Dieulle, L., Bérenguer, C. Grall, A. and Roussignol, M.
cost of unavailability, as indicated by EC∞ (Case I) ≈ 2006. Asymptotic failure rate of a continuously monitored
EC∞ (Case IV)), so it can be seen that the suit- system. Reliability Engineering and System Safety 91(2):
able maintenance policy is mainly the preventive 126–130.
replacement whereas we take a corrective replace- Gouno, E., Sen, A. and Balakrishnan, N. 2004. Optimal
ment for a bigger preventive threshold (this will results step-stress test under progressive type-I censoring. IEEE
in more failure with the cost of Cd , indicated by Transactions on Reliability 53(3): 388–393.
Grall, A., Bérenguer, C. and Dieulle, L. 2002. A condition-
EC∞ (Case I) ≈ EC∞ (Case III)) in Figure 5(a)). based maintenance policy for stochastically deteriorating
For fixed Lp = 16, R = 0, Figure 5(b) shows systems. Reliability Engineering and System Safety 76(2):
that for a smaller inter-inspection such that there is 167–180.
no maintenance cost paid for unavailability, the main- Grall, A., Dieulle, L. Bérenguer, C. and Roussignol, M.
tenance policy is mainly the preventive replacement 2002. Continuous-time predictive-maintenance schedul-
(EC∞ (Case I) ≈ EC∞ (Case IV)), however a correc- ing for a deteriorating system. IEEE Transactions on
tive replacement action will take place for a bigger Reliability 51(2): 141–150.
inter-inspection time. Jia, X. and Christer, A.H. 2002. A prototype cost model of
The optimal maintenance cost increases as R functional check decisions in reliability-centred mainte-
nance. Journal of Operational Research Society 53(12):
increases when Lp = 16 and τ = 12 are fixed, as 1380–1384.
Figure 5(c) shows. Kharoufeh, J.P. and Cox, S.M. 2005. Stochastic models
for degradation-based reliability. IIE Transactions 37(6):
533–542.
4 CONCLUSION Kong, M.B. and Park, K.S. 1997. Optimal replacement of
an item subject to cumulative damage under periodic
In this paper we deal with a non-monotone deteriorat- inspections. Microelectronics Reliability 37(3): 467–472.
ing system with covariates, we use a method similar to Lawless, J. and Crowder, M. 2004. Covariates and random
the proportional hazards model to account the influ- effects in a gamma process model with application to
degradation and failure. Lifetime Data Analysis 10(3):
ence of dynamical covariates, defined by a 3-state 213–227.
Markov chain. Makis, V. and Jardine A. 1992. Optimal replacement in the
Expected average cost is calculated, optimum peri- proportional hazards model. INFOR 30, 172–183.
odic inspection/replacement policies are derived for Newby, M. 1994. Perspective on weibull proportional haz-
different maintenance costs per unit, as a function of ards model. IEEE Transactions on Reliability 43(2):
the preventive level Lp , inspection interval τ and delay 217–223.
ratio R. The results show that: Newby, M. and Dagg, R. 2003. Optimal inspection and
maintenance for stochastically deteriorating systems II:
1. The optimal average cost is an increasing function discounted cost criterion. Journal of Indian Statistical
of the parameters β. Association 41(1): 9–27.

600
Newby, M.J. and Barker, C.T. 2006. A bivariate process Singpurwalla, N.D. 1995. Survival in dynamic environ-
model for maintenance and inspection planning. Inter- nements. Statistical Science 1(10): 86–103.
national Journal of Pressure Vessels and Piping 83(4): van Noortwijk, J.M. 2008. A survey of the application of
270–275. gamma processes in maintenance. Reliability Engineering
Park, K.S. 1988. Optimal continuous-wear limit replace- and System Safety (In press).
ment under periodic inspections. IEEE Transactions on Wang, H. 2002. A survey of maintenance policies of
Reliability 37(1): 97–102. deteriorating systems. European Journal of Operational
Rausand, M. and Høyland, A. 2004. System Reliability Research 139(3): 469–489.
Theory: Models, Statistical Methods, and Applications
(2 ed.). New Jersey: John Wiley & Sons Inc.

601
Safety, Reliability and Risk Analysis: Theory, Methods and Applications – Martorell et al. (eds)
© 2009 Taylor & Francis Group, London, ISBN 978-0-415-48513-5

Optimal replacement policy for components with general failure rates


submitted to obsolescence

Sophie Mercier
Université Paris-Est, Laboratoire d’Analyse et de Mathématiques Appliquées, (CNRS UMR 5050), Champs
sur Marne, Marne-la-Vallée, France

ABSTRACT: Identical components are considered, which become obsolete once new-type ones are available,
more reliable and less energy consuming. We envision different possible replacement strategies for the old-type
components by the new-type ones: purely preventive, purely corrective and different mixtures of both types of
strategies. To evaluate the respective value of each possible strategy, a cost function is considered, which takes
into account replacement costs, with economical dependence between simultaneous replacements, and energy
consumption (and/or production) cost, with a constant rate per unit time. A full analytical expression is provided
for the cost function induced by each possible replacement strategy. The optimal strategy is derived in long-time
run. Numerical experiments close the paper.

1 INTRODUCTION strategy. More generally, some mixture of both strate-


gies, preventive and corrective, may also be envisioned
Identical and independent components are considered, (details below) and may lead to lower costs, as will be
which may be part of a single industrial equipment or seen later. The point of the present paper is to look
dispatched in different locations, indifferently. Those for the optimal replacement strategy with respect of a
components are degrading with time and their random cost function, which represents the mean total cost on
life-times follow some common general distribution. some finite time interval [0, t]. This function takes into
At some fixed time, say time 0, new components account replacement costs, with economical depen-
appear in the market, issued from a new technology, dence between simultaneous replacements (Dekker,
which makes them more reliable, less energy consum- Wildeman, and van der Duyn Schouten 1997), and also
ing and more performing. Such new-type components energy consumption (and/or production) cost, with a
may be substituted to the older ones with no problem of constant rate per unit time.
compatibility. There is no stocking of old-type com- A similar model as here has already been studied in
ponents and after time 0, no old-type component is (Elmakis, Levitin, and Lisnianski 2002) and (Mercier
available any more (or the industrialist is not allowed and Labeau 2004) in case of constant failures rates
to use old-type components any more, e.g. for safety for both old-type and new-type components. In those
reasons). After time 0, any failed component, either papers, all costs were addingly discounted at time 0,
old-type or new-type, is instantaneously replaced by contrary to the present paper. In such a context, it
a new-type one. At time 0, each old-type component had been proved in (Mercier and Labeau 2004) that in
is in use since some random time, with some random case of constant failure rates, the only possible optimal
remaining life-time. If the new-type components are strategies were either purely corrective or nearly pure
much less energy consuming than the older ones and preventive (details further), leading to some simple
if the period of interest is very long, it may then be dichotomous decision rule.
expedient to remove all old-type components imme- A first attempt to see whether such a dichotomy is
diately at time 0 and replace them by new-type ones, still valid in case of general failure rates was done in
leading to some so-called purely preventive replace- (Michel, Labeau, and Mercier 2004) by Monte-Carlo
ment strategy. On the contrary, in case there is not (MC) simulations. However, the length of the MC
much improvement between both technologies and simulations did not allow to cover a sufficient range
if the period of interest is short, it may be better to for the different parameters, making the answer dif-
wait until the successive failures of the old-type com- ficult. Similarly, recent works (Clavareau and Labeau
ponents and replace them by new-type ones only at 2006a) or (Clavareau and Labeau 2006b) e.g. proposed
failure, leading to some purely corrective replacement complex models including the present one, which are

603
evaluated by MC simulations. Here again, the length component is correctively replaced and the n − K
of the MC simulations added to the complexity of the non-failed old-type components are simultaneously
model, do not allow to provide the optimal replacement preventively replaced. Before time U1:n , there are
strategy according to the data of the model. exactly n old-type components. After time UK:n ,
The point of the present paper hence is to answer there are exactly n new-type components. For K ≥

to the following questions: is the dichotomy proved 2, between times Ui:n and Ui+1:n (1 ≤ i ≤ K − 1),
in case of constant failure rates still valid in case of there are i new-type components and n − i old-type
general failure rates? If not (and it will not), what are ones (see Figure 1).
the possible optimal strategies? Finally, how can we • strategy n: no preventive replacement is performed
find the optimal strategy? at all. Before time U1:n , there are exactly n old-

This paper is organized as follows: the model is type components. Between times Ui:n and Ui+1:n
specified in Section 2. Section 3 presents the theoreti- (1 ≤ i ≤ n − 1), there are i new-type components
cal results both for a finite and an infinite time horizon. and n − i old-type ones. After time Un:n , there are
Numerical experiments are lead on in Section 4. exactly n new-type components.
Concluding remarks end the paper in Section 5.
This paper presents the results from (Mercier 2008), Once a new-type component is put into activity at
with different numerical experiments however. Due to time 0 or at time say Ui:n , it is next instantaneously
the reduced size of the present paper, no proofs are replaced at failure by another new-type component.
provided here, which may be found in the quoted paper. The successive life-times of such components are
assumed to form a renewal process with eventual delay
Ui:n ; the i.i.d. inter-arrival times are distributed as
some non-negative r.v. V with P (0 ≤ V < ∞) = 1
2 THE MODEL and P (V > 0) > 0. The renewal function associated
to the non-delayed process is then finite on R+ . Let
We consider n identical and independent components E stand for the expectation with respect of the proba-
(n ≥ 2), called old-type components in the following. bility measure P on (, A) and for A ⊂ A, let 1A be
At time 0, such old-type components are up, in activ- the indicator function with 1A (ω) = 1 if ω ∈ A and
ity. For each i = 1, . . . , n, the residual life-time for the 1A (ω) = 1 if ω ∈ \A. The renewal function is then
i−th component is assumed to be some absolutely con- denoted by ρV with:
tinuous random variable (r.v.) Ui , where Ui ’s are not  
necessarily all identically distributed. The i−th (old- 
type) component is assumed to fail at time Ui . The ρV (t) = E 1{V (1) +···+V (k) ≤t }
successive times to failure of the n old-type compo- k∈N∗
nents are the order statistics of (U1 , . . . , Un ). They are for t ≥ 0, where V (1) , . . . , V (k) , . . . are the successive
denoted by (U1:n , . . . , Un:n ), where U1:n < · · · < Un:n inter-arrival times. We recall that ρV (t) corresponds
almost everywhere (a.e.). to the mean number of renewals on [0, t] of the non-
All preventive and corrective replacements (by delayed process.
new-type components) are instantaneous. The follow- The envisionned cost function represents the mean
ing replacement strategies are envisioned: total cost on some time interval [0, t]. It is denoted by
• strategy 0: the n old-type components are immedi- CK ([0, t]) when strategy K is used. Two type of costs
ately replaced by n new-type ones at time 0. This are considered:
is a purely preventive strategy. After time 0, there • replacement costs, with economic dependence in
are exactly n new-type components and no old-type case of simultaneous replacements: each solicita-
component any more, tion of the repair team is assumed to entail a fixed
• strategy 1: no replacement is performed before the
first failure, which occurs at time U1:n . At time
U1:n , the failed component is correctively replaced
and the n − 1 non-failed old-type components are
simultaneously preventively replaced. This hence is Failure of old components:
corrective replacements by new ones
a nearly pure preventive strategy. Before time U1:n ,
there are exactly n old-type components. After time
U1:n , there are exactly n new-type components, 0 U1:n U2:n Ui:n Ui+1:n UK-1:n UK:n t
• strategy K (1 ≤ K ≤ n): no preventive replace-
ment is performed before the K−th failure, which Corrective replacements
occurs at time UK:n . This means that only correc- of new components
tive replacements are performed up to time UK:n (at
times U1:n , . . . , UK−1:n ). At time UK:n , the failed Figure 1. Corrective and preventive replacements.

604
cost r (r ≥ 0). Each corrective and preventive Setting
replacement involves a supplementary cost, respec-
1
tively cf and cp , to be added to r (0 < cp ≤ cf ). gK (t) := (CK+1 ([0, t]) − CK ([0, t])) (1)
For instance, the cost for preventive replacement of cp
i units (0 ≤ i ≤ n − 1) which comes along with the for all 0 ≤ K ≤ n − 1, we easily derive the following
corrective replacement of one unit is r + cf + icp . corollary.
• energy and/or production cost, with a constant rate
per unit time (eventually negative, in case of a pro-
duction rate higher than the energy cost rate). The Corollary 2 Let t ≥ 0. For K = 0, we have:
rates for an old-type and a new type unit respec-
tively are η + ν and η, with ν ≥ 0, η ∈ R. r
g0 (t) = (a − 1) FU1:n (t) −
(The cost rate is higher for an older unit). The cp
‘‘energy/production’’ cost for j new-type units and   t 
k old-type units on some time intervall [t1 , t2 ] is + n bE U1:n − F̄U1:n (t)
  
( jη + k (η + ν)) (t2 − t1 ), where 0 ≤ t1 ≤ t2 and − aE ρV (t) − ρV (t − U1:n )+
j + k = n.
All components both new-type and old-type are and, for 1 ≤ K ≤ n − 1, we have:
assumed to be independent one with each other.
In all the paper, if X is a non-negative random gK (t) = (a − 1) FUK+1:n (t)
variable (r.v.), its cumulative density function (c.d.f.)   t 
is denoted by FX , its survival function by F̄X with + (n − K) × bE UK+1:n − UK:n
t

 
F̄X = 1 − FX and its eventual probability density − FUK:n (t) − FUK+1:n (t)
function (p.d.f.) by fX . For t ∈ R+ , we also set    
X t = min (X , t) and x+ = max (x, 0) for any real x. − aE ρV (t −UK:n )+ −ρV (t −UK+1:n )+
Finally, we shall use the following notations:
In order to find the optimal strategy according to the
r + cf mission time t and to the data of the model as in the
a= ≥1 case of constant failure rates (see (Mercier and Labeau
cp
2004)), the point should now be to find out the sign
ν of gK (t) for 0 ≤ k ≤ n − 1. This actually seems to
b= ≥0
cp be impossible in the most general case. However, we
are able to give some results in long-time run, which
is done in next subsection.
3 THEORETICAL RESULTS
3.2 Comparison between strategies 0, 1, . . . , n
3.1 Cost functions on [0, t] in long-time run
We first give our results for a finite mission time t. We first compute the limit of gK (t) when t → +∞.

Theorem 1 Let t ≥ 0. For K = 0, we have: Proposition 3 Assume the distribution of V to be


nonarithmetic and E (Ui ) < +∞ for all 1 ≤ i ≤ n.
C0 ([0, t]) = nηt + r + ncp (1 + aρV (t)) Setting gK (∞) := limt→+∞ gK (t) for all 0 ≤ K ≤
n − 1, we then have:
and, for 1 ≤ K ≤ n, we have:

a
gK (∞) = a − 1 + b −

K E (V )

CK([0, t]) = (r + cf )(FUi:n (t) × (n − K)E (UK+1:n − UK:n )
i=1

 for all 1 ≤ K ≤ n − 1 and


+ E(ρV ((t − Ui:n )+ ))) + νE(Ui:n
t
)

 cf a
+ (n−K) (r+cf )E(ρV ((t −UK:n )+ )) g0 (∞) = −1+ b− nE (U1:n − U0:n )
cp E (V )

+ cp FUK:n (t) + νE(UK:n
t
) + nηt where we set U0:n := 0.

605
A first consequence is that, if b − a
E(V )
≥ 0 or Theorem 4 If b − E(V a
) ≥ 0, the optimal strategy
alternatively ν ≥
r+cf
,we then have gK (∞) ≥ 0 among 0, . . . , n in long time-run is strategy 0.
E(V )
for all 0 ≤ K ≤ n − 1 (we recall that a ≥ 1 and In case b − E(V ) < 0, assume that U1 , . . . , Un are
a
r+c
cf ≥ cp ). Consequently, if ν ≥ E(V f) , the best strategy i.i.d. IFR r.v. (which may be realized through assum-
among 0, . . . , n in long-time run is strategy 0. Such ing that Ui stands for the waiting time till next arrival
a result is conform to intuition: indeed, let us recall for a stationary renewal process with inter-arrival time
that ν stands for the additional energy consumption distributed as U (0) , where (0)
 (0)U is a non-negative
rate for the old-type units compared to the new-type IFR r.v. with 0 < E U < +∞). Assume
ones; also, observe that r+cfE(V )
is the cost rate per unit too that Ui s are not exponentially distributed. The
time for replacements due to failures among new-type sequence (E (DK ))0≤K≤n−1 is then strictly decreasing,
components in long-time run. Then, the result means and, setting
that if replacements of new-type components due to
cf
failures are less costly per unit time than the benefit due a−1 cp − 1
to a lower consumption rate, it is better to replace old- c:= a and d:= ≤ c,
E
(V )−b a
E(V )
−b
type components by new-type ones as soon as possible.
Now, we have to look at the case b − E(V a
) < 0
one of the following cases occurs:
and for that, we have to know something about the
monotony of • if c ≤ E (Dn−1 ) : the optimal strategy among
0, . . . , n in long time-run is strategy n,
• if c > E (D1 ) :
DK := (n − K) (UK+1:n − UK:n ),
– if d > E (D0 ) : the optimal strategy among
0, . . . , n in long time-run is strategy 0,
with respect of K, where DK is the K-th normal-
– if d ≤ E (D0 ) : the optimal strategy among
ized spacing of the order statistics (U1:n , . . . , Un:n ),
0, . . . , n in long time-run is strategy 1,
see (Barlow and Proschan 1966) or (Ebrahimi and    
Spizzichino 1997) e.g.. With that aim, we have to • if E DK0 < c ≤ E DK0 −1 for some 2 ≤ K0 ≤
put some assumption on the distributions of the resid- n − 1 : the optimal strategy among 0, . . . , n in long
ual life times of the old-type components at time time-run is strategy K0 .
t = 0 (Ui for 1 ≤ i ≤ n): following (Barlow and
Proschan 1966), we assume that U1 , . . . , Un are i.i.d. In (Mercier and Labeau 2004), we had proved the
IFR (Increasing Failure Rate), which implies that following ‘‘dichotomy’’ property: in case of constant
(DK )0≤K≤n−1 is stochastically decreasing. A first way failure rates, only purely preventive (0), nearly pure
to meet with this assumption is to assume that all preventive (1) or purely corrective (n) strategies can
old-type components have been put into activity simul- be optimal for finite horizon. We now know from last
taneously (before time 0) so that the residual life times point of Theorem 4 that such a property is not valid any
are i.i.d. (moreover assumed IFR). Another possibility more in case of general failure rates, at least for infinite
is to assume that all units have already been replaced horizon and consequently for large t. We now look at
a large number of times. Assuming such replacement some numerical experiments to check the validity of
times for the i-th unit to make a renewal process with the dichotomy property in case of small t.
inter-arrival times distributed as some U (0) (indepen-
dent of i), the residual life at time 0 for the i-th unit may
then be considered as the waiting time until next arrival 4 NUMERICAL EXPERIMENTS
for a stationary renewal process with inter-arrivals dis-
tributed as U (0) . Such a waiting time is known to admit We here assume that Ui ’s are i.i.d IFR random vari-
as p.d.f. the function fU (t) such that: ables with known distribution. Examples are provided
in (Mercier 2008) for the case where the data is the dis-
F̄U (0) (t) tribution of some U (0) and the common p.d.f. fU of Ui
fU (t) =   1R+ (t), (2)
E U (0) is given by (2) (see Theorem 4). All the computations
  are made with Matlab.
assuming 0 < E U (0) < +∞. Also, it is proved in All Ui ’s and Vi ’s are Weibull distributed accord-
(Mercier 2008) that if U (0) is IFR, then U is IFR too. ing to W (αU , βU ) and W (αV , βV ), respectively, (all
The r.v. U1 , . . . , Un then are i.i.d. IFR, consequently independent) with survival functions:
meeting with the required assumptions from (Barlow βU βV
F̄U (x) = e−αU x and F̄V (x) = e−αV x
and Proschan 1966).
We are now ready to state our main result: for all x ≥ 0.

606
  
We take: We finally compute E ρV (t − UK:n )+ with:
 
αU = 1/103 ; αV = 1/ 2.25 × 103 (3)
E(ρV ((t − UK:n )+ ))
t
βU = βV = 2.8 > 1 (4)
= ρV (t − u)dfUK:n (t)
0
(Ui ’s are IFR), which leads to

n−1
=n
E (U )
10.5, σ (U )
4.1, K −1
t
E (V )
14, σ (V )
5.4. × ρV (t − u)FUK−1 (u)F̄Un−K (u)fU (u)du
0
We also take:
where the renewal function ρV is computed via the
n = 10; η = 0; ν = 0.06; cp = 1; cf = 1.1; r = 0 (5) algorithm from (Mercier 2007).
For finite horizon, the optimization on K is sim-
We compute FUK:n using: ply made by computing all CK ([0, t]) for K =
0, . . . , n and taking the smallest. For infinite horizon,
FU (x)
n! Theorem 4 is used.
FUK:n (x) = t K−1 (1−t)n−K dt The optimal strategy is given in Table 1 for different
0 (K −1)!(n−K)!
values of αV and t, as well as the asymptotic results (all
= IFU (x) (K, n − K + 1) other parameters fixed according to (3–5). We can see
in such a table that the optimal strategy is quickly stable
for 1 ≤ K ≤ n, where Ix (n1 , n2 ) is the incomplete with increasing t. More precisely, the optimal strategy
Beta function (implemented in Matlab), see (Arnold, for a finite horizon t is the same as the optimal strategy
Balakrishnan, and Nagaraja 1992) e.g. for the results in long-time run as soon as t is greater than about
about order statistics used in this section. 3.5 mean lengths of life of a new-type component.
We also use: For t about twice the mean life length, the finite time

optimal strategy is already very near from the long-
n
F̄UK+1:n (t) − F̄UK:n (t) = FUK (t) F̄Un−K (t) time run one. Also, any strategy may be optimal, even
K for small t.
 t  We now plot in Figure 2 the optimal strategy with
from where we derive E UK+1:n − UK:n
t
due to: respect of t, for αV fixed according to (3). We can
see in such a figure that the behavior of Kopt (opti-
 t  t   mal K) with increasing t is not regular at all. There
E UK+1:n − UK:n
t
= F̄UK+1:n (u) − F̄UK:n (u) du
0
is consequently no hope to get any clear characteriza-
tion of Kopt with respect of the different parameters
for 0 ≤ K ≤ n − 1 (we recall U0:n := 0). in finite horizon as we had in the exponential case in

Table 1. Optimal strategy according to t and αV .


 
t \ α1V ×103 1 1.2 1.5 1.75 2 2.25 2.5 3 3.5 4 5

5 10 10 10 10 10 10 10 10 10 10 10
10 10 10 10 10 10 10 10 10 10 10 10
15 10 10 10 2 2 1 1 1 0 0 0
20 7 7 6 5 5 4 4 3 2 1 1
25 9 9 8 8 7 7 6 5 4 3 1
30 10 10 9 9 8 7 6 4 3 1 0
35 9 9 7 6 5 4 4 2 1 0 0
40 10 9 8 7 6 5 4 3 2 1 0
45 10 9 8 7 6 6 5 3 2 1 0
50 10 9 8 7 6 5 4 3 2 1 0
75 10 9 8 7 6 5 4 3 2 1 0
100 10 9 8 7 6 5 4 3 2 1 0
∞ 10 9 8 7 6 5 4 3 2 1 0

607
(Mercier and Labeau 2004) and as we have here in 10
infinite horizon (Theorem 4).
We next plot Kopt in Figures 3–6 for t fixed (t = 25)
8
with respect of parameters cf , ν, r and cp (all other
parameters fixed according to (3–5), which shows
that Kopt may vary a lot changing one single param- 6
eter. Also, one may note that Kopt decreases with cf

Kopt
(Fig. 3), ν (Fig. 4) and r (Fig. 5). Such observations
4

10
2

8
0
0 0.5 1 1.5
6 r
Kopt

Figure 5. Optimal strategy w. r. of r for t = 25.


4

2 10

0 8
0 10 20 30 40 50
t

Figure 2. Optimal strategy with respect of t. 6


Kopt

10
4

8
2

6
0
Kopt

0 0.2 0.4 0.6 0.8 1 1.2 1.4


4 cp

2 Figure 6. Optimal strategy w. r. of cp for t = 25.

0
1 1.5 2 2.5 3
cf are coherent with intuition which says that preventive
maintenance should be performed all the earlier (or
Figure 3. Optimal strategy w. r. of cf for t = 25. equivalently new-type components should be intro-
duced all the earlier) as failures are more costly, as
10 the difference of costs is higher between both gener-
ations of components, or as economical dependance
8
between replacements is higher. Similarly, Figure 6
shows that Kopt increases with cp , which means that
the higher the cost of a preventive replacement is, the
6 later the preventive maintenance must be performed.
This is coherent with intuition, too.
Kopt

2
5 CONCLUSIONS

In conclusion, we have considered here different


0 replacement strategies for obsolescent components by
0 0.02 0.04 0.06 0.08 0.1
ν others issued from a newer technology. A cost func-
tion on a finite and on an infinite time horizon has been
Figure 4. Optimal strategy w. r. of ν for t = 25. considered, in order to sort the different strategies one

608
with each other. We have seen that the variations of Clavareau, J. and P.-E. Labeau (2006a). Maintenance
the optimal strategy with respect of a finite horizon and replacement policies under technological obsoles-
t is much less regular in the present case of gen- cence. In Proceedings of ESREL’ 06, Estoril (Portugal),
eral failure rates than in the case of constant failure pp. 499–506.
rates as in (Elmakis, Levitin, and Lisnianski 2002) or Clavareau, J. and P.-E. Labeau (2006b). Maintenance et
stratégies de remplacement de composants soumis à obso-
(Mercier and Labeau 2004) (see Figure 2). Also, the lescence technologique. In In Proc. Lambda-Mu 15, Lille
main result from (Mercier and Labeau 2004), which (France).
told that the optimal strategy could only be strategy Dekker, R., R. E. Wildeman, and F. A. van der Duyn Schouten
0, 1 or n, namely (nearly) purely preventive or purely (1997). A review of multicomponent maintenance models
corrective, is here false: any strategy among 0, 1, . . . , n with economic dependence. Math. Methods Oper. Res.
may be optimal. 45(3), 411.435. Stochastic models of reliability.
It does not seem possible here to give clear condi- Ebrahimi, N. and F. Spizzichino (1997). Some results on
tions on the data to foretell which strategy is optimal normalized total time on test and spacings. Statist. Probab.
in finite horizon as in case of constant failure rates. We Lett. 36(3), 231–243.
Elmakis, D., G. Levitin, and A. Lisnianski (2002). Optimal
however obtained such conditions in long-time run. 1 scheduling for replacement of power system equipment
A few numerical experiments (see others in (Mercier with new-type one. In Proc. Of MMR’2002 (Mathemath-
2008)) seem to indicate that the optimal strategy in ical Methods in Reliability 2002), Trondheim (Norway),
long-time run actually is quickly optimal, namely for pp. 227–230.
t not that large. The results for long-time run then Mercier, S. (2007). Discrete random bounds for general ran-
seem to give a good indicator for the choice of the best dom variables and applications to reliability. European J.
strategy, even for t not very large. Oper. Res. 177(1), 378–405.
Mercier, S. (2008). Optimal replacement policy for obso-
lete components with general failure rates.Appl. Stoch.
Models Bus. Ind. 24(3), 221–235.
REFERENCES Mercier, S. and P.-E. Labeau (2004). Optimal replacement
policy for a series system with obsolescence. Appl. Stoch.
Arnold, B. C., N. Balakrishnan, and H. N. Nagaraja (1992). Models Bus. Ind. 20(1), 73–91.
A first course in order statistics. Wiley Series in Proba- Michel, O., P.-E. Labeau, and S. Mercier (2004). Monte Carlo
bility and Mathematical Statistics: Probability and Math- optimization of the replacement strategy of components
ematical Statistics. New York: John Wiley & Sons Inc. subject to technological obsolescence. In Proc. Of PSAM
A Wiley-Interscience Publication. 7- ESREL’ 04, Berlin (Germany), pp. 3098–3103.
Barlow, R. E. and F. Proschan (1966). Inequalities for linear
combinations of order statistics from restricted families.
Ann. Math. Statist. 37, 1574–1592.

609
Safety, Reliability and Risk Analysis: Theory, Methods and Applications – Martorell et al. (eds)
© 2009 Taylor & Francis Group, London, ISBN 978-0-415-48513-5

Optimization of the maintenance function at a company

Smail Adjabi, Karima Adel-Aissanou & Mourad Azi


Laboratory LAMOS, University of Bejaia, Bejaia, Algeria

ABSTRACT: The main objective of this work is optimizing the function of maintenance at the foundry of the
company BCR in Algeria. For this, we use a comprehensive approach involving two global areas: organizational
aspect and technical aspect. As a first step, we analyse the reliability of a certain equipment through a Pareto
Analysis. After that, we present the influence of repair times on the unavailability of this equipment. In order
to calculate the optimal renewal times of some spare parts of an equipment (for they present an important time
of unavailability), we first make an economic evaluation, which leads to express the direct and indirect costs
of maintenance. Finally, in order not to charge an available item (good condition), we give an overview of
implementing a ‘‘non destructive control’’.

1 INTRODUCTION and then we will make a techno-economic study which


will determine the optimal time for renewal. Finally,
The essential problem for a maintenance service is we propose methods of the implementation of the non-
the optimization of its maintenance plan and the destructive control.
costs involved for the accidental failures. They may
have serious consequences on the technical, economic
and organizational parameters. The industrial mainte-
nance, which aims to ensure the good functioning of 2 SELECTION OF EQUIPMENT
the production tools, is a strategic function in a com-
pany. This is bound to the apparition of the new kind This study is directed to the foundry section of the
of management, to the technological development and subsidiary SANIAK (BCR), which presents one of
to the necessity of reduction of the production costs. the major concerns of its managers, because of its
Today it is no longer its sole objective to repair work importance in factories and significant downtime.
tools, but also to anticipate and prevent shortcomings, Given the important number of equipments that
in order to maximize the overall performance of pro- make up the section, it is primordial to target only the
duction, and meet the demands of their customers in critical equipments. For this purpose, we conducted
term of quality and quantity while respecting delivery an ABC (Pareto) analysis to determine the class of
times. equipments which, for a year, has spawned the high-
The objective of this work is to minimize the costs est downtime. We proposed that Class A is the one
involved for the accidental failures at this company. corresponding to the equipments with a percentage
For this, we propose to put in place the tools to orga- of 38.24% breakdowns which cause over 60% of the
nize and optimize the maintenance function through downtime, as illustrated in the Figure 1.
reliability, availability and techno-economic study. The equipments chosen are presented in the table 1.
The main issues which we will try, in this work, to
provide some answers are:
– What are the equipments to Study? 3 ANALYSIS OF EQUIPMENT RELIABILITY
– What are the performances of these equipments ?
– What is the most efficient method of maintenance 3.1 Parametric modeling equipment reliability
according to the performance of these equipments?
– How to calculate the optimal time for preventive For our study, this approach begins with an assumption
renewal? that the random variable X ‘‘lifetime’’ has a common
usage model; the Weibull or an exponential distribu-
To answer all these questions, we will determine the tion. The estimation of the parameters of each model
equipments to be studied. It will prepare an analysis was produced by the method of maximum likelihood
of the reliability and availability of these equipments, using the statistical software R. that can validate the

611
models obtained through the classic test of adequacy
‘‘Kolmogorov-Smirnov’’. The results are in Table 2.
n: Size of sample.
β: Parameter form of the Weibull law .
η: Parameter scale of the law Weibull.
Dks : Statistics empirical test Kolmogorov-Smirnov.
d(n,0.05) : Quantile tabulated test Kolmogorov-Smirnov
with a level of significance equal to 0.05.
The results show that the model of the two param-
eters Weibull is accepted for a level of significance =
0.05 for the facilities: DECRT, ROVT021, ROVT041,
ROVT042, GREN011, GREN021, NOYT020, BASP,
but For the NOYT018 machine, the model of Weibull
Figure 1. Diagram ABC (Pareto). is rejected.
In addition to a tendency of these facili-
ties(equipment) to the Weibull law : DECRT, ROVT-
041, ROVT042, GREN011, GREN021, NOYT020,
their exponentiality in lifetime is validated as well. For
Table 1. Equipment chosen. equipment BASP, ROVT021, the exponential model is
rejected.
Code Designation

ROVT021 Forging Machine mechanical ROVETTA 3.2 Non parametric modeling


F180 (4602)
BASP Casting Machine to 2 heads BP 225.S A.O In order to have another support for the reliability of
DECRT Machine for cutting bars the equipments, one has opted, to the non parametric
NOYT018 Machine to produce kernels (H 2,5 Co) modeling, in particular the graphic test based on the
GREN021 Installation sandblasting 200 Kgs/ TTT-Statistical. This consists of rejecting the Expo-
Grenailleuse GF nential model if one notices a meaningful gap between
GREN011 Grenailleuse TAMBRO 1000 EXK
the curve and the first bisector in favor of the IFR law
NOYT020 Automatic Machine to produce kernels
ROVT042 Forging Machine mechanics if the curve is concave or for a DFR law if the curve is
S/PRESSE FO300 (4604) convex.
ROVT041 Forging Machine mechanics The diagrams Figure 2 and Figure 3, illustrate the
S(ti )
S/PRESSE FO300 (4603) points below ( ri , S(tr)
) exit of the TTT-Statistical.

Table 2. Results of the modeling parametric reliability.

Equip n Model adjusted Parameters Dks d(n,5%)

GREN011 197 Weibull β = 1.08, η = 79.12 0.068 0.097


Exponential λ = 0.013 0.057
GREN021 244 Weibull β = 1.07, η = 66.68 0.043 0.087
Exponential λ = 0.087 0.053
NOYT018 203 Weibull β = 1.05, η = 127.28 0.097 0.095
Exponential λ = 0.0080 0.088
NOYT020 195 Weibull β = 1.09, η = 138 0.070 0.097
Exponential λ = 0.0075 0.057
ROVT021 110 Weibull β = 1.25, η = 182.04 0.055 0.13
Exponential λ = 0.0059 0.135
ROVT041 165 Weibull β = 1.15, η = 176.77 0.092 0.108
Exponential λ = 0.0059 0.075
ROVT042 112 Weibull β = 1.14, η = 196.45 0.059 0.128
Exponential λ = 0.0053 0.063
DECRT 268 Weibull β = 0.97, η = 150.98 0.050 0.083
Exponential λ = 0.0065 0.005
BASP 388 Weibull β = 1.32, η = 55.66 0.034 0.069
Exponential λ = 0.0195 0.139

612
Table 3. Graphic test.

Curve of
Equip n tendency Model Rate

GREN011 197 Close to the Exponential Constant


first bisector
GREN021 244 Close to the Exponential Constant
first bisector
NOYT018 203 Close to the Exponential Constant
first bisector
NOYT020 195 Close to the Exponential Constant
first bisector
ROVT021 110 Concave IFR Growing
ROVT041 165 Close to the Exponential Constant
first bisector
ROVT042 112 Close to the Exponential Constant
Figure 2. Graphic test for: DECRET, ROVT041, ROVT042 first bisector
and ROVT021. DECRT 268 Close to the Exponential Constant
first bisector
BASP 388 Concave IFR Growing

exponential model. That is to say that the fail-


ure rate of such equipment is common, which
is confirmed by the test chart that validated the
exponential model of its lifetime.

4 AVAILABILITY OF EQUIPMENTS

The calculation of the operational availability


involves:
• Reliability by MUT (Mean UP Time);
Figure 3. Graphic test for: GREN011, GREN021, • The maintainability by MTTR (Mean Time To
NOYT018, NOYT020 and BASP. Failure);
• The repair time annexes.
The operational availability is given by:
From the preceding graphs: Figure 2 and Figure 3,
we will pronounce the results described in the table 3. MUT MUT
Dop = = , (1)
MUT + MDT MTBF
3.3 Comparison and interpretation of the results
With:
According to the preceding results, one notices homo- MDT: (Mean Down Time), which includes the
geneity between the results gotten by the parametric average repair time and repair time annexes.
modeling and those non parametric because they join When these annexes are negligible before the repair
themselves in the description of the type of failure. time, the operational availability will be:
 For the BASP and ROVT021, the parameter of
shape of the model of Weibull is superior to 1. Their MUT
Dop = · (2)
failure rates are therefore increasing with age, what MUT + MTTR
corresponds, in the IFR law found by the graphic
test. It means that their breakdowns are due to the With this approximation, one can illustrate the
ageing. fragility of the management system of maintenance
 For the rest of the equipment, modeling parametri- at this company.
cal reliability of the their lifetimes gave the model In what follows we will model the maintainabil-
Weibull with parameter shape β close to 1 and the ity and the down time to estimate respectively MTTR

613
Table 4. Results of the modeling of the repair times.

Equip n Law adjusted Parametres Dks d(n,0.05) MTTR

ROVT021 111 Log-normal m = 0.93, σ = 1.27 0.11 0.13 5.67


ROVT042 113 Log-normal m = 1.06, σ = 1.13 0.08 0.12 5.48
DECRT 269 Log-normal m = 0.74, σ = 1.18 0.05 0.08 4.23
ROVT041 166 Log-normal m = 0.75, σ = 1.14 0.07 0.10 4.12
BASP 389 Log-normal m = 0.71, σ = 1.07 0.06 0.06 3.62
GREN021 245 Log-normal m = 0.78, σ = 0.96 0.07 0.08 3.49
GREN011 198 Log-normal m = 0.66, σ = 0.96 0.03 0.09 3.07
NOYT020 197 Log-normal m = 0.59, σ = 0.88 0.06 0.09 2.68
NOYT018 204 Log-normal m = 0.55, σ = 0.83 0.04 0.09 2.49

Table 5. Results of the modeling of the immobilization times.

Equip n Law adjusted Parametres Dks d(n,0.05) MDT

GREN011 198 Log-normal m = 1.17 , σ = 0.93 0.05 0.09 5.00


GREN021 245 Log-normal m = 1.46, σ = 1.04 0.07 0.08 7.38
NOYT018 204 Log-normal m = 1.21, σ = 0.92 0.08 0.09 5.14
NOYT020 197 Log-normal m = 1.20, σ = 1.01 0.05 0.09 5.49
ROVT021 111 Log-normal m = 1.61, σ = 1.54 0.12 0.13 16.37
ROVT041 166 Log-normal m = 1.51, σ = 1.33 0.08 0.11 11.05
ROVT042 113 Log-normal m = 1.86, σ = 1.36 0.13 0.13 16.35
DECRT 269 Log-normal m = 1.31, σ = 1.49 0.08 0.08 11.24
BASP 389 Log-normal m = 1.47, σ = 1.10 0.06 0.07 7.96

Dopr −Dops
Table 6. Results of the modeling of the availability times, where Ddif = Dopr − Dops and Dres = 1−Dopr .

Equip MUT(h) MTTR(h) MDT(h) Dopr Dops Ddif Dres

GREN011 76.75 3.07 5.00 0.93 0.96 0.03 0.43


GREN021 64.68 3.49 7.38 0.89 0.95 0.06 0.54
NOYT018 125.17 2.49 5.14 0.96 0.98 0.02 0.50
NOYT020 133.17 2.68 5.49 0.96 0.98 0.02 0.50
ROVT021 169.2 5.67 16.37 0.91 0.97 0.06 0.66
ROVT041 167.93 4.12 11.05 0.94 0.98 0.04 0.66
ROVT042 186.62 5.48 16.35 0.92 0.97 0.05 0.62
DECRT 153 4.23 11.24 0.93 0.97 0.04 0.57
BASP 51.20 3.62 7.96 0.86 0.93 0.07 0.50

(Mean Time To Failure), and TDM (Mean Down  finally we estimate the rate of unavailability caused
Time), in order to evaluate Dopr et Dops . by the repair time annexes.
The results are reported in table 4, table 5 and
table 6. In an effort to highlight the potential impact
of the annexes repair time on the unavailability of Calculate of the reports between the operational
equipments, we have adopted the following steps: unavailability to the supplementary time of left and
the operational unavailability watch that the times
annex of repair represents more the half of the times
 calculate, as a first step, the actual availability Dopr unavailability. As analysis of the situation is therefore
and Dops ; indispensable, in order to reduce these times of immo-
 calculate, in a second stage, the unavailability bilization. Him also invites to review the politics of the
(Dopr − Dops ) trained by the waiting time of repair; management of the stocks of the pieces of spare and

614
Table 7. The chosen components. 5.3 Replacement depending on the age

Code Designation The replacement policy under age is to replace by a


new component with a Cp cost another component as
POT300 POTENTIOMETER LWG300 N ◦ 014311 it reaches the age T . A failure before the age T causes
THER THERMOCOUPLE NI CR AL 14X2 a breakdown cost Cd + Cp .
DETP END OF COURSE 516326E Under this policy, the average unit cost is esti-
AXCL AXIS FLUTED IMR mated by:
VISB SCREW OF BLOCKAGE
(Cd + Cp )[1 − R(T )] + Cp .R(T )
γ (T ) = T (3)
0 R(t)dt
Table 8. Costs of maintenance.
In cases where T tends to infinity we find the
Code CP (DA) Cd (DA) average cost per usage unit of a corrective mainte-
nance ‘‘γc ’’.
POT300 47036.96 24000
THER 3136.46 28800 (Cd ) + Cp C p + Cd
DETP 3772.05 19200 γ (∞) =  ∞ = = γc , (4)
AXCL 6813.46 48000 0 R(t)dt MUT
VISB 291.99 24000
The economic justification of the preventive main-
tenance comes from the calculation of the gain by using
a preventive maintenance policy.

to put the adequate means in place for a better hold in Gain = γ (T0 ) − γ (∞).
charge of the repairs.
Where ‘‘T0 ’’ the optimum time is a solution of the
derive of γ (T ), it is a solution of the equation:
5 OPTIMIZATION OF THE RENEWAL  T
Cd
λ(T ) R(t)dt + R(t) = . (5)
The importance of the BASP machine possesses in 0 (Cd − Cp )
the chain of production and the rate of unavailability
importing of this one. if Cp > Cd the equation has no solution, the most
It is on this account that we chose to study optimal economical renewal is curative.
replacement for some of its components, in order to
optimize its overall performance. 5.3.1 Research the optimum replacement:
Kelly model
It allows you to determine the optimum time of pre-
5.1 Selection of components ventive change, based on Weibull law parameters and
Kelly abacuses. To use these abacuses it is necessary
we have chosen components which have a fairly large to determine the parameters of Weibull law form and
frequency of use, in order to have enough data. These the ratio r = CCdp .
components are five in number, presented in the
table 7. for this we have modeled the lifetime of components
with the Weibull law. The results are presented in the
table 9.
The results of calculating the optimum time T0 of
5.2 Evaluation of maintenance costs Preventive change are given in the table 10.
The costs involved in a maintenance policy can be
separated into: compressible Cp (cost of preventive 5.3.2 Kay model
maintenance) and Cd (costs of defiance in service). In From the equations (3) and (4) the preferably relation-
the first category are involved the prices of spare parts ship γ (T ) < γ (∞) ) is expressed by the formula:
and Workers. Whatever the planned maintenance is, no T
gain is possible. For the second category, one variable R(t)dt Cp
is taken into account: the cost of the unavailability  0∞ ≥ k + (1 − k) · F(T ), k =
0 R(t)dt Cp + Cd
related to maintenance actions to be carried out. The
results are given in the table 8. (6)

615
Table 9. Modeled the lifetime of components.

Equip n Law Parametres D(ks) d(n,5%)

POT300 16 Weibull β = 3.86, 0.143 0.328


η = 3284.45
THER 26 Weibull β = 2.21, 0.267 0.267
η = 1002.25
DETP 55 Weibull β = 4.23, 0.114 0.183
η = 2460.00
AXCL 17 Weibull β = 3.39, 0.168 0.318
η = 5156.40
VISB 28 Weibull β = 2.33, 0.075 0.257
η = 1968.67

Figure 4. Graphic procedure of calculation of the optimal


age for AXCL.
Table 10. Optimal times of the renewals ‘‘abacus of Kelly’’.

Parameters of Table 11. Calculation of twisted them of the graph exit of


Cd
Compo Weibull law r= Cp X T0 = X .η the TTT-Transformation for AXCL.
Si
POT300 β = 3.86, Rangi Ti Ti − Ti−1 ti = i/r S(Ti ) S(T16 )
η = 3284.45 0.51 – –
THER β = 2.21, 0 0 0
η = 1002.25 9.18 0.36 360.81 1 1000 1000 0,062 24000 0.307
DETP β = 4.23, 2 2400 1400 0.125 45000 0.575
η = 2460 5.18 0.52 1279.2 3 2500 100 0.187 46400 0.593
AXCL β = 3.39, 4 3200 700 0.25 55500 0.709
η = 5156.40 7.05 0.45 2320.38 5 3200 0 0.312 55500 0.709
VISB β = 2.33, 6 4500 1300 0.375 69800 0.892
η = 1968.67 82.19 0.17 334.67 7 4500 0 0.437 69800 0.892
8 4588 88 0.5 70592 0.903
9 4800 212 0.562 72288 0.924
10 4840 40 0.625 72568 0.928
11 4900 60 0.687 72928 0.932
This relationship has an interesting graphic inter- 12 5240 340 0.75 74628 0.954
pretation operated by Kay (1976) in the case of several 13 5290 50 0.812 74828 0.957
criteria. Considering the first member of this relation- 14 5900 610 0.875 76658 0.98
ship, it is a function strictly increasing  of t = F(T ) 15 6400 500 0.937 77658 0.993
in the interval [0,1]to value in [0,1]. It is written in the 16 6912 512 1 78170 1
form:
 T =F −1 (t)
R(t)dt Table 12. Optimal time of preventive renewal.
(t = F(T )) = 0
∞ . (7)
R(t)dt
0
Code K tk t∗ T0
The time T0 minimizes γ (T ) defined by the relation POT300 1.96 2.04 − −
(3) exactly when t ∗ = F(T0 ) maximizes the expression THER 0.11 −0.12 0.038 350
(t)/[t + k/(1 − k)]. While t ∗ abscissa is the point DETP 0.19 −0.23 0.163 1282
of intersection of the curve y = (t) with his tangent AXCL 0.14 −0.16 0.125 2400
equation y =  (t ∗ ).[t + k/(1 − k)] Following the VISB 0.012 −0.012 0.036 345
point abscissa t = tk = −k/(1−k) and orderly 0. The
age of optimal replacement T0 = F −1 (t ∗ ) thus, comes
from the graphic procedure TTT-transformation, for
component AXCL, as illustrated in the figure 4. 5.3.3 Interpretation and comparing results
From the graph we have t ∗ = i/r = 0.125, and Based on the results, there is a similarity between the
if we go back to the table 11, it corresponds to T2 = results found with the two model. They show that:
2400. The optimal time of renewal for the component if Cd > Cp , there is no optimal time for renewal,
is then AXCL T 0 = 2400 (h). Results for the other and that the most advantageous maintenance is the
components are on the table 12. curative one.

616
The optimal times of preventive renewal obtained failures. Then we introduced a phenomenon found in
by the two methods are similar. The difference is due to the collection of data, which is the unavailability linked
the results found by the model Kay, which are based on to repair time annexes, that represents more than half
the values of the whole sample, or on discrete values, of the operational unavailability of equipments.
whereas those found by the model Kelly are obtained The evaluation of the economic consequences has
from an adjustment of a continuous law. led us to express the direct and indirect costs of main-
tenance in order to calculate the time of the optimal
renewal of certain pieces of equipment that presents
6 NON DESTRUCTIVE CONTROL important time (cost) unavailability. This shows that
the decisions of Preventive renewal are not only the
After the results of the optimal time for renewal are results of a technical study.
obtained, and in order not change a component in good For a better system of managing the maintenance,
condition, it is proposed to implement the procedures it is proposed to revise the current system of manage-
of organization of a non destructive control (NCD) ment, by investing in the implementation.
which consists of:
• Define the objectives and the equipments to be
followed, and evaluate the causes of failures; REFERENCES
• Study the feasibility of NCD;
• choose the methods and techniques of the Non- Bunea, C. Bedfford, T. (2002). The effect of model uncer-
destructive inspections to be used; tainty on maintenance optipization. IEEE, 486–493.
• Study and establish guidelines alarm; Canfield, R. (1983). Cost optimisation of periodic preventive
maintenance. IEEE, 78–81.
• Establish economic assessments; Cocozza-Thivent, C. (1997). Processus stochastique et fia-
• train all staff concerned. bilté des systèmes. Springer.
Gasmi, S. love, C. and W. Kahle (2003). A general
repair, proportional-hasard, framework to model complex
7 CONCLUSION repairable systems. IEEE, 26–32.
Lyonnet, P. (2000). La maintenance mathématique et méth-
In this study we used a global approach that uses odes. Tec and Doc.
two main areas: The organizational aspect and the Pellegrin, C. (1997). Fondements de la Décision Mainte-
techno-economic aspect. In a first step, thanks to the nance. Economica.
Pongpech, J. Murth, D. Optimal periodic preventive mainte-
ABC analysis, it was possible to identify the equip- nance policy for leased equipment. Science Direct.
ments that cause more than 60% of immobilization Priel, V. (1976). La maintenance: Techniques Modernes de
of the foundry section. After that, we have mod- Gestion. EME.
eled equipments’ reliability, by using parametric and Yves, G. Richet, D. and A. Gabriel (1999). Pratique de la
non-parametric approaches, which helped to highlight Maintenance Industrielle. Dunod.
the types of failures of these equipments. We have
seen that the majority of these are subject to random

617
Safety, Reliability and Risk Analysis: Theory, Methods and Applications – Martorell et al. (eds)
© 2009 Taylor & Francis Group, London, ISBN 978-0-415-48513-5

Planning and scheduling maintenance resources in a complex system

Martin Newby & Colin Barker


Centre for Risk Management, Reliability and Maintenance
City University, London

ABSTRACT: The inspection and maintenance policy is determined by the crossing of a critical threshold
by an aggregate performance measure. Rather than examining the first hitting time of the level, we base our
decisions on the probability that the system will never return to the critical level. The inspection policy is state
dependent and we use a "scheduling function" to determine the time to the next inspection given the system
state. Inspection reveals the true state of the system and allows the determination of the appropriate action, do
nothing or repair, and the time of the next inspection. The approach is illustrated using a multivariate system
model whose aggregate measure of performance is a Bessel process.

0
1 INTRODUCTION of failure and replacement times GF constitutes a
renewal process. This embedded renewal process is
The models derived in this paper are a natural exten- used to derive the expected cost per unit time over
sion of models which use the first hitting time of a an infinite time horizon (for the periodic inspection
critical level as a definition of failure (Barker and policy) and the total expected cost (for the non-
Newby 2006). Here we develop a model in which periodic inspection policy). The costs are optimized
the system is repaired if the probability of return- with respect to the system parameters.
ing to the critical level is small (a last exit time) and
it has not crossed a second level which corresponds
to catastrophic failure. The intention is to maintain 1.1 Modelling degradation
a minimum level of performance. The approach is The system is complex consisting of N components
appropriate when the system is subject to relatively and its state is an N -dimensional Wiener process
minor repairs until it begins to degrade faster and
requires major repair. This is typically the behaviour of Wt = μt + σ Bt , W0 = 0
infrastructure and large capital items. It also captures
the behaviour of systems which eventually become  T
economically obsolete and not worth repairing. The with μ = [μ1 , . . . , μN ]T , Bt = Bt(1) , . . . , Bt(N )
system is complex in the sense that it consists of a where Bt(i) is a standard Brownian motion.
number of components whose states evolve in time. The individual processes are not observed and
The system state is summarized using a Bessel pro- decisions are based on a performance measure
cess Rt ∈ [0, ∞) which is transient and thus tends Rt = Wt 2 , the L2 -norm of Wt . Without loss of gen-
to increase. Transience ensures that the process will erality we assume that σ = 1. Rt is the radial norm
eventually escape to ∞. There are two critical levels, of a drifting Brownian motion starting at the origin, a
ξ and F > ξ . The system is repaired if on inspec- Bessel process Bes0 (ν, μ) starting at the origin with
tion it has a small probability of returning to ξ , and parameter ν and drift μ (Rogers and Pitman 1980)
suffers a catastrophic failure if it reaches F. The time where
to inspection and repair is determined by a scheduling
function (Grall et al. 2002) which gives the time until 1
ν= N − 1, μ = μ2
the next action as a function of the current state. 2
The threshold ξ defines the repair actions and is
incorporated in a maintenance function r. The actions The properties of the Bessel process entail some
are determined by the probability that the process changes in the way the model is developed. Because
has escaped from [0, ξ ) and F defines the failure of the radial part of a Brownian motion with drift starting
the system and hence its replacement. The sequence at x > 0 is not a Bessel process with drift (Rogers and

619
Pitman 1980) we handle repair by adjusting the thresh- Maintenance actions are modelled using a function
olds. The difficulty is resolved by calculating the r to specify the amount by which both of the thresh-
distance remaining between the observed state at repair old values are decreased. The maintenance function
and the threshold and represent repair by restarting depends on the probability of return to ξ :
the process from the origin (in RN ) and lowering the 
threshold to that remaining distance. Extensive treat- x, P[Hξ0−x ≤ τ ] ≤ 1 − 
ments of the Bessel process with drift and the radial r(x) =
kx, P[Hξ0−x ≤ τ ] > 1 − 
Brownian motion are given in (Revuz and Yor 1991;
Pitman and Yor 1981; Rogers and Pitman 1980).
where 0 <  < 1 and k ∈ [0, 1].
Standard models can be recovered:  = 0 cor-
responds to no maintenance; k = 1 corresponds to
2 PERIODIC INSPECTIONS
minimal repair (as bad as old); and k = 0 corresponds
to perfect repair (good as new).
2.1 Features of the model
The cost function depends on the amount, r, by
2.1.1 Model assumptions which the threshold values are decreased
The model assumes: a) the inspection policy is to 
inspect at fixed intervals τ and are perfect and instan- 0, P[Hξ0−r(x) ≤ τ ] ≤ 1 − 
taneous; moreover, the system state is known only at Cr (x) =
Crep , P[Hξ0−r(x) ≤ τ ] > 1 − 
inspection or failure; b) the system starts from new,
at t = 0 R0 = 0; c) the thresholds are F and ξ < F ;
The transience of the Bessel process implies that Cr
d) each inspection incurs a fixed cost ci ; e) catas-
is well defined, ∀  ∈ (0, 1), ∃ τ ∗ ∈ R+ such that
trophic failure are instantaneously revealed as the first
hitting time of the threshold F; f ) the system is instan-
taneously replaced by a new one at cost Cf ; g) each ∀ τ ≤ τ ∗, P[Hξ0−x ≤ τ ] ≤ 1 − 
maintenance action incurs a cost determined by a cost ∀ τ > τ ∗, P[Hξ0−x ≤ τ ] > 1 − 
function Cr ; h) the transition density for Rt starting
from x is fτx ( y) ≡ f ( y|x, τ ). 2.1.3 The framework
There are two possibilities, the system fails GF0
≤ τ or
2.1.2 Settings for the model is still working GF ≤ τ after crossing ξ . At inspection
0

The state space in which the process evolves is parti- time t1 and before any maintenance action, the perfor-
tioned into a normal range [0, ξ ), a deteriorated range mance measure is Rt1 = x. Maintenance lowers the
[ξ , { ) and failed [F , ∞), threshold values ξ → ξ − r(x) and F → F − r(x),
so considering the next interval
R+ = [0, ξ ) ∪ [ξ , F ) ∪ [F , +∞) , 1. GF0
−r(x) > τ : the system survives until the next
planned inspection in τ units of time. The next
The decisions are based on the last exit time from inspection is at t1 +τ with cost Ci . The cost of repair
the critical threshold ξ at this next inspection is Cr (r(x)). The perfor-
mance measure at time t1 + τ is R0τ and determines
Hξ0 = sup {Rt ≤ ξ | R0 = 0} the reduction in the thresholds.
t∈R+
−r(x) ≤ τ : the performance measure hits the
0
2. GF
which is not a stopping time. Because Hξ0 is not a threshold F − r(x) before the inspection at t1 + τ .
stopping time, we work with the probability of not The system fails and is instantaneously replaced
returning to the level ξ before the next inspection, with cost of failure Cf . These failure times form a
P[Hξ0−x ≥ τ ]. The catastrophic failure time is renewal process.
Each cycle consists of a sequence of occurrences of
0
GF = inf {Rt = F | R0 = 0} case 1 and ends with case 2 as the system fails and is
t∈R+
replaced.
which is a stopping time. The density of GF x
is g xF .
Inspection at time t = τ (immediately before any
maintenance) reveals the system’s performance mea- 2.2 Optimal periodic inspection policy
sure Rτ . The level of maintenance (replacement 2.2.1 Expected cost per cycle
or imperfect maintenance) is decided according to If Rτ = x at time τ − an inspection prior to any main-
whether the system has failed GF0
≤ τ or is still work- tenance, we set Rτ + = x and the threshold values
ing GF0
≤ τ . Replacement is determined by the first adjusted to F − r(x) and ξ − r(x) at τ + just after
hitting time of threshold F. the action. A recursive argument yields an analytical

620
expression for the cost of inspection and maintenance 2.2.3 Expected cost per unit time
per cycle. The cost per cycle is Vτx given that at Rτ = x. A standard renewal reward argument gives the cost per
unit time.
Vτx = Cf 1{G0 ≤τ } vτx
r
F − (x)
Cτx =
 R0  lτx
+ Ci + Cr (x) + Vτ τ 1{G0 >τ }
r
F − (x)
with expressions for vτx , lτx given in (1) and (4)
respectively.
R0
where Vτ τ is the future cost restarting from the
renewed state 0. 2.2.4 Obtaining solutions
0
Taking the expectation: The density, gF , for the first hitting time of a Bessel
process with drift is known only through its Laplace
vτx = E[Vτx ] = A + B (1) transform (Pitman & Yor 1981; Yin 1999). The
 transform is for ν > 0
  τ

ν
A = E Cf 1{G0 = Cf 0
gF −r(x) ( y)dy β 2 + μ2 Iν (μF)
F −r(x) ≤τ }
E[e− 2 β GF ] =
1 2 0
0
 R0   μ Iν (F β 2 + μ2 )
B = E Ci + Cr (x) + Vτ τ 1{G0 >τ }
F −r(x)
  τ Solutions to (2) were obtained by performing numer-
= {Ci + Cr (x)} 1 − ical inversions of the Laplace transform using the
−r(x) (
0
gF y) dy
0 EULER method (Abate & Whitt 1995).
  τ  F −r(x) The Volterra equations (2) and (4) are reformulated
+ 1− 0
gF −r(x) ( y) dy vτy fτ0 ( y) dy as Fredholm equations
0 0
 F
vτx = Q(x) + λ(x) K {x, y} vτy dy
We restructure the expected cost as 0
 F
 F −r(x)
lτx = P(x) + λ(x) K {x, y} lτy dy
vτx = Q(x) + λ(x) vτy fτ0 ( y) dy (2) 0
0
with Q, λ as in (3), P as in (5) and
with
K {x, y} = 1{y≤F −r(x)} fτ0 ( y)
 τ
λ(x) = 1 − 0
gF −r(x) ( y) dy They are solved numerically using the Nystrom
0
routine with an N point Gauss-Legendre rule. For
Q(x) = (1 − λ(x)) Cf + λ(x) {Ci + Cr (r(x))} (3) xi ∈ (ξ , F], r (xi ) is not defined since ξ − xi < 0.
For such values we take r (xi ) = kxi , i.e. repair is
2.2.2 Expected length of a cycle undertaken on the system. If ξ − r (xi ) < 0, a cost of
The expected length of a cycle, lτx , is obtained similarly. repair is automatically included at the next inspection
The length of a cycle Lτx is time. This seems to be a reasonable assumption since
Rt is positive and will therefore always stay above such
 R0 
a threshold with negative value, meaning that the last
Lτx = GF
0
−r(x) 1{G 0 + τ + Lτ τ 1{G0 exit time has already happened and hence that repair
F −r(x) ≤τ } F −r(x) >τ }
must be considered.
R0 The optimal period of inspection and repair thresh-
where Lτ τ is the length of a cycle restarting in state 0. old can then be determined as:
The expected value is
(τ ∗ , ξ ∗ ) = argmin(τ ξ )∈R+ ×[0,F ] {Cτ0 }
 F −r(x)
lτx = P(x) + λ(x) lτy fτ0 ( y) dy (4)
0
3 NON-PERIODIC INSPECTIONS
with λ defined in (3) and
3.1 Features of the model
 τ
P(x) = 0
ygF −r(x) ( y) dy + τ λ(x) (5) The extension to non-periodic inspection policies
0 shares many features with the periodic policy

621
described in 2.1. The complexities of a dynamic Taking the expectation
programming formulation are avoided by introducing
a scheduling function τ = m(x) which determines the vx = E[V x ] = A + B + C
time τ to the next inspection based on the observed 
system state x. The scheduling function develops the   m(r(x)) 0
sequence of inspections in the following way: an A = Cf + v0 gF −r(x) ( y) dy
0
inspection at τi reveals Rτi = x, the repair is r(x) and   m(r(x))
the next inspection is scheduled at m(r(x)). B = {Ci + Cr (x)} 1 − −r(x) ( y) dy
0
gF
Different inspection policies are obtained through 0
the use of three scheduling functions m1 , m2 and m3   m(r(x))
modelled on the approach in (Grall et al. 2002). C = 1− 0
gF ( y) dy × ···
−r(x)
 0
a−1 
m1 [x | a, b] = max 1, a − x F
b ··· 1{y≤F −r(x)} vy fm(r(x))
0
( y) dy
⎧ 0
⎨ (x − b) 2

m2 [x | a, b] = 2
(a − 1) + 1, 0  x  b
⎩ b which may be re-arranged as
1, x > b.

√ 2  F

⎨ a−1 vx = Q(x) + (1 − λ(x)) v0 + λ(x) K {x, y} vy dy
− x + a, 0  x  b
m3 [x | a, b] = b 0


1, x>b with
All the functions decrease from a to 1 on the interval  m(r(x))
[0, b] and then remain constant at 1. The scheduling λ(x) = 1 − 0
gF −r(x) ( y) dy
function m1 decays linearly; the function m2 is con- 0
cave, initially steeply declining and then slowing; the Q(x) = (1 − λ(x)) Cf + λ(x) {Ci + Cr (r(x))}
function m3 is convex, initially slowly declining and
then more rapidly. The different shapes reflect differ- K{x, y} = 1{y≤F −r(x)} fm(r(x))
0
( y)
ent attitudes to repair, m3 tends to give longer intervals
initially and m2 gives shorter intervals more rapidly.
To avoid a circular definition maintenance function 3.3 Obtaining solutions
r employs the performance measure just before repair While (6) contains ν x and ν 0 because we need only the
 value of ν 0 we change the equation to
x, P[Hξ0−x ≤ m(x)] ≤ 1 − 
r(x) = 
kx, P[Hξ0−x ≤ m(x)] > 1 −  F
ν x = Q(x) + (1 − λ(x))vx + λ(x) K{x, yt}ν y dy
with 0 <  < 1, k ∈ (0, 1]. 0
The next inspection is scheduled at m (r(x)) units (6)
of time with cost Cr (r(x)), where
 and solve for ν x as for the periodic model as in (Barker
0, P[Hξ0−r(x) ≤ m (r(x))] ≤ 1 −  Newby 2007). The solution is then obtained by setting
Cr (r(x)) = x = 0.
Crep , P[Hξ0−r(x) ≤ m (r(x))] > 1 − 

3.2 Expected total cost 4 NUMERICAL RESULTS


The expected total cost in the non-periodic case may
be deduced with the use of the scheduling function. 4.1 Periodic inspection policy
The numerical results are based on Bes0 (0.5, 2) as
V x = [Cf + V 0 ]1fails given r(x)} + · · · Rt and a failure threshold F = 12. The choice
0 for the costs
 and the maintenance function’s param-
· · · [Ci + Cr (x) + V Rm(r(x)) ]1survives given r(x)}
eters are Ci , Crep , Cf = (50, 200, 500) and (k, ) =
= [Cf + V 0 ]1{G0 (0.9, 0.5). They are varied to investigate the behaviour
F −r(x) ≤m( r(x))} + · · ·
of the model. The choices were chosen arbitrarily to
R0m(r(x)) show some important features of both the inspection
· · · [Ci + Cr (x) + V ]1{G0 r(x))}
F −r(x) >m( and maintenance policies.

622
Table 1. Optimal parameters given k = 0.9,  = 0.5. 160

 
Ci , Crep , Cf (τ ∗ , ξ ∗ ) vτ0∗ lτ0∗ Cτ0∗ 140

Expected cost per unit time


(0.5, 200, 500) (1.2, 9) 9255.1 20781 0.45 120
(50, 200, 500) (1.6, 9.5) 6972 208.81 33.39
(500, 200, 500) (max, any) 500 5.83 85.89 100
(50, 2, 500) (1.8, 7.5) 2640.2 81.77 32.29
(50, 200, 500) (1.6, 9.5) 6972 208.81 33.39 80
(50, 2000, 500) (1.6, 10.5) 5542.3 163.69 33.86
(50, 200, 5) (max, any) 5 5.83 0.86 60
(50, 200, 500) (1.6, 9.5) 6972 208.81 33.39
(50, 200, 5000) (1.4, 8.5) 90295 2350.5 38.42
33.39

20
0 5 9.5 15
Repair threshold: ξ
4.1.1 The influence of the costs
The response of the model to costs is examined with Figure 1. Effect of parameter ξ on Cτ0∗ with (a, , τ ∗ ) =
(0.9, 0.5, 1.6).
costs Ci ∈ {0.5, 50, 500}, Crep ∈ {2, 200, 2000} and
Cf ∈ {5, 500, 5000}. The optimal period of inspec-
tion τ ∗ , repair threshold ξ ∗ and expected cost per unit
time Cτ0∗ are summarized in table 1. The expected cost Table 2. Optimal inspection period &
expected cost per unit time for different values
and expected length per cycle at the optimum are vτ0∗ of k ( = 0.5).
and lτ0∗ .
As Ci increases the optimal expected cost per unit k τ∗ Cτ0∗
time and the optimal period of inspection increase.
Increasing Ci makes inspection more expensive result- 0 3.4 15.72
ing in less frequent inspection and reduces lτ0∗ because 0.1 3.2 17.04
there will be more failures. 0.2 3.0 18.80
0.3 2.8 20.37
Changing Crep affects the optimal period of inspec- 0.4 2.4 23.08
tion and gives higher values for the optimal repair 0.5 2.4 25.18
threshold. The higher threshold ξ ∗ reduces the fre- 0.6 2.0 28.59
quency of repairs hence reducing costs. The optimal 0.7 2.0 30.02
strategy is driven by the repair threshold which deter- 0.8 1.6 32.78
mines the frequency of maintenance and thus the 0.9 1.6 33.39
optimal expected total cost. 1 1.4 36.21
Increasing Cf increases, τ ∗ and ξ ∗ . For a low cost
of failure (i.e. Cf << Ci + Crep ) the optimal strategy
is to let the system fail and then replace it resulting in
a lower cost than a repair or a simple inspection. with k = 1. Table 2 shows the uniform effect of the
repair parameter. The cycle length decreases and
4.1.2 Investigating the maintenance actions the cost increases as k increases.
The maintenance
 9 1  function considered has parameters iii. Repair Parameter: Repair is determined by the
(k, ) = 10 ,2 parameter  and the probability

x, P[Hξ0−x ≤ τ ] ≤ 12 P[Hξ0−x ≤ τ ] .
r(x) =
0.9x, P[Hξ0−x ≤ τ ] > 12 The different values for , 0.1, 0.5 and 0.9 reflect
The effects of parameters ξ , k and  on the model the decision maker’s attitude towards repair. Values
with (Ci , Crep , Cf ) = (50, 200, 500) are examined. close to 1 corresponds almost certain repair and as the
In Table 1 the optimal parameters are (τ ∗ , ξ ∗ ) = value decreases to 0 repair occurs less frequently, a
(1.6, 9.5). riskier position. The results in table 3 show that only
the threshold responds to .
i. The repair threshold: The optimal solution The model adapts itself to the decision maker’s atti-
depends strongly on ξ when other parameters tudes to repair (the value of ) by moving the optimal
remain fixed as is shown in figure 1. repair thresholds. As  increases repairs will be con-
ii. Level of repair: The level of repair increases sidered more often but ξ ∗ increases to restrain the
from perfect repair with k = 0 to minimal repair frequency of repairs. The optimal expected cost per

623
Table 3. Optimal parameters for different . Table 4. Optimal expected total cost and parameters a,
b and ξ .
 0.1 0.5 0.9
Repair Scheduling
ξ∗ 8 9.5 11 threshold function a∗ b∗ v∗
τ∗ 1.6 1.6 1.6
Cτ0 ∗ 33.39 33.39 33.39 m1 2.2 1.5 1171.7
m2 2.1 4.2 1169.5
ξ =1 m3 2.1 0.9 1170.4
m1 2.2 1.7 1189.1
m2 2.2 2.9 1194
ξ =2 m3 2.1 1 1189.9
m1 2.4 2.5 1546.3
m2 2.5 2.8 1572.1
ξ =3 m3 2.4 1 1547.8
m1 5.2 3.7 2.3283 × 105
m2 5.2 3.8 2.34 × 105
ξ =4 m3 5.2 1.9 2.3264 × 105
m1 6.5 0.5 3.8437 × 106
m2 6.5 0.7 3.8437 × 106
ξ =5 m3 6.5 0.5 3.8437 × 106

a more complex dynamic programming problem to


determine a policy = {τ1 , τ2 , . . . , τn , . . . }.

Figure 2. Effect of parameter  on the optimal solution Cτ0∗ 4.2.1 The optimal maintenance policy
with parameters (k, ξ ∗ , τ ∗ ) = (0.9, 9.5, 1.6). Table 4 reports the optimal solutions for a range of
thresholds and for the different scheduling functions.
An optimal solution was found in each case. The solu-
tions are not particularly sensitive to (a, b), but range
unit time remains constant in the three cases stud- over several orders of magnitude as the threshold ξ
ied. Figure 2 clearly shows that this is not the case varies.
for inspection periods τ ∈ (τ ∗ , τ=0.1 ], where τ=0.1
satisfies 4.2.2 The influence of costs
  We take an example with scheduling function m1 and
∀ t > τ=0.1 : P Hξ0 < t > 1 − 0.1
ξ = 3. The optimal parameters (a∗ , b∗ ) and total cost
For most values in this interval, the expected cost per are summarized in table 4. As Ci increases, the opti-
unit time increases with : the model penalizes a costly mal values of a and b increase making inspection less
strategy that favors too many repairs. For a period frequent when the cost of inspection increases.
of inspection greater than τ=0.1 , the expected costs
per unit time are identical since in all three cases the 4.2.3 Investigating the maintenance actions
approach towards repair is similar: the system will be i. Level of repair: The optimal cost for the
repaired with certainty three optimal inspection scheduling functions and
    k ∈ [0, 1] and repair threshold ξ = 3 are shown in
P Hξ0 < t > 0.9 ⇒ P Hξ0 < t > 0.5 Figure 3. In all three cases the expected total cost
  increases with k implying a reduction in the amount
⇒ P Hξ0 < t > 0.1 . of maintenance undertaken on the system at each
repair. The system will therefore require more fre-
quent repairs or will fail sooner implying an increase
4.2 Non-periodic inspection policy
in the total expected cost value.
The results are obtained using Bes0 (0.5, 2) for Rt and ii. Attitude to repair: The attitude of the deci-
with fF = 5. The different costs and the main- sion maker towards repair is reflected in the
tenance function’s parameters are (Ci , Crep , Cf ) = parameter  ∈ [0, 1]. The optimal expected costs
(50, 100, 200), (k, ) = (0.1, 0.5). obtained with corresponding optimal parameters
The optimal solution is determined by the optimal with  = 0.1, 0.5, 0.9 and ξ = 3 are summarized in
parameter values (a∗ , b∗ ) rather than the solution of table 6. Letting  approach zero means that the

624
decision maker tends to a safer maintenance 5 SUMMARY
approach. Changes in  induce changes in the opti-
mal inspection policy and the resulting optimal The aim of the models derived and investigated in
expected total cost (table 6). the present paper extend an earlier paper (Barker &
Newby 2006) by incorporating catastrophic failure
of the system. Catastrophic failure is represented by
Table 5. Optimal expected total cost and parameters (a, b) introducing a second threshold F to account for catas-
for different values of the maintenance costs, ξ = 3. trophic failure. The threshold ξ is incorporated in the
  repair function r as the last exit time from the interval
Ci , Crep , Cf a∗ b∗ v∗ [0, ξ ). The repair decision depends on the probabil-
(5, 100, 200) 2.4 2.4 1467.6 ity of occurrence of this last exit time before the
(50, 100, 200) 2.4 2.5 1546.3 next inspection. The models proposed hence include
(500, 100, 200) 2.5 2.6 2296.7 both a stopping time (the first hitting time) and a
(50, 1, 200) 2.4 2.3 1377.3 non-stopping time (the last exit time). The proba-
(50, 100, 200) 2.4 2.5 1546.3 bility density function of the first hitting time for a
(50, 1000, 200) 2.6 2.5 2971.9 Bessel process with draft being not known explicitly,
(50, 100, 2) 3.2 2.9 203.67 the expression for the expected total cost was solved
(50, 100, 200) 2.4 2.5 1546.3 numerically (a numerical inversion of the Laplace
(50, 100, 2000) 2.4 2.2 13134
transform of the first hitting time’s density function
was required).
The numerical results revealed a strong influence
5500
Inspection strategy considered: m1
of the threshold’s value ξ and parameter k on both the
5000
Inspection strategy considered: m
Inspection strategy considered: m
2 optimal period of inspection and the optimal expected
cost per unit time. Letting parameter  vary produced
3

4500
changes in the optimal repair threshold only, sug-
4000
gesting that the optimal strategy aims at keeping a
relatively constant frequency of repairs.
Expected total cost

3500

3000

REFERENCES
2500

2000 Abate, J. and W. Whitt (1995). Numerical inversion of laplace


transforms of probability distribution. ORSA Journal on
1500
Computing 7(1), 36–43.
1000
Barker, C.T. and M.J. Newby (2006). Optimal inspection
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7
k and maintenance to meet prescribed performance stan-
dards for a multivariate degradation model. In C. G. S. .
Figure 3. Optimal expected total cost as a function of k for E. Zio (Ed.), Safety And Reliability For Managing Risk,
the three inspection strategies. Volume 1, pp. 475–487. European Safety and Reliability
Association: Taylor & Francis.
Barker, C.T. and M.J. Newby (2007). Optimal non-
Table 6. Optimal expected total cost and parameters (a, b) periodic inspections for a multivariate degradation model.
for different values of , ξ = 3. to appear Reliability Engineering and System Safety,
DOI:10.1016/j.ress.2007.03.015.
Repair Inspection Grall, A., L. Dieulle, C. Berenguer, and M. Roussignol (2002,
parameter policy a∗ b∗ v∗ June). Continuous-time predictive-maintenance schedul-
ing for a deteriorating system. IEEE Transactions on
m1 2.3 2.4 1535 Reliability 51, 141–150.
m2 2.3 3.1 1622.7 Pitman, J. W. and M. Yor (1981). Bessel process and infinitely
 = 0.1 m3 2.3 2.1 1560.3 divisible laws, stochastic integrals. Lecture Notes in
Mathematics, Springer Berlin.
m1 2.4 2.5 1546.3 Revuz, D. and M. Yor (1991). Continuous Martingale and
m2 2.5 2.8 1572.1 Brownian Motion. Springer Verlag.
 = 0.5 m3 2.4 1 1547.8 Rogers, L. C. G. and J. W. Pitman (1980). Markov functions.
m1 2.1 2.2 1169.1 Annals of Probability 9, 573–582.
m2 2.1 4.1 1169.1 Yin, C. (1999). The joint distribution of the hitting time and
 = 0.9 m3 2.1 0.9 1169.9 place to a sphere or spherical shell for brownian motion
with drift. Statistics and Probability Letters 42, 367–373.

625
Safety, Reliability and Risk Analysis: Theory, Methods and Applications – Martorell et al. (eds)
© 2009 Taylor & Francis Group, London, ISBN 978-0-415-48513-5

Preventive maintenance planning using prior expert knowledge


and multicriteria method PROMETHEE III

F.A. Figueiredo, C.A.V. Cavalcante & A.T. de Almeida


Federal University of Pernambuco, Brazil

ABSTRACT: Preventive maintenance planning is one of the most common and significant problems faced by
the industry. It consists of a set of technical, administrative and management actions to decrease the component
ages in order to improve the system’s availability. There has been a great amount of research in this area
using different methods for finding an optimal maintenance schedule. This paper proposes a decision model,
integrating Bayesian approach with a multicriteria decision method (MCDM) based on PROMETHEE. This
model determines the best solution for the preventive maintenance policy taking both costs and availability as
objectives and considering prior expert knowledge regarding the reliability of the systems. Several building
considerations regarding the model are presented in order to justify the procedure and model proposed. Finally,
a numerical application to illustrate the use of the model was carried out.

1 INTRODUCTION of equipment. However, in order to create a PM pro-


gram, it is necessary to determine how often and when
Maintenance seeks to optimize the use of assets dur- to perform PM activities (Tsai et al., 2001).
ing their life cycle, which means either preserving In this paper, we apply a multicriteria method
them or preserving their ability to produce some- to solve the problem of preventive maintenance and
thing safely and economically. In recent decades, take into consideration costs, downtime and relia-
maintenance management has evolved and, nowadays, bility. The following section, Section 2, contains an
it is not just a simple activity, but a complex and overview of the PM problem and discusses previous
important function. Its importance is due to the great PM scheduling work. Section 3 presents explanations
number of operational and support costs incurred on of the PROMETHEE method and describes its pro-
systems and equipment. The United States, for exam- cedures. In Section 4, the model applied is presented
ple, used to spend approximately 300 billion dollars along with the results and discussions. Finally, the con-
every year on the maintenance of plant and operations clusions are summarized and possible future studies
(Levitt, 1997). are suggested in Section 5.
Since maintenance planning and development are
crucial, their best practice is important for any indus-
try. Maintenance planning can be defined according 2 THE PROBLEM
to three types of maintenance. One type is correc-
tive maintenance, which is performed after a fail- 2.1 Preventive maintenance problem
ure occurs. Other types are preventive maintenance
and predictive maintenance which are considered to This kind of maintenance is not known to be the dom-
be proactive strategies, because they are performed inant practice, because people resist spending money
before any failure has occurred and act in order to on efforts to forestall failures that might not hap-
reduce the probability of a failure. pen. There are two situations where PM is important:
Preventive maintenance (PM) is deemed as actions when it reduces the probability or the risk of death,
that occur before the quality or quantity of equip- injuries and environmental damage or when the task
ment deteriorates. These actions include repair or costs are smaller than the cost of the consequences
replacement of items and are performed to extend (Levitt, 2003).
the equipment’s life, thus maximizing its availability. Of the many reasons to implement a PM policy in an
Predictive maintenance, on the other hand, concerns industry, the greatest is the improvement of a system’s
planned actions that verify the current state of pieces of availability. But that is not the only one. There are
equipment in order to detect failures and to avoid sys- many others such as: increasing automation, minimiz-
tem breakdown, thus also maximizing the availability ing administrative losses due to delays in production,

627
reducing stocks of spares and equipment redundancy the decision-making processes related to preventive
reduction, minimizing energy consumption etc (Quan maintenance.
et al., 2007). Many advantages can be gained from a In summary, this paper proposes a multi-criteria
well developed PM program, although they depend on model in order to establish the age t of replacement,
the size and type of plant. The greater one’s asset val- taking into account not only the cost, as in classical
ues per square meter are, the greater will be the return models, but also reliability, as well as how the problem
on the PM program (Worsham, 2000). of data is overcome by means of a Bayesian integrated
The PM program will determine what maintenance approach.
activities should be performed on which items of each
piece of equipment, what resources are needed and
how often the activities should be scheduled. It will 2.2 The service context
only be effective if the tasks demanded for the system Mont (2001) has observed that the service sector
are scheduled at regular intervals (Quan et al., 2007). has undergone significant growth with strong reper-
Production must be considered while planning cussions for the economy. This has caused specific
maintenance, because production schedules may be changes in production systems, which increasingly
interrupted by failures or maintenance tasks. Thus depend on service for their processes, and which have
it is necessary to balance how many interventions resulted in the increasing flexibility of the production
will be made to production in order to make as few systems themselves.
interruptions as possible to perform PM and still main- According to Hirschl et al., (2003), there has to
tain a healthy system with few repairs or corrective be a societal change in consumer behavior. Thus, the
maintenance needs (Sortrakul et al., 2005). idea of keeping a particular piece of equipment for
Numerous studies have been conducted concerning the longest time possible is now an idea that may be
the problem of preventive maintenance. Bevilacqua imputed to consumer mentality. On the producer side,
and Braglia (2000) used AHP for the selection of the this means the longer a particular product is used, the
best maintenance strategies for an Italian oil refinery. more service work will be needed over the life-cycle
Wang et al., (2007) extended Bevilacqua and Braglia’s of the product.
studies by applying a fuzzy analytic hierarchy pro- The sustainable potential of a system of goods and
cess to deal with decision makers’ uncertainties as services does not only depend on this holistic vision
to their judgment. Samrout et al., (2005) applied a which embraces diverse criteria to define operational
heuristic method, called ant colony optimization to procedures, but is also based on the structure of new
a system of series-parallel aiming to minimize the kinds of relationships or partnerships between the
cost of preventive maintenance. A few studies tried actors of the process, leading to new convergences of
to solve the preventive maintenance problem by com- economic interests and a potential optimization of the
bining genetic algorithms and multicriteria, such as use of resources. Therefore the notion of a product is
Lapa et al., (2006), who used reliability and costs no longer just the physical result of a productive pro-
as objectives. Chareonsuk et al., (1997) presented a cess. Rather, it is an integrated concept where goods
PROMETHEE II model to determine the optimal inter- and services are mutually dependent; focusing on con-
val. However, in their model, the criteria were limited sumer needs and demands in an integrated way and at
to an interval of values for both criteria, thus excluding the same time increasing profit and diminishing the
the conflict regions. Cavalcante and Almeida (2005, environmental impact by reducing the volume of goods
2007) also applied the PROMETHE method in dif- manufactured (Manzini et al., 2003).
ferent papers. In one of them, a PROMETHEE II Given the importance of service production systems
methodology was applied, but in addition to not con- and their growth potential for participating in the econ-
sidering any limitation for the criteria values, the omy as well as the ever-stronger tendency towards the
lack or unreliability of data was also considered. In formation of chains, the distinctive features of service
the other paper, PROMETHEE III was used, which production systems are more than enough to justify the
permitted more rational planning for the problem of need for a serious study related to planning for their
preventive maintenance and also took into account maintenance.
external uncertainties.
In this paper, we propose an application based on the
early model (Cavalcante & Almeida, 2007), integrat-
3 THE DECISION MODEL
ing a Bayesian approach with a multi-criteria decision
method (MCDM) based on PROMETHEE, but, unlike
3.1 Multicriteria decision aid
the former, we consider some different aspects regard-
ing uncertainties about life distribution parameters. In According to Gomes et al., (2002), several methods for
addition, we higtilight some important particularities approaching complex processes of decision making
of service systems, which require some changes on have been developed over many years. Abstractions,

628
heuristics and deductive reasoning were developed, Depending on the pre-defined preference function
but only using multi-criteria or multi-objective meth- chosen, it may be necessary to define certain parame-
ods in which the decision maker’s preference structure ters such as q, p and s. According to Brans (1985) the q
was most loyally represented. indifference threshold is the largest deviation which is
Multi-criteria methods have been developed to sup- considered as negligible by the decision maker, while
port and lead the decision maker in evaluating and the p preference threshold is the smallest deviation
choosing the best solution. The problems analyzed which is considered as sufficient to generate a full
by the method can be discrete, when it has a finite preference. The parameter s is used in the case of the
number of alternatives, or continuous, when it has Gaussian criterion and it defines the inflection point
an infinite set of alternatives. It is very important of the preference function. It is recommended that a p
always to remember that these methods do not indi- and q be defined first, in order to choose the s value
cate the solution of the problem, but, instead they between the interval.
support the decision process through alternatives or The decision maker, then, decides a weight wj for
recommendations for actions (Gomes et al., 2002). each criterion that increases with the importance of the
There are two kinds of multi-criteria methods: criterion, and then the weights are normalized to sum
compensatory and non-compensatory methods. Com- up to unity (Brans et al., 2002).
pensatory methods represent the American School and To define the preference function Pj (a, b) = dj (a,
aggregate all the criteria in a unique utility function, b) = fj (a)−fj (b), that assumes values between 0 and 1,
which brings the notion of compensation between the what is evaluated is how the preference of the decision
criteria. Non-compensatory methods do not aggregate maker changes with the difference between the per-
the criteria and for each criterion a weight is defined formance of alternatives for each criterion j(Almeida
that represents its relative importance in the problem. et al., 2002). The behavior can be represented below:
The best known methods of this type are the fami-
lies ELECTRE and PROMETHEE that represent the
French School (Almeida et al., 2002). Pj (a, b) = 0: Preference for a or b
is indifferent,
Pj (a, b) ≈ 0: Mild preference for a
3.2 The PROMETHEE method in relation to b,
(2)
Pj (a, b) ≈ 1: Strong preference for a
The PROMETHEE (Preference Ranking Organiza- or b is indifferent,
tion Method for Enrichment Evaluation) conceived Pj (a, b) = 1: Strict preference for a
by Brans consists of two phases: the construction of in relation to b .
an outranking relation and then the exploitation of
outranking values relations (Almeida et al., 2003).
These French school methods, which include the Once the generalized criteria are defined for each
PROMETHEE family, like the American school, are criterion of the problem, then the calculation of the
used in multi-criteria problems of the type: multi-criteria preference index or aggregate ranking
(4) of every pair of alternatives must be established
(Almeida et al., 2002).
max {f1 (x), f2 (x), . . . , fk (x)|x ∈ A} , (1)

1 
n
where A is a set of decision alternatives and fi (x),
i = 1, . . . , k is a set of criteria of which each alterna- π(a, b) = wj Pj (a, b)
W j=1
tive is to be evaluated. Each criterion has its own unit (3)
(Chareonsuk et al., 1997). 
n
The application of the PROMETHEE method con- where: W = wj
sists of three steps. First of all, there is the definition j=1
of the generalized criteria, then the calculation of the
multi-criteria index, followed by the determination and
evaluation of an outranking relation (Dulmin et al., And finally, the next step is carried out, which
2003). involves the determination and evaluation of the out-
The preference behavior of the decision maker ranking relations.
will determine a function Pj (a, b) that assumes values For each alternative a ∈ A, two outranking flows
between zero and one, where a and b are alternatives. are determined with respect to all the other alternatives
There are six generalized criteria to choose from, when x ∈ A.
it comes to defining the preference function (Brans, The leaving flow or positive outranking flow (4)
1985). expresses how an alternative a is outranking all the

629
others, so the higher + (a), the better the alternative many advantages, such as the ease and speed at which
(Brans, 1985). decision makers understand the method itself and the
concepts and parameters involved, since they represent
 π(a, b) a more meaningful significance for them to relate to,
+ (a) = (4)
n−1 such as physical and economical values (Cavalcante
b∈A
et al., 2005).
The entering flow or negative outranking flow (4)
expresses how an alternative a is outranked by all the
others, so the lower − (a), the better the alternative 3.3 Decision model structure
(Brans, 1985).
3.3.1 Preventive maintenance policy in the service
 π(b, a) context
− (a) = (5) Concerning the question of failure control, in par-
n−1 ticular the breakdown of equipment parts, while the
b∈A
paradigm of optimization is well applied to the con-
The complete ranking, which is often requested by the text of the production of goods, it is not efficient in the
decision maker , considers the net ranking flow (6). context of services since the quality of service is intrin-
This particular flow is the balance between the positive sically related to the perception of the consumer, for
and negative outranking flows. Thus the higher the whom the translation into monetary value has little or
(a), the better the alternative (Brans, 1985). no meaning. On the other hand, cost, in a competitive
environment, is of great importance when decisions
(a) = + (a) − − (a) (6) are to be made, and cannot be ignored.
As a result, the choice of a preventive maintenance
After finding the net flow (6), it is possible to apply policy, which in practice refers to the intervals or
the PROMETHEE II method, which results in a total frequency of component replacement, represents a bal-
pre-order, as follows (Brans, 1985): ance between the risk of having a breakdown, keeping
in mind the negative perception that this can cause,
aPb(a outranks b) iff(a) > (b), and the cost related to the policy adopted. Optimiza-
(7) tion, then, no longer makes sense, as there is more than
aIb(a is indifferent to b) iff(a) = (b). one criterion to be observed.
What is needed then is a more appropriate
PROMETHEE III associates an interval [xa , ya ] with approach, for example, multi-criteria decision-aid,
each action a, as follows (Brans and Mareschal, 2002): capable of dealing with conflicts among the criteria
involved in the problem as well as the inclusion in
xa = φ̄(a) − ασa, the decision-making process of the decision-maker’s
ya = φ̄(a) + ασa, preferences.

where (8)
⎧ 3.3.2 Bayesian analyses
⎪ n is the number
 of actions,



⎪ 1  Bayesian analyses are used in cases where there
⎨ φ̄(a) = (a, b) − (b, a) = 1n (a)
n are insufficient data to directly obtain distribution
b∈A
2 parameters of the model (Cavalcante et al., 2007).

⎪ σa2 = 1n (a, b) − (b, a) − φ̄(a) ,

⎪ Accordingly to Procaccia et al., (1997), the feed-
⎩ b∈A
α > 0. back from experience can yield objective factual data.
These data allow the estimation of failure rates in oper-
Then the method defines a complete interval order, ation or probabilities of starting on demand, when
as represented below: the equipment is initially sound. So the combina-
tion of subjective and objective data, prior probability
aP III b (a outranks b) iff xa > yb function and likelihood provided by feedback from
(9) experience, respectively, helps to obtain a posterior
aI III b(a outranks b)iff xa ≤ yb and xb ≤ ya probability density that is richer.
It is extremely important to know that question-
For the problem presented, the use of a non- ing experts can produce much information since they
compensatory method was indicated as the most have different experiences and are even from differ-
suitable solution. Therefore this method favors the ent departments or industries, so their sensitivity to
alternatives with the best average performance. The a specific problem will not necessarily be the same
PROMETHEE method was chosen because of its (Procaccia et al., 1997).

630
Earlier literature often takes the classical approach the reliability function (11) obtained are:
or the combined classical Bayesian approach that pre-
sumes true parameters such as average costs per unit β−1 t β
β t − η
time and failure rates (Apeland et al., 2003). f (t) = e (10)
In the classical approach it is assumed that true η η

parameters exist, like failure rates related to com- − ηt
β

ponent types, or means and standard deviations of R(t) = e (11)


failure costs. The presumed true values of the param-
eters are unknown, and estimating them is only based For the preventive maintenance problem tackled, a
on historical ‘‘hard data’’. The focus of a combined replacement policy was considered, because it guar-
classical Bayesian approach is also on the presumed antees lower costs in replacements. There are a great
true probabilities and probability distributions; how- number of models proposed varying in cost, functional
ever, the meaning of uncertainty is different. In this performance and importance in the process. For every
case, the analyst group is uncertain of the true val- one of those models several criteria can be identified
ues and based on available hard data, suitable models to describe the preventive maintenance activities.
and engineering judgment, they elaborate a distribu- The replacement policy used is replacement with
tion that expresses the uncertainty (Apeland & Scarf, age that replaces an item at the moment it fails or peri-
2003). odically, whichever comes first. This policy is only
In the fully subjective approach, the focus, unlike effective when the replacement costs before the fail-
the other approaches, is on observable quantities, like ure are smaller than after the failure. Therefore, the
the number of failures or production loss over a period objective is to find that interval, t, that represents the
of time, and probabilities and probability distribu- smallest costs per unit time.
tions are used to describe uncertainty related to these The costs involved are ca, replacement cost after
quantities (Apeland & Scarf, 2003). failure, and cb, replacement costs before the failure.
In this paper, the model presented uses the com- To express the expected cost of using the replacement
bined classical Bayesian approach, because the data policy (12), in a mathematical expression is necessary
assumed a distribution, the parameters of which are to associate costs and time, as shown below:
unknown, and for that matter the uncertainty is related
to the distribution’s parameters. ca(1 − R(t)) + cbR(t)
Therefore, these parameters will be determined on Cm(t) = (12)
t
evidence of prior expert knowledge based on previ- xf (x)dx + tR(t)
ous experiences. When a model is based on a priori 0
knowledge of specialists, it means that it is based on
the hypothesis that a specialist has a reasonable idea Since in real situations this variable is not continuous,
about the subjective probability distribution of a vari- a set of possible solutions or alternatives is given in the
able θ , which represents the state of nature and can numerical application to find the best solution between
have a random or non-random nature (Cavalcante & the set.
Almeida, 2005). Where
In the Weibull distribution, the lack of data will R(t) is the reliability for the time t;
not allow the knowledge of both parameters η and β f (x) is the probability density function for time to
instead of that we can have uncertainty about one or failure.
both of them. In this way this(these) unknown param-
eter(s) is (are) considered random variable(s) and after
elicitation from a specialist it(they) will be considered 4 NUMERICAL APPLICATION
as priori distribution(s), π(η) and/or π(β) (Cavalcante
& Almeida, 2007). The criteria have already been described and the uncer-
tainties have been discussed. The simulation presented
in this section was based on data taken from the con-
text of electrical energy for devices and was found to
3.3.3 Criteria be highly adequate to preventive replacement.
A Weibull distribution was assumed for equipment For the replacement policy it was necessary to deter-
which deteriorates with time. Besides being highly mine some parameters, as shown in Table 1. After the
used, it is also flexible and therefore is proven to be a determination of the parameters or equipment charac-
good fit in many cases. teristics, the alternatives for the simulation are defined
The Weibull distribution has two parameters in this (see Table 2). The alternatives determined are not cho-
model: β, η shape parameter and scale parameter sen randomly, but according to opportune times that
respectively. The probability density function (10) and allow preventive maintenance

631
Table 1. Input data. Table 4. Decision evaluation matrix.

Distribution Parameters Value Alternatives R(t) Cm(t)

Weibull β 2,8 A1 0.986 2.117


η Unknown A2 0.941 1.244
- ca 10 A3 0.877 1.025
- cb 2 A4 0.8 0.958
A5 0.717 0.945
A6 0.631 0.954
A7 0.548 0.971
Table 2. Set of actions. A8 0.469 0.992
A9 0.398 1.013
Time A10 0.334 1.033
Alternatives years

A1 1
A2 2 Table 5. Preference function and criteria
A3 3 characteristics.
A4 4
A5 5 Characteristics R Cm
A6 6
A7 7 Max/min Max Min
A8 8 Weights 0.55 0.45
A9 9 Preference function V V
A10 10 Indifference limit 0.02 0.08
Preference limit 0.05 0.5
α 0.23

Table 3. Weibull distribution parameters


for unknown η.

Weibull distribution parameters Table 6. Ranking of alternatives.

Neta2 11 Time
Beta2 2.34 Alternatives years + − 

A1 1 0.54 0.45 0.09


A2 2 0.54 0.22 0.31
A3 3 0.49 0.12 0.37
Since the multi-criteria model proposed assumes A4 4 0.44 0.18 0.26
uncertainty about the parameter η of the Weibull distri- A5 5 0.38 0.24 0.14
bution of life time, the next step must be the elicitation A6 6 0.32 0.31 0.01
for the parameters unknown by applying the Bayesian A7 7 0.26 0.37 −0.11
A8 8 0.19 0.43 −0.24
analyses, in order to obtain the a priori distributions A9 9 0.13 0.49 −0.36
of the parameter unknown. The parameters of the dis- A10 10 0.07 0.55 −0.48
tribution that deal with the uncertainty of η, πη, are
presented in Table 3.
After that it is possible to calculate the performance
of the alternatives for each criterion according to their
expressions presented in section 3.3.3. The results starts. In this process the alternatives are evaluated in
obtained are shown in the decision evaluation matrix respect of all criteria and the decision-maker’s prefer-
(Table 4). ence. As explained in section 3.2, the PROMETHEE
The generalized criteria for the pairwise compar- method will determine the multi-criteria preference
ison must be determined, thus come the weights index, the input and output flow and finally the net
for each criterion and their parameters depending flow (see Table 6).
on the choice of generalized criteria. Since the Based on these results we can obtain the partial
PROMETHEE III method is used, parameter α must ranking by PROMETHEE III. The ranking can be
also be defined. The values are presented in Table 5. visualized in the illustration in Figure 1.
Once all the parameters are defined and the decision Thus, the multi-criteria model applied resulted in
evaluation matrix obtained, the aggregation process the ranking of the alternatives as follows, ordered from

632
the best to worst: 0,4

A3 > A2 > A4 > A5 0,2

Net flow
= A1 > A6 > A7 > A8 > A9 > A10 0
Therefore, the (A3) is the best alternative since –0,2
it is first in the ranking and it represents that the
replacement of the equipment should take place every –0,4
(A3)Years.
Besides, we can highlight that some alternatives –0,6
were considered indifferent, since their general perfor- A1 A2 A3 A4 A5 A6 A7 A8 A9 A10
mance were judged almost the same. This results from Alternatives
enlarging the indifference notion, which mirrors more
suitably the behavior of the decision maker when ana- Figure 1. Ranking of alternatives based on the net flow.
lyzing similar alternatives, especially if it was admitted
that there are uncertainties about some criteria of the
problem. in some intervals. The multi-criteria method applied,
In order to verify the robustness of some param- PROMETHE III, allows not only a better understand-
eters of the model, a sensitivity analysis was car- ing of the parameters, concepts and the method by
ried out. Through this analysis, it is possible to the decision makers, but, also an amplification of the
obtain more interpretative results which enhance the notion of indifference between the alternatives. In this
decision-maker’s understanding of the maintenance way, for practical purposes, where not merely one cri-
problem. terion is important for the establishment of preventive
Therefore, some variations were made up to +/ − periods, for example, for the service context, the deci-
15% on the values of the parameters of weights. As a sion maker can make use of this model, even if some
result, in spite of the fact that there was change in the problems with data have been detected.
ranking, the first three alternatives remain the same. Future studies suggested include the application of
Regarding the parameters called Preference limits the model to enrich some other problems in mainte-
and indifference limits for both criteria, there were no nance context, where some decision criteria beyond
changes to the best alternative for variations of up to the cost are very important for the decision-making
+/ − 15% on these parameters. The most significant process.
change was for the variation on −15% on the prefer-
ence limit for the Reliability criterion. In this case, one
tie was introduced, as we can see:
ACKNOWLEDGEMENTS
A3 > A2 = A4 > A5
This work has received financial support from CNPq
= A1 > A6 > A7 > A8 > A9 > A10 (the Brazilian Research Council).

In summary, even with substantial changes to the


parameters that form part of the decision model,
the best alternative remains the same and no critical REFERENCES
change was observed to the rankings of alternatives.
Almeida, A.T. de & Costa, A.P.C.S. 2003. Aplicações com
Métodos Multicritério de Apoio a Decisão. Recife: Ed.
Universitária.
5 CONCLUSIONS Apeland, S. & Scarf, P.A. 2003. A fully subjective approach
to modeling inspection maintenance. European Jounral
In this study, a MCDM model is put forward in order of Operational Research: p. 410–425.
to determine the frequency and periods at which the Bevilacqua, M. & Braglia, M. 2000. The analytical hier-
preventive maintenance for a specific item should be archy process applied to maintenance strategy selection.
performed in a service maintenance context in the face Reliability Engineering & System Safety: p. 71–83.
of a lack of failure data or external uncertainties. Brans, J.P. & Vincke, 1985. A Preference Ranking Organi-
zation Method: (The PROMETHEE Method for Multiple
Based on the results of the numeral application, we Criteria Decision-Making). Management Science Vol. 31,
can conclude that the decision model structure was No. 6: p.647–656.
used effectively, providing a result that can support the Brans, J.P. & Mareschal, B. 2002. Promethee-Gaia, une
decision as to replacement periods, considering the cri- Methodologie d’Aide à la Décision em Présence de
teria involved, which can be conflicting with each other Critères Multiples. Bruxelles: Éditions Ellipses.

633
Cavalcante, C.A.V. & Almeida, A.T.de. 2005. Modelo ples taken from the ‘environmentally friendly innovation’.
multicritério de apoio a decisão para o planejamento Journalof Cleaner Production: p. 851–857.
de manutenção preventiva utilizando PROMETHEE Mont, O.K. 2002. Clarifying the concept of product-service
II em situações de incerteza. Pesquisa Operacional: system. Journal of Cleaner Production: p. 237–245.
p. 279–296. Procaccia, H. & Cordier, R. & Muller, S. 1997. Application
Cavalcante, C.A.V. & Almeida, A.T. de. 2007. A multi- of Bayseian statistical decision theory for a maintenance
criteria decision-aiding model using PROMETHEE III for optimization problem. Reliability Engineering and System
preventive maintenance planning under uncertain condi- Safety: p. 143–149.
tions. Journal of Quality in Maintenance Engineering: Quan, G. & Greenwood, G. & Liu, D. & Hu, S. 2007. Search-
p. 385–397. ing for multiobjective preventive maintenance sched-
Chareonsuk, C. & Nagarur, N. & Tabucanon, M.T. 1997. ules: Combining preferences with evolutionary algo-
A multicriteria approach to the selection of preventive rithms. European Journal of Operational Research:
maintenance intervals. International Journal of Produc- p. 1969–1984.
tion Economics: p. 55–64. Samrout, M. & Yalaoui, F. & Châtelet, E. & Chebbo, N.
Dulmin, R. & Mininno, V. 2004. Standardized project man- 2005. New methods to minimize the preventive main-
agement may increase development project success. Inter- tenance cost of series-parallel systems using ant colony
national Journal of Project Management. optimizations. Reliability Engineering & System Safety:
Gomes, L.F.A.M. & Gomes, C.F.S. & Almeida, A.T.de. 2002. p. 346–354.
Tomada de Decisão Gerencial. São Paulo: Editora Atlas. Sortrakul, N. & Nachtmann, H. & Cassady, C.R. 2005.
Hirschl, B. & Konrad, W. & Scholl, G. 2003. New concepts Genetic algorithms for integrated preventive maintenance
in product use for sustainable consumption. Journal of planning and production scheduling for a single machine.
Cleaner Production: p. 873–881. Computers in Industry: p. 161–168.
Lapa, C.M.F. & Pereira, C.M.N.A. & Barros, M.P.de. Tsai, Y.-T. & Wang, K.-S. & Teng, H.-Y. 2001. Optimizing
2006. A model for preventive maintenance planning by preventive maintenance for mechanical components using
genetic algorithms based in cost and reliability. Reliability genetic algorithms. Reliability Engineering and System
Engineering & System Safety: p. 233–240. Safety: p. 89–97.
Levitt, J. 1997. The Handbook of Maintenance Management. Wang, L. & Chu, J. & Wu, J. 2007. Selection of optimum
New York: Industrial Press INC. maintenance strategies based on a fuzzy analytic hierarchy
Levitt, J. 2003. The complete guide to preventive and process. International Journal of Production Economics:
predictive maintenance. New York: Industrial Press INC. p. 151–163.
Manzine, E. & Vezzoli, C. 2003. A strategic design approach Worsham, W.C. 2000. Is preventive maintenance necessary?
to develop sustainable product service systems: exam- Maintenance Resources On-Line Magazine.

634
Safety, Reliability and Risk Analysis: Theory, Methods and Applications – Martorell et al. (eds)
© 2009 Taylor & Francis Group, London, ISBN 978-0-415-48513-5

Profitability assessment of outsourcing maintenance from the producer


(big rotary machine study)

P. Fuchs & J. Zajicek


Technical University of Liberec, Liberec, Czech Republic

ABSTRACT: Most companies have their maintenance plans and method of conducting them in place now.
The maintenance plan is in agreement with the producer’s recommendations or is changed by the maintenance
technician on the basis of his experiences. The producer’s recommendations for the warranty period needn’t be
the best of course, it’s best for the producer and its warranty. These maintenance strategies could be optimal
for commonly available equipment, but big rotary machines are often unique. Expensive maintenance done
by the producer could be in the long term more profitable because of their better experiences. This paper is
about total cost assessment of big rotary machine (i.e. a hydrogen compressor used in the refinery industry)
related to selected maintenance. Reliability data about MTBF and MTTR, economical evaluation of conducting
maintenance and last but not least—production lost from scheduled shutdown or equipment failure, are all
considered as all of these affect the cost. The companies that have one or more big rotary machines should do
a study of operational costs and they should do profitability assessment of outsourcing maintenance from the
producer. An economical model can find other problems, for example the wrong list of spare parts. The model
proposed in this paper could help companies to do studies for their big rotary machines.

1 INTRODUCTION DP = delivery period of spare parts [day]


LT = useful lifetime [year]
Effective control of production risks consists of under- MTBF_o = MTBF; needed parts that have to be
standing the risk character in operating practice and in ordered [year]
finding suitable tools for evaluating and optimizing MTBF_s = MTBF; needed parts in store [year]
the risk. One of the most important risks is associated MTTR_o = MTTR; needed parts that have to be
with failures of expensive and unique equipment. ordered (includes delivery period) [day]
The producer’s experience and construction knowl- MTTR_s = MTTR; needed parts in store [day]
edge can decrease the failure risk due to change of PL = production losses [EUR/h]
mean time between failures and mean time to repair PUR = purchase cost [EUR]
or due to a shorter delivery period of spares. Better
maintenance can affect the useful lifetime also.
The paper is about the profitability evaluation of 3 APPLICATION FIELD OF THIS STUDY
the maintenance contract with the producer and find-
ing profitability boundary values of the new MTBF, The equipment researched for this study, which mea-
MTTR and delivery period. The warranty, war- sures the profitability of outsourcing maintenance, is
ranty period and additional warranty cost are not compressor. Other suitable equipment types would be
included, because the paper is related to important big pumps, turbines, etc. For this application it is
equipment whose cost/losses come mainly from pro- necessary to highlight the following presumptions.
duction losses. In case of failure during warranty
period producers pay the spares and the reparation, Equipment is unique
but production losses are on the users side. – The user does not have the same or similar equip-
This is a practical application of known procedures ment, which can perform the same function.
and the model can be easy used for other suitable – The delivery period of spare parts for repairs and
equipments. scheduled maintenance is in weeks or months.
– Equipment or spare parts are expensive.
2 NOTATIONS Equipment performs important function
– Equipment downtime causes stopping or reduction
CONx = annual maintenance contract cost [EUR/year] of production. It generates large production losses.

635
Maintenance significantly affects the equipment Table 1. Current cost.
availability
– Preventive maintenance tasks change the reliability, [EUR /
availability, etc. Cost [EUR] [year] year]
– Maintenance tasks are difficult because of the
equipment structure. Purchase cost 3 000 000 30 100 000
Spare parts cost
– It is possible to effectively apply condition – bearings 50 000 4 12 500
monitoring. – rotary 30 000 12 2 500
– seals system 30 000 12 2 500
– seals
 6 000 4 1 500
4 REQUIRED INPUTS – 116 000 19 000
Spare parts Storage 1500
Ideally all possible costs should be inputs to the model. and maintenance
Practically it is more useful to focus on the main parts Labour cost 10 000
Monitoring
of cost only. It may not be possible to exactly evaluate
– vibrodiagnostic 3 000
all values. Insufficient statistical data (about failures, –tribology 600
maintenance tasks and costs) have to be replaced by – 3600
estimations. Operating cost
Inaccuracy of some inputs with large cost could be (energy) 1 000 000
so important that these inaccuracies are larger than
some smaller input costs (e.g. administration, rent a
land). These items with small costs are recommended
to be ignored. Table 2. Production losses.

Production losses PL [EUR / hour]


5 BIG ROTARY MACHINE STUDY - Tasks during outage 0
COMPRESSOR Long-term downtime 15 000

The compressor in a refinery was chosen as an exam-


ple of this study. Its function is compressed hydrogen
production. A great deal of company production maintenance cost of spare parts, production losses,
depends on this equipment because the compressed administration and labour cost.
hydrogen is necessary to the technological process.
Estimation of these cost are:
The user and the producer will not specify the input – ×.103 EUR/year - administration, installation, liq-
data because of private data protection, therefore this uidation, storage and maintenance of spare parts
has been changed, but it has not impacted on the – ×.104 EUR/year - labour cost, purchase cost of
method. spare parts, condition monitoring
Equipment basic information: – ×.105 EUR/year - purchase cost of equipment
– input: 3MW – ×.105 till ×.106 EUR/year - production losses, oper-
– purchase cost (including installation and liquida- ating cost
tion): 3 million EUR
Using a more exact approach, the values used in the
– condition monitoring is necessary (vibrodiagnostic,
study are given in table 1.
tribology)
Production losses per year are not in the table above.
– recommended period of inspections and overhaul:
For evaluation of this cost it is necessary to have some
4 years
reliability quantities (MTTR & MTBF) and produc-
– recommended period of general overhaul: 12 years
tion losses per hour during the downtime. Production
– supposed useful lifetime: 30 years
losses per hour are specified in table 2.
Remark: The recommended periods of inspections
and overhaul proceed from scheduled outage of the
refinery. 5.2 Reliability quantities
This study is economically oriented therefore only the
most likely events are sufficient. The reliability quan-
5.1 Current cost associated with the equipment
tities are expert estimations because of the low amount
Cost associated with the equipment can be split into of statistical data about failures, repairs and preventive
purchase cost, installation and liquidation cost, oper- maintenance tasks. Obtaining the required statisti-
ating cost, purchase of spare parts, storage and cal data is often a problem for reliable and unique

636
Table 3. MTBF, MTTR. service offer of the producer (section 5.3). Other
varieties will be compared with this variety 1.
Reliability quantities MTBF [year] MTTR [day]
Variety 2
Parts in storage 15 10 The maintenance contract includes these points:
Parts not in storage 50 10 + 300
– Training course
– Condition monitoring
– Delivery, storage and maintenance of spare parts
equipments therefore the estimations are the best way – Technical support
forward.
Undesirable events, which stop this equipment and The annual contract price is 60 000 EUR. The esti-
hence stop the production process too, are necessary to mated benefits for the user are MTBF and useful
investigate or eradicate. These events are mostly bear- lifetime increase to 105% and repairing times decrease
ing failure or sealing system failure. It is supposed to 95%. Due to contract existence this will not delay
that the mean time to repair (MTTR) is 10 days, if the the ordering and administration, so the delivery period
needed parts are in storage. If the needed parts are not could be about 8 months.
in storage, it is necessary to add the delivery period.
The delivery period is approximately 10 months for Variety 3
this equipment type (include ordering, the administra- The maintenance contract includes these points:
tive delay, etc.). The mean times between these events
(MTBF) and MTTR are given in table 3. – Training course
MTBF, when the parts are not in storage, includes – Condition monitoring
the possibility that the machine crashes promptly after – Delivery, storage and maintenance of spare parts
repair therefore new parts are not available. – Technical support
– Inspection and overhaul conducting (every 4 years)
– General overhaul conducting (every 12 years)
5.3 Producer services
The annual contract price is 75 000 EUR. The esti-
The producer offers a wide range of services and is mated benefits for the user are MTBF and useful
able to ensure full outsourcing of maintenance. The lifetime increase to 110% and repairing times decrease
offered services are summarized as: to 90%. Due to contract existence this will not delay
the ordering and administration, so the delivery period
1. Training course could be about 8 months.
2. Condition monitoring The benefits of the varieties are summarized in
– online vibrodiagnostic table 4.
– tribology
– technology parameters monitoring
– annual reports 5.5 Varieties evaluation - effectiveness index
The varieties differ in annual purchase cost (includ-
3. Delivery, storage and maintenance of spare parts
ing installation and liquidation), monitoring cost and
4. Maintenance tasks scheduling and conducting
production lost.
– according to machine condition Purchase cost is the same, but annual purchase cost
– periodic inspections depends on useful lifetime (equation 2).
– periodic overhauls and general overhauls
5. 24h Technical support Purchase_ cos t
Annual_purchase_ cos t =
Useful_lifetime

5.4 Outsourcing maintenance varieties × [EUR/year] (1)

The offered services for outsourcing can be sum-


marized into two packages (variety 2 & variety 3). Condition monitoring cost is only in variety 1, other
The current method of maintenance is described in varieties include it in contract price.
variety 1. Production losses are divided into three possibilities
due to maintenance tasks. Tasks conducted during the
Variety 1 scheduled stop period don’t generate production losses
The first possibility is to continue with the cur- because all the plant is in outage. The second (third)
rent maintenance strategy without using the standard possibility enables maintenance tasks during common

637
Table 4. Benefits of varieties.

Variety Lifetime extension [%] MTBF extension [%] Time to repair reducing [%] Delivery Period [month]

1 0 0 0 10
2 5 5 5 8
3 10 10 10 8

Lifetime MTBF SP in storage MTBF SP not in storage Time to repair Delivery time
[year] [year] [year] [day] [month]

1 30 15 50 10 10
2 31.5 15.75 52.5 9.5 8
3 33 16.5 55 9 8

operational time when the spares are (not) in storage. difficult as variety 2 has better effectiveness index and
These possibilities have different MTBF and MTTR variety 3 generates more savings. It is caused by non-
for each variety and that is why the losses are different. linear dependence between the benefits and contract
Total annual production losses are evaluated according costs.
to formula (2).
  5.6 Varieties evaluation—profitability boundary
MTTR_s MTTR_o
Annual_PL = + Another approach is finding boundary values of
MTBF_s MTBF_o
some parameters, which can be modified by main-
∗ ∗
24 PL [EUR/year] (2) tenance. The parameters are spare parts delivery
period, repair time, mean time between failures
Concrete values of direct cost and risks for each and useful lifetime. The contract is profitable if at
variety are given in table 5. least one of these watched parameters achieves its
At first sight it is evident the profitability of contract boundary value. Evaluation of these indicators cor-
making due to savings. Average annual savings are responds to formulas (5)–(8). All formulas compare
generated most by a decrease of catastrophic failure contract price with differences between current cost
risk. The catastrophic failure risk mainly comes from and cost after one parameter change, with the other
the spare parts delivery period. The producer was not parameters the same. The general formula for all
able to provide the spare parts prices dependent on parameters is:
the delivery period, although the user was agreeable
to pay a multiple of the standard prices. The solution CONx = risk(DP) + risk(MTBFs)
is hence contract making and thereby administrative
delay elimination. + risk(MTTR)
The next indicator (except savings) could be the
effectiveness index of the contract. The index indicates + annual_ cos t(LT )[EUR/year] (4)
valuation of investment. The formula for this index
follows, C_ DP = boundary value of delivery period

TC1 − TCx C_ RT = boundary value of repair time


IEx = (3)
CONx C_ BF = boundary value of MTBF
where IEx = effectiveness index of variety x; C_ LT = boundary value of useful lifetime
TC1 = total cost of variety 1 (current strategy);
TCx = total cost of variety x without annual contract (DP − C_DP)∗ 24∗ PL
price. CONx = ⇒ C_DP
Investment is profitable if the index is greater than 1. MTBF_o
The values after evaluation are IE2 = 9.2 &
IE3 = 8.8. Both of the possibilities are much better CONx∗ MTBF_o
= DP − [day] (5)
than the current state. Determining the best variety is 24∗ PL

638
MTTR_s∗ 24∗ PL MTTR_o∗ 24∗ PL − [(MTTR_o(1 − DP)∗ C_RT + DP]∗ 24∗ PL
CONx = (1 − C_RT ) +
MTBF_s MTBF_o
∗ ∗
CONx MTBF_o MTBF_s
⇒ C_RT = 1 − (6)
(MTBF_o∗ MTTR_s + MTBF_s∗ MTTR_o − MTBF_s∗ DP)∗ 24∗ PL

MTTR_s∗ 24∗ PL MTTR_s∗ 24∗ PL


CONx = −
MTBF_s MTBF_s∗ C_BF
MTTR_o ∗ ∗
MTTR_o∗ 24∗ PL MTTR_o∗ 24∗ PL ( MTBF_s
MTTR_s
+ MTBF_o ) 24 PL
+ − ⇒ C_BF = (7)
MTBF_o MTBF_o∗ C_BF MTTR_o ∗ ∗
( MTBF_s + MTBF_o ) 24 PL − CONx
MTTR_s

PUR PUR PUR


CONx = − ⇒ C_LT = (8)
LT LT ∗ C_LT PUR − LT ∗ CONx

After real values substitution into formulas (5)–(8) additional cost from their purchase cost, storage and
the boundary values are according to table 6. maintenance, but they can sharply change the rate
between MTBF when the parts are / are not in storage.
5.7 Optimal list of spare parts It ultimately affects annual risk.
For every reasonable option of spare parts list
The number of spare parts has an indispensable effect should be done cost analysis including risks. The low-
on the equipment total cost. Spare parts generate est total cost indicates the best spare parts list and in
case of need with maintenance contract making.
Table 5. Annual cost of each variety.
5.8 Summary of results
Variety 1 Variety 2 Variety 3
[EUR/y.] [EUR/y.] [EUR/y.] The presented economical model compares two vari-
eties of outsourcing maintenance with the current
Purchase cost, state. The effectiveness indexes confirm profitability
installation,
liquidation 1 00 000 95 238 90 909
of both varieties. Since both indexes are very high
Spare parts cost 19 000 19 000 19 000 (evaluate the investment nearly 10 times) the rec-
Storage and ommended possibility is variety with more savings -
maintenance variety 3.
of spare parts 1500 1500 1500 In case of input uncertainty it is suitable to evaluate
Labour cost 10 000 10 000 10 000 boundary values of profitability and then consider if it
Monitoring 3 600 0 0 is possible to achieve at least one of these parameters.
Operating cost 10 00 000 10 00 000 10 00 000 Concretely for contract from variety 3 is sufficient for
Production lost example decrease delivery period from 300 days to
– tasks during stop 0 0 0
period
290 days or increase MTBF by 3%.
– failure; SP in storage 2 40 000 2 17 143 1 96 364
– failure; SP not in 22 32 000 17 10 857 16 29 818
storage 6 CONCLUSION
36 06 100 30 53 738 29 47 591
Contract 0 60 000 75 000 Most companies have their maintenance plans and
Savings 0 4 92 362 5 83 509 method of conducting them in place now. The
maintenance plan is in agreement with the pro-
ducer’s recommendations or is changed by the main-
Table 6. Profitability boundary. tenance technician on the basis of his experiences.
Maintenance tasks are conducted by company employ-
Variety 2 Variety 3 ees or contractors for special branches (electrician,
machine staff etc.)
C_ DP 292 290 days These maintenance strategies could be optimal
C_ RT 81 76 %
C_ BF 102.5 103.1 %
for commonly available equipment, but big rotary
C_ LT 250 400 % machines (especially compressors, turbines), which
are made individually based on clients requirements)

639
are often unique. Expensive maintenance done by the No. 1M06047 - Research Center for Quality and
producer could be in the long term more profitable Reliability of Production.
because of their better experiences.
The presented economical model targets risks
related with operational and maintenance cost of REFERENCES
unique equipments and could be used as an instruction
for risk optimization. Moubray J. M.: Reliability-centred Maintenance. Second
edition. Butterworth-Heineman, Oxford, 1997.
Fuchs, P.: Využití spolehlivosti v provozní praxi, Liberec,
2002.
ACKNOWLEDGMENT ISO 14121 (1999): Safety of machinery—Principles of risk
assessment.
This research was supported by the Ministry of Educa- ISO 14040 (2006): Environmental management—Life cycle
tion, Youth and Sports of the Czech Republic, project assessment.

640
Safety, Reliability and Risk Analysis: Theory, Methods and Applications – Martorell et al. (eds)
© 2009 Taylor & Francis Group, London, ISBN 978-0-415-48513-5

Simulated annealing method for the selective maintenance optimization


of multi-mission series-parallel systems

A. Khatab & D. Ait-Kadi


Mechanical Engineering Department, Faculty of Science and Engineering, Interuniversity Research Center
on Entreprise Networks, Logistics and Transportation (CIRRELT), Université Laval, Quebec, Canada

A. Artiba
Institut Supérieur de Mécanique de Paris (Supmeca), France

ABSTRACT: This paper addresses the selective maintenance optimization problem for a multi-mission series-
parallel system. Such a system experiences several missions with breaks between successive missions. The
reliability of the system is given by the conditional probability that the system survives the next mission given
that it has survived the previous mission. The reliability of each system component is characterized by its hazard
function. To maintain the reliability of the system, preventive maintenance actions are performed during breaks.
Each preventive maintenance action is characterized by its age reduction coefficient. The selective maintenance
problem consists in finding an optimal sequence of maintenance actions, to be performed within breaks, so that
to minimize the total maintenance cost while providing a given required system reliability level for each mission.
To solve such a combinatorial optimization problem, an optimization method is proposed on the basis of the
simulated annealing algorithm. In the literature, this method has been shown to be suitable for solving such a
problem. An application example with numerical results are given for illustration.

1 INTRODUCTION Dealing with selective maintenance, the first work


is reported in (Rice, Cassady, and Nachlas 1998).
For reparable systems, several mathematical mod- In this work, Rice et al. consider a series-parallel
els have been developed in the literature for optimal system which operates a series of missions. Replace-
design of maintenance policies (for a survey, see for ment of failed components is performed during breaks
example (Cho and Parlar 1991; Dekker 1996)). Never- between successive missions. Since breaks are of finite
theless, most of these models do not take into account length, it may be difficult to replace all failed com-
the limitations on the resources required to perform ponents. In (Rice, Cassady, and Nachlas 1998), to
maintenance actions. This drawback has motivated the determine the optimal replacement to be performed, a
development of a relatively new concept called the maintenance optimization problem is derived to max-
selective maintenance. The objective of the selective imize the system reliability for the next mission. To
maintenance consists in finding, among all available solve this problem, an heuristic-based method is pro-
maintenance actions, the set of appropriate actions posed to overcome the limitations of total enumeration
to be performed, under some operation constraints, method. The selective maintenance optimization prob-
so that to maximize the system reliability or either lem has been also addressed in (Cassady, Pohl, and
to minimize the total maintenance cost or the total Murdock 2001) (Cassady, Murdock, and Pohl 2001)
maintenance time. Selective maintenance, as a main- (Chen, Mend, and Zuo 1999) (Khatab, Ait-Kadi,
tenance policy, is relevant to systems that are required and Nourelfath 2007b) and (Khatab, Ait-Kadi, and
to operate a sequence of missions such that, at the Nourelfath 2007a).
end of a mission, the system is put down for a finite Selective maintenance optimization problems han-
length of time to provide an opportunity for equipment dled by the above mentioned works are limited to the
maintenance. Such systems may include for exam- treatment of only one next mission. However, since the
ple manufacturing systems, computer systems and system is required to perform a sequence of missions,
transportation systems. while requiring short development schedules and very

641
high reliability, it becomes increasingly important formulation. The optimization method is presented in
to develop appropriate approaches to manage selec- Section 4, and an application example with numerical
tive maintenance decisions when the planning horizon results are provided in Section 5. Conclusion is drawn
considers more than a single mission. In (Maillart, in Section 6.
Cassady, Rainwater, and Schneider 2005), the authors
consider a series-parallel system where each subsys-
2 MULTI-MISSION SERIES-PARALLEL
tem is composed of identical components whose time
SYSTEM DESCRIPTION
to failure is exponentially distributed. The system is
assumed to operate a sequence of identical missions
Consider a series-parallel system S composed of n
such that breaks between two successive missions are
subsystems Si (i = 1, . . . , n). Each subsystem Si
of equal durations. At the end of a given mission,
is composed of Ni independent, and possibly, non-
the only available maintenance action is the replace-
identical components Cij (j = 1, . . . , Ni ). Compo-
ment of failed components. At a given time, the
nents, subsystems and the system are assumed to expe-
average number of successful missions remaining in
rience only two possible states, namely functioning
the planning horizon is defined. To maximize such
state and failure state.
a number, a stochastic programming model is then
Assume that the system is initially new and required
proposed. Numerical experiments are conducted to
to perform a sequence of M missions each with known
perform and compare the results obtained for three
duration U (m), m = 1, . . . , M . Between two succes-
maintenance optimization problems. Nevertheless, the
sive missions there are breaks of known length of time
approach proposed in (Maillart, Cassady, Rainwater,
D(m, m + 1) for m = 1, . . . , M − 1. Namely, the sys-
and Schneider 2005) merely relies on a series-parallel
tem operates according to two successive states: Up
system with few subsystems each composed of com-
state −→ Down state −→ Up state. . . . In the Up
ponents of identical constant failure rates. Further-
state the system is operating while in the Down state
more, replacement of failed components is the only
the system is not operating, but available for any main-
available maintenance action, missions are of identical
tenance actions. Such a scenario may arise for systems
time interval and breaks are also of identical durations.
that operate for some time per a day and then put into
This paper solves the selective maintenance opti-
the down state for the rest of the day.
mization problem proposed in (Khatab, Ait-Kadi, and
Let Aij (m) and Bij (m) be the ages of component Cij ,
Artiba 2007). In (Khatab, Ait-Kadi, and Artiba 2007)
respectively, at the beginning and at the end of a given
the authors consider a system composed of series of
mission m(m = 1, . . . , M ). Clearly, one may write
subsystems each composed of parallel, and possibly,
Bij (m) as:
different components the lifetime of which are gen-
erally distributed. The system operates a sequence of
missions with possibly different durations such that Bij (m) = Aij (m) + U (m). (1)
nonidentical breaks are allowed between successive
missions. During a given mission, a component that If Xij denotes the lifetime of component Cij , then the
fail undergoes minimal repair while at the end of a reliability Rij (m) of component Cij to survive mission
mission several preventive maintenance actions are m is given such that:
available. Each preventive maintenance action is char-
acterized by its ability to affect the effective age of Rij (m) = Pr(Xij > Bij (m) | Xij > Aij (m))
system components. In (Khatab, Ait-Kadi, and Artiba Pr(Xij > Bij (m))
2007), the proposed selective maintenance optimiza- =
tion problem consists in finding an optimal sequence Pr(Xij > Aij (m))
of preventive maintenance actions the cost of which R(Bij (m))
minimizes the total maintenance cost while providing = , (2)
R(Aij (m))
the desired system reliability level for each mission. In
the present work, to solve this problem, we present an
optimization method based on the simulated annealing where R is the survival time distribution function of
algorithm (Kirkpatrick, Gelatt-Jr., and Vecchi 1983). the random variable Xij .
The advantage of such an algorithm, over other meth- If component Cij is characterized by its corre-
ods, is known for its ability to avoid becoming trapped sponding hazard function h(t), then the conditional
at local minima during the search process. reliability Rij (m) can be written as:
The remainder of this paper is organized as follows.   
Aij (m) Bij (m)
The next section gives some notations and definitions
related to the studied multi-mission series-parallel Rij (m) = exp hij (t)dt − hij (t)dt
0 0
system. Section 3 addresses the proposed selective
maintenance optimization model and the problem = exp(Hij (Aij (m)) − Hij (Bij (m)), (3)

642
t
where Hij (t) = 0 hij (x)dx is the cumulated hazard performed. To construct such a model, the following
function of component Cij . decision variable is introduced:
From the above equation, it follows that the reli- ⎧
ability of subsystem Si and that of the system S are ⎪
⎨1 if Cij undergoes PM ap
respectively denoted by Ri (m) and R(m) and given as: ap (Cij , m) = at the end of mission m,


0 otherwise.

Ni
Ri (m) = 1 − (1 − Rij (m)), and (4) (m = 1, . . . , M − 1) (6)
j=1
In this paper, the selective maintenance problem

n
consists in finding an optimal sequence of mainte-
R(m) = Ri (m). (5) nance actions the cost of which minimizes the total
i=1 maintenance cost while providing the desired system
reliability level for each mission. The total main-
tenance cost is composed of minimal repair cost
3 SELECTIVE MAINTENANCE MODEL CMRij (m) induced by the repair of each compo-
AND PROBLEM FORMULATION nent Cij during each mission m, and the preventive
maintenance cost CPMij (m) of each component Cij
In this paper two types of maintenance are con- that undergoes preventive maintenance at the end of
sidered, namely corrective maintenance (CM) and mission m.
preventive maintenance (PM). CM by means of The cost induced by minimal repairs is function
minimal repair is carried out upon components of components failure rates. Following the work of
failures during a given mission while PM is a Boland (Boland 1982), for a given component Cij , the
planned activity conducted at the end of missions expected minimal repair cost in an interval [0, t] is:
(i.e. within breaks) to improve the overall sys-  t
tem mission reliability. It is assumed that com- cmrij hij (x)dx. (7)
ponent failure is operational dependent and the 0
time in which a given component undergoes mini-
mal repair is negligible if compared to the mission According to the above equation, the minimal
duration. repair cost CMRij (m) induced by component Cij during
Each component Cij of the system is character- mission m is such that:
ized by its hazard function hij (t) and its minimal  Bij (m)
repair cost cmrij . The preventive maintenance model CMRij (m) = cmrij hij (x)dx, (8)
is given on the basis of the age reduction con- Aij (m)
cept initially introduced by Nakagawa (Nakagawa where Aij (m) and Bij (m) represent the ages of com-
1988). According to this concept, the age of a ponent Cij , respectively, at the beginning and at the end
given component is reduced when PM action is per- of a given mission m (m = 1, . . . , M ) and Aij (1) = 0
formed on this component. In this paper, the vector by definition. If component Cij undergoes preventive
VPM = [a1 , . . . , ap , . . . , aP ] represents the P PM maintenance action ap (p = 1 . . . , P) at the end of
actions available for a given multi-mission system. For mission m, then the value of the component age Bij (m)
each PM action ap (p = 1, . . . , P) is assigned the cost is reduced by the age reduction coefficient α(ap ). In
cpm(ap ) and the time duration dpm(ap ) of its imple- this case, the minimal repair cost CMRij (m) assigned
mentation, the age reduction coefficient α(ap ) ∈ [0, 1] to Cij becomes:
and the set Comp(ap ) of components that may under-

goes PM action ap . Regarding the values taken by g(α(ap ))×Bij (m)
a given age reduction coefficient α(ap ), two par- CMRij (m) = cmrij hij (x)dx , (9)
ticular cases may be distinguished. The first case Aij (m)
corresponds to α(ap ) = 1 which means that the PM
action ap has no effect on the component age (the where the function g is related to the value taken by
component status becomes as bad as old), while the the decision variable ap (Cij , m) and defined to be such
second case is α(ap ) = 0 and corresponds to the that:
case where the component age is reset to the null
α(ap ) if ap (Cij , m) = 1,
value (i.e. the component status becomes as good g(α(ap )) = (10)
as new). 1 otherwise.
Selective maintenance model attempts to specify a
PM action that should be performed, on which com- According to the above equation, the total mini-
ponent and at the end of which mission it has to be mal repair cost CMRij assigned to Cij which undergoes

643
preventive maintenance actions at the end of missions of a given mission. Therefore, time between missions
1, . . . , M − 1 is given such that: should be taken into account as an operation constraint.
The total duration DPM (m) spent by preventive main-
 Bij (M ) tenance actions at the end of a given mission m is given
CMRij = cmrij hij (x)dx by the following formula:
Aij (M )


P 
n 
Ni
  −1 
P M g(α(ap ))×Bij (m) DPM (m) = dpm(ap ) × ap (Cij , m). (17)
+ cmrij hij (x)dx . p=1 i=1 j=1
p=1 m=1 Aij (m)

(11) Assume that the system has just achieved the first
mission and will operate the remaining missions, and
By using components accumulated hazard rates, let R0 denotes the required reliability level of the
Equation (11) may be written as: system at the beginning of each mission m (m =
⎛ ⎞ 2, . . . , M ). The selective maintenance problem is then
P M −1
 formulated as follows: from the vector VPM find
CMRij = cmrij ⎝ΔHij (M + Hij (m, p)⎠, the optimal sequence of PM actions which minimizes
p=1 m=1 the total maintenance cost Ctotal while providing the
(12) desired reliability level R0 . To derive the mathematical
programming model corresponding to such a problem,
where ΔHij (M ) = Hij (Bij (M )) − Hij (Aij (M )) and let the vector S = [s1 , . . . , sK ] be the sequence of PM
ΔHij (m, p) = Hij (g(α(ap )) × Bij (m)) − Hij (Aij (m)). actions performed so that to keep the system reliability
From Equation (12), it follows that the total cost at the desired level R0 . Roughly speaking, the vector
CMR of minimal repair, induced by all components S is of dimension K  P and composed of elements
during missions, is given by: of the vector VPM. At the end of a given mission, if
preventive maintenance is required, then the first PM
action to be performed corresponds to the first ele-

n 
Ni
CMR = CMRij . (13) ment s1 of S. Whenever, action s1 is not sufficient to
guaranty the system reliability level, in this case PM
i=1 j=1
actions s1 and s2 should be performed simultaneously,
The total preventive maintenance cost CPMij and so on. The mathematical programming model cor-
assigned to component Cij , which undergoes preven- responding to the selective maintenance optimization
tive maintenance actions at the end of missions, is problem is:
given by:
MinimizeCtotal (S) = CMR(S) + CPM (S), (18)
 
P M −1
Subjectto :
CPMij = cpm(ap ) × ap (Cij , m). (14)
p=1 m=1 R(m + 1)  R0 , (19)
It follows, from the above equation, that the total DPM (m)  D(m, m + 1), (20)
preventive maintenance cost CPM induced by all
system components is: 
sK
ap (Cij , m)  1, (21)

n 
Ni p=s1
CPM = CPMij . (15)
i=1 j=1 ap (Cij , m) ∈ {0, 1}, (22)

Finally, the total maintenance cost Ctotal to be min- i = 1, . . . , n; j = 1, . . . , Ni ; p = s1 , . . . , sK , (23)


imized is given from Equations (13) and (15) such
that: m = 1, . . . , M − 1; and K  P. (24)

Ctotal = CMR + CPM . (16) where constraint (20) stands that PM actions under-
taken at the end of a given mission should be completed
To complete the selective maintenance optimization within the allotted time, constraint (21) imposes the
problem, let note that, due to the limited time (break) fact that each component may receive almost one PM
between missions, it may be not possible that all pre- action at the end of each mission, while constraint(22)
ventive maintenance actions be performed at the end is a {0, 1}-integrality constraint.

644
4 OPTIMIZATION METHOD 4.2 Simulated annealing algorithm for the selective
maintenance optimization problem
4.1 The simulated annealing algorithm
In this paper, the simulated annealing algorithm is
Simulated annealing (SA) is one of the most local used as an optimization technique to solve the selec-
search metaheuristics which has been widely studied tive maintenance optimization problem. The solution
and shown to be suitable for a wide range of com- representation is inspired from that of (Levitin and Lis-
binatorial optimization problems, as in the case of nianski 2000) (see also (Nahas, Khatab, Ait-Kadi, and
the selective maintenance optimization problem for- Nourelfath 2007)). The element of the vector VPM
mulated in this paper. The SA principle exploits an of available PM actions are numbered from 1 to P.
analogy between the way a metal cools and freezes The maintenance plan, as a solution, is represented
into a minimum energy crystalline structure and the by a vector S = [s1 , . . . , sK ] with finite length K ≤ P
search for a minimum in a more general system. The and where sp ∈ {1, 2, . . . , P} , for p = 1, . . . , K. The
application of this algorithm to solve combinatorial length of a given solution depends on its feasibility.
optimization problems was initiated by Kirkpatrick The initial feasible solution is derived on the basis of
et al. (Kirkpatrick, Gelatt-Jr., and Vecchi 1983). The the following procedure.
major advantage of SA algorithm over other methods is
known for its ability to avoid becoming trapped at local 4.2.1 Initial solution construction
minima during the serach process. Figure 1 presents 1. Set the length of S to a constant number Kmax
an overview of the SA algorithm. 2. Generate the elements of S from a random permu-
The algorithm starts by an initial solution s gen- tation of the set {1, . . . , P}
erated either randomly or heuristically, and by an 3. Set K = 1
initial temperature Ti . Then, a solution s is randomly 4. Calculate the objective function and the constraint
sampled from the neighborhood V (s) of the current values by using the K first elements (i.e. PM
solution s. The solution s is then accepted or rejected actions) of S
depending on the current temperature Tc and the val- 5. if (K = Kmax ) and (S is not feasible) then return to
ues of the objective function f at points s and s (i.e. step 2
f (s) and f (s )). Since the selective maintenance opti- 6. If (K < Kmax ) and (S is not feasible) then set K =
mization problem consists in maximizing the system K + 1 and proceed from step 4
reliability, it follows that s will be accepted with prob-
ability 1 as a new solution whenever f (s ) > f (s). To define the appropriate neighborhood, several
However, in the case where f (s )  f (s), s will be structures were investigated. The following procedur
accepted with probability which is function of Tc and provides the neighbor solution.
f = f (s)−f (s ). Thisprobability follows Boltzmann
−f 4.2.2 Neighboring solution construction
distribution p = exp . The temperature is 1. Generate randomly a number x from the interval
Tc
decreased following the progression formula Tc = ηTc [0, 1]
where η represents the cooling schedule. The search 2. If (x ≥ 0.5) then choose randomly an element S(i)
process is continued until the termination condition with 1 ≤ i ≤ N , and randomly increase or decrease
T = Tf holds, where Tf is the minimal temperature. by 1 the content of S(i), i.e. S(i) = S(i) + 1 or
S(i) = S(i) − 1
3. If (x < 0.5) then choose randomly two elements
S(i) and S(j) with 1 ≤ i ≤ K and K + 1 ≤ j ≤ P,
and exchange the contents of S(i) and S(j).
It is worth noticing that in order to ensure the
feasibility of a given solution, one needs to evalu-
ate the objective function and the constraints of the
optimization problem. To this end, a procedure was
developed for the feasibility test of a given solution
vector S = [s1 , . . . , sK ].

5 APPLICATION EXAMPLE

The test problem used is this paper is based on a series-


Figure 1. An overview of the simulated annealing parallel system composed of n = 4 subsystems Si
algorithm. (i = 1, . . . , 4) as shown in Figure 2. Subsystems S1 ,

645
Table 2. Parameters of preventive maintenance actions.

PM action p Comp(p) α(p) cpm(p) dpm(p)

1 C11 0.51 10.01 0.66


2 C11 0.41 14.33 1.02
3 C12 0.00 15.08 1.50
4 C12 0.48 3.3 0.22
5 C13 0.27 6.28 0.49
6 C13 1.00 5.93 0.30
Figure 2. The series-parallel system for the application 7 C14 1.00 7.38 0.37
example. 8 C14 0.12 18.19 1.62
9 C15 0.15 16.4 1.43
10 C15 0.00 20.83 2.08
Table 1. Parameters of system components. 11 C21 0.00 19.6 1.96
12 C21 1.00 7.73 0.39
Component Cij βij γij cmrij 13 C22 0.37 6.92 0.51
14 C22 0.18 16.67 1.41
C11 0.006 1.36 1 15 C22 0.00 16.99 1.70
C12 0.01 2.19 2.3 16 C23 0.42 10.54 0.74
C13 0.006 1.66 1.5 17 C23 0.38 13.22 0.96
C14 0.002 1.93 1 18 C23 0.68 5.04 0.30
C15 0.008 1.41 1.9 19 C31 0.48 11.83 0.80
C21 0.002 1.58 2.6 20 C31 0.34 13.96 1.04
C22 0.01 2.57 1.6 21 C32 0.42 12.25 0.86
C23 0.011 2.84 1.3 22 C32 0.71 4.49 0.26
C31 0.002 2.01 1.2 23 C32 0.34 13.33 0.99
C32 0.007 1.34 2.4 24 C33 0.13 18.55 1.64
C33 0.009 1.13 1.9 25 C33 1.00 9.70 0.49
C34 0.006 1.7 0.7 26 C33 0.43 1.68 0.12
C41 0.002 1.27 0.7 27 C34 0.63 5.45 0.33
C42 0.002 2.09 0.4 28 C34 1.00 1.44 0.07
29 C34 0.75 8.71 0.50
30 C41 0.78 3.68 0.21
31 C41 0.52 6.54 0.43
32 C41 0.00 17.66 1.77
33 C42 1.00 2.53 0.13
S2 , S3 and S4 are, respectively, composed of N1 = 5, 34 C42 0.58 9.51 0.6
N2 = 3, N3 = 4 and N4 = 2 components. The reliabil-
ity function of each component is given by a Weibull
hazard function h(t) = β γ γ (t)γ −1 , where β and γ are,
respectively, the scale and the shape parameters. The
accumulated hazard function is then H (t) = (βt)γ .
For each component Cij (i = 1, . . . , 4; j = Table 3. Durations of missions and breaks between
successive missions.
1, . . . , Ni ), the scale βij and shape γij parameters and
minimal repair cost cmrij , are randomly generated and U 84 57 59 54 78 68 70 76 84 49
reported in Table 1. The vector VPM of available D 9 9 9 11 7 9 9 7 11
PM actions is also randomly genertaed and given by
Table 2. The system is assumed to experience M = 10
missions. Durations of missions and breaks are given
in Table 3.
The proposed algorithm is implemented by using
MATLAB software tool on a 1.86 GHz Intel Core Duo According to the available PM actions given by
processor. For required reliability level R0 = 0.90, the Table 2, Table 4 gives components and the time at
algorithm was tested for several values of parameters which they should receive a specific type of PM action
Ti , Tf and η. The most appropriate values are found so that to ensure the system reliability level R0 for each
to be Ti = 80, Tf = 10−5 and η = 0.99. The length mission. Figure 3 presents the system reliability ver-
assigned to the solution vector S is set to Kmax = 30. sus the number of missions. The best total maintenance

The best selective maintenance plan obtained for the cost Ctotal , induced by the obtained selective mainte-

mission reliability level R0 = 0.90 is presented in nance plan and minimal repairs, is Ctotal = 400.64,
Table 4. while the execution time is about 76 sec .

646
Table 4. The best selective maintenance plan obtained for this problem, an optimization method is proposed on
the required reliability level 0.90. the basis of the simulated annealing algorithm.
Mission m PM action p Cij ∈ Comp(p)
REFERENCES
2 11 C21
4 15,30,26, C22 , C41 , C33 , Boland, P. (1982). Periodic replacement when minimal repair
20,8 C31 , C14 costs vary with time. Naval Research Logistic Quarterly
5 11,34 C21 , C42 29(4), 541–546.
6 15,26,34, C22 , C33 , C42 , Cassady, C. R., W. P. Murdock, and E. A. Pohl (2001).
30,20 C41 , C31 Selective maintenance for support equipement involv-
7 20,30, C31 , C41 , ing multiple maintenance actions. European Journal of
15,8 C22 , C14 Operational Research 129, 252–258.
8 11,20,8, C21 , C31 , C14 , Cassady, C. R., E. A. Pohl, and W. P. Murdock (2001). Selec-
26,15 C33 , C22 tive maintenance modeling for industrial systems. Journal
of Quality in Maintenance Engineering 7(2), 104–117.
Chen, C., M. Q.-H. Mend, and M. J. Zuo (1999). Selec-
tive maintenance optimization for multistate systems. In
Proc. of the IEEE Canadian Conference on Electrical and
Computer Engineering, Edmonton, Canada, 1477–1482.
Cho, D. I. and M. Parlar (1991). A survey of mainte-
nance models for multi-unit systems. European Journal
of Operational Research 51, 1–23.
Dekker, R. (1996). Application of maintenance optimization
models: a review and analysis. Reliability Engineering
and System Safety 51(3), 229–240.
Khatab, A., D. Ait-Kadi, and A. Artiba (2007). Selective
maintenance optimization for multimission series-parallel
systems. European Journal of Operational Research
(submitted).
Khatab, A., D. Ait-Kadi, and M. Nourelfath (2007a). Algo-
rithme du recuit simul pour la rsolution du problme d’op-
timisation de la maintenance slective des systmes srie-
parallle. Seventh International Conference on Industrial
Enginnering, Trois-Rivires, QC, Canada.
Khatab, A., D. Ait-Kadi, and M. Nourelfath (2007b).
Heuristic-based methods for solving the selective main-
tenance problem in seriesprallel systems. International
Conference on Industrial Engineering and Systems Man-
Figure 3. Mission reliability of the system in the planing agement, Beijing, China.
horizon. Kirkpatrick, S., C. D. Gelatt-Jr., and M. P. Vecchi (1983).
Optimization by simulated annealing. Science 220(4598),
671–680.
6 CONCLUSION Levitin, G. and A. Lisnianski (2000). Optimization of
imperfect preventive maintenance for multi-state systems.
In this paper, we proposed a selective maintenance Reliability Engineering and System Safety 67, 193–203.
optimization model for a multi-mission series-parallel Maillart, L., C. R. Cassady, C. Rainwater, and K. Schnei-
der (2005). Selective Maintenance Decision-Making Over
system. Lifetime of each system component is gen- Extended Planing Horizons. Technical Memorandum
erally distributed. The system operates on a plan- Number 807, Department of Operations, Weatherhead
ning horizon composed of several missions such that School of Management, Case Western Reserve University.
between successive missions a break of finite length is Nahas, N., A. Khatab, D. Ait-Kadi, and M. Nourelfath
allotted to perform maintenance actions. Missions as (2007). Extended great deluge algorithm for the imper-
well as breaks are of possibly different durations, and fect preventive maintenance optimization of multi-state
during breaks a list of preventive maintenance actions systems. REliability Engineering and System Safety
are available for system components maintenance. A (submitted).
combinatorial optimization problem is formulated the Nakagawa, T. (1988). Sequential imperfect preventive main-
tenance policies. IEEE Transactions on Reliability 37(3),
objective of which consists in finding, during the plan- 295–308.
ing horizon, an optimal sequence of preventive main- Rice, W. F., C. R. Cassady, and J. Nachlas (1998). Opti-
tenance actions to be performed so that to minimize mal maintenance plans under limited maintenance time. In
the total maintenance cost while providing, for each Proceedings of Industrial Engineering Conference, Banff,
mission, the desired system reliability level. To solve BC, Canada.

647
Safety, Reliability and Risk Analysis: Theory, Methods and Applications – Martorell et al. (eds)
© 2009 Taylor & Francis Group, London, ISBN 978-0-415-48513-5

Study on the availability of a k-out-of-N System given limited spares


under (m, NG ) maintenance policy

T. Zhang, H.T. Lei & B. Guo


National University of Defense Technology, Changsha, People’s Republic of China

ABSTRACT: This paper considers a k-out-of-N system with identical, repairable components under called
(m, NG ) maintenance policy. Under this policy, maintenance is initiated when the number of failed components
exceeds some critical level identified by m. After a possible set-up time of spares replacement, at least NG
components should be good in the k-out-of-N system when it is going to be sent back to user. A multi-server
repair shop repairs the failed components. The operational availability of this kind depends on not only the spare
part stock level, the repair capacity, but also the two parameters m and NG of maintenance policy. This paper
presents a mathematical model of operational availability for repairable k-out-of-N system given limited spares
under (m, NG ) maintenance policy. We can make trade-off between the spare part stock level, the number of
repairmen and two parameters of maintenance policy using this model. From the analysis of an example, we get
the some valuable conclusions.

1 INTRODUCTION is less than N for the rapid restoration of the repairing


equipments. This can be called (m, NG ) maintenance
For its high reliability, k-out-of-N systems has been policy. And the (m, NG ) maintenance policy means
more and more used in the equipments of modern that replacement of components is only initiated if the
advanced technology, such as the systems of aircraft, number of available components is less than N − m.
warship, or missile. An example is the Active Phased And after the repair, the number of available compo-
Array Radar (APAR). This radar has a cubical shape. nents at least is more than NG (N − m < NG ≤ N )
On each of the four sides, it has a so-called face, before the system will be transited back to the user.
consisting of thousands of transmit and receive ele- When NG = N , (m, NG ) maintenance policy is just
ments. A certain percentage of the total number of m-maintenance policy.
elements per face is allowed to fail, without losing the Some work is available on the operational avail-
function of the specific radar face. Say that this per- ability model of k-out-of-N systems. Ref.[2] analyses
centage is 10% and that the total number of elements the availability of parallel repairable system; Ref.[3]
is 3000, then we have a 2700-out-of-3000 system. analyses the operational availability of a 1-out-of-N
And the operational availability of these systems is system cold standby redundancy system under the
affected by the number of spare parts, repair teams cycle replacement policy; Ref.[4] analyses the oper-
and so on. By analyzing the operational availability ational availability when the failed components are
with the different number of spare parts and repair repaired immediately and it exists just one repair
teams under the different maintenance policy, we can team which finish the replacement and repair job
give support to the decision of maintenance of these of spare part together in the system; Ref.[5] pro-
systems. For these systems, Ref.[1] provides a sim- vides a operational availability model of a k-out-of-N
ple policy of condition-based maintenance (we call it system which exists cold standby redundancy and
m-maintenance policy here), and gives the model of the hot standby redundancy together. It also con-
operational availability of a k-out-of-N system under siders the failed components repaired immediately.
this policy. The m-maintenance policy is the policy Ref.[1] provides a operational availability model of
that the system asks for the repair of spare parts when the hot standby redundancy k-out-of-N system. This
there are m failed components in the system (the avail- model considers the m-maintenance policy. And this
able ones are reduced to N −m). But actually, Because paper does some research on the operational avail-
of the limit of spare parts, system can be restored to the ability model of a k-out-of-N system under (m, NG )
state during which the number of available components maintenance policy.

649
2 DESCRIPTION OF PROBLEM waiting for sufficient repaired spare parts. Tr is zero
when there are sufficient spare parts.
In this section, we also describe the k-out-of-N system Given E(Td ) = Td and E(Ts ) = Ts , to compute
with hot standby redundancy. At the start of a system Ao , the key point is to compute the excepted opera-
uptime, all N components are as good as new. The tional time of system E(To ) and the excepted lingering
failure process of each component is characterized by time E(Tr ) which is produced by the waiting for the
a negative exponential distribution with rate λ; where demanding spare parts.
we assume that the component failure processes are
mutually independent. The system functions properly
3.1 Definition
as long as at least k (k > 0) components are available.
It takes (m, NG ) maintenance policy for the system. The state of system is defined as (n, s). It consists
There are c repair teams which especially repair the of the number of operational components n and the
replaced failed components in the repair shop. Each number of available spare parts s. The n and s are
team can repair one failed component in the same constrained by NG ≤ n ≤ N , 0 ≤ s ≤ N + X − NG
time. The repairing time has the negative exponen- and NG ≤ n + s ≤ N + X . Before discussing the
tial distribution with rate μ; where we assume the formulas of E(To ) and E(Tr ), we define the following
time of replacement is short enough to be neglected symbols which will be used in the resolving process.
and each repair team can independently work with- Pr (a, b, c, t) is the probability when the number of
out stopping. The failed component is as-good-as-new failed components is reduced from a to b by the repair
after repair. Also it needs time Td to send the sys- of c repair teams in the time t.
tem to the repair shop after it was dismantled, and Pa (n) is the steady probability when the number
time Ts to fix the system and transit back to the user of operational components is n at the starting time of
after replacement of spare parts. If insufficient spares operational period.
are available, the maintenance completion is delayed Pb (n, s) is the steady probability when the initial
until the number of available components at least is state of operational period of system is (n, s).
NG . Now, we want to get the operational availability p(ns ,ss ),(ne ,se ) is the state translation probability when
Ao of this system given the initial number X of spare the state of system is translated from (ns , ss ) to (ne , se )
parts. during the system operation cycle.
Ps (s) is the steady probability when there are just s
available spare parts at the starting time of the period
of spare replacement.
3 RESOLVING PROCESS
NM is the number of operational components when
the system asks for the repair, and NM = N − m is the
The system cycle starts with all N components as good
fixed value.
as new. We define the system operation cycle which
Z is the maximal number of available spare
is from the system start time to the next and the spare
parts. When the number of components in the sys-
replacement cycle which is from the spare replace-
tem is NG and there is no components needing
ment start time to the next. So the system operation
repaired, the number of availability spare parts is
cycle includes 4 processes: the period of system up,
maximal. So Z = S + N − NG , and the maxi-
the period of system down and in transit to depot, the
mal number of components waiting to be repaired is
period of component replacement, and the period of
also Z.
system in transit back. See Figure 1.
We know the operational availability equals the
expected uptime during a cycle divided by the expected 3.2 The expression of Pr (a, b, c, t)
cycle length. So, we find
Because of frequent use of Pr (a, b, c, t), we give its
formula firstly.
E(To ) If we treat c repair men as c servers, a components
Ao = (1) which are repaired or waiting for the repair as aguests
E(To ) + E(Td ) + E(Tr ) + E(Ts ) which are served or waiting for the service, and the
repairing time of failed components has exponential
where E(To ) is the expected uptime until system down distribution with rate μ, the resolving of Pr (a, b, c, t)
and maintenance, E(Td ) is the expected time during can be treated as the queue model which has sev-
system is down and in transit to depot, E(Tr ) is the eral servers and finite guests. Assume that there are
expected time of the period of component replace- c independent servers with the serving rate μ. The
ment, E(Ts ) is the expected time during system is sum of the guests which are served or are waiting
restored and in transit back. Because we assume the for the service is a at the starting time. There are not
time of replacement can be neglected, E(Tr ) is the time new guests. The waiting guest can enter into any free

650
To Td Tr Ts

……

the system operation cycle


the spare replacement cycle

the period of system is operational the period of system is down and in transit to depot
the period of spare replacement the period of system is restored and in transit back

Figure 1. The cycles and periods of system.

server. And guests leave when they finish their ser- cμe−cμ(t−τ ) Pr (a − 1, b, c, τ ) for τ at [0, t]. And we
vices. So, Pr (a, b, c, t) equals to the probability when can get the recursive Equation (4).
the sum of the guests which are served or are wait-
ing for the service is b, after t time. The formula of Pr (a, b, c, t)
Pr (a, b, c, t) can be treated according to the following  t
conditions. = cμe−cμt ecμτ Pr (a − 1, b, c, τ ) dτ (4)
1. When b > a, as there is not new guest, the sum of 0

the guests which are served or are waiting for the


According to Equation (4), when a > c >
service can’t be increased. This condition can’t be
b ≥ 0,
produced, and Pr (a, b, c, t) = 0.
2. When a = b, it means that there isn’t any leav-  t
ing guest finishing the service in t time. When Pr (a, b, c, t) = cμ·e−cμt ecμτ Pr (a−1, b, c,τ )dτ
a > c, the sum of guests is larger than the num- 0
ber of servers, and all servers are in their job with  t   τ
c guests served in the same time. Thus, the time in = cμ·e−cμt ecμτ cμ · e−cμτ ecμλ
which one guest finishes the service has exponen- 0 0
tial distribution with serving rate cμ. When a ≤ c, 
the sum of guests is smaller than the number of Pr (a − 2, b, c, λ)dλ dτ
servers. And all guests can be served, the serving
rate is aμ. Synthesizing these two conditions, we  t
can get the serving rate min {a, c} μ. Whena = b, = (cμ)2 e−cμt ecμτ Pr (a−2, b, c, τ )(t−τ )dτ
all guests have not finished their service in ttime. 0
So, we can get the following equation. 
(cμ)a−c e−cμt t
= ··· = ecμτ
− min{a,c}μt (a − c − 1)
Pr (a, b, c, t) = e (2) 0

Pr (c, b, c, τ )(t − τ )a−c−1 dτ (5)


3. When b < a and a ≤ c, a guests are all being
served. And number of the guests who have fin- where Pr (c, b, c, τ ) = Ccb e−bμτ (1 − e−μτ )c−b .
ished their service in t time is a − b. b guests have According to the Equation (4), when c ≤ b < a,
not finished in ttime. So, there is the following as the deducing process of Equation (5), we can get:
equation.

Pr (a, b, c, t) = Cab e−bμt (1 − e−μt )a−b (3) (cμ)a−b e−cμt


Pr (a, b, c, t) =
(a − b − 1)!
4. When 0 ≤ b ≤ a and a > c, as a > c, there  t
are c guests served at most in the same time. In × ecμτ Pr (b, b, c, τ )(t − τ )a−b−1 dτ (6)
this condition, the probability is Pr (a − 1, b, c, τ ), 0
when the number of failed components is reduced
from a − 1 to b in τ time by repairing. And where Pr (b, b, c, τ ) = e−cμτ .
the probability density function is cμe−cμ(t−τ ) ,
when one guest finishes his service in the left Synthesizing all above conditions, we can get the
t − τ time. So, Pr (a, b, c, t) is the integral of following computing equation of Pr (a, b, c, t).

651

⎪ 0 a < b or a < 0 or b < 0





⎪ e− min{a,c}μt , a=b







⎪ Cab e−bμt (1 − e−μt )a−b 0≤b≤a≤c




⎪ a−c


c−b−1 c

⎪ C b i
C (−1) i


a c−b
c−b−i

⎪ i=0






⎨ −μ(b+i)t −cμt

a−c−1 C a−c
Pr (a, b, c, t) = × (e − e ) − (7)
j=1 (c − b − i)
j





⎪ 

⎪ (μt)a−c−j −cμt


⎪ ×
⎪ e + Ccb

⎪ (a − c − j)!





⎪ (−1)c−b (cμt)a−c



⎪ × e−cμt ,

⎪ (a − c)!



⎪ a>c>b≥0



⎪ (cμt)a−b −cμt

⎩ e , a>b≥c≥0
(a − b)!

3.3 The expression of E(To ) the number of available components and consists of
the number of available spare parts. Because the fail-
In an operational period of system, we use To (n) to
ing time and the repairing time of components both
express the operational time of system when the num-
have the exponential distribution, and the probabil-
ber of available components is n at the starting time of
ity of system state at the end of operational period
operational period. It also means the interval when the
is just related to the system state at the starting time
state of system that has n available components trans-
of operational period, it can form a Markov chain.
lates to the state of system that has n − NM failed
Before resolving Pa (n), we can compute Pb (n, s)
components. Consequently, there is the following
firstly.
equation.
To get Pb (n, s), the key point is to get the trans-
lating matrix of state Q = [p(ns ,ss ),(ne ,se ) ]. The blow is

n−NM −1 discussing p(ns ,ss ),(ne ,se ) .
1
To (n) = (8) We divide (ns , ss ) → (ne , se ) into two phases. The
i=0
(n − i)λ first phase is (ns , ss ) → (NM , sm ). It begins at the start-
ing time of operational period, and ends at the starting
So, we can also get the following equation of E(To ). time of component replacement period. The second
phrase is (NM , sm ) → (ne , se ). It begins at the starting
time of component replacement period, and finishes

N at the end time of component replacement period. The
E(To ) = (Pa (n) · To (n)) time of first phase is t1 . t1 equals to the addition of the
n=NG time when the state of system is available and the time
 used to ask for repair and send to the repair shop. And

N 
n−NM −1 we can get the following.
1
= Pa (n) · (9)
n=NG i=0
(n − i) λ
ns −N
 M −1
1
The equation of Pa (n) is the next thing we are going t1 = To (ns ) + Td = Td + (10)
(ns − i) λ
to discuss. We have known that the state of system i=0

652
sm is the number of available spare parts at the starting p2 (NM , sm , ne , se ) = Pr (N + X − ne , N + S
time of component replacement period. sm can’t be
− ne − se , c, ts ) (13)
less than ss which is the number of available spare
parts at the starting time of operational period. But
it is possible larger than the sum (N + X − ns ) of
3. When ne = N , sm ≥ N −NM and se ≥ sm −N +NM ,
failed components that is repairing or waiting for the
sm can fill the maximal demand of replacement, and
repair. Thus, we can get the constraint ss ≤ sm ≤
the number of available spare parts is sm − N + NM
N + X − ns .
at the end time of replacement.
Consequently, The translating probability of system
state (ns , ss ) → (ne , se ) equals to the addition of the
possibilities for any sm , when the state of system trans-
lates from (ns , ss ) to (NM , sm ), and then from (NM , sm ) p2 (NM , sm , ne , se ) = Pr (X + NM − N − sm ,
to (ne , se ).
X − se , c, ts ) (14)
X +N
−ns
p(ns ,ss ),(ne ,se ) = p1 (ns , ss , NM , sm ) According to (m, NG ) maintenance policy, The con-
sm =ss dition that ne and sm dissatisfy above conditions, is
· p2 (NM , sm , ne , se ) impossible to happen. Thus, p2 (NM , sm , ne , se ) = 0.
Synthesizing the Equation (13),(14) and (15), we
where p1 (ns , ss , NM , sm ) is the translating prob- can get the equation of p2 (NM , sm , ne , se )
ability when the state of system changes from
(ns , ss ) to (NM , sm ) in the first phase. p2 (NM , sm , ne , se ) p2 (NM , sm , ne , se )
is the translating probability when the state of sys-
tem changes from (NM , sm ) to (ne , se ) in the second ⎧
phase. ⎪ Pr (Z, Z − se , c, ts ) ,


p1 (ns , ss , NM , sm ) equals to the probability when the ⎪

⎨ Pr (N + X − ne , N + X − ne − se , c, ts ) ,
number of the components repaired in t1 time is sm −ss . =
So, we can get following equation. ⎪
⎪ Pr (X + NM − N − sm , X − se , c, ts ) ,




0,
p1 (ns , ss , NM , sm ) = Pr (L1 , L2 , c, t1 ) (11) ne = NG and sm ≤ NG − NM

where L1 = N + X − ns − ss , L2 = N + X − ns − sm . NG < ne < N and sm = ne − NM


According to (m, NG ) maintenance policy, it should
go through the process of replacement of failed com- ne = N and sm ≥N −NM and se ≥ sm −N +NM
ponents and waiting for spare parts. Therefore, we can
discuss it according to following conditions. others

1. When ne = NG and sm ≤ NG − NM , the number If we take the Equation (12) and (16) in the Equa-
of available components goes back to NG , and the tion (11), we can get p(ns ,ss ),(ne ,se ) . According to the
number of spare parts is reduced to 0 at the end time theory of Markov, there is the following equation.
of replacement process. And p2 (NM , sm , ne , se )
equals to the probability when the number of failed  +N −i
N X
 
components is reduced from Z to Z − se by c repair Pb (n, s) = Pb (i, j) · p(i,j),(n,s) (15)
men. i=NG j=0

p2 (NM , sm , ne , se ) = Pr (Z, Z − se , c, ts ) (12)

2. When NG < ne < N and sm = ne − M , sm can fill Order


the demand of repair, and the system restores to the
state of ne available components. And the number  = [Pb (NG , 0) · · · Pb (NG , Z) · · ·
of spare parts is also reduced to 0 at the end time
of replacement process. Pb (NG + i, 0) · · · Pb (NG + i, Z − i) · · · ]T ,

653

p(NG ,0),(NG ,0) ··· p(NG ,0),(NG ,Z) ··· p(NG ,0),(NG +i,0) ··· p(NG ,0),(NG +i,Z−i) |
⎢ .. .. .. .. .. .. ..
⎢ . . . . . . . |

⎢ p(NG ,Z),(NG,0 ) ··· p(NG ,Z),(NG,Z ) ··· p(NG ,Z),(NG +i,0) ··· p(NG ,Z),(NG +i,Z−i) |

⎢ .. .. .. .. .. .. ..
⎢ . . . . . . . |

⎢ p(NG +i,0),(NG,0 ) ··· p(NG +i,0),(NG,Z ) ··· p(NG +i,0),(NG +i,0) ··· p(NG +i,0),(NG +i,Z−i) |

⎢ .. .. .. .. .. .. ..
Q =⎢ . . . |
⎢ . . . .
⎢ p(NG +Z−i),(NG,0 ) ··· p(NG +i,Z−t),(NG,Z ) ··· p(NG +i,Z−t),(NG +i,0) ··· p(NG +i,Z−t),(NG +i,Z−i) |

⎢ .. .. .. .. .. .. ..
⎢ |
⎢ . . . . . . .
⎢ ··· ··· ··· |
⎢ p(N0 ),(NG ,0) p(N0 ),(NG ,Z) p(N0 ),(NG +i,0) p(N0 ),(NG +i,Z−i)
⎢ .. .. .. ..
⎣ .. .. ..
. . . . . . . |
p(N ,X ),(NG ,0) ··· p(N ,X ),(NG ,Z) ··· p(N ,X ),(NG +i,0) ··· p(NX ),(NG +i,Z−i) |
⎤T
| ··· p(NG ,0),(N ,0) ··· P(NG ,0),(N ,X )
. .. .. .. ⎥
| .. . . . ⎥

| ··· p(NG ,Z),(N ,0) ··· p(NG ,Z),(N ,X ) ⎥

. .. .. .. ⎥
| .. . . . ⎥

| ··· p(NG +i,0),(N ,0) ··· p(NG +i,0),(N ,X ) ⎥

. .. .. .. ⎥
| .. . ⎥ (16)
. . ⎥
| ··· p(NG +i,Z−i),(N ,0) ··· p(NG +i,Z−i),(N ,X ) ⎥

. .. .. .. ⎥
| .. . ⎥
. . ⎥
| ··· ··· ⎥
p(NG ,0),(N ,0) P(N ,0),(N ,X ) ⎥
. .. .. .. ⎥

| .. . . .
| ··· p(N ,X ),(N ,0) ··· P(N ,X ),(N ,X )

we can get process of replacement and repair of system compo-


nents, and Tr can be treated as 0. When sm dissatisfies
=Q· (17) the least demand (NG − NM ) of repair, it must wait
for the lacking components which are repaired to fin-

Z
ish the process of replacement of system components.
Pa (n) = Pb (n, s) (18) So, Tr is the lingering time which is produced by
s=0 waiting for the demanding spare parts, and it just

relates to sm . Assume that Tc s , s , c is the lingering
Then, according to the Equation (19) and (20), we time which is produced by waiting for the demanding
can get Pb (n, s) and Pa (n). spare parts, when the number of demanding replaced
spare parts is s , the number of components that
3.4 The expression of E (Tr ) wait for the repairing is s and the number of repair
teams is c at the starting time of component replace-
At the starting time of components replacement, when ment period. Therefore, we can catch the computing
the number (sm ) of available spare parts satisfies the formula of E(Tr ).
least demand (NG − NM ) of repair, it can finish the When s ≤ c, the number of components that wait
for the repairing is less than the number of repair
teams, and the intervalof the next component that
M −1 
NG −N  
has been repaired is 1 s μ . When s > c, the
E (Tr ) = E Tc NG − NM − sm , N
number of components that wait for the repair is
sm =0
  larger than the number of repair teams, and the inter-
+X − NM − sm , c Ps (sm ) (19) val
 of the next component
  that has been repaired is
1 (cμ). And T c s , s , c equals to the addition of the

654
time repairing one component
 and the repairing time Table 1. The values of the availability for a combination of
(Tc s − 1, s − 1, c ) when the number of demanding m and NG . m is chosen equal to 3.
spare parts is s −1and the number of components wait-
ing for repairing is s − 1. Thus, there is the following Parameter of policy
recursive equation. Initial number of
   spare parts (3,5) (3,6) (3,7)
E Tc s , s , c = (20)
X =1 0.8797 0.8797 0.7726
X =2 0.8974 0.8974 0.8359
1   
+ E Tc s − 1, s − 1, c X =3 0.9106 0.9106 0.8907
min {s , c} μ X =4 0.9106 0.9106 0.9106

On the basis of Equation (22), we can deduce


0. 94
⎧s −1



1
, 
0<s ≤s ≤c 

⎪ (s −h)μ 0. 92

⎪ h=0


⎨s ,
s > c and s ≤ s − c
0. 9
E[Tc (s , s , c)] = cμ

⎪  
⎪ s −c s −s +c−1 1 0. 88

Operationalavailability

⎪ + s > c and s − c < s ≤ s

⎪ cμ (c−h)μ ,

⎪ h−0 0. 86

0, others Initial number of s pare parts
0. 84
(21) X=1
0. 82 X=2
X=3

For Ps (sm ), because the number of available com-


X=4
0. 8

ponents must be NM at the starting time of component 0. 78


replacement period, Ps (sm ) equals to the steady prob-
ability when the state of system is (NM , sm ) at the 0. 76
5 6 7
starting time of component replacement period. And
we have known 0 ≤ ss ≤ sm , so we can get Figure 2. The values of the availability for a combination
of m and NG . m is chosen equal to 3.

N 
sm
Ps (sm ) = Pa (ns , ss ) p1 (ns , ss , NM , sm ) the date table and curve of operational availability
ns =NG ss =0 when we set m = 4 and change NG in the (m, NG )
(22) maintenance policy.
When NG = N = 7 the parameter is (3,7) or
At the end, by the Equation (21) we can compute (4,7) in (m, NG ) maintenance policy, from Figure 2
to E(Tr ). and Figure 3, we can find that in the condition of
fixed number of initial spare parts, no matterm = 4
or m = 3, the operational availability in (4,6) mainte-
4 ANALYSIS OF AN EXAMPLE nance policy and (3,6) maintenance policy are larger
than the operational availability in m = 4 maintenance
Assume that there is a 2-out-of-7 hot standby redun- policy and m = 3 maintenance policy.
dancy system. The failed time of components has the But when the parameter m is fixed, the parame-
exponential distribution with rate λ = 0.005. There ter NG is not always the smaller is the better. In the
is only one repair man (c = 1) to repair the failed example, when the initial number is X = 2, the
components replaced. And the repairing time also has (4,6) maintenance policy is superior to the (4,5) main-
the exponential distribution with rate μ = 0.1. The tenance policy. And when the initial number X < 4,
time of system down and in transit to depot is Td = 5. the (3,6) maintenance policy is superior to the (3,5)
And time of system in transit back is Ts = 5. When maintenance policy.
we set the initial number (X) of spare parts from 1 to The following shows the influence of m in (m, NG )
5, the parameter of (m, NG ) maintenance policy cor- maintenance policy on the operational availability,
responds to (3,6), (4,6), (3,7) and (4,7), we want to when the initial number changes. Table 3 and Figure 4
know the result of operational availability. Following show the data table and curve of operational availabil-
the above method, the Table 1 and Figure 2 show the ity, when it fixes NG = 6 and changes m.
date table and curve of operational availability when From Figure 4, we can find that when the initial
we set m = 3 and change NG in the (m, NG ) main- number X = 1, the parameter of best maintenance
tenance policy. And the Table 2 and Figure 3 show policy is m = 3. When X = 2, the parameter of best

655
Table 2. The values of the availability for a combination of 0. 96

m and NG . m is chosen equal to 4. Initial number of spare parts


0. 94
X=1
X=2
Parameter of policy X=3
0. 92
X=4

Operational availability
Initial number of
spare parts (4,5) (4,6) (4,7) 0. 9

X =1 0.9142 0.8605 0.7916 0. 88


X =2 0.9249 0.9250 0.8351
X =3 0.9323 0.9322 0.8837 0. 86
X =4 0.9382 0.9382 0.9268
0. 84

0. 82
0. 96 2 3 4 5
m

0. 94
Figure 4. The values of the availability for a combination
0. 92 of m and NG . NG is chosen equal to 6.
Oper at i onal avai l abi l i t y

0. 9

0. 88
1

0. 86
Initial number of spare parts 0. 95
0. 84 X=1
X=2
0. 9
Operational availability

0. 82 X=3
X=4
0. 8 0. 85
( 4, 6) maintenance policy
( 3, 6) maintenance policy
0. 78 m=4 maintenance policy
5 6 7 0. 8 m=3 maintenance policy

Figure 3. The values of the availability for a combination 0. 75


of m and NG . m is chosen equal to 4.
0. 7

Table 3. The values of the availability for a combination


0. 65
of m and NG . NG is chosen equal to 6. 1 2 3 4 5
Initial number of spare parts

Parameter of policy
Figure 5. The values of the availability for a combination
Initial number of of m, NG and initiation for different values of the number of
spare parts (2,6) (3,6) (4,6) (5,6) spares.

X =1 0.8238 0.8797 0.8605 0.8636


X =2 0.8605 0.8974 0.9250 0.9048
X =3 0.8609 0.9106 0.9322 0.9500 is the better. And it relates to the initial number of
X =4 0.8609 0.9106 0.9382 0.9533 spare parts.
3. When the parameter NG of (m, NG ) maintenance is
fixed, the parameter m of best maintenance policy
maintenance policy is m = 4. And when X ≥ 3, the is increasing following the increasing of the initial
parameter of best maintenance policy is m = 5. number of spare parts. And it is also not always the
From the above analysis of example, we can get the bigger is the better.
following conclusions.
1. Given the initial number of spare parts, choos- For getting the best maintenance policy directly, we
ing the reasonable (m, NG ) maintenance policy can can build the relational curve between the initial num-
improve the operational availability of k/N sys- ber of different maintenance policy and the operational
tem. When we just consider the availability as the availability. Figure 5 shows the curve. From the figure,
optimization target, (m, NG ) maintenance policy is we can see that when the initial number is 1, (3,6) main-
superior to m maintenance policy. tenance policy is the best maintenance policy. When
2. When the parameter m of (m, NG ) maintenance is the initial number is larger than 1, (4,6) maintenance
fixed, the parameter NG is not always the smaller policy is the best.

656
5 CONCLUSIONS [2] Deng Zhong-yi. Applicability analysis for N Various
Components Parallel Connection and Maintainability
This paper provides an operational availability model Systems[J]. Journal of Hunan Institute of Engineering.
when thek-out-of-N system takes the (m, NG ) main- 2003, 13(2)(in chinese)
tenance policy with the given number of spare parts. [3] Smith MAJ, Dekker R. Preventive maintenance in a
1-out-of-n system: the uptime, downtime and costs[J].
Through the analysis of an example, we can find that European Journal of Operations Research, 1997, 99:
the influence on operational availability which is pro- 565–83.
duced by initial number of spare parts, repairing ability [4] Fawzi BB, Hawkes AG. Availability of an R-out-of-
and selection of mand NG is expressed clearly from N system with spares and repairs[J]. J Application
the model. And at the same time, the model can pro- Probability, 1991, 28: 397–408.
vides the decision support for the tradeoff between the [5] Frostig E, Levikson B. On the availability of R out of N
maintenance policy and the number of spare parts in repairable systems[J]. Naval Research Logistics, 2002,
the different conditions. 49(5): 483–498.
[6] Karin S. de Smidt-Destombes, Matthieu C. van der
Heijden, Aart van Harten. On the interaction between
maintenance, spare part inventories and repair capac-
REFERENCES ity for a k-out-of-N system with wear-out. European
Journal of Operational Research. 2006,174: 182–200
[1] Karin S. de Smidt-Destombes, Matthieu C. van der [7] Karin S. de Smidt-Destombes, Matthieu C. van der
Heijden, Aart van Harten. On the availability of a Heijden, Aart van Harten. Availability of k-out-of-N
k-out-of-N system given limited spares and repair systems under block replacement sharing limited spares
capacity under a condition based maintenance strat- and repair capacity, Int. J. Production Economics, 2007,
egy[J]. Reliability Engineering and System Safety, 107: 404–421.
2004, 83:287–300.

657
Safety, Reliability and Risk Analysis: Theory, Methods and Applications – Martorell et al. (eds)
© 2009 Taylor & Francis Group, London, ISBN 978-0-415-48513-5

System value trajectories, maintenance, and its present value

K.B. Marais
Department of Industrial Engineering, Stellenbosch University, South Africa

J.H. Saleh
School of Aerospace Engineering, Georgia Institute of Technology, USA

ABSTRACT: Maintenance planning and activities have grown dramatically in importance and are increasingly
recognized as drivers of competitiveness. While maintenance models in the literature all deal with the cost
of maintenance (as an objective function or a constraint), only a handful addresses the notion of value of
maintenance, and seldom in an analytical or quantitative way. We propose that maintenance has intrinsic value
and argue that existing cost-centric models ignore an important dimension of maintenance, its value, and in so
doing, can lead to sub-optimal maintenance strategies. We develop a framework for capturing and quantifying
the value of maintenance activities. The framework presented here offers rich possibilities for future work in
benchmarking existing maintenance strategies based on their value implications, and in deriving new maintenance
strategies that are ‘‘value-optimized’’.

1 INTRODUCTION reliability and/or availability constraint; conversely,


the objective function can be the maximization of
Maintenance planning and activities have grown reliability or availability, given a cost constraint. In
dramatically in importance across many industries. addition to varying the objective function, different
This importance is manifested by both the signif- ‘‘optimal’’ maintenance models are obtained by: (1)
icant material resources allocated to maintenance varying for example the system configuration (e.g.,
departments as well as by the substantial num- single-unit systems versus multi-unit systems, k-out-
ber of personnel involved in maintenance activi- of-n systems); (2) by including several degrees of
ties in companies—for example over a quarter of maintenance (e.g., minimal, imperfect, perfect); (3)
the total workforce in the process industry is said by varying the planning horizon; (4) by using differ-
to deal with maintenance work [Waeyenbergh and ent analytical tools; or (5) by positing different types of
Pintelon, 2002]. This situation, coupled with an dependencies between the various units in a multi-unit
increasingly competitive environment, creates eco- system.
nomic pressures and a heightened need to ensure Yet, while all these models deal with the cost of
that these considerable maintenance resources are maintenance (as an objective function or a constraint),
allocated and used appropriately, as they can be sig- only a handful of models touches on the notion of
nificant drivers of competitiveness—or lack thereof if value of maintenance, and seldom in an analytical
mismanaged. or quantitative way (e.g., Macke and Higuchi, 2007).
In response to these pressures, the notion of ‘‘opti- Wang (2002) highlights a critical idea for the devel-
mality’’ and the mathematical tools of optimization opment of a value-based perspective on maintenance
and Operations Research (OR) have seeped into main- when he suggests that the cost of maintenance as well
tenance planning, and resulted in the proliferation of as the resulting system reliability should be consid-
‘‘optimal’’ maintenance models (see the reviews by ered together when developing optimal maintenance
Pham and Wang (1996) and Wang (2002), for exam- strategies. Where the benefits of maintenance are con-
ple). In each ’’optimal’’ maintenance model developed, sidered, it is usually in the sense of avoiding the costs
an objective function is first posited, then analytical of failure. Interestingly, it is only within the civil
tools are used to derive a maintenance policy that engineering community that the benefits in the sense
maximizes or minimizes this objective function sub- of service delivery are considered and cost-benefit
ject to some constraints. For example, the objective considerations explicitly taken into account in the
function can be the minimization of cost (cost rate, development of maintenance strategies (e.g., Macke
or life cycle cost) of maintenance given a system and Higuchi, 2007).

659
The argument for dismissing or not focusing on the value implications of existing maintenance policies,
value of maintenance, when it is made, goes along and deriving new policies based on maximizing value,
these lines: while it is easy to quantify the (direct) cost instead of minimizing the cost of maintenance.
of maintenance, it is difficult to quantify its benefits.
Dekker (1996) for example notes ‘‘the main question
faced by maintenance management, whether main- 2 BACKGROUND
tenance output is produced effectively, in terms of
contribution to company profits, [ . . . ] is very dif- This section provides a brief overview of various
ficult to answer’’. Therefore maintenance planning is maintenance models. The reader interested in exten-
shifted from a value maximization problem formula- sive reviews of the subject is referred to the survey
tion to a cost minimization problem (see Saleh (2008) papers by Dekker (1996), Pham and Wang (1996) and
for a discussion of why these two problems are not the Wang (2002). In the following, we discuss (1) mainte-
same and do not lead to similar decisions in system nance classification, (2) maintenance models, and (3)
design and operation). Incidentally, in many organi- maintenance policies.
zations, maintenance is seen as a cost function, and
maintenance departments are considered cost centers
2.1 Types and degrees of maintenance
whose resources are to be ‘‘optimized’’ or minimized.
In short, as noted by Rosqvist et al., (2007) a cost- Maintenance refers to the set of all technical and
centric mindset prevails in the maintenance literature administrative actions intended to maintain a system
for which ‘‘maintenance has no intrinsic value’’. in or restore it to a state in which it can perform at least
In this paper, we propose that maintenance has part of its intended function(s) [Dekker, 1996].
intrinsic value and that one aspect of this value, the Maintenance type can be classified into two main
net present value, can be captured. We argue that categories: corrective maintenance and preventive
existing cost-centric optimizations ignore an impor- maintenance [Pham and Wang, 1996]. Corrective
tant dimension of maintenance, namely its value, and maintenance (CM), also referred to as repair or run-
in so doing, they can lead to sub-optimal maintenance to-failure (RTF), refers to maintenance activities per-
strategies. We therefore develop a framework built on formed after a system has failed in order to restore its
aspects of existing maintenance models for capturing functionality.
and quantifying the value of maintenance activities Preventive maintenance (PM) refers to planned
by connecting an engineering and operations research maintenance activities performed while the system
concept, system state, with a financial and managerial is still operational. Its aim is to retain the system
concept, the present value (PV). Note that we consider in some desired operational condition by prevent-
‘‘value’’ as the net revenue generated by the system ing (or delaying) failures. Preventive maintenance is
over a given planning horizon. We do not consider further sub-divided into clock-based, age-based, and
additional dimensions of value such as the potential condition-based, according to what triggers mainte-
positive effects of maintenance on environmental or nance activities [Rausand and Høyland, 2004]:
health impacts. Such effects can be incorporated in
– Clock-based maintenance is scheduled at specific
future work, see, for example, Marais et al., (2008)
calendar times; its periodicity is preset irrespective
for a discussion of the quantification of environmen-
of the system’s condition (e.g., every Tuesday).
tal and health impacts of aviation. The system state
– Age-based maintenance is performed at operating
refers to the condition of the system and hence its abil-
time intervals or operating cycles of the system (e.g.,
ity to perform and thereby provide a flow of service
every 500 on/off cycles, or every 4,000 hours of
(hence generate revenue, or ‘‘quasi-rent’’). In order
flight).
to build this connection, we first explore the impact
– Condition-based maintenance is triggered when the
of a system’s state on the flow of service the sys-
measurement of a condition or state of the system
tem can provide over time—for a commercial system,
reaches a threshold that reflects some degrada-
this translates into the system’s revenue-generating
tion and loss of performance of a system (but not
capability. Next we consider the impact of main-
yet a failure). Condition-based maintenance is also
tenance on system state evolution and hence value
referred to as predictive maintenance.
generation capability over time. We then use tradi-
tional discounted cash flow techniques to capture the Opportunistic maintenance encompasses both cor-
impact of system state evolution with and without rective and preventive maintenance and is relevant
maintenance on its financial worth, or PV. For simplifi- for multi-unit systems with economic and functional
cation, we call the results of our calculations the ‘value dependencies in which the failure of one unit, and
of maintenance’. Finally, we discuss the advantages hence its corrective maintenance, offers an opportu-
and limitations of our framework. This work offers nity to perform preventive maintenance on other still
rich possibilities for assessing and benchmarking the functional units.

660
Each type of maintenance can be further classi- maintenance cost, system age, or the number of prior
fied according to the degree to which it restores the maintenance activities [Malik, 1979; Pham and Wang
system [Pham and Wang, 1996]. At one end of the 1996].
spectrum, perfect maintenance restores the system to The third class of models views maintenance as
its initial operating condition or renders it ‘‘as good reducing the virtual age of the system [Kijima et al.,
as new’’. At the other end of the spectrum, minimal 1988]. It is assumed that maintenance reduces the age
repair returns the system to the condition it was in of the system by some proportion (assuming increas-
immediately prior to failing (in the case of corrective ing failure rate, which implies among other things
maintenance), or ‘‘as bad as old’’. In between these that the system exhibits no infant mortality). Perfect
extremes lies imperfect maintenance, which returns maintenance returns the system virtual age to zero,
the system to a condition somewhere in between as while minimal maintenance returns the virtual age to
good as new and as bad as old. Finally, there is also the age immediately prior to the failure. Kijima et al.,
the possibility that maintenance leaves the system in ’s (1988) original model allowed only a reduction to
a worse condition than before the failure, through, for the virtual age of the system following the previous
example, erroneous actions such as damaging adjacent repair effort, though larger reductions in virtual age
parts while replacing a faulty unit. can be seen as resulting from more extensive main-
tenance efforts. Pham and Wang (1996a) assume that
maintenance time increases with subsequent repairs
2.2 Maintenance models
and consider the reduction in virtual age as decreasing
Models used to derive optimal maintenance policies over time—that is, repairs become successively less
generally cover four main aspects [Dekker, 1996]: (1) effective over time.
a description of the system being maintained; (2) a The fourth class of models considers system fail-
model of how the system deteriorates and the con- ures as manifesting as some level of damage or
sequences thereof; (3) a description of the available degradation in response to a shock. These models are
information on the system and the available response therefore referred to as shock models. Perfect main-
options; and (4) an objective function and an analytical tenance then reduces the damage to zero, minimal
framework (or tools) according to which the opti- maintenance returns the damage level to that immedi-
mal maintenance policy is to be derived. This section ately prior to the failure, and imperfect damage reduces
reviews the four main classes of maintenance mod- the damage by some factor greater than zero and less
els, following the reviews in Pham and Wang (1996), than 100%. These models also allow the possibility for
Doyen and Gaudoin (2004), Tan and Raghavan (in less-effective and more expensive repairs over time by
press), and Wang (2002). making the reduction in damage a decreasing function
The first class of models developed considered the of time and by successively increasing the duration
possibility only for perfect or minimal repair [Nak- of maintenance activities over time [Wang and Pham,
agawa, 1979a, b; Pham and Wang 1996]. Thus, 1996a, b].
following maintenance, the system is returned to as In each case these models have been used primar-
good as new with some repair probability p, or to as bad ily to derive maintenance policies that minimize cost
as old with probability (1-p). This basic concept is then or downtime, or that maximize system availability,
expanded to take into account time-dependent repair as we discuss in the next sub-section. In Sections 3
probabilities, the possibility that maintenance causes and 4 we show how a simple model based on aspects
the system to be scrapped or to transition to some inter- of these models can be used to quantify the value of
mediate state, and non-negligible repair times (and maintenance.
hence non-negligible downtime losses).
The second class of models considers maintenance
2.3 Maintenance policies
as improving the failure rate or intensity, and thus
allows the possibility of imperfect maintenance [Block Maintenance policies describe what types of main-
et al., 1985; Pham and Wang 1996]. It is assumed that tenance (repair, replacement, etc.) are considered in
maintenance provides a fixed or proportional reduc- response to what types of events (failure, calendar
tion in failure rate, or that it returns the system to the time, machine cycles, etc.). In the following, we
failure rate curve at some time prior to the mainte- confine our discussion to maintenance policies for
nance activity. Perfect maintenance returns the failure single-unit systems with increasing failure rates.
rate to that of a new system; minimal maintenance One popular maintenance policy is age-dependent
returns it to that of the system immediately prior to preventive maintenance where a system is repaired or
the failure. The improvement factor is the degree of replaced at a pre-determined ‘‘age’’ [Wang, 2002]. The
improvement of failure rate. The improvement factor triggering of maintenance in this case may be prede-
is determined based on historical data, experiment, termined based on machine time (e.g., every 10,000
expert judgment, or by assuming it correlates with cycles) or on time elapsed since the last maintenance

661
activity. Under a random age-dependent maintenance model their state evolution using Markov chains
policy, maintenance is performed based on age and and directed graphs
system availability. This policy takes account of the 2. Second, we consider that the system provides a
fact that systems may not be available for mainte- flow of service per unit time. This flow in turn is
nance in the middle of a production run, for example. ‘‘priced’’ and a discounted cash flow is calculated
A further extension of age-dependent replacement is resulting in a Present Value (PV) for each branch
failure-dependent replacement where the system is of the graph—or ‘‘value trajectory’’ of the system.
repaired in response to failures and replaced when a 3. Third, given our previous two points, it is straight-
given number of failures has occurred, or at a given forward to conceive of the following: as the system
time, whichever occurs first [Nakagawa, 1984]. Many ages or deteriorates, it migrates towards lower PV
other variations on the theme of age-dependent main- branches of the graph, or lower value trajectories.
tenance have been proposed; see Wang (2002) for an 4. Finally, we conceptualize maintenance as an opera-
extensive review. tor (in a mathematical sense) that raises the system
An alternative family of maintenance policies, to a higher PV branch in the graph, or to higher
referred to as periodic preventive maintenance, is value trajectory. We refer to the value of mainte-
based on calendar time. Here maintenance occurs on nance, or more specifically the Present Value (PV)
failure and periodically regardless of the failure or of maintenance, as the incremental Present Value
operating history of the system [Wang, 2002]. Varia- between the pre- and post-maintenance branches of
tions on this theme are developed by selecting degrees the graphs.
of repair from minimal to perfect at specific times or in
response to failures. Further variations are developed In the following section, we set up the analytical
by incorporating the failure or operating history of the framework that corresponds to this qualitative discus-
system. For example, the level of maintenance may be sion.
dependent on the number of previous repairs [Wang
and Pham, 1999].
Sequential preventive maintenance can be seen as 4 MAINTENANCE AND PRESENT VALUE
a variation of periodic PM where the interval between BRANCHES
PM activities is not constant. For example, the PM
interval may be decreased as the system ages, so In developing our value model of maintenance, we
that the system does not exceed a certain operating make a number of simplifying assumptions to keep
time without maintenance [Wang, 2002; Nguyen and the focus on the main argument of this work. These
Murthy, 1981]. assumptions affect the particular mechanics of our cal-
culations but bear no impact on the main results, as will
be shown shortly. Our assumptions are the following:
3 THE VALUE PERSPECTIVE IN DESIGN,
OPERATIONS AND MAINTENANCE: 1. We restrict ourselves to the case of perfect main-
A QUALITATIVE DISCUSSION tenance; in addition we assume that maintenance
does not change the system’s deterioration mecha-
The present work builds on the premise that engineer- nism.
ing systems are value-delivery artifacts that provide a 2. We restrict ourselves to the case of single-unit
flow of services (or products) to stakeholders. When systems.
this flow of services is ‘‘priced’’ in a market, this pric- 3. We only consider systems that exhibit an increasing
ing or ‘‘rent’’ of these system’s services allows the failure rate. In other words, as our systems age, they
assessment of the system’s value, as will be discussed become more likely to deteriorate in the absence of
shortly. In other words, the value of an engineering perfect maintenance.
system is determined by the market assessment of the 4. The systems in our model can be in a finite num-
flow of services the system provides over its lifetime. ber of discrete states, and the current state depends
We have developed this perspective in a number of pre- only on the prior state, though the state transition
vious publications; for further details, the interested probabilities may be time-dependent. This assump-
reader is referred to for example Saleh et al., (2003), tion allows us to model the state evolution of the
Saleh and Marais (2006), or Saleh (2008). system as a Markov process.
In this paper, we extend our value-centric per- 5. The systems in our model have no salvage value at
spective on design to the case of maintenance. Our replacement or end of life
argument is based on four key components: 6. Finally, for simulation purposes, we consider dis-
crete time steps, and assume that the duration of
1. First, we consider systems that deteriorate stochas- maintenance activities is negligible compared with
tically and exhibit multi-state failures, and we the size of these time steps.

662
These assumptions will be relaxed in future work. p11(2) B1
In the following, we consider first how a system
p12(2)
deteriorates under no-maintenance and introduce the p13(2)
concept of ‘‘value trajectories’’ of the system. Next, p11(1)

we show how maintenance moves the system onto p12(1) p22(2)


higher present value trajectories. Finally, we develop p23(2) New
expressions for the present value of maintenance. p13(1) Deteriorated
We illustrate the framework by means of numerical p11(0)

examples. Failed
p12(0) p22(1) p22(2)

p23(2)
4.1 Deterioration under no-maintenance,
and value trajectories p23(1)
p13(0)
We consider a k-state discrete-time Markov deterio-
rating system with time-dependent transition proba-
bilities as shown in Figure 1, for the no-maintenance Bworst
case with three states. The states are numbered from
1 through k in ascending order of deterioration where Figure 2. Three-state system evolution over time with no
state 1 is the ‘‘new’’ state and state k is the failed state. maintenance.
The time-dependence allows us to take account of the
fact that a new (or deteriorated) system will become
more likely to transition to the deteriorated (or failed) The transition matrix for a system with k states and
state as it ages (time-dependence implies dependence no self-healing is given by:
on the virtual age of the system). With no mainte- ⎡ p (i) p (i) · · · p (i) ⎤
nance the failed state is an absorbing state whence 11 12 1k
it is not possible to transition to either of the other ⎢ 0 p22 (i) · · · p2k (i) ⎥
P(i) = ⎢ .. .. ⎥ (1)
states. Further, it is not possible to transition from the ⎣ ..
. ⎦
deteriorated state to the new state without perform- 0 . .
ing maintenance. In other words, the system can only 0 0 ··· 1
transition in one direction, from new to failed, per-
haps via the deteriorated state (but the system has no i is the index of the time step considered, and P(i) is
self-healing properties). in effect P(ti ) in which ti = iT . For simplification
purposes, we retain only the index i in our notation.
The transition probabilities can be derived from
pnn(i) failure rates as shown by [Macke and Higuchi, 2007].
We represent the evolution of the system over time
using a directed graph, as shown in Figure 2 for a three-
state system. This representation expands on Figure 1
New and allows in effect an easy read of the time-dependent
pnd(i)
transition probabilities, which is difficult to visual-
ize using the traditional Markov chain representation
(Figure 1).
We assume that the probability of transitioning to a
lower state increases over time, and correspondingly
pnf(i) Deteriorated pdd(i) that the probability of remaining in a given state (other
than the failed state) decreases over time:

p11 (0) ≥ p11 (i) ≥ p11 (i + j) i, j ≥ 1


pdf(i) (2)
pmn (0) ≤ pmn (i) ≤ pmn (i + j) 1≤m<n≤k
Failed

If we define π0 to be the initial probability distri-


bution of the system, the probability distribution after
1 j state transitions is:
Figure 1. Three-state model of system with no
maintenance.
π j = Pj · · · P2 P1 π 0 (3)

663
For convenience we assume that the system is The likelihood of the system following a particular
initially in the new state, that is: branch over N steps is given by the products of the
transition probabilities along that branch:

π0 = 1 0 ··· 0 (4)

N
p Bj = pi (Bj ) (7)
i=1
Next, we consider that the system can generate
um (t) revenue per unit time when it is in state m; The right side of Eq. 7 is shorthand for the product
a degraded system having lower capacity to provide of the transition probabilities along the branch Bj .
services (hence generate revenues) than a fully func- Finally, the expected Present Value of the system
tional system. This um (t) is the expected utility model over all the branches is calculated by weighting the
of the system or the price of the flow of service it Present Value of each branch by its likelihood:
can provide over time. We discretize time into small
T bins over which um (t) can be considered constant. 
Therefore PV (N ) = p(Bj ) · PV (N , Bj ) (8)
all branches

um (i) = um (iΔT ) ≈ um [(i + 1) ΔT ] (5) Using the Markov chain terminology, this present
value can also be expressed as:

To simplify the indexing, we consider that the rev- N  k


um (i) π i (m)
enues the system can generate between (i − 1)T and PV (N ) = (9)
iT are equal to um (i) · T . i=1 m=1
(1 + rΔT )i
Each branch of the graph represents a particular
value trajectory of the system, as discussed below. The following simple numerical example will help
Each time step, the system can remain fully oper- clarify this discussion and equations.
ational, or it can transition to a degraded or failed Numerical example. Consider a three-state system
state. A branch in the graph is characterized by with the following transition matrix:
the set of states the system can ‘‘visit’’ for all time
periods considered. For example, the branch B1 = ⎡ ⎤
{1, 1, 1, 1, . . ., 1} represents the system remaining in .95 .04 .01
State 1 throughout the N periods considered, whereas P=⎣ 0 .9 .1 ⎦
Bj = {1, 1, 1, 2, 2, . . ., 2} represents a system start- 0 0 1
ing in State 1 and remaining in this state for the
first two periods, then transitioning to a degraded The system has a constant revenue model over N
state (here State 2) at the third period and remain- periods, and can generate $100,000/year in State 1,
ing in this particular degraded state. Notice that the $60,000/year in the degraded State 2, and $0 in the
branch Bworst = {1, k, k, . . ., k} represents a new failed State 3. Assume a yearly discount rate of 10%.
system transitioning to the failed state at the first Next, consider the branches of the system evolution
transition. defined by the transitions in Table 1.
Since each state has an associated utility um (i),a Applying Eq. 5 to this simple case, we obtain the
Present Value can be calculated for each branch over value trajectories and probabilities shown in Figure 3.
N periods as follows: The expected Present Value of this system across
all branches is given by Eq. 7:

N
PV (N ) = $296, 672
uBj (i)

i=1
PV N , Bj = (6) In the next sub-section, we show how maintenance
(1 + rΔT )i
moves the system onto higher-value trajectories and
therefore increases the expected present value of the
uB (i) is shorthand for the revenues along the branch B. system.
rT is the discount rate adjusted for the time interval
T. Notice that Branch 1 has the highest Present Value
4.2 Deterioration, Maintenance, and Present Value
since the system remains fully functional throughout
the N periods (in State 1), and branch Bworst has the Maintenance compensates for the lack of self-healing
lowest Present Value (PVworst = 0) since the system in the system, and, from a modeling perspective, given
starts its service life by failing! a three-state system, maintenance allows the system to

664
Table 1. Branches for a three-state system evolution over four periods.

Branch Transitions Comment

B1 {1, 1, 1, 1, 1} The system starts in state 1 and remains in this state throughout the four periods.
B4 {1, 1, 1, 2, 2} The system starts in state 1; it remains in State 1 for two periods, then transitions to
the degraded State 2 in the third period and remains in this State 2.
B8 {1, 1, 2, 2, 3} The system starts in State 1; it remains in State 1 for the first period, then transitions
to the degraded State 2 in the second period; it remains in this degraded state for
the third period, then transitions to the failed State 3 in the fourth period.

PV of Branch 1 PV of Branch 4 PV of Branch 8

$350,000 p(B 1 )=81.4% Perfect


$300,000
maintenance
p(B 4 )=3.2%
Present Value

$250,000
p(B 8 )=0.3%
$200,000
$150,000
$100,000
New
$50,000
$-
Period 1 Period 2 Period 3 Period 4

Perfect
Figure 3. Illustrative value trajectories for a three-state maintenance
system (branches defined in Table 1).

Perfect Deteriorated
maintenance
move back, (1) from a failed state to a deteriorated state
(imperfect corrective maintenance), (2) from a failed
state to a new state (perfect corrective maintenance),
and (3) from a deteriorated state to a new state (perfect
preventive maintenance). These maintenance-enabled Imperfect
transitions are shown in Figure 4. Failed maintenance
In addition to returning the system to a higher-
functional state, maintenance provides another
advantage: it modifies the time-dependent transition
probabilities. In particular, the transition probabilities Figure 4. Performing perfect maintenance returns the sys-
from the new state after perfect maintenance are equal tem to the NEW state.
to those of a new system. That is, performing perfect
maintenance returns the system to the initial transition
Reliability

probabilities pij (0). Seen from a reliability viewpoint,


Perfect maintenance
perfect maintenance returns the system to the initial shifts the reliability
reliability curve, as shown in Figure 5. In the remain- curve to the right
der of this work, we focus only on the case of perfect
maintenance.
Figure 6 shows the Markov model for our simple
system under the option of perfect maintenance. The
dotted lines indicate the ‘‘new’’ transition probabilities
assuming that perfect maintenance occurred at the end
of the previous time step. 1 2 3 4 5 6 t
Between maintenance activities the system deteri-
orates according to: Figure 5. Impact of maintenance on reliabilit.

π j = Pj · · · P2 P1 π 0 (10)
The effect of perfect maintenance is to return
where π j is the vector of probabilities of being in states the system to the initial probability distribution π 0
1 through k. and to the initial transition probability matrix P.

665
and (2) in the case of perfect maintenance, it restores
p11(i) the initial (more favorable) transition probabilities,
p11(0) which in effect ensure that system is more likely to
remain in the ‘‘new’’ state post-maintenance than pre-
maintenance. The two effects of maintenance on the
New value of a system are shown in Figure 7 for a three-
p12(0) state system. The following simple examples will help
p12(i) further clarify this discussion.

Example 1: Perfect maintenance and the restoration of


p13(0) p13(i) Deteriorated p22(i)
the initial transition probabilities
Consider Branch B1 in the state evolution of the system
(see Figure 3 and Table 1); B1 as defined previously
corresponds to a system remaining in State 1 (or the
p23(i) new state) throughout the time periods considered.
Failed
Under no maintenance, as the system ages, the likeli-
hood that it will remain on B1 decreases. For example,
a system that remained on B1 until the end of the
second period, as shown in Figure 9, will find its con-
1 ditional probability of remaining on B1 (given that it
was on B1 at the end of the second period) to be:
Figure 6. Performing maintenance introduces new transi-
tion possibilities.
pnoM_2 (B1 ) = p11 (2) · p11 (3) · ... · p11 (n) (12)

p11(1) Whereas if perfect maintenance were performed on


p12(1) the system at the end of the second period, the initial
p11(0)
p13(1) transition probabilities are restored, and conditional
probability that the system will remain on B1 increases
p12(0) p22(1) and is given by:
p13(0) p23(1) New
maintenance

p11(0)
Deteriorated pM_2 (B1 ) = p11 (0) · p11 (1) · ... · p11 (n − 2)
Failed and (13)
p12(0) p22(1) p22(2) pM_2 (B1 ) > pnoM_2 (B1 )
Maintenance
p23(2)
ce

This result is easily extended for perfect mainte-


nan

p23(1)
te

p13(0) nance occurring at the end of any period j:


main

pM_j (B1 ) = p11 (0) · p11 (1) · ... · p11 (n − j)


and (14)
Figure 7. Maintenance moves the system to a higher PV
branch and, in the case of perfect maintenance, restores the pM_j (B1 ) > pnoM_j (B1 )
initial transition probabilities.
Figure 8 illustrates the system evolution over time
under the maintenance and no-maintenance policies.
Thus we can model the transition assuming mainte-
nance at time tm > t0 as: Example 2: The present value of maintenance
π m−1 = Pk · · · P2 P1 π 0 Consider the three-state system example discussed in
Section 4.1, and assume that the system transitions to a
πm = π0 (11)
degraded state at the end of the first period. If no main-
π m+k = Pk · · · P2 P1 π 0 tenance is performed, the system will remain on the
lower value branches shown in Figure 9 (lower box).
In short, maintenance acts on two levers of value: The expected Present Value of these lower branches,
(1) it lifts the system to a higher value trajectory, calculated at the end of the first period, is given by

666
Period 1 Period 2 Period 3 Period 4 Period n+1
Time This incremental Present Value, which in this case
it is calculated at the end of period 1, is what we define
p11(2)
p11(3) p11(n) Branch 1
without
maintenance
as the Present Value of maintenance:
p11(1)

ΔPV1 = PV maintenance_1 − PV no_maintenance_1


p11(0)

p11(n-2)
p11(0) p11(1)
Branch 1

= $76, 900
p11(1) with perfect
maintenance
occurring at the
end of the
p11(0) Perfect second period
maintenance
occurs here

Since maintenance is the only cause of the system


accessing these higher value trajectories after it has
Figure 8. Perfect maintenance and the restoration of the
initial transition probabilities. degraded, it is appropriate to ascribe this incremental
Present Value to maintenance and term it ‘‘the Present
Value of maintenance’’.

Period 1 Period 2 Period 3 Period 4


Time
5 CONCLUSIONS
Higher value (and more
likely) branches for the
maintained system
In this paper, we proposed that maintenance has
Maintenance lifts a
intrinsic value and argued that existing cost-centric
PV
degraded system to
higher value branches
maintenance_1
models ignore an important dimension of mainte-
nance, namely its value, and in so doing, they can
€ PV
provided by
lead to sub-optimal maintenance strategies. We argued
maintenance
that the process for determining a maintenance strat-
egy should involve both an assessment of the value
of maintenance—how much is it worth to the sys-
PV no_maintenance_1
tem’s stakeholders—and an assessment of the costs
€ Lower value branches of
the degraded system with
no maintenance of maintenance.
We developed a framework for capturing and quan-
tifying the value of maintenance activities by connect-
Figure 9. The incremental Present Value provided by main- ing an engineering and operations research concept,
tenance. system state, with a financial and managerial concept,
the Present Value (PV).
By identifying the impact of system state or condi-
Eq. 8. Using the numerical values provided in the tion on PV, the framework and analyses developed in
previous example, we find: this paper provide (financial) information for decision-
makers to support in part the maintenance strategy
development. Maintenance, as a consequence, should
PV no_maintenance_1 = $165, 600
not be conceived as ‘‘just an operational matter’’ and
guided by purely operational matters but by multi-
Next, assume that perfect maintenance is carried out disciplinary considerations involving the marketing,
at the end of period 1 after the system has degraded finance, and operations functions within a company.
to State 2. Perfect maintenance lifts the system up to We are currently expanding this framework to derive
higher value trajectories, shown in the upper box in the Net Present Value of maintenance, defined as the
Figure 10. The expected Present Value of these upper PV of maintenance minus the cost of maintenance.
branches, calculated at the end of the first period, In this work we made several simplifying assump-
after the occurrence of perfect maintenance, is given tions. In particular, we considered ‘‘value’’ as the net
by Eq. 8. Using the numerical values provided in the revenue generated by the system over a given planning
previous example, we find: horizon. We did not include additional dimensions of
value such as the potential positive effects of main-
PV maintenance_1 = $242, 500 tenance on environmental or health impacts. Such
effects can be incorporated in future work. Further,
Finally, consider the difference between the we did not include factors such as safety or regulatory
expected Present Value of the branches enabled by requirements. Such factors can be easily included as
maintenance and the branches without maintenance: constraints on the optimisation of value in future work.
One important implication of this work is that the
maintenance strategy should be tied to market con-
ΔPV = PV maintenance − PV no_maintenance (15) ditions and the expected utility profile (ore revenue

667
generating capability of the system). In other words, Nguyen, D.G., Murthy, D.N.P. 1981. Optimal preventive
unlike traditional maintenance strategies, which are maintenance policies for repairable systems. Operations
‘‘fixed’’ once devised, a value-optimal maintenance Research 29(6): 1181–1194.
strategy is dynamic and can change with environmen- Pham, Hoang and Wang, Hongzhou. 1996. Imperfect main-
tal and market conditions. tenance. European Journal of Operational Research, 94:
425–438.
Finally, we believe that the framework presented Rausand M., Høyland A. 2004. System Reliability Theory:
here offers rich possibilities for future work in bench- Models, Statistical Methods, and Applications. 2nd Ed.
marking existing maintenance strategies based on their New Jersey: Wiley–Interscience.
value implications, and in deriving new maintenance Rosqvist, T., Laakso, K. and Reunanen, M. 2007. Value-
strategies that are ‘‘value-optimized.’’ driven maintenance planning for a production plant. Reli-
ability Engineering & System Safety. In Press, Corrected
Proof.
Saleh, J.H. 2008. Flawed metrics: satellite cost per transpon-
REFERENCES der and cost per day. IEEE Transactions on Aerospace and
Electronic Systems 44(1).
Block, H.W., Borges, W.S. and Savits, T.H. 1985. Age- Saleh, J.H. 2008. Durability choice and optimal design
Dependent Minimal Repair, Journal of Applied Proba- lifetime for complex engineering systems. Journal of
bility 22(2): 370–385. Engineering Design 19(2).
Brealy R, Myers C. 2000. Fundamentals of Corporate Saleh, J.H., Lamassoure, E., Hastings, D.E., Newman, D.J.
Finance. 6th Ed. New York: Irwin/McGraw-Hill. 2003. Flexibility and the Value of On-Orbit Servicing: A
Dekker, Rommert. 1996. Applications of maintenance opti- New Customer-Centric Perspective. Journal of Spacecraft
mization models: a review and analysis. Reliability and Rockets 40(1): 279–291.
Engineering and System Safety (51): 229–240. Saleh, J.H., K. Marais, K. 2006. Reliability: how much is
Doyen, Laurent and Gaudoin, Olivier, 2004. Classes of it worth? Beyond its estimation or prediction, the (net)
imperfect repair models based on reduction of failure present value of reliability. Reliability Engineering and
intensity or virtual age. Reliability Engineering and System Safety 91(6): 665–673.
System Safety, 84: 45–56. Sinden, J.A., Worrell, A.C. 1979. Unpriced values: Decisions
Hilber, P., Miranda, V., Matos, M.A., Bertling, L. 2007. without market prices. New York: Wiley–InterScience.
Multiobjective Optimization Applied to Maintenance Pol- Tan, Cher Ming and Raghavan, Nagarajan, In Press. A frame-
icy for Electrical Networks. IEEE Transactions on Power work to practical predictive maintenance modeling for
Systems 22(4): 1675–1682. multi-state systems. Reliability Engineering & System
Kijima, Masaaki, Morimura, Hidenori and Suzuki, Yasusuke, Safety. Corrected Proof, Available online 21 September
1988. Periodical replacement problem without assum- 2007. DOI: 10.1016/j.ress.2007.09.003.
ing minimal repair. European Journal of Operational Wang, Hongzhou. 2002. A survey of maintenance policies of
Research 37(2): 194–203. deteriorating systems. European Journal of Operational
Malik, M.A.K. 1979. Reliable Preventive Maintenance Research 139: 469–489.
Scheduling. AIIE Transactions 11(3): 221–228. Wang H.Z., Pham H. 1996. Optimal maintenance policies for
Marais, Karen, Lukachko, Stephen, Jun, Mina, Mahashabde, several imperfect repair models. International Journal of
Anuja, and Waitz, Ian A. 2008. Assessing the Impact Systems Science 27(6): 543–549.
of Aviation on Climate. Meteorologische Zeitung, 17(2): Wang, Hongzhou and Pham, Hoang. 1996. Optimal age-
157–172. dependent preventive maintenance policies with imperfect
Nakagawa, T. 1979a. Optimum policies when preventive maintenance. International Journal of Reliability, Quality
maintenance is imperfect. IEEE Transactions on Relia- and Safety Engineering 3(2): 119–135.
bility 28(4): 331–332. Wang H.Z., Pham H. 1999. Some maintenance models and
Nakagawa, T. 1979b. Imperfect preventive-maintenance. availability with imperfect maintenance in production
IEEE Transactions on Reliability 28(5): 402–402. systems. Annals of Operations Research 91: 305–318.
Nakagawa, T. 1984. Optimal policy of continuous and dis- Waeyenbergh G, Pintelon, L. 2002. A framework for main-
crete replacement with minimal repair at failure. Naval tenance concept development. International Journal of
Research Logistics Quarterly 31(4): 543–550. Production Economics 77(3): 299–313.

668
Safety, Reliability and Risk Analysis: Theory, Methods and Applications – Martorell et al. (eds)
© 2009 Taylor & Francis Group, London, ISBN 978-0-415-48513-5

The maintenance management framework: A practical view to


maintenance management

A. Crespo Márquez, P. Moreu de León, J.F. Gómez Fernández, C. Parra Márquez & V. González
Department Industrial Management, School of Engineering, University of Seville, Spain

ABSTRACT: The objective of this paper is to define a process for maintenance management and to classify
maintenance engineering techniques within that process. Regarding the maintenance management process, we
present a generic model proposed for maintenance management which integrates other models found in the
literature for built and in-use assets, and consists of eight sequential management building blocks. The different
maintenance engineering techniques are playing a crucial role within each one of those eight management build-
ing blocks. Following this path we characterize the ‘‘maintenance management framework’’, i.e. the supporting
structure of the management process.
We offer a practical vision of the set of activities composing each management block, and the result of the
paper is a classification of the different maintenance engineering tools. The discussion of the different tools can
also classify them as qualitative or quantitative. At the same time, some tools will be very analytical tools while
others will be highly empirical. The paper also discusses the proper use of each tool or technique according to
the volume of data/information available.

1 THE MAINTENANCE MANAGEMENT of the process and whether the process produces the
PROCESS required result.
The second part of the process, the implementation
The maintenance management process can be divided of the selected strategy has a different significance
into two parts: the definition of the strategy, and the level. Our ability to deal with the maintenance man-
strategy implementation. The first part, definition of agement implementation problem (for instance, our
the maintenance strategy, requires the definition of ability to ensure proper skill levels, proper work prepa-
the maintenance objectives as an input, which will be ration, suitable tools and schedule fulfilment), will
derived directly from the business plan. This initial allow us to minimize the maintenance direct cost
part of the maintenance management process condi- (labour and other maintenance required resources). In
tions the success of maintenance in an organization, this part of the process we deal with the efficiency
and determines the effectiveness of the subsequent of our management, which should be less impor-
implementation of the maintenance plans, schedules, tant. Efficiency is acting or producing with minimum
controls and improvements. Effectiveness shows how waste, expense, or unnecessary effort. Efficiency
well a department or function meets its goals or com- is then understood as providing the same or better
pany needs, and is often discussed in terms of the maintenance for the same cost.
quality of the service provided, viewed from the cus- In this paper we present a generic model proposed
tomer’s perspective. This will allow us to arrive at for maintenance management integrates other mod-
a position where we will be able to minimize the els found in the literature (see for instance [6,7]) for
maintenance indirect costs [3], those costs associated built and in-use assets, and consists of eight sequential
with production losses, and ultimately, with customer management building blocks, as presented in Figure 1.
dissatisfaction. In the case of maintenance, effective- The first three building blocks condition maintenance
ness can represent the overall company satisfaction effectiveness, the fourth an fifth ensure maintenance
with the capacity and condition of its assets [4], or efficiency, blocks six and seven are devoted to main-
the reduction of the overall company cost obtained tenance and assets life cycle cost assessment, finally
because production capacity is available when needed block number eight ensures continuous maintenance
[5]. Effectiveness concentrates then on the correctness management improvement.

669
Effectiveness Effectiveness
Phase 1:
Phase 3:
Definition of the Phase 2: Phase 2: Phase 3:
maintenance Assets priority
Immediate Phase 1:
intervention Criticality Failure Root
objectives and and maintenance Balance
on high impact Analysis Cause Analysis
KPI’s strategy definition Score Card
weak points (CA) (FRCA)
(BSC)

Improvement Phase 4:
Phase 8:
Design of Improvement
Continuous
the preventive Phase 8:
Improvement Phase 4:
maintenance Total Productive
and new Reliability-
plans and Maintenance
techniques Centred
resources (TPM),
utilization Maintenance
e-maintenance
(RCM)

Phase 7: Phase 6: Phase 5:


Asset life cycle Maintenance Preventive plan,
analysis execution schedule Phase 6:
and replacement assessment and resources Phase 7: Reliability Phase 5:
optimization and control optimization Life Cycle Analysis (RA)
Cost Analysis & Critical Path Optimization
(LCCA) Method (RCO)
Assessment Efficiency (CPM)

Assessment Efficiency
Figure 1. Maintenance management model.

Figure 2. Sample of techniques within the maintenance


management framework.
2 MAINTENANCE MANAGEMENT
FRAMEWORK

In this section, we will briefly introduce each block Maintenance


and discuss methods that may be used to improve each Cost Effectiveness
building block decision making process (see Figure 2).
Regarding the Definition of Maintenance Objec-
tives and KPI’s (Phase 1), it is common the operational Maintenance cost (%)
per unit produced (7%)
objectives and strategy, as well as the performance
measures, are inconsistent with the declared over-
all business strategy [8]. This unsatisfactory situation
can indeed be avoided by introducing the Balanced Maintenance
Scorecard (BSC) [9]. The BSC is specific for the planning Quality Learning
organization for which it is developed and allows the and scheduling
creation of key performance indicators (KPIs) for mea-
PM Accomplishment
suring maintenance management performance which Data integrity
Compliance of criticality analysis
are aligned to the organization’s strategic objectives (Every 6 months)
(95%)
(98%)
(See Figure 3).
Unlike conventional measures which are control
Figure 3. From KPIs to functional indicators [2].
oriented, the Balanced Scorecard puts overall strategy
and vision at the centre and emphasizes on achiev-
ing performance targets. The measures are designed
to pull people toward the overall vision. They are qualitative techniques which attempt to provide a sys-
identified and their stretch targets established through tematic basis for deciding what assets should have
a participative process which involves the consul- priority within a maintenance management process
tation of internal and external stakeholders, senior (Phase 2), a decision that should be taken in accor-
management, key personnel in the operating units dance with the existing maintenance strategy. Most of
of the maintenance function, and the users of the the quantitative techniques use a variation of a concept
maintenance service. In this manner, the performance known as the ‘‘probability/risk number’’ (PRN) [11].
measures for the maintenance operation are linked to Assets with the higher PRN will be analysed first.
the business success of the whole organization [10]. Often, the number of assets potentially at risk out-
Once the Maintenance Objectives and Strategy are weighs the resources available to manage them. It is
defined, there are a large number of quantitative and therefore extremely important to know where to apply

670
+

-
4
F

C
B
r
e 3 1 2 1 3 Critical
q
u Semi-critical
e
n 2 4 2 Non-critical

A
c

B.C

Maintainability
y

A.B
M
1 3

M
M
C
A
10 20 30 40 50

B.C

A.B

Reliability
R
Consequence

R
R
C
A
Figure 4. Criticality matrix and assets location.

B.C

A.B
D

Delivery
D
D
available resources to mitigate risk in a cost-effective

C
and efficient manner. Risk assessment is the part of the
ongoing risk management process that assigns relative

Working
C
priorities for mitigation plans and implementation. In

Time
A.B

W
W
professional risk assessments, risk combines the prob-
ability of an event occurring with the impact that event
would cause. The usual measure of risk for a class of

B.C
events is then R = PxC, where P is probability and C

Quality
Q
A

Q
is consequence. The total risk is therefore the sum of
the individual class-risks (see risk/criticality matrix in
B.C
Figure 4).

Safety
Risk assessment techniques can be used to pri-

S
A

S
oritize assets and to align maintenance actions to
business targets at any time. By doing so we ensure that

Environment
B.C

maintenance actions are effective, that we reduce the

E
indirect maintenance cost, the most important mainte-
A

nance costs, those associated to safety, environmental


risk, production losses, and ultimately, to customer
dissatisfaction. Figure 5. Qualitative criticallity assessment [2].
The procedure to follow in order to carry out an
assets criticality analysis following risk assessment
techniques could be then depicted as follows: assets assessment, as a way to start building mainte-
nance operations effectiveness, may be obtained. Once
1. Define the purpose and scope of the analysis;
there is a certain definition of assets priority, we have
2. Establish the risk factors to take into account and
to set up the strategy to be followed with each category
their relative importance;
of assets. Of course, this strategy will be adjusted over
3. Decide on the number of asset risk criticality levels
time, but an initial starting point must be stated.
to establish;
As mentioned above, once there is a certain rank-
4. Establish the overall procedure for the identifica-
ing of assets priority, we have to set up the strategy
tion and priorization of the critical assets.
to follow with each category of assets. Of course, this
Notice that assessing criticality will be specific to strategy will be adjusted over time, and will consist
each individual system, plant or business unit. For of a course of action to address specific issues for
instance, criticality of two similar plants in the same the emerging critical items under the new business
industry may be different since risk factors for both conditions (see Figure 6).
plants may vary or have different relative importance. Once the assets have been prioritized and the main-
On some occasions, there is no hard data about his- tenance strategy to follow defined, the next step would
torical failure rates, but the maintenance organization be to develop the corresponding maintenance actions
may require a certain gross assessment of assets prior- associated with each category of assets. Before doing
ity to be carried out. In these cases, qualitative methods so, we may focus on certain repetitive—or chronic—
(see example in Figure 5) may be used and an initial failures that take place in high priority items (Phase 3).

671
Initial RCM
Reach optimal reliability, Phase Implementation phase
A maintainability and
RCM
availability levels team Operational
conformation context Functional
Function Failure modes
failures

Maintenance strategy
definition
Criticality and asset
Asset category

Analysis selection FMEA


(level?) Failure Mode and Effect of
Ensure certain Effects Analysis failure modes
B equipment availability Tool to answer the first 5
levels RCM Questions

Tool to answer Application of


the last 2 the RCM
RCM Questions logic

C Sustain – improve
current situation Final Maintenance
plan
Phase documentation

Figure 6. Example of maintenance strategy definition for Figure 7. RCM implementation process.
different category assets [2].

Equipment
Finding and eliminating, if possible, the causes of status &
Optimality
those failures could be an immediate intervention pro- Criteria
functional
dependencies
viding a fast and important initial payback of our
maintenance management strategy. The entire and
detailed equipment maintenance analysis and design Failure
could be accomplished, reaping the benefits of this Dynamics
intervention if successful. Monte Carlo PM
There are different methods developed to carry out Model Schedule
this weak point analysis, one of the most well known Preventive
being root-cause failure analysis (RCFA). This method Maintenance
consists of a series of actions taken to find out why a Plan
particular failure or problem exists and to correct those System
causes. Causes can be classified as physical, human or constraints Work in process
latent. The physical cause is the reason why the asset
failed, the technical explanation on why things broke Figure 8. Obtaining the PM schedule.
or failed. The human cause includes the human errors
(omission or commission) resulting in physical roots.
Finally, the latent cause includes the deficiencies in the and the maintenance/replacement interval determina-
management systems that allow the human errors to tion problems, mid-term models may address, for
continue unchecked (flaws in the systems and proce- instance, the scheduling of the maintenance activities
dures). Latent failure causes will be our main concern in a long plant shut down, while short term models
at this point of the process. focus on resources allocation and control [13]. Mod-
Designing the preventive maintenance plan for a elling approaches, analytical and empirical, are very
certain system (Phase 4) requires identifying its func- diverse. The complexity of the problem is often very
tions, the way these functions may fail and then high and forces the consideration of certain assump-
establish a set of applicable and effective preventive tions in order to simplify the analytical resolution of
maintenance tasks, based on considerations of sys- the models, or sometimes to reduce the computational
tem safety and economy. A formal method to do this needs.
is the Reliability Centred Maintenance (RCM), as in For example, the use of Monte Carlo simula-
Figure 7. tion modelling can improve preventive maintenance
Optimization of maintenance planning and schedul- scheduling, allowing the assessment of alternative
ing (Phase 5) can be carried out to enhance the scheduling policies that could be implemented dynam-
effectiveness and efficiency of the maintenance poli- ically on the plant/shop floor (see Figure 8).
cies resulting from an initial preventive maintenance Using a simulation model, we can compare and dis-
plan and program design. cuss the benefits of different scheduling policies on the
Models to optimize maintenance plan and sched- status of current manufacturing equipment and sev-
ules will vary depending on the time horizon of eral operating conditions of the production materials
the analysis. Long-term models address mainte- flow. To do so, we estimate measures of performance
nance capacity planning, spare parts provisioning by treating simulation results as a series of realistic

672
CAPEX OPEX
Capital Costs Operational Costs
Conventional Maintenance E-maintenance
Development Investment Operation
costs costs costs
Top Management Top Management
Acquisition
Corrective Maintenance + Security, Environment, Production = Reports
Design Non Reliability Costs = Risk

Operation + Planned Maintenance Costs. Middle Management Middle Management


Investigation
Precise &
Reports Login to
Concise
iScada Information
Maintenance Dept Maintenance Dept
Construction Remove
Time (years) Inspections/Complaints

Assets / Assets /
Figure 9. Life cycle cost analysis. Information Source Information Source

Figure 10. Implementing e-maintenance (http://www.


experiments and using statistical inference to identify
devicesworld.net).
reasonable confidence intervals.
The execution of the maintenance activities—once
designed planned and scheduled using techniques
described for previous building blocks—has to be are considered to be of higher impact as a result of the
evaluated and deviations controlled to continuously previous steps of our management process. Regard-
pursue business targets and approach stretch values for ing the application of new technologies to mainte-
key maintenance performance indicators as selected nance, the ‘‘e-maintenance’’ concept (Figure 10) is
by the organization (Phase 6). Many of the high put forward as a component of the e-manufacturing
level maintenance KPIs, are built or composed using concept [14], which profits from the emerging infor-
other basic level technical and economical indica- mation and communication technologies to implement
tors. Therefore, it is very important to make sure a cooperative and distributed multi-user environment.
that the organization captures suitable data and that E-Maintenance can be defined [10] as a maintenance
that data is properly aggregated/disaggregated accord- support which includes the resources, services and
ing to the required level of maintenance performance management necessary to enable proactive decision
analysis. process execution.
A life cycle cost analysis (Phase 7) calculates the This support not only includes etechnologies (i.e.
cost of an asset for its entire life span (see Figure 9). ICT, Web-based, tether-free, wireless, infotronic tech-
The analysis of a typical asset could include costs nologies) but also, e-maintenance activities (opera-
for planning, research and development, production, tions or processes) such as e-monitoring, e-diagnosis,
operation, maintenance and disposal. Costs such as e-prognosis . . . etc. Besides new technologies for
up-front acquisition (research, design, test, produc- maintenance, the involvement of maintenance people
tion, construction) are usually obvious, but life cycle within the maintenance improvement process will be
cost analysis crucially depends on values calculated a critical factor for success. Of course, higher lev-
from reliability analyses such us failure rate, cost els of knowledge, experience and training will be
of spares, repair times, and component costs. A life required, but at the same time, techniques covering the
cycle cost analysis is important when making decisions involvement of operators in performing simple mainte-
about capital equipment (replacement or new acqui- nance tasks will be extremely important to reach higher
sition) [12], it reinforces the importance of locked levels of maintenance quality and overall equipment
in costs, such as R&D, and it offers three important effectiveness.
benefits:
• All costs associated with an asset become vis-
ible. Especially: Upstream; R&D, Downstream; 3 CONCLUSIONS
Maintenance;
• Allows an analysis of business function interre- This paper summarizes the process (the course of
lationships. Low R&D costs may lead to high action and the series of stages or steps to follow)
maintenance costs in the future; and the framework (the essential supporting structure
• Differences in early stage expenditure are high- and the basic system) needed to manage maintenance.
lighted, enabling managers to develop accurate A set of models and methods to improve maintenance
revenue predictions. management decision making is presented. Models
are then classified according to their more suitable
Continuous improvement of maintenance manage- utilization within the maintenance management pro-
ment (Phase 8) will be possible due to the utilization cess. For further discussion of these topics the reader
of emerging techniques and technologies in areas that is addressed to a recent work of one of the authors [2].

673
REFERENCES improvement programmes. International Journal of
production Research, 32(4): 797–805.
[1] EN 13306:2001, (2001) Maintenance Terminology. [9] Kaplan RS, Norton DP, (1992) The Balanced Score-
European Standard. CEN (European Committee for card—measures that drive performance. Harvard
Standardization), Brussels. Business Review, 70(1): 71–9.
[2] Crespo Marquez, A, (2007) The maintenance manage- [10] Tsang A, Jardine A, Kolodny H, (1999) Measur-
ment Framework. Models and methods for complex ing maintenance performance: a holistic approach.
systems maintenance. London: Springer Verlag. International Journal of Operations and Production
[3] Vagliasindi F, (1989) Gestire la manutenzione. Perche Management, 19(7): 691–715.
e come. Milano: Franco Angeli. [11] Moubray J, (1997) Reliability-Centred Maintenance
[4] Wireman T, (1998) Developing performance indica- (2nd ed.). Oxford: Butterworth-Heinemann.
tors for managing maintenance. New York: Industrial [12] Campbell JD, Jardine AKS, (2001) Maintenance
Press. excellence. New York: Marcel Dekker.
[5] Palmer RD, (1999) Maintenance Planning and [13] Duffuaa SO, (2000) Mathematical models in main-
Scheduling. New York: McGraw-Hill. tenance planning and scheduling. In Maintenance,
[6] Pintelon LM, Gelders LF, (1992) Maintenance man- Modelling and Optimization. Ben-Daya M, Duffuaa
agement decision making. European Journal of Oper- SO, Raouf A, Editors. Boston: Kluwer Academic
ational Research, 58: 301–317. Publishers.
[7] Vanneste SG, Van Wassenhove LN, (1995) An inte- [14] Lee J, (2003) E-manufacturing: fundamental,
grated and structured approach to improve mainte- tools, and transformation. Robotics and Computer-
nance. European Journal of Operational Research, 82: Integrated Manufacturing, 19(6): 501–507.
241–257.
[8] Gelders L, Mannaerts P, Maes J, (1994) Man-
ufacturing strategy, performance indicators and

674
Safety, Reliability and Risk Analysis: Theory, Methods and Applications – Martorell et al. (eds)
© 2009 Taylor & Francis Group, London, ISBN 978-0-415-48513-5

Workplace occupation and equipment availability and utilization,


in the context of maintenance float systems

I.S. Lopes
School of Engineering, University of Minho, Braga, Portugal

A.F. Leitão
Polytechnic Institute of Bragança, Bragança, Portugal

G.A.B. Pereira
School of Engineering, University of Minho, Guimarães, Portugal

ABSTRACT: In industry, spare equipments are often shared by many workplaces with identical equipments
to assure the production rate required to fulfill delivery dates. These types of systems are called ‘‘Maintenance
Float Systems’’. The main objective of managers that deal with these types of systems is to assure the required
capacity to deliver orders on time and at minimum cost. Not delivering on time has often important consequences;
it can cause loss of costumer goodwill, loss of sales and can damage organization’s image. Maintenance cost is
the indicator more frequently used to configure maintenance float systems and to invest in maintenance workers
or spare equipments. Once the system is configured, other performance indicators must be used to character-
ize and measure the efficiency of the system. Different improvement initiatives can be performed to enhance
the performance of maintenance float systems: performing preventive maintenance actions, implementation of
autonomous maintenance, improvement of equipments maintainability, increase of maintenance crews’ effi-
ciency etc. ‘‘Carrying out improvement based on facts’’ is a principle of Total Quality Management (TQM)
in order to step to business excellence. It requires monitoring processes through performance measures. This
work aims to characterize and highlight the differences and relationships between three types of performance
measures—equipment availability, equipment utilization and workplace occupation, in the context of mainte-
nance float system. Definitions and expressions of these three indicators are developed for maintenance float
systems. The relationship between maintenance float systems efficiency and the referred indicators is shown.
Other indicators are also proposed and compared with the first ones (number of standby equipments, queue
length etc.).

1 INTRODUCTION in the organization’s strategic planning (Madu 2000)


Alsyouf (2007) illustrates how an effective mainte-
In the past, market demand was very high. It was nance policy could influence the productivity and prof-
required that producers delivered products as quicker itability of a manufacturing process trough its direct
as possible and without defects. Nowadays, within a impact on quality, efficiency and effectiveness of
competitive market, producers must deliver on time operation. Madu (2000) aimed to demonstrate that reli-
products that satisfy client requirements. Once client ability and maintainability are crucial to the survival
requirements are always changing, producers have and competitiveness of an organization, especially
to constantly redesign and innovate their products. with the rapid proliferation of technologies.
Small product series are more frequent and require Several methodologies and models concerning
flexible and reliable technologies. The production maintenance management can be found in literature,
dependence on robots, automatic systems and trans- such as: Reliability Centered Maintenance (RCM),
port systems makes maintenance an important and Total Productive Maintenance (TPM), preventive
key function in production systems. Some authors maintenance models including predictive maintenance
show that maintenance is no longer a cost center, but and inspection models.
should be regarded as a profit generation function Whenever breakdowns are inevitable, the utiliza-
(Alsyouf 2006, Alsyouf 2007) and must be integrated tion of spare equipments is frequent to minimize

675
simulation methods approach the design problem of
units in standby
MFS through the development of meta-models (mod-
els that express the input-output relationship in the
form of a regression equation). Madu (2000) used
Taguchi’s techniques to construct the meta-model.
Chen & Tseng (2003) used Neural Networks.
workstation Maintenance cost is the indicator more frequently
used to configure MFS and to hire maintenance work-
ers or to invest on spare equipments (Zeng & Zhang
1997; Madu & Kuei 1996; Madu 1999; Chen & Tseng
2003). Lopes et al., (2006) present a cost model to
in queue determine the number of float units, the number of
maintenance crews in the maintenance center and the
time between periodic overhauls. Periodic overhauls
are performed to improve equipments reliability and
be attended by to minimize the number of breakdowns.
maintenance servers Once the system is configured, other performance
indicators must be used to characterize and measure
Figure 1. Maintenance Float System representation. the efficiency of the system and to identify the poten-
tial improvement initiatives that can be implemented.
Gupta & Rao (1996) present a recursive method to
undesirable effects in production, caused by down- obtain the steady-state probability distribution of the
times. Spares can be efficiently managed when iden- number of down machines of a MFS. A M/G/1 queue
tical equipments are operating in parallel in the is considered with only one repairman. Gupta & Rao
workstation. This type of system is called a ‘‘Mainte- (1996) use several performance measures to evaluate
nance Float System’’. ‘‘Float’’ designates equipments the efficiency of MFS: the average number of down
in standby and equipments waiting for maintenance machines, the average number of machines waiting in
actions in the maintenance center. Equipment or queue for repair, the average waiting time in queue, the
unit involved in a Maintenance Float System (MFS) average number of operating machines, the machine
switches among different states: operating in worksta- availability and the operator utilization. Each state of
tion, waiting in the queue to be repaired, being repaired the system is characterized by the number of failed
in the maintenance center, waiting until required by the units. Gupta (1997) deals with the same queue model
workstation (see fig. 1). but considers that the server takes a vacation of ran-
Some studies present mathematical and simulation dom duration every time the repair facility becomes
models to configure MFS. One of the first attempts empty.
to determine the number of float units was proposed Lopes et al. (2007) also determine state prob-
by Levine (1965) who uses analytical method based abilities of a float system submitted to preventive
on traditional reliability theory. The author introduced maintenance at periodic overhauls and show the effect
a reliability factor based on the ratio MTTR/MTBF. of performing overhauls in equipments involved in a
Gross et al., (1983), Madu (1988), Madu & Kuei MFS. Each state of the system is defined by the num-
(1996) use Buzen’s algorithm. Zeng & Zhang (1997) ber of failed units (i) and by the number of equipment
consider a system where a key unit keeps the worksta- submitted or waiting for an overhaul ( j). Lopes et al.,
tion functioning and a set of identical units are kept in (2007) conclude that periodic overhauls optimize the
a buffer to replace units sent for repairing. The system efficiency of maintenance crews in the maintenance
is modeled as a closed queue (an M/M/S/F queue), center. Like Gupta & Rao (1996) several mainte-
and the idle probability of the system is obtained. The nance indicators are defined and determined (average
optimal values of the capacity of the inventory buffer queue length, probability of waiting queue occurrence,
(F), the size of repair crew (S) and the mean repair rate average number of units not yet replaced etc.).
are determined by minimizing the total cost Shankar & This work presents and defines three types of
Sahani (2003) consider a float system whose failures maintenance indicators for MFS: equipment availabil-
are classified as sudden and wear-out. Units subject ity, equipment utilization and workplace occupation.
to wear-out failures are replaced and submitted to Workplace occupation and other performance mea-
preventive maintenance actions after a specific time sures are determined for a MFS submitted to periodic
period. Based on the reliability function of the system, overhauls.
authors find the number of floating units needed to This paper is organized as follows—section 2
support the active units such that the number of active discusses equipment availability. Section 3 stud-
units does not change. Most of the studies based on ies equipment utilization. In section 4, workplace

676
occupation is defined. Section 5 and 6 present the
indicators for a MFS submitted to periodic overhauls. Active Waiting be attended in
in standby
Section 7 presents the conclusions. in workstation in the queue maintenance center

2 EQUIPMENT AVAILABILITY Figure 2. Cycle time of a unit involved in MFS.

Availability represents the fraction of time that a


repairable equipment is operational. An equipment
can be operational and not be active (operating). The Active Waiting be attended in in standby
expression often used is given by equation 1. in workstation in the queue maintenance center

0 t
Tup
A= , (1)
Tup + Tdown Figure 3. Equipment utilization in MFS.

where Tup represents the total time the equipment is


up (operational) and Tdown represents the total time In MFS, if the availability increases, production
the equipment is down (not operational), thus meaning capacity increases. For the same level of capac-
the equipment can not be used. For equipment in the ity, spare equipments will not be required so often.
steady-state, the equation 2 is also frequently used: The realization of preventive actions reduces down-
time and then increases the availability of equipments
MTBF in the workstation. The equipment availability is also
A= (2) influenced by time to repair. In this respect, different
MTBF + MTTR techniques could be used to reduce time to perform
repairs or to improve maintainability of equipments.
Equation 2 assumes that whenever the equipment Availability determination requires quantifying
is up (operational), it is operating and when it fails the standby time which is complex to determine analyt-
repair action is immediately started. If equipment does ically. Instead of this indicator, the average number
not start immediately, MTTR must include waiting of non-operational equipments could be used to pro-
time until repair action starts. vide similar information. In section 6.1, the average
Equation 2 is often misused. MTBF and MTTR do number of equipments waiting for or under mainte-
not always use the same measure units. Usually MTTR nance actions is determined for a MFS, submitted to
is associated with time to repair, but it is very frequent periodic overhauls.
that MTBF uses measures associated with the phe-
nomena of equipment degradation, such as number
of kilometers or number of utilizations, etc. Avail-
ability is a performance indicator which is related to 3 EQUIPMENT UTILIZATION
equipment reliability and maintainability. A higher
availability can be achieved if the equipment relia- Equipment utilization characterizes the efficiency of
bility could increase or if the repair process could MFS. A high utilization of an equipment means that
be faster. It can involve enhancing the repair crew spare equipments are often used, thus cost-effective.
capacity by adopting new work methods or improv- It means that equipment spent relatively short time in
ing skills through training, or by introducing changes standby state, in the maintenance queue and in the
in equipment that could improve its maintainability or repair station. If the system had been well configured,
reliability. equipments involved in a MFS will have its utiliza-
In the context of MFS, the time spent in the tion value very close to its availability value. Figure 3
maintenance queue need to be incorporated in the illustrates equipment utilization in a MFS.
expression for availability. In this environment, equip- Utilization rate = (time in workstation)/(time in
ment availability is influenced by the efficiency of the workstation + time in queue + time to repair +
maintenance center which can be characterized by the standby time)
respective queue length. The up time (fig. 2) is the sum The efficiency of MFS could also be assessed
of the equipment operating time in the workstation and through the average number of equipments in standby
the standby time. and through the average number of equipments in
Availability = (time in workstation + standby queue. These last indicators are easier to determine
time)/(time in workstation + time in queue + time analytically and provide similar information. They
to repair + standby time) also allow the identification of the most critical source

677
for time loss (standby time and time spent in the also low. In the second case, the workplace occupation
queue). is high. In this case, the average number of equipments
in standby can give an idea of the equipment utilization
and also of the workplace occupation. If the average
4 WORKPLACE OCCUPATION number of equipments in standby is high, then the
workplace occupation will also be high, once spare
Workplace occupation depends on time between fail- equipments availability is high.
ures and time until replacement of the failed unit
(fig. 4).
Workplace occupation = time in workplace /oper- 5 WORKPLACE OCUPATION FOR A MFS
ating cycle SUBMITTED TO PERIODIC OVERHAULS
This operating cycle (Dw ), however, does not corre-
spond to the cycle defined for equipment availability Based on the model developed in Lopes et al. (2007),
and utilization calculation purposes. This operating workplace occupation is determined. Lopes et al.
cycle (fig. 5) begins when an unit starts operating in (2007) present a preventive maintenance model with
the workstation and ends when another unit takes its replacement at constant time intervals or upon failure
place. For MFS, the new unit is different from the for a network with load-dependent maintenance ser-
failed one and the replacement is performed before vice. Each system state is represented by (i, j), where i
the conclusion of initial unit repair. is the number of failed equipments and j is the number
Dw = time in workstation + time until replacement of equipments waiting or under overhaul. Pi,j denotes
of failed unit occurs the state probability for the MFS. The queue for the
If the replacement of failed equipment is always maintenance center follows FIFO (First In, First Out)
immediate, then the workstation occupation will be 1, discipline. Failed units form a queue and are replaced
meaning that the workplace is always available when while and when spare units are available. Equipment
required. The time to replace equipment is neglected. that needs an overhaul is kept in operation waiting
The time until a new unit starts operating in the for its turn to be attended and the replacement will
workplace depends on time to repair and on the number take place when the overhaul is initiated. If a unit that
of spare equipments. Workplace occupation is related needs an overhaul fails before being attended, it will
with equipment availability and utilization. Equipment be replaced if a spare unit is available. In this case, this
utilization can be low due to time to repair or due to unit is treated like a failed unit.
standby time. In the first case, workplace occupation is Lopes et al. (2007) address both problems:
• number of spare equipments (R) is bigger than the
number of maintenance crews (L); (R ≥ L).
• number of spare equipments (R) is lower than the
number of maintenance crews (L); (R ≤ L).
In this work, the expression for workplace occupa-
tion will be determined for the first problem above.
workplace 1 workplace 2 ...... workplace M

5.1 Distribution of time to failure


The workstation comprises M identical workplaces
and time to failure for an active equipment in the work-
Workstation station follows an Exponential distribution f (t) with
its parameter given by 1/λf , where λf designates the
Figure 4. Replacing units in workstation. equipment failure rate.

cycle 5.2 Distribution of time to replace an equipment


time until replacement Considering that time interval between two succes-
sive conclusions of maintenance actions follows the
Exponential distribution:
new unit failure of the new unit
starts operating operating unit starts operating
g(t) = μe−μt , (3)
0 t
where μ is the service rate of the maintenance crews,
Figure 5. Cycle time for a workplace. which activity is divided into repair and overhaul

678
actions with a service rate of ‘a’ and ‘b’ respectively.
a) Machine ready to be overhauled
μ can be determined based on the repair rate (μrep ),
on the overhaul rate (μrev ) and on the number of
maintenance crews (L), see equation 4. AL∩AR T

μ = Laμrep + Lbμrev (4) a1


(AL∩AR) ∩ NF
The time distribution for an equipment that needs
(AL∩AR) ∩ F X
to wait for r + 1 releases of the maintenance center t1 z1
to be replaced is a Gamma distribution (5). The sum a2
of independent random variables following Exponen- (AL∩AR) ∩ NF
tial distributions follows a Gamma distribution (see (AL∩AR) ∩ F X
equation 5). t2 z2

b) Failed machine
t r e−μt
Wr+1 (t) = μr+1 (5)
(r + 1) AL∩AR t
X
The average waiting time is given by equation 6. AL∩AR b2
AL∩AR
∞
t r e−μt r+1 Active machine
t · μr+1 · · dt = , (6)
(r + 1) μ X Failure
0 Failed machine in queue

which is the mean of the Gamma distribution. When


r + 1 is an integer, W (t) is designated by Erlang Figure 6. Cycle time for L < R.
distribution and (r + 1) = r!

5.3 Average Occupation rate of a workplace 5.3.1 Cycle time


The average occupation rate of a workplace (Q) is a) Cycle time for the workplace when the operating
determined based on the time interval an equipment equipment is overhauled At the end of the overhaul
is operating in the workplace (Tup ). interval [0, T ), MFS can be in several different
Considering a cycle with a duration D, Q is given states. Depending on the system state, different
by equation 7. situations could arise:
– There is at least one spare unit available to per-
Tup form the replacement (AL ∩AR ): the replacement
Q= (7)
D is made immediately and the overhaul is initiated.
– There is at least one spare unit available to
Based on figure 6, the expression of Tup and D could perform the replacement, but there is no main-
be deducted—up time is represented by a continuous tenance crew available to initiate the overhaul
line and downtime is represented by a dashed line. At (equipment will wait active until being replaced)
the left side of figure 6 is represented the cycle time (AL ∩ AR ): if the equipment does not fail while
for the workplace when operating equipment reaches waiting for an overhaul (NF), the replacement is
the end of the overhaul interval and is ready to be made only when a maintenance crew is available,
overhauled and at the right side is represented the cycle following the first come first served discipline. If
time when the operating equipment fails. the equipment fails (F) meanwhile, it is treated as
Some possible events are considered: a failed equipment and will be replaced as soon
as possible.
• AR– at least a spare equipment is available; – There is no spare unit and no maintenance crew
• AR – no spare equipment is available; available (AL ∩ AR ): this situation is addressed
• AL– at least one maintenance crew is available; like the previous one.
• AL – no maintenance crew is available;
• F – equipment which is waiting for an overhaul fails b) Cycle time when the operating equipment fails
before replacement When the operating equipment in the workplace
• NF – equipment which is waiting for an overhaul fails, the following situations need to be considered
does not fail. to identify the time to replace the equipment:

679
T
– There is at least one spare unit available (AL ∩ AR ∩

AL ∩ AR = AR ): the time to replace the equipment + f (t) · P(AL ∩ AR ) · t + P(AL ∩ AR ) · t


is null (neglecting the time to make the change in 0

the workplace). +P(AL ∩ AR ) · t dt
– There is no spare unit available (AR ∩ AL ): The
average time to replace the equipment could be
determined based on Gamma distribution. Simplifying:
The average duration of a cycle D is given by the ⎧ ⎫
following equation: ⎨T + P(AL ∩ AR ) · PNF · τa1 ⎬
Tup = F(T) · +P(AL ∩ AR ) · PF · t1 + P(AL ∩ AR )
⎧ ⎫ ⎩ ⎭
⎪ P(AL ∩ AR ) · T + P(AL ∩ AR ) · PNF · [T + τa1 ]⎪ · PNF · τa2 + P(AL ∩ AR ) · PF · t2

⎪ ⎪

⎨ +P(AL ∩ AR ) · PF · [T + t1 + z1 ] ⎬
D = F(T) · T

⎪ P(AL ∩ AR ) · PNF · [T + τa2 ] ⎪


⎩ ⎪
⎭ + f (t) · t · dt
+P(AL ∩ AR ) · PF · [T + t2 + Zz ]
0
T

+ f (t) · P(AL ∩ AR ) · t + P(AL ∩ AR ) · t


0
5.4 Average time to replace an equipment
+ P(AL ∩ AR ) · [t + τb2 ] · dt waiting for an overhaul
Before replacing an equipment that is waiting for
Simplifying: an overhaul, all equipments that are already in the
queue have to be replaced first. The number of equip-
⎧ ⎫ ments in the queue is equal to i + j − L, i.e., the

⎪ T + P(AL ∩ AR ) · PNF · τa1 ⎪


⎨ +P(AL ∩ AR ) · PF · [t1 + z1 ] +⎪
⎬ number of failed equipments and the number of equip-
D = F(T) · ments waiting for an overhaul minus the number of

⎪ P(AL ∩ AR ) · PNF · τa2 ⎪
⎪ equipments being attended. Then, the average time to

⎩ ⎪

+P(AL ∩ AR ) · PF · [t2 + z2 ] replace an equipment requiring an overhaul is given
by equation 8.
T


+ f (t) · t + P(AL ∩ AR ) · τb2 · dt = (i + j − L) + 1
τa1 = τa2 (8)
0
u

5.5 Time to replace a failed equipment


where For L < i + j < R,
• (i + j − L) · a is the average number of failed
P(AL ∩ AR ) = Pi,j equipments in the queue;
i+j+1≤L • (R − L) is the number of available spare equipments
at the instant where maintenance crews become all
P(AL ∩ AR ) = Pi,j occupied.
L<i+j+1≤R
If (i + j − L) · a + 1 ≤ R − L, the failed equipment
P(AL ∩ AR ) = Pi,j i + j + 1 can be immediately replaced, once there is
i+j+1>R spare equipment available.
If (i + j − L) · a + 1 > R − L, the failed equip-
where F(T) is the probability of equipment reaching ment i + j + 1 can not be replaced immediately (there
the end of the overhaul interval without a failure. is no spare equipments available). Released equip-
ments will replace failed equipments or equipments
waiting for an overhaul depending of the maintenance
5.3.2 Up time in workstation action that is initiated. If the maintenance action is
⎧ ⎫ an overhaul, the released equipment will replace the

⎪ P(AL ∩ AR ) · T + P(AL ∩ AR ) · PNF · [T + τa1 ]⎪
⎪ equipment submitted to an overhaul. If the mainte-

⎪ ⎪

⎨ +P(AL ∩ AR ) · PF · [T + t1 ] ⎬
Tup = F(T) ·
nance action is a repair, the released equipment will

⎪ P(AL ∩ AR ) · PNF · [T + τa2 ] ⎪
⎪ replace the next non-replaced failed equipment which

⎪ ⎪

⎩ +P(A ∩ A ) · P · [T + t ] ⎭ is in the queue.
L R F 2

680
Once max[0; (i + j − L) · a + 1 − (R − L)] is the (see equation 12).
number of non-replaced failed equipments, then the
failed equipment i + j + 1 will be replaced after the
release of max[0; (i + j − L) · a + 1 − (R − L)]/a ∞ t t r e−μt
μr+1 · · τ · λf e−λf τ · dτ · dt
equipments from the maintenance center (number of 0 0
r!
non-replaced failed equipments divided by the fraction tv = (11)
PF
of failed equipments in the queue).
The time elapsed until the replacement is given by
equation 9. t1 et2 have the same expressions, once maintenance
rates and expressions for the number of releases of
max [0; (i + j − L) · a + 1 − (R − L)] maintenance center until overhaul be initiated (r =
τb2 = (9) i + j − L) are identical.

Simplifying:

5.6 Probability of failure of an equipment   


waiting for an overhaul μr+1 − (μ+λ
r+1
)r+2
− 1
λf
1
(μ+λf )r+1
− 1
μr+1
tv =
f

The probability of failure PF is determined based on PF


Erlang distribution—the time distribution for replac- (12)
ing an equipment waiting for an overhaul, and on
Exponential distribution—the time distribution for
failures of an active equipment. 5.8 Mean time to replace failed equipment
(initially waiting for an overhaul)
∞  t Failed equipments are replaced as soon as possible
t r e−μt
PF = μr+1 · · λf e−λf t2 dt2 · dt (10) and equipments that need an overhaul continue oper-
r! ating until being attended. Equipments waiting for
0 0
an overhaul which fail are treated as failed equip-
∞ t ments. Therefore, these equipments are replaced
t r e−μt
= μr+1 · · λf · e−λf t2 dt2 · dt after replacement of lacked equipments in workstation
r! (failed equipments not replaced).
0 0
It is considered that the maintenance system is in the
∞ steady state (state probabilities are time independent)
t r e−μt 
= μr+1 · · 1 − e−λf t · dt and it is assumed that system state at the moment of
r! failure occurrence is independent of the system state
0
⎡∞ ⎤ when the overhaul is required.
 ∞ Then, the average number of lacked equipments
μh+ 1 ⎣ r −μt
= t e · dt − t r e−(μ+λf )t · dt ⎦ when failure occurs can be calculated using equa-
r! tion 14 below.
0 0
 
μ r+1
r! r! 
= − Pi,j · Max [0; (i + j − L) · a − (R − L)]
r! μr+1 (μ + λf )r+1 i+j≥L
rf = 
PNF = 1 − PF Pi,j
i+j≥L
(13)
Equations 10 and 11 depend on r, the number of
equipments to be replaced, given by:
where max[0; (i+j−L)·a−(R−L)] represents the aver-
r = i + j − L. age number of lacked equipments in the workstation
for the state (i, j).
Using the same logic that is used for determining
5.7 Mean time until failure of an active τb2 (presented above), z1 ez2 are given by equation 15
equipment waiting for an overhaul below.

The mean time until failure tv of an active equip-


(rf + 1)
ment waiting for an overhaul is obtained based on z1= z2 = (14)
Erlang distribution and on Exponential distribution au

681
6 OTHER MAINTENANCE INDICATORS occupation. It gives information about the workstation
FOR MFS SUBMITTED TO OVERHAULS efficiency.
Improvement in the occupation rate can be
6.1 Average number of equipments waiting achieved by:
for and under maintenance actions
• Increasing the number of spare equipments
For a MFS submitted to preventive maintenance, Increasing the number of spare equipments allows
equipment availability determination needs to incor- increasing the workplace occupation, but it also
porate time to perform an overhaul and time to decreases the equipment utilization rate. Utilization
repair. Then, the number of non-operational equip- needs to be as close as possible to equipment avail-
ments (the related indicator) includes failed equip- ability. However, holding cost of spare equipments
ments and equipments waiting for and under overhaul, and investment made need to be balanced with loss
see equation 16. of production cost.
• Increasing the number of maintenance servers
Pi,j (i + j) (15) The maintenance center will deliver repaired equip-
ments more frequently, and then equipments will
have a higher availability.
6.2 Average number of equipments in the queue • Improving maintenance efficiency
Maintenance efficiency can be enhanced by per-
The average number of equipments in the queue (equa- forming preventive maintenance actions or equip-
tion 17) allows the identification of the need for ment improvement (enhance reliability or main-
maintenance center improvement. tainability); changing work procedures and training
operators. Improving maintenance efficiency has
Pi,j (i + j − L) (16) the same effect as increasing maintenance servers,
i+j≥L but it often requires less investment.

Decision makers also frequently use Overall Equip-


6.3 Average number of equipments in standby ment Effectiveness (OEE), an indicator associated
with TPM methodology, to assess the efficiency of
The average number of lacked equipments in the work-
production lines. OEE quantifies three types of losses:
station determined by Lopes et al. (2007) is directly
performance and quality losses and loss associated
related to the average number of equipments in standby
with reliability and maintainability. As far as the
(equation 18) and could also be used to assess MFS
loss associated with reliability and maintainability is
efficiency.
concerned, an availability indicator is usually used.
However, as shown in this work, workplace occupa-
Pi,j [R − (i + j)] tion is a better indicator than availability to assess
i+j≤L MFS efficiency. Therefore, we propose a new formula
for OEE to be used in the context of MFS replacing
+ Pi,j max [0; (R − L) − (i + j − L)a] (17) availability by workplace occupation.
i+j>L

REFERENCES
7 CONCLUSIONS
Alsyouf, I. 2006. Measuring maintenance performance using
The three indicators, equipment availability, equip- a balanced scorecard approach. Journal of Quality in
ment utilization and workplace occupation, addressed maintenance Engineering 12(2):133–149.
in this work are important and need to be used in order Alsyouf, I. 2007. The role of maintenance in improving
to prioritize improvement initiatives and monitor the companies’ productivity and profitability. International
efficiency of MFS. Analytically, it seems complex to Journal of Production economics 105(1):70–78.
determine equipment availability and equipment uti- Chen, M.-C. & Tseng, H.-Y. 2003. An approach to design
lization. It involves quantifying the time in standby. of maintenance float systems. Integrated Manufacturing
However, as shown, some other indicators provide Systems 14(3):458–467.
Gross, D., Miller, D.R. & Soland, R.M. 1983. A closed
similar information and are easier to determine. queueing network model for multi-echelon repairable item
Decision makers use several kinds of indicators to provisioning. IIE Transactions 15(4):344–352.
define and identify improvement initiatives. However, Gupta, S.M. 1997. Machine interference problem with
one of the most important indicators for production warm spares, server vacations and exhaustive service.
processes in the context of MFS is the workplace Performance Evaluation 29:195–211.

682
Gupta, U.C. & Rao, T.S. 1996. On the M/G/1 machine Madu, C.N. & Kuei, C.-H. 1996. Analysis of multiech-
interference model with spares. European Journal of elon maintenance network characteristic using implicit
Operational Research 89:164–171. enumeration algorithm. Mathematical and Computer
Levine, B. 1965. Estimating maintenance float factors on the Modelling 24(3):79–92.
basis of reliability. Theory in Industrial Quality Control Madu, I.E. 1999. Robust regression metamodel for a main-
4(2):401–405. tenance float policy. International Journal of Quality &
Lopes, I., Leitão, A. & Pereira, G. A maintenance float Reliability Management 16(3):433–456.
system with periodic overhauls. ESREL 2006. Portugal: Shankar, G. & Sahani, V. 2003. Reliability analysis of a
Guedes Soares & Zio (eds); 2006. maintenance network with repair and preventive main-
Lopes, I., Leitão, A. & Pereira, G. 2007. State probabili- tenance. International Journal of Quality & Reliability
ties of a float system. Journal of Quality in Maintenance Management 20(2):268–280.
Engineering 13(1):88–102. Zeng, A.Z. & Zhang, T. 1997. A queuing model for designing
Madu, C.N. 1988. A closed queuing maintenance network an optimal three-dimensional maintenance float system.
with two repair centers. Journal of Operational Research Computers & Operations Research 24(1):85–95.
Society 39(10):959–967.
Madu, C.N. 2000. Competing through maintenance strate-
gies. International Journal of Quality and Reliability
Management 17(7):937–948.

683
Monte Carlo methods in system safety and reliability
Safety, Reliability and Risk Analysis: Theory, Methods and Applications – Martorell et al. (eds)
© 2009 Taylor & Francis Group, London, ISBN 978-0-415-48513-5

Availability and reliability assessment of industrial complex systems:


A practical view applied on a bioethanol plant simulation

V. González, C. Parra & J.F. Gómez


Industrial Management PhD Program at the School of Engineering, University of Seville, Spain

A. Crespo & P. Moreu de León


Industrial Management School of Engineering, University of Seville, Spain

ABSTRACT: This paper shows a practical view about the behaviour of an industrial assembly in order to assess
its availability and reliability. For that intention it will be used such a complex system like a Bioethanol Plant.
A computerized model will help to create a realistic scenario of the Bioethanol Plant Life Cycle, obtaining an
estimation of the most important performance measures through real data and statistic inference. By this way, it
will be possible to compare and discuss the profit of different plant configurations using the model and following
the initial technical specifications. Basically, the Bioethanol Plant will be divided for that purposes in functional
blocks, defining their tasks and features, as well as their dependencies according to the plant configuration.
Additionally, maintenance information and data bases will be required for the defined functional blocks. Once
these data have been compiled and using any commercial software, it will be possible to carry out a model of the
plant and to simulate scenarios and experiments for each considered configuration. Parameters about availability
and reliability will be obtained for the most important functions, following different plant configurations. From
their interpretation, it will be interesting to consider actions that improve the availability and reliability of the
system under different plant functional requirements. Among other important aspects, it will be researchable
as well a sensitive analysis, i.e., the exploring on how parameters modifications have influence on the result or
final goal.

1 INTRODUCTION assessment in a plant. Once simulated the operational


cycles of each component and through their combi-
Nowadays, investors, engineers etc. have to take nation and dependencies, we will be able to obtain
into consideration a lot of requirements and condi- the whole system operation cycle. The scope of this
tions in order to avoid risks and hazards on industrial research will be the availability assessment of alterna-
systems. Components or subsystems have potential tives configurations in a plant according, not only to
failure modes which have to be in mind from the sys- predetermined maintenance strategies, but also to new
tem initial state and according to its operation modes, maintenance policies.
environmental conditions, failure times etc. . .
The failure modelling could be very complicated
because of dependencies or inferences among com-
2 PROCEDURE
ponents and, at the same time, the great amount
of required data. The intention here is to generate
2.1 Definition of the system configuration
a random number of events under a computerized
model which simulates the plant life scenario. There Description: It is needed to define functional blocks
are simulation methods that allow us taking into and their dependencies according to the plant config-
consideration important aspects on the system oper- uration and each block tasks or purposes.
ation like redundancies, stand-by nodes, preventive Result: Functional block list/diagram: function,
maintenance, repairing priorities. . . inputs, outputs, etc. Functional chart of the sys-
The use of such methods is rising up when one wants tem containing the relations among blocks and their
to predict the general availability or the economical reliability features.

687
2.2 Data compiling Grain (Starch)
Milling Hydrolysis Saccharifi-
Description: Maintenance information and data base Water
cation

will be required for each considered block.


Result: Schedule with preventive tasks, times, Evaporation Centrifuge Distillation Fermentation

failure ratios etc.


Drying Dehydration
2.3 Model construction
Description: Using any commercial software, it will Syrup
DDG Ethanol CO2
be possible to make a model of the plant.
Result: Plant computerized model. Figure 1. General diagram of the Bioethanol obtaining
process.
2.4 Simulation
Description: Using the model and with the above men-
tioned data we will simulate scenarios and experiments
for each considered configuration.
Result: Scenarios list, real and unreal events repli-
cation.

2.5 Results and analysis


Description: It will be able to calculate and discuss the
simulation results.
Results: Parameters about availability and reliabil- Figure 2. Diagram of the grain reception, cleaning, milling
ity will be obtained for the most important functions, and storage system.
following different plant configurations. From their
interpretation, we will consider actions to improve
the availability and reliability of the system under
plant functional requirements. Among other important Conversion of starch into bioethanol involves sev-
aspects, we will have as well the sensitive analysis, i.e., eral more process steps, as. It starts by making a
the exploring on how parameters modifications have ‘‘beer’’ from the milled grain, then distilling off the
influence on the result or final goal. alcohol followed by recovery of the residual solids and
recycle of water.
It is necessary to be care with each step of the pro-
3 SHORT DESCRIPTION OF THE cess to assure an efficient conversion, particularly due
BIOETHANOL PROCESS AND ITS to the fact that it is a biological process where unsuit-
RELATED BLOCK DIAGRAMS able reactions can happen causing loss in yield, and
also because different grains have different process
Ethanol is commonly produced by the fermentation requirements. Some of the most important matters to
of sugars. The conversion of starch to sugar, called have into account in the sequence of process steps are
saccharification, significantly expands the choice of below mentioned.
feedstock to include cereals as corn, wheat, barley Together with this description are also included
etc. Other alternative route to bioethanol involves the Block Diagrams of the main systems in a Bioethanol
enzymatic conversion of cellulose and hemi-cellulose. plant. These diagrams are only an approaching of an
Completing the picture, ethanol is also commercially ideal plant. They try to show briefly the main equip-
synthesised from petrochemical sources but this does ments, devices, as well as material flows inside the
not fall into the renewable category, which is the frame whole plant.
of this paper.
Basically, the chemical reactions involved in
bioethanol production can be simplified as follows: 3.1 Slurry preparation
The feed grain is milled to a sufficient fineness in order
to allow water access to all the starch inside each grain.
The meal is then mixed with warm water till a specific
concentration without generating excessive viscosities
downstream.

688
Figure 3. Diagram of the mashing, cooking, liquefaction
and fermentation system. Figure 4. Diagram of the distillation and dehydration
system.

3.2 Hydrolysis
The slurry temperature is raised up in order to accel-
erate the hydrolysis of the grain’s starch into solution.
Again there is an optimum depending on the grain
type—if the slurry is too hot, the viscosity is exces-
sive and if too cool, the required residence time for
effective hydrolysis is too long.

Figure 5. Diagram of the centrifugation, evaporation and


3.3 Saccharification drying system.
With enzymes, the dissolved starch is converted to sug-
ars by saccharification, but at a reduced temperature
parallel absorber and the almost pure ethanol is sent
which again is selected to achieve a balance between a
to product storage. The cycle is then repeated.
satisfactory reaction rate, and avoiding the promotion
of unsuitable side reactions and a subsequent loss in
yield. 3.7 Centrifugation
The residual slurry left after distillation is called
3.4 Fermentation ‘‘Whole Stillage’’ and it contains all the insoluble and
soluble non-starch components from the feed grain,
The slurry is cooled to a fermentation temperature and as well as the yeast which has grown during fermen-
held in large batch fermentation tanks for a specific tation. The bulk of the solids, termed ‘‘Wet Distillers
time. Fresh yeast is prepared in parallel and added Grains’’ (WDG) are removed by centrifuge leaving a
at the beginning of each batch fermentation cycle. ‘‘Thin Stillage’’.
The fermenting slurry is agitated and also circulated
through external exchangers to remove the heat gen-
erated by the fermentation process. On completion of 3.8 Evaporation
fermentation the batch is transferred to the Beer Well To minimise water consumption, a large portion of the
and the particular fermentation tank and associated Thin Stillage is recycled to the front of the process.
equipment are cleaned by the plant’s CIP system. Basically, by effect of the evaporation a portion of
water is recycled, and the residual ‘‘Syrup’’ that con-
3.5 Distillation tains around 30–35% solids is either blended with the
WDG or sold separately to animal feed.
The ‘‘beer’’ contains about 8–12% ethanol. It is contin-
uously pumped to the distillation unit which produces
an overhead stream of about 90% ethanol and water. 3.9 Drying
Ethanol and water form a 95% azeotrope so it is not WDG is commonly dried as a mixture with the syrup to
possible to reach 100% by simple distillation. about 10% moisture by gas fired rotary dryer. This by-
product, called ‘‘Dried Distillers Grains and Solubles’’
(DDGS) is sold to animal feed.
3.6 Dehydration
The 90% overhead stream is passed through an
3.10 CO2 Recovery
absorber containing a molecular sieve which traps the
ethanol while letting the water pass through. Once the Fermentation simultaneously generates carbon diox-
bed in the absorber is full, the feed is switched to a ide which is collected and scrubbed to recover any

689
4.1.3 System availability and unavailability
For the system to be available, each subsystem should
be available. Thus:

Aseries = A1 · A2

Conversely, the unavailability is:


Figure 6. Diagram of the pelletizing system.
UAseries = 1 − Aseries = 1 − (1 − UA1 ) · (1 − UA2 )
ethanol. It can either be processed into a by-product = UA1 + UA2 − UA1 · UA2
or released to atmosphere.
4.1.4 System mean down time for repairable
subsystems
If two subsystems are both repairable, one with mean
4 SUMMARY OF FORMULAS down time MDT1 and the other MDT2 , then, the mean
AND CONCEPTS REGARDING down time for the two subsystems in series will be as
RELIABILITY AND AVAILABILITY follows. At any instance in time, the system is in one
of the 4 states:
The intention here is to summarise the formulation
regarding the reliability and availability concepts for • Both subsystems functional.
two subsystems (non-identical) in series or in parallel. • Only subsystem #1 is non-functional.
Additionally, a system with n identical subsystems in • Only subsystem #2 is non-functional.
parallel is also here taken into consideration, where • Both subsystems are non-functional.
the system will be declared as failed if m or more
The last 3 cases are responsible for the system
subsystems fail (the m_out_n case).
being non-functional. It is assumed that the 4th case
The system characteristics to be formulated here are
has negligible probability. Given the system is down,
such as ‘‘system failure rate’’, ‘‘system Mean Time
the probability that it is because the subsystem #1 is
Between Failure’’, ‘‘system availability and unavail-
non-functional is obviously:
ability’’, and ‘‘system mean down time’’. We under-
stand as ‘‘Failure (f)’’ the termination of the ability of λ1
a component or system to perform a required function. .
Therefore, the ‘‘Failure Rate (λ)’’ will be the arithmetic λ1 + λ2
average of failures of a component and/or system per
unit exposure time. The most common unit in relia- Since subsystem #1 needs MDT1 to repair, the
bility analyses is hours (h). However, some industries repair time associated with repairing subsystem #1 is
use failures per year then:
λ1
∗ MDT1
4.1 Two subsystems in series λ1 + λ2
4.1.1 System failure rate A similar expression is true for subsystem #2.
Just for one subsystem, the failure rate is λ1 . The Summing them up, one gets:
probability of failure in dt is λ1 dt. For two sub-
systems in series, the probability of failure in dt MTBF1 · MDT2 + MTBF2 · MDT1
is (λ1 dt + λ2 dt ). The system failure rate is thus MDTseries =
MTBF1 + MTBF2
(λ1 + λ2 ) : λseries = λ1 + λ2
The reliability function is:
4.2 Two subsystems in parallel
R(t) = exp[−(λ1 + λ2 )t]. Here the two subsystems are repairable. The mean
down times are MDT1 and MDT2 .
4.1.2 System MTBF
4.2.1 System failure rate
From the exponential form of the reliability function,
If the system just consists of subsystem #1, then the
it is obvious that:
system failure rate is λ1 . The probability of failure
in dt is λ1 dt. Adding subsystem #2 in parallel, the
MTBF1 · MTBF2 probability for system failure in dt is λ1 dt reduced by
MTBFseries = 1/(λ1 + λ2 ) =
MTBF1 + MTBF2 the probability that the subsystem #2 is in the failure

690
MDT1 MDT2
state. The probability to find the subsystem #2 in the UA1 · UA2 = ·
failure state is given by: MTBF1 MTBF2

MDT2 Consequently:
MTBF2 + MDT2
MDT1 · MDT2
MDTparallel =
Assuming that: MDT1 + MDT2

MDT2 << MTBF2 4.3 M out of N parallel subsystems

And using: If a system consists of n parallel, identical subsystems


and the system is down if there are m or more subsys-
1 tems down, then, the formulas for system failure rate,
MTBF2 = system MTBF, system availability, and system mean
λ2
down time will be the following ones.
Then, the reduced failure rate for subsystem #1 is
then given by: λ1 · λ2 · MDT2 4.3.1 System failure rate
Likewise, the reduced failure rate for subsystem #2 If the system just consists of subsystem #1, then the
is: λ1 · λ2 · MDT1 system failure rate is λ. The probability of failure in
Consequently: λparallel = λ1 · λ2 · (MDT1 + MDT2 ) dt is λdt. To have a system failure, we need to have
other (m-1) subsystems in the failure state. The chance
4.2.2 System MTBF that any one subsystems is in the failure state is given
Taking the approach that the inverse of the failure rate by MDT/(MTBF+MDT), or (MDT/MTBF), if we
is MTBF (true for exponential distribution), one gets: assume MDT<< MTBF. To find (m−1) subsystems
in the failure state, the probability is:
 m−1
MTBF1 · MTBF2 MDT
MTBFparallel = 1/λparallel =
MDT1 + MDT2 MTBF
It is noted that if the two subsystems are not There are n−1 Cm−1 ways to group (m-1) subsystems
repairable, then the MTBF for the parallel case is the out of (n-1) subsystems. Also, it is possible to choose
sum of the individual MTBF’s. any subsystem to be the #1 subsystem in the analysis.
Putting all together, one has:
4.2.3 System availability and unavailability
For the system to be available, either subsystem should MDT m−1
be available. Thus: λm_out_of _n = λ · ( ) n−1 C m−1 · n
MTBF
n!
Aparallel = A1 + A2 − A1 · A2 = λm · MDT m−1 (2)
(n − m)!(m − 1)!
Conversely, the unavailability is:
This is the failure rate for exactly m subsystem
UAparallel = 1 − Aparallel = 1 − (A1 + A2 − A1 · A2 ) failures. The failure rate for more than m subsystem
failures is going to be smaller by a factor of (λ · MDT ).
= (1 − A1 ) · (1 − A2 ) = UA1 · UA2 (1) For a consistency check, we consider n = m = 2.
This is a system consisting of two parallel, identical
4.2.4 System mean down time for repairable subsystems. When m = 2 subsystems fail, the system
subsystems fails. Eq. (2) for this case is λsystem = λ2 · (2 · MDT )
From the definition of: which agrees with the above mentioned formula.
MDT MDT
Unavailability = ≈ 4.3.2 System MTBF
MTBF + MDT MTBF Taking the approach that the inverse of the failure rate
is MTBF (true for exponential distribution), one gets:
It is possible to get the MDT for the parallel case
by using Eq.(1) above. MTBFm_out_of _n = 1/λm_out_of _n
MDTparallel MDTparallel MTBF m
UAparallel = = MTBF1 ·MTBF2 =
MTBFparallel MDT +MDT
1 2
n!
(n−m)!·(m−1)! · MDT m−1

691
4.3.3 System availability and unavailability 5.1 System failure rate
For the system to be available, at least (n-m+1)
• CO2 = λMilling + λFermentation
subsystems should be available. Thus:
• Ethanol = λMilling + λFermentation + λDistillation +
λDehydration

n
n! • DDGS = λMilling + λFermentation + λDistillation +
Am_out_of _n = Ai (1 − A)n−i λCentrifugation + λDrying
(n − i)! · i!
i=n−m+1 • Syrup = λMilling + λFermentation + λDistillation +
λCentrifugation + λEvaporation
Using the following equality:
Here it has been included the Hydrolysis and Sac-

n
n! charification in the same process area as Fermentation.
1 = [A + (1 − A)] = n
Ai (1 − A)n−i
i=0
(n − i)! · i!

It is possible to rewrite the availability as:


5.2 System MTBF system availability (A)
MTBFMilling ·MTBFFermentation

n−m
n! • MTBFCO2 = MTBFMilling +MTBFFermentation
Am_out_of _n = 1 − Ai (1 − A)n−i
(n − i)! · i! • MTBFEthanol = MTBFCO2 ·MTBFDis+Deh
i=0 MTBFCO2 +MTBFDis+Deh ,
n!
≈1− (1 − A)m
m! · (n − m)! Where:
MTBFDistillation ·MTBFDehydration
MTBFDis+Deh = MTBFDistillation +MTBFDehydration
And the unavailability is given (again, for
MDT << MTBF), by:
MTBFCO2 ·MTBFDis+Cen+Dry
• MTBFDDGS = MTBFCO2 +MTBFDis+Cen+Dry ,
n!
UAm_out_of _n = UAm
m! · (n − m)!
Where:
MTBFDis+Cen ·MTBFDrying
4.3.4 System mean down time for repairable MTBFDis+Cen+Dry = MTBFDis+Cen +MTBFDrying , Where:
MTBFDistillation ·MTBFCentrifugation
subsystems MTBFDis+Cen = MTBFDistillation +MTBFCentrifugation
From the definition of:

MDTm_out_of _n MTBFCO2 ·MTBFDis+Cen+Eva


UAm_out_of _n = • MTBFSyrup =
MTBFm_out_of _n MTBFCO2 +MTBFDis+Cen+Eva ,

It is possible to get the MDT for the m_out_of_n Where:


MTBFDis+Cen ·MTBFEvaporation
case by using the above mentioned formulas for MTBFDis+Cen+Eva = MTBFDis+Cen +MTBFEvaporation , and as
UAm_out_of _n and MTBFm_out_of _n . above mentioned:
Consequently:

MDTm_out_of _n =
MDT MTBFDistillation · MTBFCentrifugation
MTBFDis+Cen =
m MTBFDistillation + MTBFCentrifugation

5 FORMULATION APPLIED TO THE BLOCK


DIAGRAMS
5.3 System availability (A)
Taking into consideration the General diagram of the • ACO2 = AMilling · AFermentation
Bioethanol obtaining process included at the begin- • AEthanol = ACO2 · ADistillation · ADehydration
ning of Section 3, it is possible to summarize basically • ADDGS = ACO2 · ADistillation · ACentrifugation · ADrying
the Reliability characteristics of the Bioethanol plant • ASyrup = ACO2 · ADistillation · ACentrifugation · AEvaporation
in the following formulas for a configuration in series:

692
5.4 System Mean Down Time (MDT) which can be applied to analyze the reliability charac-
teristics of the whole complex system, in this case, a
• MDTCO2
Bioethanol Plant.
MTBFMilling ·MDTFermentation +MTBFFermentation ·MDTMilling
= MTBFMilling +MTBFFermentation

• MDTEthanol 6 CONCLUSION
MTBFCO2 ·MDTDis+Deh +MTBFDis+Deh ·MDTCO2
= MTBFCO2 +MTBFDis+Deh , With this research we pretend to improve the esti-
mations, demonstrating as well how requirements
Where expressed in initial technical specifications can be
MDTDis+Deh incompatible or even impossible to accomplish for
determined plant configurations. That means, avail-
MTBFDistillation ·MDTDehydration +MTBFDehydration ·MDTDistillation
= MTBFDistillation +MTBFDehydration
ability expectations on proposed configurations of the
whole plant could be lower, having higher reliability
or mantenability on each functional block, following
• MDTDDGS the technical requirements in effect.
MTBFCO2 ·MDTDis+Cen+Dry +MTBFDis+Cen+Dry ·MDTCO2 Additionally, reasonable estimations will be pro-
= MTBFCO2 +MTBFDis+Cen+Dry ,
vided for the production availability, which can be
Where delivered to the final customer in a more realistic
engineering project. These estimations will be based
MDTDis+Cen+Dry on validated calculations of functional blocks consid-
MTBFDis+Cen ·MDTDrying +MTBFDrying ·MDTDis+Cen ered for the model simulation, showing moreover the
= MTBFDis+Cen +MTBFDrying ,
importance and opportunity of a sensitive analysis. It
And can also be decisive for the final selection the plant
technical configuration.
MDTDis+Cen At the same time, this study can also be used to
MTBFDistillation ·MDTCentrifugation +MTBFCentrifugation ·MDTDistillation adjust some initial requirements in the plant technical
= MTBFDistillation +MTBFCentrifugation specification. Once the data have been introduced in
the model, they can be adjusted according to the real
• MDTSyrup =
equipments included in the offer. By the way, it is pos-
MTBFCO2 ·MDTDis+Cen+Eva +MTBFDis+Cen+Eva ·MDTCO2 sible to study logistical aspects like spare parts amount
MTBFCO2 +MTBFDis+Cen+Eva ,
in stock.
Where Finally, not only the availability and reliability are
important, but also the costs estimation is a key factor.
MDTDis+Cen+Eva Therefore, an extension of this study could be to trans-
MTBF ·MDT +MTBF ·MDTDis+Cen fer the information provided in this research to a life
= Dis+Cen Evaporation Evaporation
MTBFDis+Cen +MTBFEvaporation , cycle cost model, with the intention to assess globally
and as above mentioned, the plant.
MDTDis+Cen
MTBFDistillation ·MDTCentrifugation +MTBFCentrifugation ·MDTDistillation ACKNOWLEDGEMENTS
= MTBFDistillation +MTBFCentrifugation
The author would like to thank the reviewers of the
Once developed these formulas for the main parts of paper for their contribution to the quality of this work.
the process, it is possible to continue breaking down
these as deeper as wanted. It is clear that each area
(milling, fermentation, distillation. . . ) in the plant REFERENCES
has actually its own configuration for its different
devices and equipments, with its specific combina- Asociación Española de Mantenimiento. 2005. El Manten-
tion for these subsystems in series or in parallel (see imiento en España: Encuesta sobre su situación en las
for instance the different block diagrams also included empresas españolas.
in Section 3). Bangemann T., Rebeuf X., Reboul D., Schulze A., Szymanski
J., Thomesse J.P., Thron M., Zerhouni N.. 2006. Proteus-
Therefore, and as it has just mentioned, the relia- Creating distribuited maintenance systems through an
bility characteristics can be also broken down till such integration platform. Computers in Industry, Elselvier.
a detail level where is possible to apply real values Benoît Iung 2006. CRAN Laboratory Research Team
for such units. There are published many Data Bases PRODEMAS in Innovative Maintenance and Depend-
which include real values for process equipments, and ability. Nancy University—Nancy Research Centre

693
for Automatic Control (CRAN). CNRS UMR 7039 Hansen, M.T.; Noria, N.; Tierney, T. 1999. What’s your
(http://www.cran.uhp-nancy.fr). strategy for managing knowledge?. Harvard Business
Bourne, M. & Neely, A. 2003. Performance measurement Review.
system interventions: the impact of parent company ini- Henley, E.J. and Kumamoto, H. 1992 Probabilistic Risk
tiatives on success and failure. Journal of Operation and Assessment: Reliability Engineering, Design & Analysis.
Management. ISBN 0 87942 290 4. IEEE Press, US.
Campbell J.D. & Jardine A. 2001. Maintenance excellence. Høyland A & Rausand M, 1994. System Reliability Theory.
New York: Marcel Dekker. 2001. Models and Statistical methods. ISBN 0-471-59397-4.
Carter, Russell A. 2001. Shovel maintenance gains from Wiley-Interscience.
improved designs, tools and techniques. Elsevier Engi- Inc, Renewable Fuels Association. 2001. Ethanol Plant
neering Information. Development Handbook. BBI International. USA
Center for Chemical Process Safety (CCPS)Guidelines for Intelligent Maintenance Centre 2007. www.imscenter.net.
Process Equipment Reliability Data, with Data Tables. Iserman R. 1984. Process fault detection based on modelling
ISBN: 0-8169-0422-7. and estimation methods. Automatica.
Clark J. 1995. Managing Innovation and Change: People, ITSMF, IT Service Management Forum 2007. ITIL v3.
Technology and Strategy. Business & Economics. Information Technology Infrastructure Library. ITIL v2.
Crespo M.A., Moreu de L.P., Sanchez H.A. 2004. Ingeniería Information Technology Infrastructure.
de Mantenimiento. Técnicas y Métodos de Aplicación a Jardine A.K.S., Lin D., Banjevic D. 2006. A review on
la Fase Operativa de los Equipos. Aenor, España. machinery diagnostics and prognostics implementing
Crespo M.A. 2007. The Maintenance Management Frame- condition based maintenance. Mech, Syst. Signal Process.
work. Models and Methods for Complex Systems Main- Jharkharia S. & Shankarb R. 2005. Selection of logistics
tenance. Londres, Reino Unido. Springer. service provider: An analytic network process (ANP)
Dale & Plunkett 1991. Quality Costing. Chapman Hall. approach. International Journal of Management Sciente,
Dandois, P.A. & Ponte, J. 1999. La administración del Omega 35 (2007) 274–289.
conocimiento organizacional. El management en el siglo Kaplan, Robert S. & David P. Norton 1996. The Balanced
XXI. Scorecard: Translating Strategy Into Action . Boston,
Davenport T. 1993. Process innovation: Reengineering MA: Harvard Business School Press.
work through Information Technology. Harvard Business Kent Allen 1990. Encyclopedia of Computer Science and
School Press. Technology. CRC Press.
David John Smith. 2005. Reliability, Maintainability and Klein, M.M. 1994. The most fatal reengineering mistakes.
Risk: Practical Methods for Engineers. ISBN 0-750- Information strategy: The Executive’s J. lO(4) 21–28.
66694-3 Butterworth-Heinemann. Lee J. 1995. Machine performance monitoring and proac-
Deardeen, J. Lilien, G. and Yoon, E. 1999. Marketing tive maintenance in computer-integrated manufacturing:
and Production Capacity Strategy For Non-Differentiated review and perspective. International Journal of Computer
Products: Winning And Losing At The Capacity Cycle Integrating Manufacturing.
Game. International Journal of Research In Marketing. Lee J. 2004. Infotronics-based intelligent maintenance sys-
Deming W. Edwards 1989. Calidad, productividad y compet- tem and its impacts to close-loop product life cycle
itividad: la salida de la crisis. Madrid, Ediciones Díaz de systems. Proceedings of de IMS’2004 International Con-
Santos. ference on Itelligent Maintenance Systems, Arles, France.
Dixon J.R. 1966. Design engineering: inventiveness, anal- Levitt Joel. 2003. Complete Guide to Preventive and Predic-
ysis, and decision making. New York, McGraw-Hill, tive Maintenance. Industrial Press.
Inc. Davis, M. 1988. Applied Decision Support. Prentice Hall,
Dyer R.F. & Forman E.H. 1992. Group decision support Englewood Cliffs .
with the Analytic Hierarch Process. Decision Support Mitchell Ed., Robson Andrew, Prabhu Vas B. 2002. The
Systems. Impact of Maintenance Practices on Operational and
Earl M.J. 1994. The New and the Old of Business Pro- Business Performance. Managerial Auditing Journal.
cess Redesign. Journal of Strategic Information Systems, Mobley Keith 2002. An Introduction to Predictive Mainte-
vol. 3. nance. Elsevier.
EN 13306:2001. Maintenance Terminology. European Stan- Moubray J. 1997. Reliability-centered Maintenance. Indus-
dard. CEN (European Committee for Standardization), trial Press.
Brussels. Nakajima Seiichi 1992. Introducción al TPM (Manten-
European Foundation for Quality Management. 2006. EFQM imiento Productivo Total). Productivity Press.
Framework for Management of External Resources. By Neely, A.D., Gregory, M. and Platts, K. 1995. Performance
EIPM—EFQM. Measurement System Design—A Literature Review and
Gelders, L., & Pintelon, L. 1988. Reliability and mainte- Research Agenda. International Journal of Operations and
nance" in: Doff, R.C. and Nof, S.J. (ed.), International Production Management.
Encyclopedia of Robotics, Application and Automation, Nonaka & Takeuchi 1995. The Knowledge-Creating Com-
Wiley, New York. pany. USA: Oxford University Press.
Goldratt E. 1997. Cadena Crítica. Ediciones Diaz de Santos. Patton, J.D. 1980. Maintainability and Maintenance Manage-
Hammer & Champy 1993. Reengineering the Corporation. ment. Instrument Society of America, Research Triangle
Harper Business. Park, NC.
Hammer M. 1990. Reengineering Work: Don’t Automate Peters T. & Waterman H.R. Jr. 1982. ‘‘In Search of Excel-
Obliterate. Harvard Business Review. lence’’.

694
Pintelon L.M. & Gelders L.F. 1992. Maintenance manage- Turban E. 1988. Decision Support and Expert Systems:
ment decision making. European Journal of Operational Managerial Perspectives. New York: Macmillan.
Research. UNE 66174 2003. Guide for the assessment of quality man-
Porter, M. 1985. Competitive Advantage. Free Press. agement system according to UNE-EN ISO 9004:2000
Prusak Laurence 1996. The Knowledge Advantage. Strategy standard. Tools and plans for improvement. UNE.
& Leadership. UNE 66175 2003. Systems of Indicators. UNE.
Ren Yua, Benoit Iung, Herv!e Panetto 2003. A multi-agents UNE-EN ISO 9001:2000. Quality management
based E-maintenance system with case-based reasoning systems—Requirements. International Organization for
decision support. Engineering Applications of Artificial Standardization.
Intelligence 16 321–333. Wireman, T. 1991. Total Productive Maintenance. Industrial
Saaty, T.L. 1977. A Scaling Method for Priorities in Hier- Press.
archical Structures. Journal of Mathematical Psychology, Yan S.K. 2003. A condition-based failure prediction and
15: 234–281, 1977. processing-scheme for preventive maintenance. IEEE
Saaty, T.L. 1980. The Analytic Hierarchy Process. McGraw Transaction on Reliability.
Hill. Zhu G., Gelders L. and Pintelon L. 2002. Object/objective-
Saaty, T.L. 1990. How to make a decision: The ana- oriented maintenance management. Journal of quality in
lytic hierarchy process. European Journal of Operational maintenance engineering.
Research.
Shu-Hsien Liao 2005. Expert system methodologies and
applications—-a decade review from 1995 to 2004.
Elselvier. Expert Systems with Applications 28 93–103.

695
Safety, Reliability and Risk Analysis: Theory, Methods and Applications – Martorell et al. (eds)
© 2009 Taylor & Francis Group, London, ISBN 978-0-415-48513-5

Handling dependencies between variables with imprecise probabilistic


models

Sebastien Destercke & Eric Chojnacki


Institut de Radioprotection et de Sûreté Nucléaire, Cadarache, France

ABSTRACT: Two problems often encountered in uncertainty processing (and especially in safety studies)
are the following: modeling uncertainty when information is scarce or not fully reliable, and taking account
of dependencies between variables when propagating uncertainties. To solve the first problem, one can model
uncertainty by sets of probabilities rather than by single probabilities, resorting to imprecise probabilistic models.
Iman and Conover method is an efficient and practical means to solve the second problem when uncertainty is
modeled by single probabilities and when dependencies are monotonic. In this paper, we propose to combine
these two solutions, by studying how Iman and Conover method can be used with imprecise probabilistic
models.

1 INTRODUCTION 2. That variables X1 , . . . , XN are independent, that


is that their joint probability distribution is pro-
Modeling available information about input variables vided by the product of the marginal probability
and propagating it through a model are two main distributions.
steps of uncertainty studies. The former step con-
sists in choosing a representation fitting our current In real applications, both assumptions can be chal-
knowledge or information about input variables or lenged in a number of practical cases: the first when
parameters, while the latter consists in propagating available information is scarce, imprecise or not fully
these information through a model (here functional) reliable, and the second when independence between
with the aim to estimate the uncertainty on the out- variables cannot be proved or is clearly unrealistic. As
put(s) of this model. In this paper, we consider that shown in (Ferson and Ginzburg 1996), making such
uncertainty bears on N variables X1 , . . . , XN defined assumptions when they are not justified can lead to
on the real line. For all i = 1, . . . , N , we note xi a underestimations of the final uncertainty on the output,
particular value taken by Xi . possibly leading to bad decisions.
Sampling methods such as Monte-Carlo sampling Although there exist some very practical solutions
or Latin Hypercube sampling (Helton and Davis 2002) to overcome either scarceness of the information or
are very convenient tools to simulate and propagate dependencies between variables, there are not a lot
random variables X1 , . . . , XN . Most of the time, they of methods treating both problems at the same time.
j j In this paper, we propose and discuss such a method,
consists in sampling M realizations (x1 , . . . , xN ), j =
that combines the use of simple imprecise probabilistic
1, . . . , M of the N random variables, thus building a
representations with classical technics used to model
M × N sample matrix S. Each line of the matrix S
monotonic dependencies between variables (namely,
can then be propagated through a model T : RN →
Iman and Conover method). The paper is divided
R. When using such sampling technics, it is usual to
in two main sections: section 2 is devoted to basics
assume:
needed to understand the paper, and section 3 explains
and discusses the proposed method.
1. That uncertainty on each Xi is representable by a
unique probability density pi associated to a unique
cumulative distribution Fi , with 2 PRELIMINARIES

This section recalls the main principles of Iman and


 x Conover (Iman and Conover 1982) method to inte-
Fi (x) = Pi ([−∞, x]) = pi (x)dx. grate monotonic dependencies in a sampling matrix
−∞ and introduces possibility distributions (Baudrit and

697
Dubois 2006) and probability boxes (p-boxes for short) 2. Build a lower triangular N × N matrix G such that
(Ferson, Ginzburg, Kreinovich, Myers, and Sentz G C G  = R with G  the transpose of G. This
2003), the two practical probabilistic models we are can be done in the following way: use Cholesky
going to consider. More details can be found in the factorization procedure to decompose C and R
references. into C = C C and R = R R , with both
C , R lower triangular matrix (due to the fact that
2.1 Integrating monotonic dependencies correlation matrices C, R are, by definition, posi-
in sampling procedures tive definite and symmetric). Then, G is given by
G = R C− 1 and the transpose follows. Note that
The first problem we deal with is the integration of G is still a lower triangular matrix.
dependencies into sampling schemes. In the sequel, 3. Compute the M × N matrix W ∗ = W G  .
Si,j denote the matrix element in the ith line and jth 4. In each column S·,j of the original sample matrix,
column of S, while Sj and Si respectively denote the re-order the sampled values so that they are ranked
jth column and ith line of S. as in the column W·,j∗ , thus obtaining a matrix S ∗
Suppose we consider two variables X , Y and a sam- whose rank correlation matrix R∗ is close to R (but
ple (xj , yj ) of size M of these two variables. Then, if we not forcefully equal, as for a given number M of
replace the values xj and yj by their respective ranks samples, rank correlations coefficients can only
(The lowest value among xj receive rank 1, second low- assume a finite number of distinct values).
est rank 2, . . . , and similarly for yj ), their spearman
rank correlation coefficient rs , which is equivalent This method allows to take account of mono-
to the Pearson correlation computed with ranks, is tonic dependencies between the variables in sampling
given by schemes (and, therefore, in the subsequent propa-
 M 2  gation), without making any assumptions about the
6 j=1 dj shape of probability distributions and without chang-
rs = 1 −
M (M 2 − 1) ing the sampled value (it just rearranges their pairings
in the sample matrix). It is also mathematically sim-
with dj the difference of rank between xj and yj . ple and applying it do not require complex tools,
Spearman correlation rs have various advantages: as would other approaches involving, for example,
copulas (Nelsen 2005).
i it allows to measure or characterize monotonic (no
necessarily linear) dependencies between variables
ii it depends only on the ranks, not on the particular 2.2 Modeling uncertainty with sets of probabilities
values of the variables (i.e. it is distribution-free).
The second problem concerns situations where avail-
Although Spearman correlations rank are not able to able information is scarce, imprecise or not fully
capture all kinds of dependencies, they remain nowa- reliable. Such information can come, for instance,
days one of the best way to elicit dependency structures from experts, from few experimental data, from sen-
(Clemen, Fischer, and Winkler 2000). sors, etc. There are many arguments converging to
Given a sample matrix S and a N ×N target rank cor- the fact that, in such situations, a single probabil-
relation matrix R (e.g. elicited from experts), Iman and ity distribution is unable to account for the scarcity
Conover (Iman and Conover 1982) propose a method or imprecision present in the available information,
to transform the matrix S into a matrix S ∗ such that the and that such information would be better modeled by
rank correlation matrix R∗ of S ∗ is close to the target sets of probabilities (see (Walley 1991), Ch.1) for a
matrix R. This transformation consists in re-ordering summary and review of such arguments).
the elements in each column S·,j of S, without chang- Here, we consider two such models: p-boxes and
ing their values in S, so that the result is the matrix S ∗ . possibility distributions. They are both popular, simple
The transformation consists in the following steps: and are instrumental to represent or elicit informa-
1. Build a M ×N matrix W whose N columns are ran- tion from experts (for more general models and longer
dom re-orderings of the vector (a1 , . . . , aM ), where discussion, see (Destercke, Dubois, and Chojnacki
ai = φ −1 (i/(M + 1)), φ −1 being the inverse of a 2007)).
standard normal cumulative distribution, that is P-boxes (short name for probability boxes) are the
 x  2  imprecise counterparts of cumulative distributions.
1 u They are defined by an upper (F) and a lower (F)
∀x, φ(x) = √ exp − du
2π −∞ 2 cumulative distributions forming a pair [F, F] describ-
ing the uncertainty: the information only allows us to
Let C be the N × N correlation matrix associ- state that the true cumulative distribution is between
ated to W . F and F, and any cumulative distribution F such that

698
F 1
1 0.9

0.5
F
0.1
0
500 K 600 K 700 K 800 K 900 K 1000 K

0 Figure 2. Illustration of a possibility distribution.


500 K 600 K 800 K 900 K 1000 K

Figure 1. Illustration of a p-box. probabilities Pπ (Dubois and Prade 1992) such that

Pπ = {P|∀A ⊆ RN (A) ≤ P(A) ≤ (A)}


F ≤ F ≤ F is coherent with the available information.
A p-box induces a set P[F,F] of probabilities, such that with P a probability distribution. For a given possibil-
ity distribution π and for a given value α ∈ [0, 1], the
(strict) α-cut of π is defined as the set
P[F,F] = {P|∀x ∈ R, F(x) ≤ P([−∞, x]) ≤ F}.
πα = {x ∈ R|π(x) > α}.
P-boxes are appropriate models when experts pro- Note that α-cuts are nested (i.e. for two values α <
vide a set of (imprecise) percentiles, when considering β, we have πβ ⊂ πα ). An α-cut can then be interpreted
the error associated to sensor data, when we have as an interval to which we give confidence 1 − α (The
only few experimental data or when we have only higher α, the lower the confidence). α-cuts and the set
information about some characteristics of a distribu- of probabilities Pπ are related in the following way
tion (Ferson, Ginzburg, Kreinovich, Myers, and Sentz
2003). Consider the following expert opinion about Pπ = {P|∀α ∈ [0, 1], P(πα ) ≥ 1 − α}.
the temperature of a fuel rode in a nuclear reactor core
during an accidental scenario: Possibility distributions are appropriate when
experts express their opinion in term of nested con-
• Temperature is between 500 and 1000 K
fidence intervals or more generally when information
• The probability to be below 600 K is between 10
is modeled by nested confidence intervals (Baudrit,
and 20%
Guyonnet, and Dubois 2006). As an example, consider
• The probability to be below 800 K is between 40
an expert opinion, still about the temperature of a fuel
and 60%
rode in a nuclear reactor core, but this time expressed
• The probability to be below 900 K is between 70
by nested confidence intervals:
and 100%
• Probability to be between 750 and 850 K is at
Figure 1 illustrates the p-box resulting from this least 10%
expert opinion. • Probability to be between 650 and 900 K is at
Possibility distributions correspond to informa- least 50%
tion given in terms of confidence intervals, and thus • Probability to be between 600 and 950 K is at
correspond to a very intuitive notion. A possibility dis- least 90%
tribution is a mapping π : R → [0, 1] such that there • Temperature is between 500 and 1000 K (100%
is at least one value x for which π(x) = 1. Given a confidence)
possibility distribution π, possibility  and necessity
N measures of an event A are respectively defined as: Figure 2 illustrates the possibility distribution
resulting from this opinion.

(A) = max π(x) and N (A) = 1 − π(Ac )


x∈A
3 PROPAGATING WITH DEPENDENCIES
AND IMPRECISE MODELS
with Ac the complement of A. For any event A,
N (A) ≤ (A), and possibility and necessity mea- Methods presented in Section 2 constitute very prac-
sure are respectively interpreted as upper and lower tical solutions to solve two different problems often
confidence levels given to an event, defining a set of encountered in applications. As both problems can be

699
encountered in a same application, it would be inter- Also note that above sampling procedures have been
esting to blend these two tools. Such a blending is considered by Alvarez (Alvarez 2006) in the more
proposed in this section. general framework of random sets, of which p-boxes
and possibility distributions constitute two particular
instances. Let us now see how Iman and Conover
3.1 Sampling with imprecise probabilistic models
method can be extended to such models.
When uncertainty on a (random) variable X is mod-
eled by a precise cumulative distribution FX , then
simulating this variable X by sampling methods usu- 3.2 Extension of Iman and Conover method
ally consists of drawing values α coming from a We first recall some notions coming from order theory.
uniform law on [0, 1], and then to associate the (pre- Let P be a set and ≤ a relation on the elements of
cise) value F −1 (α) to each value α (see Figure 3.A). this set. Then, ≤ is a complete partial order if it is
In the case of a N -dimensional problem simulated reflexive, antisymmetric and transitive, that is if for all
by M samples, the j t h sample consists of a vector triplet a, b, c of elements in P
j j
(α1 , . . . , αN ), to which is associated the realization
(F (α )1 , . . . , F −1 (α j )N ) = (x1 , . . . , xN ). Let us
−1 j j j
a ≤ a (reflexivity) (1)
now detail what would be the result of such a sampling
with imprecise models.
P-boxes: since a p-box is described by (lower and if a ≤ b andb ≤ a, then a = b (antisymmetry) (2)
upper ) bounds on cumulative distributions, to each
value α do not longer correspond a unique inverse if a ≤ b and b ≤ c, then a ≤ c (transitivity) (3)
value, but a set of possible values. This set of possible
values correspond to the interval bounded by the upper
−1 and if for two elements a, b, neither a ≤ b nor b ≤ a,
(F ) and lower (F −1 ) pseudo inverses, defined, for then a and b are said to be incomparable. A partial
all α ∈ (0, 1] as follows: order ≤ is total, and is called an order (or a linear
−1
order), if for every pair a, b in P, we have either a ≤ b
F = sup{x ∈ R|F(x) < α} or b ≤ a.
−1
When uncertainty is modeled by precise probabili-
F = inf {x ∈ R|F(x) > α} ties, sampled values are precise, and the main reason
for being able to apply Iman and Conover method in
See Figure 3.B for an illustration. Thus, given a p- this case is that there is a natural complete ordering
box [F, F], to a sampled value α ∈ [0, 1] we associate between real numbers, and that to any set of values
the interval α such that corresponds a unique ranking. This is no longer the
−1
case when realizations are intervals, since in most
α := [F (α), F −1 (α)] cases only partial orderings can be defined on sets of
intervals (due to the fact that they can be overlapping,
Possibility distributions: In the case of a possibil- nested, disjoint,. . . ). Given two intervals [a, b], [c, d],
ity distributions, it is natural to associate to each value it is common to consider the partial ordering such that
α the corresponding α-cut (see Figure 3.C for illustra- [a, b] < [c, d] if and only if b < c, and to consider
tion). Anew, this α-cut πα is, in general, not a single that two intervals are incomparable as soon as they
value but an interval. overlap. This partial order is commonly called inter-
We can see that, by admitting imprecision in our val order. Adapting Iman and Conover method when
uncertainty representation, usual sampling methods do samples are general intervals thus seems difficult and
not longer provide precise values but intervals (which would result in a not very convenient tool, since one
are effectively the imprecise counterpart of single val- would have to consider every possible extension of the
ues). With such models, elements of matrix S can be partial ordering induced by the interval ordering.
intervals and propagating them through a model T will To circumvent this problem and to be able to apply
require to use interval analysis technics (Moore 1979). Iman and Conover method in an easy way on p-boxes
Although achieving such a propagation is more dif- and possibility distributions, we have to define a com-
ficult than single point propagation when the model plete ordering on the elements sampled from these two
T is complex, it can still remain tractable, even for representation.
high dimensional problems (see (Oberguggenberger, First, note that when uncertainty on a variable X is
King, and Schmelzer 2007) for example). Neverthe- modeled by a single (invertible) cumulative distribu-
less, propagation is not our main concern here, and tion FX , there is a one-to-one correspondence between
sampling scheme can be considered independently of the ranking of sampled values α j ∈ [0, 1] and the rank-
the subsequent problem of propagation. ing of corresponding values of X , in the sense that, for

700
1 F 1 F 1
F

0 1
0 1 1(
0
x= F ( ) F ( ) F )

Fig. 3.A: precise prob. Fig. 3.B: p-box Fig. 3.C: possibility dist.
Figure 3. Sampling from precise and imprecise probabilistic models: illustration.

two values α i , α j , we have have different meaning and interpretation, depending


on the chosen representation (and, consequently, on
α i < α j ⇐⇒ F −1 (α i ) < F −1 (α j ). (4) the chosen ordering).
In the case of p-boxes defined on multiple variables,
We will use this property to extend Iman and Conover a positive (negative) rank correlation always means
technics when realizations are either intervals α that to higher values are associated higher (lower)
coming from a p-box or α-cuts πα of a possibility values. The main difference with single cumulative
distribution. distributions is that samples are now intervals instead
P-boxes: Consider first a p-box [F, F]. For such a p- of single values. In the case of p-boxes, the applica-
box, the ordering similar to Equation (4) means that for tion of Iman and Conover method can then be seen as a
two values α, β in [0, 1], we have α < β → α < β , ‘‘simple’’ extension of the usual method, with the ben-
which is equivalent to impose a complete ordering efits that imprecision and scarceness of information
between ‘‘cuts’’  of the p-box. We note this ordering is now acknowledged in the uncertainty model. As a
≤[F,F] . Given two intervals [a, b], [c, d], this definition practical tool, it can also be seen as a means to achieve a
is equivalent to state that an interval [a, b] ≤[F,F] [c, d] robustness study (concerning either distribution shape
if and only if a ≤ c and b ≤ d. Roughly speak- or correlation coefficients). Since correlation coef-
ing, taking such an ordering means that the rank of an ficients can seldom be exactly known and are often
interval increases as it ‘‘shifts’’ towards higher values. provided by experts, such a robustness interpretation
Note that the ordering ≤[F,F] is complete only when appears appealing. Also note that the ordering ≤[F,F]
intervals are sampled from a p-box, and that incompa- is a refinement of the classical ordering considered on
rability can appear in more general cases (e.g. when intervals, and reduce to the classical ordering between
intervals are nested, or when they come from general numbers when samples are single values. All this indi-
random sets). cates that using Iman and Conover method on p-boxes
Possibility distributions: Given a possibility dis- is also equivalent to induce monotonic dependencies
tribution π, the ordering similar to Equation (4) is between variables.
equivalent to consider a complete ordering on α-cuts Contrary to p-boxes, rank correlations related to
induced by inclusion: for two values α, β in [0, 1], we possibility distributions and to the ordering ≤π cannot
have α < β → πα ⊃ πβ . We note ≤π the ordering be considered as an extension of the classical Spear-
such that man rank correlations. To see this, simply note that the
ordering ≤π , based on inclusion between sets, is not
[a, b] ≤π [c, d] if and only if [a, b] ⊃ [c, d] a refinement of the classical ordering considered on
intervals, and do not reduce to the classical ordering
with [a, b], [c, d] two intervals. Here, the rank of of numbers when samples are single values. In the case
an interval increases as it gets more precise (nar- of possibility distributions, a positive (negative) rank
rower). Again, the ordering ≤π is complete only when correlation between two variables X , Y means that to
intervals are sampled from a possibility distribution. more precise descriptions of the uncertainty on X will
Now that we have defined complete orderings on be associated more (less) precise descriptions of the
intervals sampled from p-boxes and possibility distri- uncertainty on Y , i.e. that to narrower intervals will
butions, we can apply Iman and Conover method with- correspond narrower (broader) intervals. Such depen-
out difficulty to these models. Nevertheless, one must dencies can be used when sensors or experts are likely
pay attention that a same value of rank correlation will to be correlated, or in physical models where knowing

701
a value with more precision means knowing another concerning these numerical technics (for instance, see
one with less precision (of which Heisenberg prin- (Sallaberry, Helton, and Hora 2006)).
ciple constitutes a famous example). Such kind of
dependencies has poor relation with monotonic depen-
dencies, meaning that using the proposed extension to REFERENCES
possibility distribution is NOT equivalent to assume
monotonic dependencies between variables, but rather Alvarez, D.A. (2006). On the calculation of the bounds of
to assume a dependency between the precision of probability of events using infinite random sets. I. J. of
the knowledge we have on variables. Nevertheless, Approximate Reasoning 43, 241–267.
if monotonic dependencies have to be integrated and Baudrit, C. and D. Dubois (2006). Practical representations
if information is modeled by possibility distributions, of incomplete probabilistic knowledge. Computational
it is always possible to extract a corresponding p-box Statistics and Data Analysis 51 (1), 86–108.
Baudrit, C., D. Guyonnet, and D. Dubois (2006). Joint
from a possibility distribution, and then to sample from propagation and exploitation of probabilistic and possi-
this corresponding p-box (see (Baudrit and Dubois bilistic information in risk assessment. IEEE Trans. Fuzzy
2006)). Systems 14, 593–608.
Clemen, R., G. Fischer, and R. Winkler (2000, August).
Assessing dependence : some experimental results. Man-
4 CONCLUSIONS agement Science 46 (8), 1100–1115.
Destercke, S., D. Dubois, and E. Chojnacki (2007). Relat-
Integrating known correlation between variables and ing practical representations of imprecise probabilities. In
dealing with scarce or imprecise information are Proc. 5th Int. Symp. on Imprecise Probabilities: Theories
and Applications.
two problems that coexist in many real applications. Dubois, D. and H. Prade (1992). On the relevance of non-
The use of rank correlation through the means of standard theories of uncertainty in modeling amd pool-
Iman and Conover method and the use of simple ing expert opinions. Reliability Engineering and System
imprecise probabilistic models are practical tools to Safety 36, 95–107.
solve these two problems. In this paper, we have Ferson, S., L. Ginzburg, V. Kreinovich, D. Myers, and
proposed an approach to blend these two solutions, K. Sentz (2003). Constructing probability boxes and
thus providing a practical tool to cope (at the same dempster-shafer structures. Technical report, Sandia
time) with monotonic dependencies between variables National Laboratories.
and with scarceness or imprecision in the informa- Ferson, S. and L.R. Ginzburg (1996). Different methods are
needed to propagate ignorance and variability. Reliability
tion. Engineering and System Safety 54, 133–144.
Sampling methods and complete orderings related Helton, J. and F. Davis (2002). Illustration of sampling-
to possibility distributions and p-boxes have been stud- based methods for uncertainty and sensitivity analysis.
ied and discussed. They allow to apply Iman and Risk Analysis 22 (3), 591–622.
Conover method to these two models without addi- Iman, R. and W. Conover (1982). A distribution-free
tional computational difficulties. We have argued that, approach to inducing rank correlation among input vari-
in the case of p-boxes, rank correlations can still be ables. Communications in Statistics 11 (3), 311–334.
interpreted in terms of monotonic dependencies, thus Moore, R. (1979). Methods and applications of Inter-
providing a direct extension of Iman and Conover val Analysis. SIAM Studies in Applied Mathematics.
Philadelphia: SIAM.
method, with the advantage that it can be interpreted Nelsen, R. (2005). Copulas and quasi-copulas: An introduc-
as an integrated robustness study. The interpretation tion to their properties and applications. In E. Klement
concerning possibility distributions is different, as it and R. Mesiar (Eds.), Logical, Algebraic, Analytic, and
is based on set inclusion, and describes some depen- Probabilistics Aspects of Triangular Norms, Chapter 14.
dencies between the precision of the knowledge we Elsevier.
can acquire on different variables. We suggest that Oberguggenberger, M., J. King, and B. Schmelzer (2007).
such correlation can be useful in some physical mod- Imprecise probability methods for sensitivity analysis in
els, or when sources of information (sensors, experts) engineering. In Proc. of the 5th Int. Symp. on Imprecise
are likely to be correlated. Probabilities: Theories and Applications, pp. 317–326.
Sallaberry, C., J. Helton, and S. Hora (2006). Extension
In our opinion, the prime interest of the suggested of latin hypercube samples with correlated variables.
extensions is practical, as they allow to use very pop- Tech. rep. sand2006- 6135, Sandia National Labora-
ular and efficient numerical technics such as Latin tories, Albuquerque. http://www.prod.sandia.gov/cgibin/
Hyper Cube Sampling and Iman and Conover method techlib/accesscontrol. pl/2006/066135.pdf.
with imprecise probabilistic models. Moreover, the Walley, P. (1991). Statistical reasoning with Imprecise
proposed extensions can benefits from all the results Probabilities. New York: Chapman and Hall.

702
Safety, Reliability and Risk Analysis: Theory, Methods and Applications – Martorell et al. (eds)
© 2009 Taylor & Francis Group, London, ISBN 978-0-415-48513-5

Monte Carlo simulation for investigating the influence of maintenance


strategies on the production availability of offshore installations

K.P. Chang, D. Chang & T.J. Rhee


Hyundai Industrial Research Institute, Hyundai Heavy Industries, Ulsan, Korea

Enrico Zio
Energy Department, Politecnico di Milano, Milan, Italy

ABSTRACT: Monte Carlo simulation is used to investigate the impact of the maintenance strategy on the
production availability of offshore oil and gas plants. Various realistic preventive maintenance strategies and
operational scenarios are considered. The reason for resorting to Monte Carlo simulation is that it provides
the necessary flexibility to describe realistically the system behavior, which is not easily captured by analytical
models. A prototypical offshore production process is taken as the pilot model for the production availability
assessment by Monte Carlo simulation. The system consists of a separator, compressors, power generators,
pumps and dehydration units. A tailor-made computer program has been developed for the study, which enables
to account for the operational transitions of the system components as well as the preventive and corrective
maintenance strategies for both power generators and compressor systems.

1 INTRODUCTION To satisfy these requirements, stochastic simula-


tion models, such as Monte Carlo simulation, are
By definition, production availability is the ratio increasingly being used to estimate the production
between actual and planned production over a spec- availabilities of offshore installations. They allow
ified period of time [NORSOK Z-016, 1998]. Pro- accounting for the realistic maintenance strategies and
duction availability is considered as an important operational scenarios [E.Zio et al., 2006].
indicator of the performance of offshore installations The purpose of the present study is to develop a
since it describes how the system is capable of meeting Monte Carlo simulation method for the evaluation of
demand for deliveries. the production availability of offshore facilities while
Recently, the demands of offshore industries for accounting for the realistic aspects of system behavior.
production availability analysis are not only limited A Monte Carlo simulation model has been developed
to knowing the average production evolution but also to demonstrate the effect of maintenance strategies
present the optimization of the components and sys- on the production availability, e.g., by comparing
tems maintenance strategies, e.g., the maintenance the system performance without and with preventive
intervals and spare parts holdings. maintenance, and of delays of spare parts for critical
Indeed, production availability is affected by the items.
frequencies of corrective and preventive maintenance
tasks. Furthermore, the spare parts holding require-
ments must comply with the limits of space and weight. 2 SYSTEM DESCRIPTION
The items with long lead times for replacement can
also give a serious negative effect on the production A prototypical offshore production process is taken
availabilities. as the pilot model for the production availability
The followings are typical results which are assessment by Monte Carlo simulation (Figure 1).
expected from a production availability analysis;
2.1 Functional description
– To identify the critical items which have a dominant
effect on the production shortfall The three-phase fluid produced in the production well
– To verify the intervention and maintenance strate- enters a main separation system which is a single-train
gies planned during production three-stage separation process. The well fluid is sepa-
– To determine the (minimum) spare part holdings rated into oil, water and gas by the separation process.

703
Gas
Oil
Export Gas Compression Water
Electricity

Export Gas Compression Dehydration Gas Export

Export Gas Compression

Lift Gas Compression

Power Generation

Power Generation

Production Three-Phase
Export Oil Pumping Oil Export
Well Separation

Injection Water Pumping

Figure 1. Functional block diagram.

The well produces at its maximum 30, 000 m3 /d of oil, Table 1. Transition rates of the components.
which is the amount of oil which the separator can han-
dle. The separated oil is exported by the export pump- Transition rate (1/hr)
ing unit, also with capacity of 30, 000 m3 /d of oil.
Component Failure Repair
Off-gas from the separator is routed to the main
compressor unit, with two compressors running and Dehydration 3.49 × 10−4 8.33 × 10−2
one standby a 2oo3 voting. Each compressor can Lift gas compressor 6.57 × 10−4 6.98 × 10−2
process a maximum of 3.0 MMscm/d. The nomi- Export oil pump 7.06 × 10−4 3.66 × 10−2
nal gas throughput for the system is assumed to be Injection water pump 2.27 × 10−4 1.33 × 10−2
6.0 MMscm/d, and the system performance will be Three-phase separator 4.25 × 10−4 19.6 × 10−2
evaluated at this rate. Gas dehydration is required Export gas compressor 6.69 × 10−4 4.29 × 10−2
for the lift gas, the export gas and the fuel gas. The Power generation 1.70 × 10−3 3.24 × 10−2
dehydration is performed by a 1 × 100% glycol con-
tactor on the total gas flowrate, based on gas saturated
with water at conditions downstream of the compres-
sor. The total maximum gas processing throughput
is assumed to be 6.0 MMscm/d, limited by the main The 25 MW power requirements on the production
compression and dehydration trains. system will be met by 2 × 17 MW gas turbine-driven
To ensure the nominal level of production of the power generation units.
well, the lift gas is supplied from the discharge of the
compression, after dehydration, and routed to the lift
2.2 Component failures and repair rates
gas risers under flow control on each riser.
An amount of 1.0 MMscm/d is compressed by the For simplicity, the study considers in details stochas-
compressor for lift gas and injected back into the tic failure and maintenance behaviors of only the 2oo3
production well. compressor system (one in standby) for the gas export
Water is injected into the producing reservoirs to and the 2oo2 power generation system; the other
enhance oil production and recovery. The water sepa- components have only two states ‘‘functioning’’ and
rated in the separator and treated seawater is injected ‘‘failed’’.
in the field. The capacity of water injection system is The transition rates of the components with only
assumed to be 5, 000 m3 /d. two transition states are given in Table 1.

704
Table 2. Summary of different production levels with the component failures.

Production Water
level (system Example of Oil Gas injection
capacity, %) failure events (km3 /d) (MMscm/d) (km3 /d)

100% None 30 6 5
70% Lift gas compressor 20 4 4
70% Water injection pump 20 4 0
50% Two export gas compressors 15 3 5
One power generator
Two export gas compressors
and one power generator
together
50% Two export gas compressors and 15 3 0
injection water pumping
30% Lift gas compressor and 10 2 0
injection water pump
0% Dehydration unit 0 0 0
All three export gas compressors
Both power generators

The compressor and power generation systems are c


subjected to stochastic behavior patterns due to their 2 i i i
voting configuration. The failures and repair events
for both compressor and power generation systems are 0 1 2 3
described in Section 3.1 in detail.
μi

2μi
2.3 Production re-configuration
μtotal(3μi)
The failure of the components and systems are
assumed to have the following effects on the produc- Figure 2. State diagram of export compression system.
tion level:

– Failure of any component immediately causes the 3 MAINTENANCE STRATEGIES


production level to decrease by a step.
– Failure of the lift gas compression or water injection 3.1 Corrective maintenance
pump reduces the oil production by 10, 000 m3 /day
(30 % of total oil production rate) and the gas Once failures happen in the system, it is assumed that
production by 2.0 MMscm/day. a corrective maintenance is immediately implemented
– Failure of both the lift gas compression and injec- by only a single team apt to repair the failures. In case
tion water pumping reduces the oil production that two or more components are failed at the same
by 20, 000 m3 /day and the gas production by time, the maintenance tasks are carried out according
4.0 MMscm/day. to the sequence of occurrence of the failure events.
– Failure of two export gas compressors or one gen- The failure and repair events of the export gas
erator forces the compression flow rate to decrease compressor system, and power generation are more
from 6.0 MMscm/day to 3.0 MMscm, facing the complex than those of the other components. Figure 2
oil production rate to reduce accordingly from shows the state diagram of the export compression
30, 000 m3 /day to 15, 000 m3 /day. system. As shown in Figure 2, common cause failures
– Failure of the dehydration unit, all three export which would result in the entire system shutdown are
gas compressors, or both power generator result in not considered in the study. The compressors in the
system total shutdown. export compression system are considered to be iden-
tical. The transitions from a state to another one are
The strategy of production reconfiguration against assumed to be exponentially distributed.
the failure of the components in the system is illus- The export compression system can be in four dif-
trated in Table 2. ferent states. The state 0 corresponds to two active

705
λc Table 3. Schedule maintenance interval for compressors
and power generators.

2λi λi Period, Maintenance Downtime


month (year) action hr (day)

2 (1/6) Detergent wash 6 (0.25)


0 1 2 4 (1/3) Service/Cleaning 24 (1.0)
12 (1) Baroscopic Inspection/
Generator check 72 (3.0)
μi 60 (5) Overhaul or replacement 120 (5.0)

2μi
Figure 3. State diagram of power generation system. Input the system configuration
and component information

Estimate the next transition time


for each component
compressors running at 100 % capacity. The state 1
is corresponds to one of the two active compressors
being failed and the third (standby) compressor being Determine the shortest transition time
switched on while the repair task is carried out; the
switch is considered perfect and therefore the state 1
produces the same capacity as the state 0. State 2
represents operation with only one active compressor Perform the transition of the component
or one standby compressor (two failed compressors), with the shortest transition time
i.e., 50 % capacity; the export compression system
can transfer to state 2 by transition from either the
Evaluate the system capacity
state 0 directly (due to common cause failure of 2 of
and production availability
the three compressors) or from state 1 (due to failure
of an additional compressor). The state 3 corresponds
to the total system shutdown, due to failure of all three Figure 4. Flow chart for developed simulation program.
compressors.
The same assumptions of the export compression
system apply to the power generation system, although 4 MONTE CARLO SIMULATION MODEL
there are only 3 states given the parallel system logic.
The state diagram is shown in Figure 3. The system stochastic failure/repair/maintenance
Repairs allow returning to states of higher capacity behavior described in Sections 2 and 3 has been mod-
from lower ones. eled by Monte Carlo simulation and quantified by a
dedicated computer code.

3.2 Preventive maintenance


4.1 Model algorithm
The following is assumed for the preventive mainte-
nance tasks considered in the study: Figure 4 illustrates the flowchart of the Monte Carlo
simulator developed in the study.
– Scheduled preventive maintenance is only imple- First of all, the program imports the system configu-
mented to the compressor system for the gas export ration with the detailed information of the components
and to the power generation system. including the failure rates, repair times, preventive
– Scheduled maintenance tasks of the compressors maintenance intervals and required downtimes. Then,
and the power generation system are carried out at the simulator estimates the next transition time for
the same time, to minimize downtime. all the components. Note that the next transition
– Well should be shutdown during preventive time relies on the current state of each component.
maintenance. When the component is under corrective or preven-
tive maintenance, the next transition occurs after the
The schedule maintenance intervals for both sys- time required for the maintenance action. This main-
tems are given in Table 3. tenance time is predetermined. When the component

706
Table 4. Estimation of transition time depending on initial Part list
state.
Step 1: Will the stock-out has
direct effect on the offshore No No spares holding
Initial state Transition time
production?
Corrective Time required for Yes
maintenance corrective maintenance
Preventive Time required for preventive Step 2: Can the part requirement Order parts before
No
maintenance maintenance (MTTR) be anticipated? a demand occurs
Normal (including To be estimated by the direct
partial load) Monte Carlo method Yes

Step3: Can the part be held on Yes Hold parts on


offshore platforms? offshore platforms

is in normal operation condition (not necessarily with No


100% capacity), the next transition time is determined
Review/revise of maintenance
by direct Monte Carlo simulation (Marseguerra & Zio, strategies
2002). Table 4 summarizes how the transition time is
estimated, depending on the initial state.
Figure 5. Spare holding decision algorithm [ABS. 2004].

5 NUMERICAL RESULTS Table 5. Comparison results of the production availabil-


ity with the delay scenarios.
To investigate the influence of maintenance strategies
on the production availability some cases have been (Average)
simulated and compared. Production
The mission time of 30 years has been considered Delay scenario description availability
as typical for offshore installations.
No delay for any preventive and
corrective maintenance 8.86 × 10−1
5.1 Without vs. with preventive maintenance Periodical overhaul or replacement
(preventive maintenance) tasks for
Firstly, the case in which components can fail and are the gas export compressor system
repaired following the rules of Sections 2 and 3 has are delayed by 7 days 8.36 × 10−1
been considered, but with no preventive maintenance Corrective maintenance tasks for the
performed. The average availability of this case turns power generation system are delayed
out to be 90.03 % . by 7 days 7.78 × 10−1
The value of production availability reduces to 88.6 Corrective maintenance tasks for
% with preventive maintenance. This reduction is due dehydration system are delayed
to the unavailabilities of the components during the by 7 days 8.37 × 10−1
maintenance action.

5.2 Effects of spare parts delay


in Figure 6. For example, the effect of the stock--out on
Spare parts optimization is one of the most important the offshore installations is easily estimated by com-
maintenance issues in offshore production operation. parison of production availability results for two cases,
Spare parts optimization shall give answers to the without or with delay of spare parts. Also, it is possible
following questions; to determine the required number of parts which must
– Shall spare parts be stored offshore or not? be held to reduce the consequence of stock-out to an
– How many spare parts should be held? acceptable level.
The results of the effect of delay of spare parts
Monte Carlo simulation is an effective tool for on the production availability of the pilot model are
optimization and determination of spare parts require- summarized in Table 5. It is assumed that corrective
ments. An example of a spare holding decision algo- or preventive maintenance tasks are not immediately
rithm generally used to determine spare holdings in implemented due to stock-out on offshore installa-
offshore or marine facilities is illustrated in Figure 5. tions. This situation, although unrealistic, is simulated
Monte Carlo simulation can be used to give a quan- to understand the consequences of delay on the pro-
titative solution to the questions of each step explained duction availability. Usually, the rotating equipment

707
installed in offshore platforms is expected to have As future study, it is of interest to formalize the pre-
a higher frequency of failures than others. And the ventive maintenance interval optimization and spare
failure effect of the compressor and power genera- parts optimization process with the Monte Carlo sim-
tion system on production is classified as significant. ulation. To this aim, it will be necessary to combine
With the consideration of frequencies and conse- the results of the availability assessment based on the
quences together, the effect of stock-out for such Monte Carlo simulation with the cost information.
systems on production availability should be estimated The optimization of preventive maintenance intervals
in priority during the determination of maintenance should be determined based on an iterative process
strategies. where the overall availability acceptance criteria and
costs fall within the optimal region; the spare parts
optimization will consider the cost of holding different
6 CONCLUSIONS numbers of spare parts and that of not holding any.

A Monte Carlo simulation model for the evaluation


of the production availability of offshore facilities has REFERENCES
been developed.
A prototypical offshore production process is taken ABS. 2004. Guidance notes on reliability-centered mainte-
as the pilot model for the production availability nance. ABS.
assessment. A tailor-made computer program has been E. Zio, P. Baraldi, and E. Patelli. 2006. Assessment of
developed for the study, which enables to account availability of an offshore installation by Monte Carlo
for the operational transitions of the system compo- simulation. International Journal of Pressure Vessel and
nents and the preventive and corrective maintenance Piping, 83: 312–320.
Marseguera M. & Zio E. 2002. Basics of the monte carlo
strategies. method with application to system reliability. Hagen,
The feasibility of application of the Monte Carlo Germany:LiLoLe-Verlag GmbH.
simulation to investigate the effect of maintenance NORSOK Standard (Z-016). 1998. Regularity manage-
strategies on the production availability of offshore ment & reliability technology. Oslo, Norwat, Norwegian
installations has been verified. technology standards institution.

708
Safety, Reliability and Risk Analysis: Theory, Methods and Applications – Martorell et al. (eds)
© 2009 Taylor & Francis Group, London, ISBN 978-0-415-48513-5

Reliability analysis of discrete multi-state systems by means


of subset simulation

E. Zio & N. Pedroni


Department of Energy, Polytechnic of Milan, Milan, Italy

ABSTRACT: In this paper, the recently developed Subset Simulation method is considered for improving the
efficiency of Monte Carlo simulation. The method, originally developed to solve structural reliability problems,
is founded on the idea that a small failure probability can be expressed as a product of larger conditional failure
probabilities for some intermediate failure events: with a proper choice of the conditional events, the conditional
failure probabilities can be made sufficiently large to allow accurate estimation with a small number of samples.
The method is here applied on a system of discrete multi-state components in a series-parallel configuration.

1 INTRODUCTION (inversely proportional to the failure probability) is


necessary to achieve an acceptable estimation accu-
Let us consider a system made of n components which racy: in terms of the integral in (1) this can be seen as
are subject to stochastic transitions between differ- due to the high dimensionality n of the problem and the
ent states and to uncertain loading conditions and large dimension of the relative sample space compared
mechanisms of deterioration. The resulting stochastic to the failure region of interest (Schueller 2007).
system life process can be adequately modelled within To overcome the rare-event problem, an efficient
a probabilistic approach (Schueller 2007). Within approach is offered by Subset Simulation (SS), origi-
this framework, the probability of system failure is nally developed to tackle the multidimensional prob-
expressed as a multi-dimensional integral of the form lems of structural reliability (Au & Beck 2001). In
this approach, the failure probability is expressed as
P (F) = P(x ∈ F) = ∫ IF (x)q(x)dx (1) a product of conditional failure probabilities of some
chosen intermediate failure events, whose evaluation
where x = {x1 , x2 , . . . , xj , . . . , xn } ∈ n is the vec- is obtained by simulation of more frequent events.
tor of the random states of the components, i.e. the The problem of evaluating small failure probabilities
random configuration of the system, with multidi- in the original probability space is thus replaced by
mensional probability density function (PDF) q : a sequence of simulations of more frequent events
n → [0, ∞), F ⊂ n is the failure region and in the conditional probability spaces. The necessary
IF : n → {0, 1} is an indicator function such that conditional samples are generated through succes-
IF (x) = 1, if x ∈ F and IF (x) = 0, otherwise. sive Markov Chain Monte Carlo (MCMC) simulations
In practical cases, the multi-dimensional integral (Metropolis et al., 1953), gradually populating the
(1) can not be easily evaluated by analytical methods intermediate conditional failure regions until the final
nor by numerical schemes. On the other hand, Monte target failure region is reached.
Carlo Simulation (MCS) offers an effective means In this paper, SS is used for evaluating the reli-
for estimating the integral, because the method does ability of a discrete multi-state system of literature
not suffer from the complexity and dimension of the made of n components in a series-parallel logic (Zio
domain of integration, albeit it implies the nontriv- & Podofillini 2003).
ial task of sampling from the multidimensional PDF. The benefits gained by the use of SS are demon-
Indeed, the MCS solution to (1) entails that a large strated by comparison with respect to a standard
number of samples of the values of the component MCS; finally, an analysis of the bias associated to the
states vector be drawn from q(·); an unbiased and con- estimates provided by SS is also provided.
sistent estimate of the failure probability is then simply The remainder of the paper is organized as follows.
computed as the fraction of the number of samples that In Section 2, a detailed description of the SS pro-
lead to failure. However, a large number of samples cedure is provided. In Section 3, the application to

709
the series-parallel, discrete multi-state system is illus- initial sample x 1 being distributed exactly as the mul-
trated. Finally, some conclusions are proposed in the tidimensional conditional PDF q(x|F), then so are the
last Section. subsequent samples and the Markov chain is always
stationary (Au & Beck 2001).
Furthermore, since in practical applications depen-
2 SUBSET SIMULATION dent random variables may often be generated by some
transformation of independent random variables, in
2.1 Basics of the method the following it is assumed without loss of general-
ity that the components of x are independent, that is,
For a given target failure event F of interest, let F1 ⊃ q(x) = nj=1 qj (xj ), where qj (xj ) denotes the one-
F2 ⊃ . . . ⊃ Fm be a sequence of intermediate fail-
dimensional PDF of xj , j = 1, 2, . . . , n (Au & Beck
ure events, so that Fk = ∩ki=1 Fi , k = 1, 2, . . ., m. By
2001).
sequentially conditioning on the event Fi , the failure
To illustrate the MCMC simulation algorithm with
probability P(F) can be written as
reference to a generic failure region Fi , let x u =
{x1u , x2u , . . ., xju , . . ., xnu } be the uth Markov chain sam-

m−1
ple drawn and let p∗j (ξj |xju ), j = 1, 2, . . ., n, be a
P(F) = P(Fm ) = P(F1 ) P(Fi+1 |Fi ) (2)
one-dimensional ‘proposal PDF’ for ξj , centered at
i=1
the value xju and satisfying the symmetry property
Notice that even if P(F) is small, the conditional p∗j (ξj |xju ) = p∗j (xju |ξj ). Such distribution, arbitrarily
probabilities involved in (1) can be made sufficiently chosen for each element xj of x, allows generating
large by appropriately choosing m and the intermediate a ‘precandidate value’ ξj based on the current sam-
failure events {Fi , i = 1, 2, . . ., m − 1}. ple value xju . The following algorithm is then applied
The original idea of SS is to estimate the to generate the next Markov chain sample x u+1 =
failure probability P(F) by estimating P(F1 ) and {x1u+1 , x2u+1 , . . ., xju+1 , . . ., xnu+1 }, u = 1, 2, . . ., Ns − 1
{P(Fi+1 |Fi ) : i = 1, 2, . . . , m − 1}. Considering for (Au and Back 2001):
example P(F) ≈ 10−5 and choosing m = 5 interme-
diate failure events such that P(F1 ) and {P(Fi+1 |Fi ) : 1. Generate a candidate sample x̃ u+1 = {x̃1u+1 , x̃2u+1 ,
i = 1, 2, 3, 4} ≈ 0.1, the conditional probabilities can . . ., x̃ju+1 , . . ., x̃nu+1 }: for each parameter xj , j =
be evaluated efficiently by simulation of the relatively 1, 2, . . ., n, sample a precandidate value ξju+1 from
frequent failure events (Au & Beck 2001).
Standard MCS can be used to estimate P(F1 ). On p∗j (·|xju ); compute the acceptance ratio rju+1 =
the contrary, computing the conditional failure prob- qj (ξ j u+1 )/qj (xju ); set x̃ju+1 = ξju+1 with probabil-
abilities in (1) by MCS entails the non-trivial task ity min(1, rju+1 ) and x̃ju+1 = xju with probability
of sampling from the conditional distributions of x
given that it lies in Fi , i = 1, 2, . . . , m − 1, i.e. from 1 − min(1, rju+1 ).
q(x|Fi ) = q(x)IF (x)/P(F). In this regard, Markov 2. Accept/reject the candidate sample vector x̃ u+1 :
Chain Monte Carlo (MCMC) simulation provides a if x̃ u+1 = x u (i.e., no precandidate values have
powerful method for generating samples conditional been accepted), set x u+1 = x u . Otherwise, check
on the failure region Fi , i = 1, 2, . . . , m − 1 (Au & whether x̃ u+1 is a system failure configuration, i.e.
Beck 2001). The related algorithm is presented in the x̃ u+1 ∈ Fi : if it is, then accept the candidate x̃ u+1
next Section 2.2.
as the next state, i.e., set x u+1 = x̃ u+1 ; other-
wise, reject the candidate x̃ u+1 and take the current
2.2 Markov Chain Monte Carlo (MCMC) sample as the next one, i.e., set x u+1 = x u .
simulation The proposal PDFs {p∗j : j = 1, 2, . . ., n} affect
Markov Chain Monte Carlo (MCMC) simulation com- the deviation of the candidate sample from the cur-
prises a number of powerful simulation techniques for rent one, thus controlling the efficiency of the Markov
generating samples according to any given probability chain samples in populating the failure region. In par-
distribution (Metropolis et al., 1953). ticular, the spreads of the proposal PDFs affect the
In the context of the reliability assessment of inter- size of the region covered by the Markov chain sam-
est in the present work, MCMC simulation provides an ples. Small spreads tend to increase the correlation
efficient way for generating samples from the multidi- between successive samples due to their proximity
mensional conditional PDF q(x|F). The distribution to the conditioning central value, thus slowing down
of the samples thereby generated tends to the multidi- the convergence of the failure probability estimators.
mensional conditional PDF q(x|F) as the length of the Indeed, it can be shown that the coefficient of variation
Markov chain increases. In the particular case of the (c.o.v.) of the failure probability estimates, defined as

710
the ratio of the standard deviation to the mean of the
estimate, increases as the correlation between the suc-
cessive Markov chain samples increases. On the other
hand, excessively large spreads may reduce the accep-
tance rate, increasing the number of repeated Markov
chain samples, still slowing down convergence (Au &
Beck 2003).

2.3 The subset simulation algorithm


In the actual SS implementation, with no loss of gener-
ality it is assumed that the failure event of interest can
be defined in terms of the value of a critical response
variable Y of the system under analysis (e.g., its output
performance) being lower than a specified threshold
level y, i.e., F = {Y < y}. The sequence of interme-
diate failure events {Fi : i = 1, 2, . . . , m} can then
be correspondingly defined as Fi = {Y < yi }, i =
1, 2, . . . , m, where y1 > y2 > . . . > yi > . . . > ym =
y > 0 is a decreasing sequence of intermediate thresh-
old values (Au & Beck 2001). Notice that since these
intermediate threshold values (i.e., failure regions) are
introduced purely for computational reasons in SS,
they may not have a strict physical interpretation and
may not be connected to known degradation processes.
The choice of the sequence {yi : i = 1, 2, . . . , m}
affects the values of the conditional probabilities
{P(Fi+1 |Fi ) : i = 1, 2, . . . , m−1} in (2) and hence the
efficiency of the SS procedure. In particular, choosing
the sequence {yi : i = 1, 2, . . . , m} arbitrarily a priori
makes it difficult to control the values of the condi-
tional probabilities {P(Fi+1 |Fi ) : i = 1, 2, . . . , m − 1}
in the application to real systems. For this reason, in
this work, the intermediate threshold values are chosen
adaptively in such a way that the estimated conditional
failure probabilities are equal to a fixed value p0 (Au
& Beck 2001).
The SS algorithm proceeds as follows (Figure 1).
First, N vectors {x k0 : k = 1, 2, . . ., N } are sampled by
standard MCS, i.e., from the original probability den-
sity function q(·). The subscript ‘0’ denotes the fact Figure 1. Illustration of the SS procedure: a) Conditional
that these samples correspond to ‘Conditional Level level 0: Standard Monte Carlo simulation; b) Conditional
0’. The corresponding values of the response vari- level 0: adaptive selection of y1 ; c) Conditional level 1:
able {Y (x k0 ) : k = 1, 2, . . ., N } are then computed Markov Chain Monte Carlo simulation; d) Conditional level
(Figure 1a) and the first intermediate threshold value 1: adaptive selection of y2 .
y1 is chosen as the (1 − p0 )N th value in the decreas-
ing list of values {Y (x k0 ) : k = 1, 2, . . ., N }. By so
doing, the sample estimate of P(F1 ) = P(Y < y1 ) is conditional samples {x k1 : k = 1, 2, . . ., N } ∈ F1 , at
equal to p0 (note that it has been implicitly assumed ‘Conditional level 1’ (Figure 1c). Then, the intermedi-
that p0 N is an integer value) (Figure 1b). With this ate threshold value y2 is chosen as the (1−p0 )N th value
choice of y1 , there are now p0 N samples among in the descending list of {Y (x k1 ) : k = 1, 2, . . ., N }
{x k0 : k = 1, 2, . . ., N } whose response Y lies in to define F2 = {Y < y2 } so that, again, the sam-
F1 = {Y < y1 }. These samples are at ‘Conditional ple estimate of P(F2 |F1 ) = P(Y < y2 |Y < y1 ) is
level 1’ and distributed as q(·|F1 ). Starting from each equal to p0 (Figure 1d). The p0 N samples lying in
one of these samples, MCMC simulation is used to F2 are conditional values from q(·|F2 ) and function
generate (1 − p0 )N additional conditional samples as ‘seeds’ for sampling (1 − p0 )N additional con-
distributed as q(·|F1 ), so that there are a total of N ditional samples distributed as q(·|F2 ), making up a

711
total of N conditional samples {x k2 : k = 1, 2, . . ., N } not reported for brevity; however, for clarity sake,
at ‘Conditional level 2’. This procedure is repeated the synthetic parameters of the performance distri-
for the remaining conditional levels until the samples butions (i.e., the mean vj and the standard deviation
at ‘Conditional level (m − 1)’ are generated to yield σvj , j = 1, 2, 3) are summarized in Table 1. Finally,
ym < y as the (1−p0 )N th value in the descending list of it is worth noting that the probability of the system
{Y (x km−1 ) : k = 1, 2, . . ., N }, so that there are enough having performance W equal to 0, i.e. being in state
samples for estimating P(Y < y) (Au et al., 2007). o∗ = {0, 0, 0}, is1.364 · 10−3 (this value has been ana-
lytically obtained by calculating the exact probabilities
of all the 1331 available system states).
3 APPLICATION TO A SERIES-PARALLEL
DISCRETE MULTI-STATE SYSTEM
3.2 Case 2: 21 discrete states for each component
In this Section, SS is applied for performing the relia-
For each component j = 1, 2, 3 there are now zj =
bility analysis of a series-parallel discrete multi-state
21 possible states each one corresponding to a dif-
system of literature (Zio & Podofillini 2003).
ferent hypothetical level of performance vj,o , o =
Let us consider a system made up of a series of η =
0, 1, . . ., 20; thus, the number of available system states
2 macro-components (nodes), each one performing a
is now 213 = 9261. For clarity sake, the synthetic
given function, e.g. the transmission of a given amount
parameters of the performance distributions (i.e., the
of gas, water or oil flow. Node 1 is constituted by
mean vj and the standard deviation σvj , j = 1, 2, 3) are
n1 = 2 components in parallel logic, whereas node
summarized in Table 2. Finally, in this case, the prob-
2 is constituted by a single component (n2 = 1) so
ability of the system having performance W equal to
that the overall number of components in the system
 0, i.e. being in state o∗ = {0, 0, 0}, is 1.671 · 10−4 .
is n = 2b=1 nb = 3.
For each component j = 1, 2, 3 there are zj possible
states, each one corresponding to a different hypothet- 3.3 Subset simulation parameters
ical level of performance, vj,o , o = 0, 1, . . ., zj − 1. In the application of SS to both Case 1 and Case 2,
Each component can randomly occupy the discrete the conditional failure regions are chosen such that a
states, according to properly defined probabilities conditional failure probability of p0 = 0.1 is attained
qj,o , j = 1, 2, 3, o = 0, 1, . . ., zj − 1. at all conditional levels.
In all generality, the output performance Wo asso- In Case 1, the simulations are carried out for m =
ciated to the system state 0 = {o1 , o2 , . . ., oj , . . ., on } 3 conditional levels, thus covering the estimation of
is obtained on the basis of the performances vj,o of failure probabilities as small as 10−3 .
the components j = 1, 2, . . ., n constituting the sys-
tem. More precisely, we assume that the performance
of each node b constituted by nb elements in paral- Table 1. Parameters of the probability distributions of the
lel logic is the sum of the individual performances of components’ performances for Case 1.
the components and that the performance of the node
series system is that of the node with the lowest per- Performance distributions’
parameters
formance, which constitutes the ‘bottleneck’ of the
system (Levitin & Lisnianski 1999). Component, j Mean Standard deviation
The system is assumed to fail when its performance
W falls below some specified threshold value w, so 1 56.48 25.17
that its probability of failure P(F) can be expressed 2 58.97 23.11
as P(W < w). During simulation, the intermediate 3 92.24 11.15
failure events {Fi : i = 1, 2, . . . , m} are adaptively
generated as Fi = {W < wi }, where w1 > w2 > . . . >
wi > . . . > wm = w are the intermediate threshold
Table 2. Parameters of the probability distributions of the
values (see Section 2.3). components’ performances for Case 2.

Performance distributions’
3.1 Case 1: 11 discrete states for each component
parameters
For each component j = 1, 2, 3 there are zj =
11 possible states each one corresponding to a dif- Component, j Mean Standard deviation
ferent hypothetical level of performance vj,o , o =
1 58.17 24.35
0, 1, . . ., 10; thus, the number of available system states 2 60.66 22.32
is 113 = 1331. The probabilities qj,o associated to the 3 93.55 10.02
performances vj,o , j = 1, 2, 3, o = 0, 1, . . ., 10, are

712
At each conditional level, N = 300 samples are 3.4.1 Failure probability estimation
generated. The total number of samples is thus NT =
300+270+270 = 840, because p0 N = 30 conditional 3.4.1.1 Comparison with standard Monte Carlo
samples from one conditional level are used to start Simulation (MCS)
the next conditional level and generate the missing Figure 2 shows the failure probability estimates for
(1 − p0 )N = 270 samples at that level. different threshold levels w, obtained in a single simu-
The failure probability estimates corresponding to lation run, for both Case 1 (top) and Case 2 (bottom).
the intermediate thresholds {wi : i = 1, 2, 3}, i.e. The results produced by SS with a total of 840 sam-
10−1 , 10−2 and 10−3 , are computed using a total ples (i.e., three simulation levels, each with N = 300
number of samples equal to NT = 300, 570 and samples) and 1110 samples (i.e., four simulation lev-
840, respectively. It is worth noting that the number els, each with N = 300 samples) are shown in solid
of samples employed for estimating the probabili- lines. Note that a single SS run yields failure prob-
ties of failure of the system is about 2 times lower ability estimates for all threshold levels w up to the
than the total number of available system states, i.e. smallest one considered (i.e. 10−3 and 10−4 for Cases
1331 (Section 3.1); thus, the computational time 1 and 2, respectively). For comparison, the analyti-
required for estimating the failure probabilities by SS cal failure probabilities (dashed lines) and the results
is substantially lower than that necessary for analyt- using standard MCS with 840 and 1110 samples (dot-
ically computing them (i.e., for calculating the exact dashed lines) are shown in the same Figures for Cases
probabilities of all the 1331 system states). 1 (top) and 2 (bottom), respectively.
Differently, in Case 2, the simulations are carried
out for m = 4 conditional levels, thus covering the
estimation of failure probabilities as small as 10−4 .
Also in this case, at each conditional level, N = 300 10
0
Analytical
samples are generated such that the total number of
samples is now NT = 1110. The failure probability
SS (NT = 840)

MCS (NT= 840)


estimates corresponding to the intermediate thresholds
{wi : i = 1, 2, 3, 4}, i.e. 10−1 , 10−2 , 10−3 and 10−4 ,
Failure probability,P(F)

-1
10
are computed using a total number of samples equal
to NT = 300, 570, 840 and 1110, respectively. Notice
that the number of SS samples used for estimating
the failure probabilities of the system is about 9 times 10
-2
lower than the total number of available system states,
i.e. 9261 (Section 3.2).
In both cases, for each component’s performance
vj,o , j = 1, 2, 3, o = 0, 1, . . ., zj − 1, the one- -3
dimensional discrete ‘proposal PDF’ p∗j,o (ξj,o |νj,o ) 10
0 20 40 60 80 100
Failure threshold,w
adopted to generate by MCMC simulation the random
‘pre-candidate value’ ξj,o based on the current sample 10
0
component νj,o (Section 2.2) is chosen as a symmetric Analytical
uniform distribution, that is, p∗j,o (ξj,o |νj,o ) = 1/(2lj + SS (NT = 1110)

1), if |o − o| ≤ lj and p∗j,o (ξj,o |νj,o ) = 0, otherwise, 10


-1 MCS (NT = 1110)

j = 1, 2, 3, o = o − lj , o − lj + 1, . . ., o + lj − 1, o + lj .
Failure probability,P(F)

Notice that lj is the maximum allowable number of -2


discrete steps that the next sample can depart from the 10
current one. In both cases, the choice l1 = 2, l2 = 2
and l3 = 1 empirically turned out to offer the best -3
trade-off between estimation accuracy and relatively 10
low correlation among successive conditional failure
samples. -4
10

0 20 40 60 80 100
3.4 Discussion of the results Failure threshold,w

In this Section, the results of the application of SS to Figure 2. Analytical failure probabilities (dashed lines) and
the performance analysis of the system described in their corresponding estimates obtained by SS (solid lines) and
Section 3 are illustrated with reference to both Case 1 standard MCS (dot-dashed lines) in a single simulation run,
and Case 2. for Case 1 (top) and Case 2 (bottom).

713
Table 3. Mean relative absolute errors δ[P(F)] made Table 4. Mean relative absolute errors δ[P(F)] made by
by both SS and standard MCS with 840 samples in the both SS and standard MCS with 1110 samples in the esti-
estimation of the failure probability P(F) = 1.364 · mation of the failure probabilities P(F) = 1.942 · 10−3
10−3 (Case 1); these values have been computed for and P(F) = 1.671 · 10−4 (Case 2); these values have been
three batches of S = 200 simulations each. computed for three batches of S = 200 simulations each.

Mean relative absolute errors Mean relative absolute errors

SS Standard MCS SS Standard MCS

P(F) 1.364 · 10−3 1.364 · 10−3 P(F) 1.94·10−3 1.67·10−4 1.94·10−3 1.67·10−4
Batch 1 0.4327 0.7265 Batch 1 0.4181 0.6409 0.6983 1.6670
Batch 2 0.4611 0.7530 Batch 2 0.4425 0.5611 0.7793 1.8915
Batch 3 0.4821 0.6656 Batch 3 0.4960 0.6826 0.6112 1.6190

In order to properly represent the randomness of


the SS and MCS procedures and provide a statistically in the estimation of P(F) = 1.671 · 10−4 SS pro-
meaningful comparison between the performances of vides mean relative absolute errors which are even four
SS and standard MCS in the estimation of a given fail- times lower than those produced by standard MCS (for
ure probability P(F) of interest, S = 200 independent instance, see Batch 2 of Table 4). This result is quite
runs of each method have been carried out. In each reasonable: in fact, the estimation of failure proba-
simulation s = 1, 2, . . ., S the relative absolute error bilities near 10−4 by means of standard MCS with
δs [P(F)] between the exact (i.e., analytically com- 1110 samples is not efficient since on average only
puted) value of the failure probability P(F) and the 1110 · 10−4 ∼ 0.1 failure samples are available in the
corresponding estimate P̃s (F) (obtained by SS or stan- failure region of interest. In contrast, due to successive
conditioning, SS guarantees that there are 840, 570,
dard MCS) is computed using |P(F) − P̃s (F)|/P(F). 300 and 30 conditional failure samples at probability
The performances of SS and standard MCS in the levels P(F) = 10−1 , 10−2 , 10−3 and 10−4 , thus pro-
estimation of P(F) are then compared in terms of the viding sufficient information for efficiently estimating
mean relative absolute error  δ[P(F)] over S = 200 the corresponding failure probabilities.
runs, computed as 1/S · Ss=1 δs [P(F)]. This quan- Finally, the computational efficiency of SS can be
tity gives an idea of the relative absolute error made compared with that of a standard MCS in terms of the
on average by the simulation method in the estima- coefficient of variation (c.o.v.) of the failure proba-
tion of a given failure probability P(F) of interest in a bility estimates computed from the same number of
single run. samples.
Table 3 reports the values of the mean relative abso- The sample c.o.v. of the failure probability esti-
lute errors δ[P(F)] made by both SS and standard MCS mates obtained by SS in S = 200 independent runs
with 840 samples in the estimation of the failure prob- are plotted versus different failure probability levels
ability P(F) = 1.364·1−3 in Case 1 (Section 3.1): this P(F) (solid line) in Figure 3, for both Case 1 (top)
value has been chosen as target because it corresponds and Case 2 (bottom). Recall that the number of sam-
to the probability of the system having performance W ples required by SS at the probability levels P(F) =
equal to 0, i.e. being in state o∗ = (0, 0, 0) which is the 10−1 , 10−2 , 10−3 and 10−4 are NT = 300, 570, 840
most critical for the system and also the less likely one. and 1110, respectively, as explained in Section 3.3.
Table 4 presents the values of the mean relative The exact c.o.v. of the Monte Carlo estimator using
absolute errors made by both SS and standard MCS the same number of samples at probability levels
with 1110 samples in the estimation of the failure prob- P(F)  = 10−1 , 10−2 , 10−3 and 10−4 are computed
abilities P(F) = 1.942·10−3 and P(F) = 1.671·10−4 
in Case 2 (Section 3.2). Only for illustration pur- using (1 − P (F)) P (F) NT , which holds for NT
poses, the results obtained in three batches of S = 200 independent and identically distributed (i.i.d.) sam-
simulations each are reported for both Cases 1 and 2. ples: the results are shown as squares in Figure 3, for
It can be seen from these Tables that in all cases Case 1 (top) and Case 2 (bottom). It can be seen that
the mean relative absolute errors made by SS are sig- while the c.o.v. of the standard MCS grows exponen-
nificantly (2–3 times) lower than those provided by tially with decreasing failure probability, the c.o.v. of
standard MCS using the same number of samples. In the SS estimate approximately grows in a logarithmic
particular, it is evident from Table 4 that as the target manner: this empirically proves that SS can lead to
probability of failure gets smaller, SS becomes more a substantial improvement in efficiency over standard
and more efficient than standard MCS: for example, MCS when estimating small failure probabilities.

714
1.4 0
10
SS (NT = 840) (average 200 runs)
1.2 MCS
uncorrelated (lower limit)
Coefficien of variation (c.o.v)

NT = 840

Failure probability,P(F)
1 fully correlated (upper limit)
-1
10
0.8

0.6
-2
NT = 570 10
0.4 Analytical
SS (NT = 840) (average 200 runs)

0.2 NT = 300

-3
0 -3 10
-2 -1 0
10 10 10 10 0 10 20 30 40 50 60 70 80 90 100
Failure probability,P(F) Failure threshold,w

3.5 100
SS (NT = 1110) (average 200 runs)
3 N = 1110 MCS
T
uncorrelated (lower limit)
Coefficientof variation (c.o.v)

-1
10
2.5 fully correlated (upper limit)
Failure probability,P(F)

2
-2
10
1.5
NT = 840
1 Analytical
-3
10 SS (NT = 1110) (average 200 runs)
NT = 570
0.5
NT = 300
-4
0 10
-4 -3 -2 -1 0 0 10 20 30 40 50 60 70 80 90 100
10 10 10 10 10
Failure probability,P(F) Failure threshold,w

Figure 3. Coefficient of variation (c.o.v.) versus different Figure 4. Analytical failure probabilities (dashed lines) and
failure probability levels P(F) for Cases 1 (top) and 2 (bot- sample averages of the failure probability estimates over 200
tom). Solid line: sample average over 200 SS runs; dashed SS runs (solid lines) for Case 1 (top) and Case 2 (bottom).
lines: sample average of the lower bound over 200 SS runs;
dot-dashed lines: sample average of the upper bound over
200 SS runs; squares: standard MCS (i.i.d. samples). A quantitative indicator of the bias associated to the
estimate of a given failure probability P(F) can be
computed as the relative absolute deviation [P(F)]
3.4.1.2 SS estimates: Bias due to the correlation between the exact value of the failure probability, i.e.
among conditional failure samples P(F), and the sample average P(F) of the correspond-
To assess quantitatively the statistical properties of the ing estimates, [P(F)] = |P(F) − P(F)|/P(F).
failure probability estimates produced by SS, the sam- Table 5 reports the values of the sample means P(F)
ple mean of the failure probability estimates obtained and the corresponding biases [P(F)] produced by
in S = 200 independent runs have been computed. SS in the estimation of P(F) = 1.364 · 10−3 in Case 1
For a given failure probability level P(F) of interest, (Section 3.1); Table 6 presents the values of the same
the sample mean P(F) of the corresponding estimates indicators referred to the estimation of the failure prob-

P̃s (F), s = 1, 2, . . ., S, is P(F) = 1/S · Ss=1 P̃s (F). abilities P(F) = 1.942·10−3 and P(F) = 1.671·10−4
Figure 4 shows the sample means of the failure in Case 2 (Section 3.2). Only for illustration pur-
probability estimates obtained by SS, for both Case 1 poses, the results obtained in three batches of S = 200
(top) and Case 2 (bottom) (solid lines); a compari- simulations each are reported for both Cases 1 and 2.
son with the exact (i.e., analytically computed) failure It is evident from Table 6 that the bias of the esti-
probabilities is also given (dashed lines). mates significantly increases as the target probability
The sample means of the failure probability esti- of failure decreases: for instance, in Batch 2 the bias
mates almost coincide with the analytical results, associated to the estimate of P(F) = 1.942 · 10−3
except at small failure probabilities, near 10−3 and is 0.1865, whereas the one related to the estimate of
10−4 , where the estimates seem to be quite biased. P(F) = 1.671 · 10−4 is 0.2928. This leads to conclude

715
Table 5. Sample means P(F) of the failure probabil- states, discrete multi-state systems can only occupy a
ity estimates over 200 SS runs and the corresponding finite number of states; as a consequence, the genera-
biases [P(F)] produced by SS in the estimation of tion of repeated (thus, correlated) conditional failure
P(F) = 1.364 · 10−3 (Case 1); these values have been samples during MCMC simulation may be significant.
computed for three batches of S = 200 simulations
each.
4 CONCLUSIONS
Subset simulation
In this paper, SS has been applied for the reliability
P(F) = 1.364 · 10−3
assessment of a system of discrete multi-state compo-
nents connected into a logic structure. An example of
Sample mean Bias
a simple series-parallel system of literature has been
Batch 1 1.136·10−3 0.1672 taken for reference.
Batch 2 1.145·10−3 0.1606 The results of SS have been compared to those of
Batch 3 1.065·10−3 0.2192 standard Monte Carlo Simulation (MCS) in the esti-
mation of failure probabilities as small as 10−4 . The
results have demonstrated that as the target probabil-
Table 6. Sample means P(F) of the failure probability ity of failure gets smaller, SS becomes more and more
estimates over 200 SS runs and the corresponding biases efficient over standard MCS.
[P(F)] produced by SS in the estimation of P(F) = Finally, a word of caution is in order with respect to
1.942 · 10−3 and P(F) = 1.674 · 10−4 (Case 2); these values the fact that the estimates produced by SS when applied
have been computed for three batches of S = 200 simulations to discrete multi-state systems may be quite biased
each.
if the number of discrete states is low. This is due
Subset simulation to the correlation between the conditional probability
estimators at different levels: in fact, differently from
P(F) = 1.942·10−3 P(F) = 1.671·10−4 continuous-state systems whose stochastic evolution
is modeled in terms of an infinite set of continuous
Sample mean Bias Sample mean Bias states, discrete multi-state systems can only occupy a
finite number of states; as a consequence, the num-
Batch 1 1.714·10−3 0.1170 1.374·10−4 0.1769 ber of repeated (thus, correlated) conditional failure
Batch 2 1.579·10−3 0.1865 1.181·10−4 0.2928 samples generated during MCMC simulation may be
Batch 3 1.715·10−3 0.1164 1.347·10−4 0.1934 high. Further research is underway on attempting to
estimate the bias.

that the bias due to the correlation between the condi- REFERENCES
tional probability estimators at different levels is not
negligible (Au & Beck 2001). Au, S. K. & Beck, J. L. 2001. Estimation of small fail-
This finding is also confirmed by the analysis of ure probabilities in high dimensions by subset simulation.
the sample c.o.v. of the failure probability estimates Probabilist. Eng. Mech. 16(4): 263–277.
which are plotted versus different failure probability Au, S. K. & Beck, J. L. 2003. Subset Simulation and its
levels P(F) (solid line) in Figure 3, for both Case 1 application to seismic risk based on dynamic analysis.
(top) and Case 2 (bottom). In these Figures, the dashed J. Eng. Mech.-ASCE 129(8): 1–17.
lines show a lower bound on the c.o.v. which would Au, S. K., Wang, Z. & Lo, S. 2007. Compartment fire anal-
ysis by advanced Monte Carlo simulation. Eng. Struct.,
be obtained if the conditional probability estimates in press (doi: 10.1016/j.engstrct.2006.11.024).
at different simulation levels were uncorrelated; on Levitin, G. & Lisnianski, A. 1999. Importance and sensi-
the contrary, the dot-dashed lines provide an upper tivity analysis of multi-state systems using the universal
bound on the c.o.v. which would be obtained in case generating function method. Reliab. Eng. Syst. Safe. 65:
of full correlation among the conditional probability 271–282.
estimates. From these Figures, it can be seen that the Metropolis, N., Rosenbluth, A. W., Rosenbluth, M. N. &
trend of the actual c.o.v. estimated from 200 runs fol- Taller, A. H. 1953. Equations of state calculations by fast
lows more closely the upper bound, confirming that computing machines. J. Chem. Phys. 21(6): 1087–1092.
the conditional failure probability estimates are almost Schueller, G. I. 2007. On the treatment of uncertainties in
structural mechanics and analysis. Comput. Struct. 85:
completely correlated in both Case 1 and Case 2. The 235–243.
high correlation between conditional probability esti- Zio E. & Podofillini, L. 2003. Monte Carlo simulation anal-
mates may be explained as follows: differently from ysis of the effects of different system performance levels
continuous-state systems whose stochastic evolution on the importance of multi-state components. Reliab. Eng.
is modeled in terms of an infinite set of continuous Syst. Safe. 82: 63–73.

716
Safety, Reliability and Risk Analysis: Theory, Methods and Applications – Martorell et al. (eds)
© 2009 Taylor & Francis Group, London, ISBN 978-0-415-48513-5

The application of Bayesian interpolation in Monte Carlo simulations

M. Rajabalinejad, P.H.A.J.M van Gelder & N. van Erp


Department of Hydraulic Engineering, Faculty of Civil Engineering, TUDelft, The Netherlands

ABSTRACT: To reduce the cost of Monte Carlo (MC) simulations for time-consuming processes (like Finite
Elements), a Bayesian interpolation method is coupled with the Monte Carlo technique. It is, therefore, possible
to reduce the number of realizations in MC by interpolation. Besides, there is a possibility of thought about
priors. In other words, this study tries to speed up the Monte Carlo process by taking into the account the prior
knowledge about the problem and reduce the number of simulations. Moreover, the information of previous
simulations aids to judge accuracy of the prediction in every step. As a result, a narrower confidence interval
comes with a higher number of simulations. This paper shows the general methodology, algorithm, and result
of the suggested approach in the form of a numerical example.

1 INTRODUCTION of randomly generated data in the unqualified area is


presented by a numerical example.
The so-called Monte Carlo (MC) technique helps engi-
neers to model different phenomena by simulations.
However, these simulations are sometimes expensive 2 GENERAL OUTLINES
and time-consuming. This is because of the fact that
the more accurate models, usually defined by finite In the interpolation problem, there is a signal U which
elements (FE), are time-consuming process them- is to be estimated at a number of discrete points.
selves. To overcome this problem the cheaper methods These discrete points will be called pixels, presented
are generally used in the simulation of complicated by ui . These pixels are evenly spaced on a grid of pix-
problems and, consequently, less accurate results are els u ≡ (u0 , · · · , uv+1 ). Therefore, there are totally
obtained. In other words, implementing more accu- v + 2 pixels. The first and last pixels are called bound-
rate models in the Monte Carlo simulation technique ary pixels and are treated separately. These boundary
provides more accurate and reliable results; by the pixels are presented by u0 and uv+1 . As a result, v
reduction of calculation’s cost to a reasonable norm, presents the number of interior pixels. The total num-
more accurate plans for risk management are possible. ber of observed data points is equal to n which are
To reduce the cost of Monte Carlo simulations for a distributed in arbitrary (or random) locations among
time-consuming process (like FE), numerous research the pixels. Therefore, the maximum value of n is
projects have been done, primarily in the structural equal to v + 2 when there is an observed data point
reliability to get the benefits of not only a probabilis- for each pixel (n ≤ v + 2). The locations of the
tic approach but also to obtain accurate models. For observed data points are collected in a vector c, so
instance, importance sampling and directional sam- this vector has n elements which are presented by ci
pling are among those approaches implemented to and i = 1, 2, · · · , n. The vector of observed data points
reduce the cost of calculations. But still this coupling is is called d ≡ (d1 , · · · , dn ), and its elements are pre-
a time-consuming process for practical purposes and it sented by di . Figure 1 presents an illustration of the
should be still modified. This research tries to speed up internal and boundary pixels as well as data points.
the Monte Carlo process by considering the assump- According to this figure c ≡ (1, v − 1, v + 2).
tion that the information of every point (pixel) can
give an estimation of its neighboring pixels. Taking
the advantage of this property, the Bayesian interpo- 3 BAYESIAN INTERPOLATION
lation technique (Bretthorst 1992) is applied to our
requirement of randomness of the generated data. In The univariate posterior probability density function
this study, we try to present a brief review of the (PDF) for an arbitrary pixel uj, given the data d and
method and important formulas. The application of the prior information I , will be found by integrating out
the Bayesian interpolation into the MC for estimation all pixels. In this case the sum rule is applied and the

717
following model, f , for an arbitrary pixel ui .

ui−1 + ui+1
ui = f (ui−1 , ui+1 ) = (4)
2
Figure 1. An illustration of the pixels which data points are
assigned to. Having the model defined, the error ei also is
implicitly defined by Equation 5.

product is integrated all over the multivariate posterior ui−1 + ui+1


PDF of all pixels of u except the required pixel uj . ei = ui − f (ui−1 , ui+1 ) = ui − (5)
2

P(uj |d, I ) = P(u|d, I ) . . . dui . . . (1) The only thing we know about this error is that the
  
i=j error has a mean of zero (the error is either positive
or negative) with some unknown variance φ 2 . Using
the principle of Maximum Entropy (Jaynes 2003), we
Also, according to the Bayes’ rule we have:
find the well known Gaussian probability distribution
function of ei presented in Equation 6.
P(d|u, I )P(u|I )
P(u|d, I ) = (2)
P(d|I )  
1 1
P(ei |φ) = √ exp − 2 ei2 (6)
Where P(d|I ) is a normalization constant called 2πφ 2φ
evidence. Therefore, combination of Equations 1 and
2 produces the following equation. Substituting Equation 5 into Equation 6 and making
 the appropriate change of variable from ei to ui , the
P(uj |d, I ) ∝ P(d|u, I )P(u|I ) . . . dui . . . (3) PDF of the pixel ui can be obtained by Equation 7.
  
i=j
P(ui |ui−1 , ui+1 , φ)
This equation presents that to obtain the posterior,  

1 1 ui−1 + ui+1 2
we need to define our likelihood function and the prior. = √ exp − 2 ui − (7)
The likelihood, or in this case more appropriate the 2πφ 2φ 2
PDF of the data (d) conditional on the pixels (u), is
constructed by making the standard assumptions of
noise. Therefore, according to the Bayesian interpo- Assuming that there is no logical dependence
lation technique, there are three main steps should be between the errors e1 , . . . , ev , the multivariate PDF
taken into account: of all the errors is a product of the univariate PDFs.
Then, by making the change of variable from ei to ui
1. All the pixels are connected to each other so each we find the following multivariate PDF for the pixels
pixel is defined as a function of its neighbor- u1 , . . . , uv .
ing pixels. This is the prior information which is
formulated in Section 4.
2. For the pixels which take the corresponding data P(u1 , . . . , uv |u0 , uv+1 , φ)
values, the data values are considered the best v  

1 1 ui−1 + ui+1 2
estimates. This is described in Section 5. = exp − 2 ui −
3. Then the outcome of the previous steps are com- (2π)v/2 φ v 2φ i=1 2
bined so as to get an estimation of every pixel in
grid, based on the data. In this case, Equation 3 is (8)
used and the result is presented in Section 6.
The boundary pixels are treated separately. In fact,
these two pixels are assigned to the first and last posi-
4 THE PRIOR tion and presented as u0 = v1 and uv+1 = vv+2 . As
a result of using the principle of Maximum Entropy,
We expect some logical dependence between neigh- the PDF of the boundary pixel u0 is obtained in Equa-
boring pixels and this expectation is translated in the tion 9. And a similar equation can be established for

718
the pixel uv+1 . We have derived the above equation which provides
the PDF for the pixels u0 , . . . , uu+1 using the assumed
P(u0 |u1 , φ) prior model presented in Equation 4.
  If φ = 0, we get to the conclusion that our model
1 1 (Equation 4) holds exactly. So setting φ = 0 produces
= √ exp − 2 [u0 − u1 ]2 (9)
2π φ 2φ an extremely informative prior which determines the
values of the pixels. On the other hand, if φ → inf
Combining Equations 8 and 7 using Bayes’ Theo- then the prior relaxes to an extremely uninformative
rem, the next equation will be obtained. This equation distribution which lets the values of the pixels totally
is written in a matrix form where u is vector of pixel free. So in a sense φ ‘regulates’ the freedom allowed
positions, to the pixels u0 , . . . , uv+1 .

P(u0 , u1 , . . . , uv+1 |φ)


  5 THE LIKELIHOOD
1 Q
= exp − 2 (10)
(2π )(v+2)/2 φ v+2 2φ Apart from our model and prior, we also have n+2 non-
overlapping data points , n ≤ v . These data points can
where be assigned arbitrarily to any pixel uc where c is an ele-
ment of the vector c described in Section 2. The value
of c corresponds with the location of the observed data
Q = uT Ru
regarding the pixel numbers (see Figure 2). The error
of the model at the location of any observed data point
and is defined as:
⎛ ⎞
1 −1.5 0.5 0 ··· ··· 0 ec = uc − dc (11)
⎜ .. .. ⎟
⎜−1.5 3 −2 0.5 0 . . ⎟
⎜ ⎟
⎜ .. .. ⎟ Assuming that this error has a mean of zero (the
⎜ 0.5 −2 −2 ⎟ error is either positive or negative) with some unknown
⎜ 3 0.5 . . ⎟
⎜ . ⎟ variance σ 2 and using the principle of Maximum
R≡⎜ .. .. .. .. .. ..⎟
⎜ .. . . . . . .⎟ Entropy we find that this error has the following
⎜ ⎟
⎜ .. .. ⎟ probability distribution function:
⎜ .
⎜ . 0.5 −2 3 −2 0.5 ⎟

⎜ . .. .. ⎟  
⎝ .. −1.5⎠
1 1
. . 0.5 −2 3 P(ec |σ ) = √ exp − 2 ec2 (12)
0 ··· ··· 0 0.5 −1.5 1 2πσ 2σ

Figure 2. An illustration of the pixels which data points are assigned to. The ’-’ is a representation of the evaluated values
in the pixels.

719
Substituting 11 into 12 and making a change of vari- ( Sivia 1996). Equation 19 presents the matrix form of
able from the error ec to the data dc , the likelihood this function.
function can be obtained according to Equation 13.
P(u|d, σ , φ) ∝ P(u|φ)P(d|u, σ )
   
P(dc |uc , σ ) = √
1 1
exp − 2 (dc − uc )2 1 (d − Su)T (d − Su) uT Ru
(13) = v+2 n exp − −
2π σ 2σ φ σ 2σ 2 2φ 2
(17)
Again by assuming logical independence between
the errors and making the appropriate substitutions and Equation 19 is conditional on unknown parameters
changes of variables, the following likelihood function φ and σ , but since we don’t know these parameters
can be obtained. we will want to integrate them eventually out as ’nuis-
sance’ parameters. We first assign Jeffery’s prior to
P(d1 , . . . , dn |u0 , u1 , . . . , u(v+1) , σ ) these unknow parameters:

1 1 1 1
= exp − 2 (dc − uc )2
(14) P(φ) = P(σ ) = (18)
(2π )n/2 σ n 2σ c∈c φ σ

Using Bayes’ theorem we can combine the priors


Equation 14 can be rewritten in the matrix form as in Equation 18 with Equation 19 to get the following
presented in Equation 15. equation.

P(u, σ , φ|d) = P(σ )P(φ)P(u|d, σ , φ)


P(d1 , . . . , dn |u1 , . . . , un , σ )
   
1 1 1 (d − Su)T (d − Su) uT Ru
= exp − (d
− Su) T
(d − Su) ∝ v+3 n exp − −
(2π )n/2 σ n 2σ 2 φ σ 2σ 2 2φ 2

(15) (19)

By integrating over all pixel except the target pixel,


Where d is a padded vector of length v + 2 where uj , the probability distribution function of just one
the data points have been coupled with their corre- pixel (uj ) is given as
sponding pixels and S is a diagonal matrix with entry
1 for pixels which data points are assigned to them and P(uj |d, σ , φ)
0 everywhere else. 
For example, the S matrix for the grid in Figure 1, = P(u0 , . . . , uv+1 |d, σ , φ) du0 . . . duv+1
d = (0, d1 , 0, · · · , 0, d2 , 0, d3 ) becomes:   
except duj
  
⎛ ⎞ (d − Su)T (d − Su) uT Ru
0 0 0 0 ··· ··· 0 = exp − −
2σ 2 2φ 2
⎜ .. ⎟
..
⎜ 0 1 0 0 0 . ⎟
. 1
⎜ ⎟ × du0 . . . duv+1
⎜ ⎟
.. (20)
⎜ .. ⎟ (φ)v+3 σ n   
⎜ 0 0 0 0 0 . ⎟.
⎜ ⎟ except duj
S≡⎜ .. . . .. .. .. .. ..⎟ (16)
⎜ . . . . . . .⎟
⎜ ⎟ For actual evaluation of Equation 20, we refer the
⎜ .. . . ⎟

⎜ . . 0 0 1 0 0 ⎟

interested reader to ( Bretthorst 1992).
⎜ .. . ⎟
⎝ . · · · .. 0 0 0 0 ⎠
0 ··· ··· 0 0 0 1 7 ALGORITHM

To couple the Bayesian interpolation approach with


Monte Carlo techniques the following algorithm is
6 THE POSTERIOR
suggested:
Combining the prior in Equation 10 and with the likeli- 1. Define the interval of variation and the length of
hood presented in Equation 15 we get a function which pixels for the variable X . Totally, v + 2 pixels are
is proportional to the posterior PDF of all the pixels to be defined in this interval.

720
2. A random number is generated according to the to monitor the change of its PDF during Monte Carlo
PDF of the variable X , and according to its value, its process. In this figure, the measured (or observed) data
assigned to a certain location. This location which point is assigned to the first and last internal pixels.
is the jth pixel (as presented in Figure 1) is called uj . Before we proceed to the simulation process, we
3. According to the information of the other pixels and would like to present the probable values of pixel 210
our assumed model, the PDF of the uj is calculated or u210 with the suggested model. Therefore, we need
by Equation 20. to use Equation 20 in order to get the required PDF.
4. According to the accepted tolerance criteria, it is Nevertheless, this equation contains σ and φ. As a mat-
decided whether there is a need to calculate the ter of fact, σ can be integrated out of the equation, but
limit state equation for the jth point or the accuracy we need to estimate a value for the φ. In this case, we
is enough. define  = φ1 which is called regularizer. Then we can
5. The calculations are iterated from step 3 and get the PDF of our regularizer to find its optimal value
continues to meet the simulation criteria. which leads to the most narrow PDF. The reader who
is interested in this process is referred to ( Bretthorst
1992). The most probable value of  is estimated to be
8 NUMERICAL EXAMPLE 2.6 and we use this value during the rest of this work.
As a result, Equation 20 will lead to Equation 23 for
One of the important research topics in hydraulic engi- pixel number 210 given two data points: d1 and d2 .
neering focuses on the impact of water waves on walls
and other coastal structures, which create velocities P(uj=210 |d1 , d2 )
and pressure with magnitudes much larger than those
associated with the propagation of ordinary waves 0.3126 109
under gravity. The impact of a breaking wave can gen- =
0.5897 1010 + 0.5265 1010 uj + 0.1339 1010 uj 2
erate pressures of up to 1000 kN /m2 which is equal
to 100 meters of water head. Although many coastal (23)
structures are damaged by breaking waves, very little
is known about the mechanism of impacts. Insight into The PDF of u210 given d1 and d2 is depicted in
the wave impacts has been gained by investigating the Figure 3. This figure is a plot of Equation 23. The
role of entrained and trapped air in wave impacts. In mean value of this PDF is -1.97 by assuming a sym-
this case, a simplified model of maximum pressure of metrical PDF. Besides, the 95% accuracy by assuming
ocean waves on the coastal structures is presented by a symmetrical distribution leads to the values in the
Equation 21. interval of [-11.28, 7.35]. This interval was obtained
by solving the equation which defines the integration
p × k × u2 of a symmetrical area around mean value should be
Pmax = C × (21) equal to 0.95. It is a wide PDF and its tails are much
d
more informative than the Gaussian. In other words,
Where the ρ is density of water, k is the length we expect value of this pixel vary within the interval
of hypothetical piston, d is the thickness of air cush- having the prior information about the model and just
ion, u is the horizontal velocity of the advancing wave, 2 data points.
and C is a constant parameter and equal to 2.7 s2 /m. It is useful to compare this result with the traditional
Having this knowledge, we are willing to find the interpolation problem. In fact, by considering two data
probability of the event, when the maximum impact points, there is no other way than we assume a linear
pressure exceeds 5∗105 N /m2 for a specific case. relationship which leads to the value of -1.21 for this
The one dimensional limit state function (LSF) can be pixel while we do not have any estimation about the
defined by Equation 22, where the velocity parameter uncertainty. Now, the distinction between two meth-
is assumed to be normally distributed as N (1.5, 0.45). ods is obvious; the applied method enables us to get a
criterion for the uncertainty of the estimated value of
G(u)5 − 0.98280 × u2 (22) each pixel. This is an huge advantage in the simulation
process. This comparison is depicted in Figure 4. In
We consider the variation of the variable u in the this figure there are two data points called A and B.
interval of [μ − 3σ , μ + 3σ ] where μ is the mean These two points are the only information which pro-
value and σ is the standard deviation of variable u. vide point e using a linear interpolation for the pixel
This interval is divided to the finite pixels with an 210, where e = −1.21. This is not close to real value
equal distance of 0.01. As a result, there are totally 270 of the limit state function g = 0.0246. Nevertheless,
internal pixels defined in this interval. A schematic there is no information over the certainty of the esti-
view of the all pixels is presented in Figure 2. Pixel 210 mated point e from the interpolation. In the other hand,
is considered as a sample pixel in which we are going point f is the mean value of the PDF calculated by the

721
is assigned to a pixel uj , we check if it is necessary to
run the limit state equation, or we can assign its value
regarding our tolerance. To investigate the changes,
we monitor the u210 after 20 realizations of the LSE
(or 20 data points) which are assigned to their location.
As a result, the calculated PDF of u210 given 20 data
points is obtained and depicted in Figure 5. The mean
value of this PDF is 0.013 , and the 95% accuracy
by assuming a symmetrical distribution leads to the
values in the interval of [−0.16, 0.19]. This shows
that by implementing more data points, we get a more
precise PDF.
The difference of the results of linear and Bayesian
interpolation at this case is because of the value of
the regularizer(). In this case study its value is set
to be  = 2.60. The effect of epsilon (or φ which
is inversely related to it) was previously described. In
fact, we can have two extreme situations when we con-
sider two extreme values for Φ. These extreme values
are 0 and infinity. In the first case we just stick to our
data values and in the second case we just consider
Figure 3. This figure presents the probability distribution our model assumption and leave the other informa-
function (PDF) of the u210 given 2 measured data points: d1 tion. Therefore, the difference between e and f should
and d270 . be related to the value of regularizer.
Since we are not satisfied with the accuracy we con-
tinue to generate more data points. Figure 6 presents
the PDF of u210 having 90 data points measured or
calculated. The mean value of this PDF is 0.025 , and
the 95% accuracy by assuming a symmetrical distri-
bution leads to the values in the interval of [0.014,
0.035]. This shows that by implementing more data

Figure 4. A comparison between linear interpolation and


the Bayesian technique for the pixel u210 , given 2 measured
data points: d1 and d270 . The exact value of the function
(Equation 22) is depicted by dashed line.

Bayesian technique (f = −1.97). The uncertainty is


shown by its PDF. Having a look at the presented PDF,
a rather wide PDF can be seen; and both of positive
and negative values are expected for this pixel.
From now on we start the Monte Carlo simulation by Figure 5. This figure presents the probability distribution
generating random numbers. However, before we run function (PDF) of the u210 given 20 measured data points
the limit state equation for each random number which given in random locations.

722
points, we get a more precise PDF. Since this inter- It is useful to compare the calculated PDFs in
val is small enough, we can assume that we have got another figure with the same scale. Figure 7 provides
the enough accuracy. Therefore, the simulation effort this comparison in which figure (a) presents the PDF
has been reduced by 67% for the presented numerical of the pixel at the beginning of simulation where there
example. are just two data points. Figure 7 (b) presents the PDF
In fact, the number of simulations in the Monte of the same pixel, u210 , when there are 20 data points
Carlo technique depends on several factors. The most randomly generated and assigned to the related pix-
important ones are the tolerance and the distance els. Figure 7(c) again presents the PDF of the same
between pixels defined for the analysis. In other words, pixel where the information of ninety pixels are imple-
to get a more precise result we need to implement more mented. In this figure, the same scale of axis is selected
data points. Meanwhile, a higher number of pixels lead to clarify the change of the PDF during the simulation
to a higher accuracy. process.

9 DISCUSSION

The Bayesian interpolation is a technique which can


be nicely coupled with Monte Carlo simulation. In this
study, an attempt is made to have the prior information
of the model incorporated to the current level of the
analysis. This is a step forward in Monte Carlo simu-
lations. In fact, the R in Equation 10 provides a link
between the information of each pixel with its neigh-
borhood. In other words, information of each point
passes through this link and effect the others. Besides,
this approach provides a nice tool to get the other
priors incorporated to the Monte Carlo simulations.
For instance, the limit bounds method ( Rajabalinejad
et al.2007 assumes some other prior information which
can be implemented in this approach.
Nevertheless, the approach presented in this paper
has got two main limitations. The first one is that we
need to use a grid and divide the interval of variation
of the variable to finite number of pixels. The second
limitation is that the pixel are evenly spaced. These
Figure 6. This figure presents the probability distribution conditions impose large matrices for a large interval,
function (PDF) of the u210 given 90 measured data points a small step size, or higher dimensions.
given in random locations.

Figure 7. This figure shows the probability distribution function of variable which is assigned to the pixel j = 210. In
Figure (a) Just the information of 2 data points are considered while in Figure (b) and (c), the information of 20 and 90 pixels
are considered, respectively.

723
10 CONCLUSION Jaynes, E. T. (2003). Probability Theory, the Logic of Science.
Cambridge University Press.
The suggested procedure can speed up the Monte Carlo Rajabalincjad, M., P. H. A. J. M. van Gelder, and J. Vrijling
simulations integrated with finite elements or the other (2007). Dynamic limit boundaries coupled with monte
highly complicated and time consuming processes. carlo simulations. Submitted to the Journal of Structural
Safety.
However, in this paper we have limited ourselves into Rajabalincjad, M., P. H. A. J. M. van Gelder, J. K. Vri-
the finite number of pixels. The proposed method also jling, W. Kannüng, and S. van Baars (2007). Probabilistic
provides a tool for implementing informative priors Assessment of the Flood Wall at 17th Street Canal, New
regarding the considered model. The extension of this Orleans. In Risk, Reliability, and Social Safety, Volume
work with an arbitrary length and location of pixels III, pp. 2227.
can provide a more powerful tool and is recommended Sivia, D.S. (1996). Data Analysis: A Bayesian Tutorial.
for future research projects. Clarendon Press.
van Gelder, P. H. A. J. M. (1999). Risks and safety of flood
protection structures in the Netherlands. In Participation
of Young Scientists in the Forum Engelberg, pp. 55–60.
REFERENCES

Bretthorst, G. L. (1992, july). Bayesian interpolation and


deconvolution. Cr-rd-as-92-4, The advanced sensor dire-
torate research, Alabama 35898–5000.

724
Occupational safety
Safety, Reliability and Risk Analysis: Theory, Methods and Applications – Martorell et al. (eds)
© 2009 Taylor & Francis Group, London, ISBN 978-0-415-48513-5

Application of virtual reality technologies to improve occupational &


industrial safety in industrial processes

Jokin Rubio, Benjamín Rubio, Celina Vaquero, Nekane Galarza, Alberto Pelaz & Jesús L. Ipiña
Industrial Safety Unit, Fundación LEIA CDT, Spain

Diego Sagasti
División de Realidad Virtual, EUVE- European Virtual Engineering, Spain

Lucía Jordá
Unidad de Ingeniería de Producto, AIMME-Instituto Tecnológico Metalmecánico, Spain

ABSTRACT: Virtual Reality (VR) is emerging as an important tool in the industry sector to simulate human-
machine interaction providing significant findings to improve occupational and industrial safety. In this paper
several VR applications and tools developed for industries in the manufacturing and chemical sector are presented.
These developments combine VR simulations, immersive 3D dynamic simulation and motion capture, addressing
risk assessments in the design process, in normal operation and training.

1 INTRODUCTION (ergonomic and comfortable cars), (Monacelli, G,


2003), but also to reduce costs and time of fabrica-
Virtual Reality is commonly understood as a com- tion and to develop safety and ergonomic assembly
puter simulation that uses 3D graphics and devices lines for the workers.
to provide an interactive experience. It is being used Nevertheless, despite of these examples and other
nowadays in multiple sectors as entertainment (video initiatives oriented to improve industrial and occu-
games), industry, marketing, military, medicine . . . In pational safety, VR is still an expensive technology
the industry sector, this technology is emerging as an mainly reduced to research but not accessible to the
important tool to simulate and evaluate the human- majority of the companies.
machine interaction, especially when this fact may The Industrial Safety Unit (UDS) in Fundación
have important consequences on the safety of people, LEIA has an emerging research line focused in VR
processes and facilities or in the environment. technologies, combining virtual reality simulation of
VR training applications are probably the most workplaces and processes, immersive 3D dynamic
broadly used. Military, aeronautic and nuclear indus- simulation and operator motion capture. Until now,
tries develop virtual scenarios in order to train oper- several applications and tools, addressed to compa-
ators to avoid risky and costly operations in the real nies of manufacturing and chemical sectors, have
word. Also, simulators of fork-lift trucks and gantry been developed: 1-risk assessment (safety machinery,
cranes have been developed to help worker to reduce tasks performance and ergonomic assessments), 2-risk
labour risks (Helin K. 2005, Iriarte et al., 2005). control (implementation & verification of industrial
Beyond the training applications, some research has safety & ergonomic requirements in the design and
been done to study the capabilities of VR as a tech- redesign of machinery, devices, workplaces and pro-
nology to study different safety aspects (risk iden- cesses) and, 3-operators risk communication, edu-
tification, machinery design, layout and training of cation and training (especially workplace postural
operators) in a manufacturing factory (Määttä T., training tools).
2003). Probably the sector that has shown greater In this paper, the work performed by this group and
interest in these technologies in the recent years has the results obtained are presented, as well as the future
been the automotive industry, where VR is being used research in the application of the VR technologies to
not only as a technology oriented to the final user other sectors.

727
2 METHODOLOGY: IMMERSIVE VIRTUAL 3 RESULTS
REALITY SYSTEM
3.1 Workplace risk assessment
The Industrial Safety Unit (UDS) is provided with a
3D simulation technologies and immersive virtual
fully equipped virtual reality laboratory. The capabil-
reality are being used in UDS to assess processes
ities of this laboratory can be summarized as follows
and facility risks. In this sense, evaluations of safety
(see Figure 1).
machinery, tasks performance and ergonomics assess-
• Biomechanical Motion Analysis, provided with IR ments are being developed (Figure 2).
video cameras, optical markers, data glove, and Workers performing their tasks are simulated in
motion capture software tools. This system allows a Catia-Delmia software. Mainly safety distances and
movement optical capture and biomechanical anal- worker postures when performing tasks are analysed to
ysis of those movements and provides the virtual achieve an evaluation of the workplace, using the mod-
reality system with machine vision. ules of the software, that include several ergonomic
• CAD Design and Dynamic Simulation (IBM standards and tools.
Work Station + CATIA/DELMIA software tools).
CATIA/DELMIA work-station is connected to a
Virtual Reality Equipment (next item) and sup- 3.2 Industrial process design improvement
plies initial virtual environment in order to build
the Virtual Universe. Immersive VR is being used in the design phase of
• Immersive Virtual Reality, VirTools v4.0 software the industrial process to simulate labour tasks that
tool (virtual reality engine, the ‘‘brain’’ of the vir- imply safety and ergonomic risks. This is nowadays
tual reality system) and hardware elements, HMD being carried out in a collaborative project (TIES) with
helmet, data glove, projectors, screens, glasses. another two research centres, AIMME and EUVE.
A system based on CAD design, biomechanical
Moreover, as a result of R&D projects (made jointly motion analysis and immersive virtual reality has been
with the companies ABGAM and STT), this system developed (Figure 1). The basic approach being fol-
has been optimized with modules to export motion- lowed in this research is to ‘‘inmerse’’ final users
captures to DELMIA and to perform automatic risk in interactive virtual work scenarios developed from
ergonomic assessments. 3D machinery designs (Catia, Solid Edge). The user-
This virtual reality laboratory is being used by this virtual scenario interaction is analysed in order to
research group mainly for three safety related applica- identify the safety/ergonomic risks of those work
tions: workplace risk assessment, design improvement environment designs. In this way, designs can be mod-
and to develop education and training applications. ified prior to the real machinery prototypes/workplace
development, offering a great flexibility in the design
process.
A real example has been developed in order to test
the performance of this system. A specific workplace,
including a hydraulic press, has been modelled. Exper-
imental testing with users has provided preliminary

Figure 1. Schematic illustration of the UDS-Lab


capabilities. Figure 2. Example of workplace assessment.

728
results regarding the usefulness of the system to
improve industrial processes design.
The specific scenario and machinery addressed in
this experience was selected attending to its extensive
use in the metallic transformation sector (key sector
in the TIES mentioned project) and the high risks
involved in the manipulation of this specific machin-
ery. Three main steps have been carried out in this
work: virtual scenarios development, ergonomic and
safety modules programming and initial testing.
As a first step, virtual scenarios were developed.
This activity implied mainly the programming of the
avatar and the machinery. The avatar was modelled
to move in the virtual word following the data from
the motion capture system, that is, following the real
movements of the user. User is provided with a Helmet
Mounted Display (HMD), so he can observe the virtual
world while moving within. User is also provided with
a data glove to monitor hand movements, allowing the
user to interact with the virtual world and to unleash
actions like catching and releasing a piece (manipula-
tion of workloads). On the other side, the machinery
(a hydraulic press) was imported to the graphical motor
from its original design in Solid Edge. The mechan-
ical behaviour was modelled according to the press
specifications (the scope of this project only focused
in the mechanical characteristics; other as electrical or
hydraulic where not considered because of the difficult
modelling in our virtual system). Different operating
modes and safety measures were programmed in order
to achieve different work scenarios (some examples
can be seen in Figure 3).
As a second step, software modules incorporating Figure 3. Operating modes and safety measures of the
ergonomic and safety requirements were developed modeled press.
and integrated in the virtual system. In the interaction
of the user with the virtual word, those developed mod-
ules provide with deviation from recommended stan-
dard values. Specifically, standard ergonomic meth-
ods RULA, REBA, NIOSH and Snook&Ciriello have
been implemented, allowing an automatic ergonomic
evaluation from the motion capture system data (see
Figure 4) . Similarly, safety standards with a focus on
the mechanical risks of the press were reviewed. The
standard UNE-EN294:1993 (Safety distances to pre-
vent danger zones being reached by the upper limbs)
has been programmed so distances among danger
zones, safety measures and user can be checked.
Finally, preliminary testing has been performed
with the developed system (Figures 5–6).
Users provided with input/output devices have been
‘‘immersed’’ in the virtual word, where they have per- Figure 4. Example of a result of an ergonomic evaluation.
formed their work tasks loading/unloading pieces and
operating the press. relevance and usefulness; these evaluations have been
The initial experiences showed that the system validated through comparison with standards method-
allows the identification of safety and ergonomic risks ologies (video recording plus application of ergonomic
of an industrial process in the design phase. Mainly, the standards). On the other side, the programming of
automatic ergonomic evaluation seems to be of special safety standards for mechanical risks into the virtual

729
Figure 7. Software tool for training.

developed in an industrial process where the train-


ing was identified as a key factor for the occupational
health and safety improvement. The tool was prepared
to be used directly by industry trainers and it was
based in a mixture of real video records, 3D simu-
lation (CATIA/DELMIA) and ergonomic evaluation
recognised methodologies. With these software tools,
workers are showed their bad body positions in the
executions of their tasks, the ergonomics risks derived
from them and the correct body positions they should
adopt and the improvement derived. (see Figure 7).

4 FUTURE PERSPECTIVES

Virtual Reality is currently being used as a promis-


Figure 5 & 6. Immersive virtual reality for the design of a
ing technology to improve industrial and occupational
workplace. health and safety.
In this paper, some applications of this technology
developed by this research group have been presented.
system offered some assessment results, although the In the near future, our research group plans three
experience showed that it seems to be more practical main activities. Firstly, to perform more experiences
to include this kind of modules in the CAD systems with immersive VR, analysing different alternatives to
used in the machinery design process. improve the technical difficulties drafted above, and
Nevertheless, some difficulties have arisen in the developing/testing more work environment scenarios.
use of immersive VR, mainly derived of visual techni- Second, to export the gained background to other
cal limitation of the HMD used (not stereo), address- industrial sectors, mainly the education and training
ing users to have precision problems when virtually applications. Finally, it should be mention that the
interact with the work environment/machinery. These experience obtained from these developments is being
problems were partially corrected using mixed vir- used to initiate experiences in order to use VR technol-
tual reality providing with tactile feedback to users. ogy to improve physical and cognitive disabled people
It is also planned to make new experiences using an labour integration (López de Ipiña, J. 2007).
stereo HMD.

3.3 Workforce education and training


ACKNOWLEDGMENTS
Tools to train workers in the ergonomic performance
of tasks have been developed and are being used by TIES project was funded by the Spanish Ministry of
the UDS. Specifically a postural training tool was Industry, FIT-020600-2006-29.

730
REFERENCES Marc J., Belkacem N., MArsot J. Virtual reality: a design
tool for enhanced consideration of usability ‘‘validation
Helin K. et al., Exploiting Virtual Environment Simulator elements’’. Safety Science 45 (2007) 589–601.
for the Mobile Working Machine User Interface Design. Määttä T. Virtual environments in machinery safety analysis.
VIRTSAFE Seminar. 04-06.07.2005, CIOP-PIB, Warsaw. VTT Industrial Systems, 2003.
Iriarte Goñi, X. et al., Simulador de carretillas elevadoras para López de Ipiña, J.M., Rubio J., Rubio B., Viteri A.,
la formación y disminución en riesgos laborales: motor Vaquero C., Pelaz A., Virtual Reality: A Tool for the
gráfico y simulación dinámica. 7◦ Congreso iberoamer- Disabled People Labour Integration. Challenges for Assis-
icano de ingenieria mecanica, México D.F., 12 al 14 de tive Technology AATE 07, European Conference for the
Octubre de 2005. Advancement of Assistive Technology, San Sebastián,
Monacelli G., Elasis S.C.P, VR Applications for reduc- 2007.
ing time and cost of Vehicle Development Process. 8th
Conference ATA on Vehicle Architecture: Products, Pro-
cesses and Future Developments, Florence, 2003.05.16.

731
Safety, Reliability and Risk Analysis: Theory, Methods and Applications – Martorell et al. (eds)
© 2009 Taylor & Francis Group, London, ISBN 978-0-415-48513-5

Applying the resilience concept in practice: A case study from the oil
and gas industry

Lisbeth Hansson & Ivonne Andrade Herrera


SINTEF, The Foundation for Scientific and Industrial Research at the University of Trondheim

Trond Kongsvik
NTNU Social Studies LTD, Norway

Gaute Solberg
StatoilHydro

ABSTRACT: This paper demonstrates how the resilience concept (Hollnagel et al., 2006) can be used as a
perspective for reducing occupational injuries. The empirical background for the paper is a case study on an
oil and gas installation in the North Sea that had a negative trend in LTI (Lost Time Injury) rates. The HSE
(Health, Safety, Environment) administration initiated a broad process that included the crew on the installation,
the onshore administration and a group of researchers to improve the situation. Instead of focusing the analysis
on incident reports, we applied a proactive view. Thus, we adapted a model for resilience that was used in a
development process. In the context of occupational accidents, we focused on the following factors: sufficient
time, knowledge and competence, resources and including working environment. These factors have been
identified as important for complexity and necessary for the organization to be able to anticipate, perceive and
respond to different constellation of conditions. This paper illustrates to what extent the concept of resilience
was fruitful analytically and as a reflection tool in the development of new HSE measures that are now being
implemented. The links between the resulting HSE measures and the qualities of the resilience concept are
discussed.

1 INTRODUCTION future. One example is accident investigations that are


carried out to identify causes and implement measures
1.1 Background to avoid similar incidents to occur. This is in line with
an ‘‘engineering’’ approach to safety that Hollnagel
The empirical background for this paper is a case study (2008) name ‘‘Theory W’’. Basic assumptions here are
on an oil and gas installation in the North Sea. The that systems are tractable, and that systems should be
starting point for the study was historical data that designed to avoid variability that in it self is regarded
showed a negative trend in LTI rates. We as an outside as a threat to safety.
research group were engaged in an effort to improve One problem with using historical data as an
their prevention efforts. During our involvement we approach is that conditions that produce dangerous sit-
tried to shift the focus in the organization from a reac- uations and safety conditions are both complex and
tive to a more proactive view on safety. By using an dynamic. The complexity refers to the interdepen-
action research approach (Greenwood & Levin 1998) dency of causes and that a specific constellation of
we tried to disengage from the focus on the negative conditions often has to be present for accidents to
safety results and introduced resilience as an alterna- occur. Thus, it is often difficult to reduce accidents
tive concept for reflecting upon safety. More concrete, to simple cause and effect relationships. The social
the research question we will try to illuminate in this and technological context in which accidents occur is
paper is: How can resilience be built in practice in also in constant flux and can be regarded as ‘‘moving
organizations? targets’’ that are difficult to grasp through traditional
Many different strategies are used in safety work accident investigations.
in the oil and gas industry. What many of them have The resilience concept (Hollnagel et al., 2006) rep-
in common is that history is used to learn about the resents an approach that implies building mechanisms

733
that make the organization prepared for the unex- The approach chosen for the case study was action
pected. This can be regarded as a proactive approach research combined with use of the resilience con-
to safety. In a proactive view individuals and organi- cept. In practice, this implied a mutual reflection in
zations must adjust to cope with current conditions. a search conference upon how to strengthen certain
These adjustments handle different constellations of qualities that could make Heidrun TLP more resilient
conditions that can produce accidents and also suc- as an organization. The negative safety results were not
cesses. Thus a resilient organization (or system) can addressed at all in the workshop. Instead we focused
adjust it’s functioning prior to or following changes on the positive aspects of the organization and on how
and disturbances to continue working in face of con- to strengthen these even further.
tinuous stresses or major mishap. Here, variability is In the next part of the paper the resilience concept
regarded as potentially positive for safety, in line with will be further explored, followed by a description of
what Hollnagel (2008) label ‘‘Theory Z’’. how the concept was used in our case. In part four,
The study is limited to one installation, and can be the results will be presented, while we in part five will
regarded as a case study. This implies the exploration give an overall discussion of the findings, followed by
of a ‘‘bounded system’’ over time involving several the conclusions.
data sources rich in context (Creswel 1998).

2 THEORY: THE RESILIENCE CONCEPT


1.2 The case: Heidrun TLP
IN SAFETY RESEARCH
The case study was conducted on Heidrun TLP, which
is an oil and gas producing installation in the North The term resilience is not a new concept within
Sea. It is a large tensioned leg platform (TLP) built in safety. Foster (1993) defined resilience as an abil-
concrete that has operated since 1995. When Heidrun ity to accommodate change without a catastrophic
was designed and built it was regarded as a North Sea failure, or the ability to absorb shock graceful. Ros-
‘‘Rolls Royce’’ with sufficient space and high qual- ness et al., (2004) adapted the resilience definition
ity equipment. The personnel working on Heidrun are to the capacity of an organization to accommodate
proud of the installation and feel a strong ownership failures and disturbances without producing serious
towards their working place. accidents. Resilience has also been defined as the
Heidrun TLP is operated by the largest petroleum properties of an organization to make it more resistant
company in Norway, StatoilHydro. Approximately to its operational hazards (Reason and Hobbs 2003).
50 % of the workers on the installation are employed by The resilience engineering term has been discussed
StatoilHydro while the rest are employed in companies as a transition from ‘‘traditional view’’ to a ‘‘systemic
responsible for drilling and production, companies view’’ of safety. (Hollnagel, 2007a)
responsible for the maintenance and modification In the traditional view, accidents are seen as a result
work and another company within StatoilHydro is of failure or malfunction of components (humans or
responsible for the catering activities onboard. The machines). These failures follow a predefined path of
personnel on the installation follows the ordinary work cause and effect. Accident models and their metaphors
schedule as used in the offshore business, two weeks provide basis for the prevention alternatives. Acci-
on and four weeks off duty. dent prevention recommendation based in the Domino
The offshore industry has a high reputation regard- Model (Heinrich, 1930) finds and eliminates causes
ing safety. The safety awareness is high among the of specific accidents, and allows responding to this
employees and compared to other industries the statis- specific unwanted event. In the same way, acci-
tics on occupational accidents are low. Still, the dent prevention recommendations based in the Swiss
frequency of occupational accidents has increased dur- cheese model (Reason, 1990) focus on strengthen any
ing the last years, also on Heidrun TLP. This may be barriers and defences. In the traditional view, risks are
a consequence of multiple reasons but some of the seen as a linear combination of cause and effect and
causes mentioned are increased personnel turnover, safety is achieved by constraint variabilities.
more inexperienced personnel and a higher activity Due to technological improvements there has been
level caused by extended modification work. The con- a transition from tractable systems to intractable sys-
tract personnel are more exposed to accidents than the tems. Sociotechnical systems are becoming so com-
StatoilHydro employees. One explanation for this is plex that work situations are always underspecified.
that the contractors are engaged in the most risky and The designers can not anticipate every contingency.
complex operations. In general the work schedule in These systems are no longer bimodal and normal per-
the offshore business is regarded as challenging for the formance is variable. In accordance with Hollnagel
workers; being away from normal life and the family (2007b), performance variability is related to techno-
for two weeks is one challenge and it is also challeng- logical systems malfunctions, imperfections and to the
ing to change into a work mindset after four week off. humans that have the tendency to adjust to current

734
conditions. Therefore, the performance variability of
Environment
the sociotechnical systems is normal and necessary
Dynamic developments
resulting in both successes and failures. In the sys-
temic view, accidents and incidents are the result of

Learn
ting
unexpected combinations of normal performance vari- Anticipation Attention Response

Up d a

ing
ability. Accidents are prevented by monitoring and

System
damping variability. In this view, risks emerge from
non-linear combination of performance variability. Knowing Knowing Knowing
Hollnagel (2008) defines formally resilience engi- what to expect what to look for what to do
neering as the intrinsic ability of an organization (or
Knowledge Competence Resources
system) to adjust its functioning prior to or follow- Time
ing changes and disturbances to continue working in
face of continuous stresses or major mishaps. It is Figure 1. Required qualities of a resilient system
not a surprise that there is no unique way to define (Hollnagel, Woods and Leveson, 2006).
resilience. While the majority of definitions focused
on the capability to cope with failures providing a
reactive approach, resilience engineering focused on monitor the external conditions that may affect the
the ability to adjust prior to or following a failure. operation.
Resilience engineering explores ways that enhance the – Anticipate risk and opportunities. At this point it is
ability of the organizations to be robust and flexi- required to go beyond risk analysis and have imag-
ble and make the organizations prepared to cope with ination to see what may happen and see key aspects
the unexpected. This definition focused on variability, of the future (Westrum, 1993). Is is not only of iden-
adaptability and unpredictability tify single events but how the may be interact and
We explore resilience engineering and the premises affect each other.
for resilience engineering will have an influence on – Learn from experience implies from actual events
the understanding of the phenomena that we studied not only collection of data in databases.
and the solutions that we identified (Hollnagel, 2007).
These premises are: In resilience engineering safety is not seen as
the absence of an accident but a dynamic non-event
– Since it is not possible to describe in detail all (Weick and Sutcliffe, 2001) and the capability of the
operations and resources are limited, performance system to handle unexpected situations. Resilience
variability is necessary acknowledges that individuals and organizations must
– Many adverse events could contribute to a success adjust to cope with current conditions. These adjust-
or to a failure. These adverse events are the result ments are always approximate due to current working
of adaptations to cope with complexity conditions where there is a limited amount of infor-
– Safety management must be reactive and proac- mation, resources and time. Resilience Engineering
tive. Safety management shall take into account is about to increase the ability of the organisation to
both hindsight and the ability of the organisation make correct adjustments. The adjustments are influ-
(system) to make proper adjustments to anticipate enced by a number of condition, these conditions are
potential threats, monitor risk, revise risk models lack of time, lack of knowledge, lack of competence
and to use resources proactively. and lack of resources (Hollnagel and Woods, 2005).
These conditions will facilitate the system to cope with
In this context, the qualities required for a system the unexpected event.
to be resilient are illustrated in Figure 1. Unexpected events require more time to understand
These qualities are related to the ability to: the situation and decide the proper action. If unex-
pected events occur in several occasions, they will
– Respond to regular and irregular threats in a robust affect other activities and there is a possibility of loose
and flexible manner. This is the reactive part of of control. The focus in relation to time should be
safety management. The system is designed to a when time demands are real and have consequences
limited range of responses. There is still a necessity for individuals. Knowledge is required to understand
to adjust responses in a flexible way to unexpected the event and ‘‘what happened’’ and competence is
demands. related to ‘‘knowing what to do’’ even if the unexpected
– Monitor in a flexible way own performance and event has gone beyond design limits. An unexpected
external conditions. This monitoring focused on event will require the use of resources to regain control.
what it is essential to the operation. In a dynamic Finally, the experienced learned from the management
and unpredictable environment, it is required for the of the unexpected events need to go back to the system
system to be able to have internal monitoring and in order to augment response capacity.

735
The resilience engineering concept presented in this contracts’’, ‘‘organisational changes’’, ‘‘onshore sup-
section is adapted to the oil and gas case to analyse port’’ and last but not least ‘‘management’’.
the ability of the organisation to anticipate, monitor, All factors in the resilience model were discussed
respond and learn together with the conditions that through approximately 40 semi-structured interviews,
influence this ability mostly by offshore workers. All kind of positions
where covered; both internal StatoilHydro and con-
tractor employees. The main findings were extracted
3 METHOD: ADAPTATION AND USE from the interviews and sorted by the factors in the
OF THE RESILIENCE MODEL resilience model.
The findings from the interviews were the main
Based on the increasing number of occupational acci- input to the creative search conference, also gather-
dents, StatoilHydro initiated a project in cooperation ing around 40 persons. The participants represented
with SINTEF. The scope was to turn the negative both on and off shore personnel and internal and exter-
trend identifying safety measures dedicated to the pilot nal StatoilHydro personnel. The two day conference
Heidrun TLP. was arranged with a mix between plenum and group
The action research approach that was used on Hei- sessions to discuss and suggest safety measures. The
drun TLP is a part of an organizational development second day of the conference was dedicated to identi-
(OD) process (Greenwood & Levin 1998). The goal of fication of measures; i.e. how could Heidrun become
this process was to increase the quality of adjustments a more resilient organization.
and to improve occupational safety in the organization.
In the OD process we focused on working conditions
that influence the ability to make proper adjustments; 4 RESULTS
sufficient time, knowledge, competence and resources
(Hollnagel and Woods, 2005). In our early information The safety measures identified in the search confer-
gathering we saw that the psychosocial work environ- ence were sorted and defined as HSE activities. These
ment on Heidrun TLP also could have a significant were presented and prioritized in a management meet-
influence on the ability to make proper adjustments, ing in the Heidrun organization and the end result from
and therefore added this as a condition. the project was nine HSE activities;
With this background we developed a working • Safety conversations
model that was used to structure the interviews and • Buddy system
as a starting point for the discussions in the search • Collaboration in practise
conference. • The supervisor role
‘‘Knowledge and Competence’’ was merged in • Consistent management
the model due to pedagogical reasons (although the • Clarification of the concept ‘‘visible management’’
difference was explained orally) and the factor ‘‘psy- • Meeting/session for chiefs of operation
chosocial work environment’’ was added, resulting in • Risk comprehension course
four main factors as indicated in the Figure 2 below. • Visualisation of events
Based on initial discussions with the safety staff
some external factors where identified, anticipated to These activities are now put into action. ‘‘Safety
influence the safety level on Heidrun TLP. These exter- conversations’’ cover both formal and informal con-
nal factors are; ‘‘The safe behaviour programme’’ (a versation were safety is a topic either explicit or
large safety campaign), ‘‘open safety talk’’, ‘‘cost & indirectly. The aim of this measure is first of all to
enhance the quality of these conversations by observ-
ing good and less good practice and applying training
when necessary. In the ‘‘Buddy system’’ colleagues
are assigned to take care of new colleagues. This HSE
activity will contribute to an enhanced quality of this
system by observation of practice, exchange of expe-
rience and training. In the ‘‘Collaboration in practice’’
activity different work groups are brought together to
be more familiar with own work in relation to others
work. The aim is to clarify roles and responsibili-
ties in the common work processes and to increase
knowledge about each others work.
‘‘The supervisor role’’ is a role that needs to be
developed and clarified as this role has changed. The
Figure 2. The adapted resilience model. supervisor is the daily manager for the work force on

736
the installation and he has direct contact with the crew Table 1. Activities influencing anticipation, attention
and has a thorough knowledge about the operations. and response.
This activity will aim at clarifying this role and identify
need for enhanced competence. ‘‘Consistent manage- Anticipa- Resp-
tion Attention onse
ment’’ will help the managers to agree on common
practice on reactions to insecure behaviour. Safety conversation x x (x)
The crew onboard the installation request more Buddy system x (x) (x)
‘‘Visible management’’, but at the same time the man- Collaboration in practice x
agement claim that they have too little time to be The supervisor role x (x) (x)
visible. It is however rather diffuse what is meant by Consistent management x (x)
this expression and the activity will help to clarify this. Clarification of ‘‘visible
‘‘Meeting/session for the chiefs of operation’’ shall be management’’ (x)
Session for chiefs of
an arena for good and constructive discussions about
operation x
safety related topics. This activity will define top- Risk perception course (x) (x)
ics to be addressed and will contribute in design of Visualization (x) (x)
fruitful processes for these meetings. ‘‘Risk compre-
hensive course’’ shall develop different courses with
the aim to enhance the comprehensive of risk. Finally
the ‘‘Visualisation of events’’ activity will follow up safety issues are treated in a proper way they will
and extend visualisation of events through animations increase the knowledge about safety, and clarify antici-
and video and will also encourage the use of drawings pations and what to expect. Safety conversations can
in reporting of events. also influence attention, e.g. what to look for in terms
of hazards in their daily work. One purpose of safety
conversations between employees and managers is
5 DISCUSSION also to increase the awareness of how to respond to
critical situations.
Three main qualities are required for a resilient organi- The ‘‘Buddy system’’ will in itself contribute to
zation. These are anticipation, attention and response. make newcomers to the installation more familiar and
These qualities are described in a theoretical way in the increase the competence both about the installation
theory section but as an introduction to the discussion and how work is performed. Increasing the quality of
we will give a practical example related to occupation this system and giving the ‘‘buddies’’ support so that
accidents. they can be more prepared for this role may improve
If a group of people onboard an oil installation the newcomer‘s anticipation, attention and response.
shall install a heavy valve together, they need to be ‘‘Collaboration in practice’’ will especially give a
well coordinated. They need knowledge about how better clarification of what to expect (anticipation)
to carry out the operation including who is respon- regarding how the work that is carried out in a safe
sible for what. Competences on the risky situations manner.
they go through in this operation are also essential. The supervisors are close to the daily operations.
This knowledge represents ‘‘anticipation’’, knowing By increasing their knowledge and skills, this may
what to expect. When the operation proceeds they also have an important effect on anticipation. Indirectly and
need to have competence on how to interpret the sit- dependent on the skills the supervisors acquire, both
uation and what to look for to be aware of the risky attention and the quality of response may increase.
situation, ‘‘attention’’ is needed. When a risky sit- The goal of the activity ‘‘Consistent management’’
uation is observed it is crucial that they ‘‘respond’’ is to give managers a common understanding of how
to it and respond in a correct way. It is not unusual to respond to safe and non- safe behavior. Consistent
that an employee do not respond if he sees that an positive and negative feedback that is regarded fair can
employee in a higher position do not follow safety pro- potentially increase both anticipation and attention.
cedures. Trust is essential to secure response. Time and Response regarded as unfair can worsen the psychoso-
resources is also important to avoid that critical situa- cial working environment and thereby decreasing the
tion are not responded to because they like to ‘‘get the two qualities.
job done’’ in due time. A management that is visible to the employees in
How can the identified HSE activities potentially terms of safety issues can in itself have a positive effect
influence attention, anticipation and response? The on attention. The activity ‘‘Clarification of ‘‘visible
following Table 1 shows how we interpret this. management’’ will in the first stage only define the
The activity ‘‘Safety conversation’’ covers all con- meaning of the term and will thereby not contribute
versations where safety is an issue and the purpose is to resilience before something is done to make the
to enhance the quality of these conversations. When managers more visible.

737
Introducing safety issues in meetings for chiefs of offshore personnel liked to be associated with. One
operations in a positive way can increase the managers of the participants in the search conference, a safety
knowledge about safety - anticipation. delegate, expressed that this was the most interesting
Both ’’risk comprehension course’’ and ’’Visualiza- HSE meeting he had participated in during the last
tion of events’’ can increase knowledge about safety 25 years. The terms from the resilience model have
(anticipation) and also competence on how to be aware been adapted and used during daily safety work on the
of the risky situations (attention), but this effect is installation. We may conclude that it is more motivat-
dependent on a high quality and a proper use. ing to use the proactive approach in practical safety
We see that most of the activities can potentially improvement work.
improve the anticipation of risk and opportunities.
More uncertain are the influences on appropriate
responses to threats and also on attention; the mon-
itoring of performance and conditions. Attention and REFERENCES
response are the two qualities that are most difficult to
change or improve. Both attention and response can be Creswell, J.W. 1994. Research design: Qualitative & quan-
regarded as behavior. Thus a change in these two qual- titative approaches. Thousand Oaks, California: Sage
Publications.
ities require a behavioral change. Anticipation can be Greenwood, D.J. & Levin, M. 1998. Introduction to action
regarded as a cognitive process, and is as such easier to research: social research for social change. Thousand
change than behavior. Still, behavior change is crucial Oaks, California.: Sage Publications.
in the building of resilience. How the nine activities Heinrich, H.W. 1931. Industrial accident prevention: New
actually contribute to behavior change is still an open York: McGraw-Hill.
question, as the effects have not yet been evaluated. Hollnagel, E., Woods, D. 2005. Joint Cognitive Systems.
Foundations of Cognitive Systems Engineering. Taylor
and Francis, USA.
6 CONCLUSION Hollnagel, E., Leveson, N., Woods, D. 2006. Resilience
Engineering Concepts and Precepts, Aldershoot, Ashgate
The research question for this paper was how resilience Hollnagel, E. 2007a. Resilience Engineering: Why, What
can be built in practice in organizations. We have illus- and How. Viewgraphs of presented at Resilient Risk
trated that the use of an action research approach, using Management Course, Juan les Pins, France.
Hollnagel, E. 2007b. Principles of Safety Management
search conferences potentially could have a positive Systems: The Nature and Representation of Risk. View-
influence on qualities that are required for resilient graphs of presented at Resilient Risk Management Course,
organizations; anticipation, attention and response. Juan les Pins, France.
Our focus has been occupational injuries, but the Hollnagel, E. 2008. Why we need Resilience Engineering.
approach could be valid for safety work in general. Ecole des Mines de Paris, Sophia Antipolis, France
The approach and process used in the case study Reason, J., Hobbs, A. 2003. Managing Maintenance Error,
demonstrates that a proactive approach to safety issues Ashgate, Aldershot, USA.
is motivating for the personnel involved. Statistics and Weick, K., Sutcliffe, M. 2001. Managing the unexpected.
reports on accidents are widely used to improve safety. Assuring High Performance in the Age of Complexity.
University of Michigan Business School Management
Some fatigue can be observed among the personnel Series John Wiley & Sons, Inc. USA.
related to safety work using this approach. The feed- Westrum, R. 1993. Cultures with Requisite Imagination. In
back from this project was that the personnel had no Verification and Validation of Complex Systems: Human
difficulties dealing with the resilience concept as it was Factors Issues, ed. Wise, J, Hopkin, D and Stager, P.
used in the project. Resilience was a construct thatthe New York: Springer-Verlag, pp 401–416.

738
Safety, Reliability and Risk Analysis: Theory, Methods and Applications – Martorell et al. (eds)
© 2009 Taylor & Francis Group, London, ISBN 978-0-415-48513-5

Development of an assessment tool to facilitate OHS management


based upon the safe place, safe person, safe systems framework

A.-M. Makin & C. Winder


School of Risk and Safety Science, The University of New South Wales, Sydney, Australia

ABSTRACT: A model of OHS management was developed using the safe place, safe person and safe systems
framework. This model concentrates on OHS being a collective responsibility, and incorporates three different
perspectives—an operational level (safe place), an individual level (safe person) and a managerial level (safe
systems). This paper describes the qualitative methodology used in the development of the assessment tool,
including the lessons learnt from the pilot study and preliminary results. This research also promotes the use of a
new style of reporting that identifies areas of strengths as well as vulnerabilities, and uses discreet, non-emotive,
neutral language to encourage an objective, constructive approach to the way forward. The preliminary results
from the pilot study and peer review using the Nominal Group Technique were very encouraging suggesting
that this technique would be useful in directing a targeted approach to systematic OHS management, and that
the safe place, safe person, safe system framework was suitable to be taken to the next stage of wider case study
application.

1 INTRODUCTION it may be necessary to dedicate considerable time and


resources to the development of a customised solution.
Hazards in the workplace may arise in many areas, Safe person strategies can include the incorporation
for example from the infrastructure, hardware and of safety into position statements; selection criteria to
operating environment; from issues related to the protect vulnerable workers; inductions for employees,
individual workers’ skills and strengths and the com- visitors and contractors; training and skills develop-
plexity of human behaviour; or even from the manage- ment; methods for obtaining feedback; the inclusion
ment strategies and methodologies used to direct the of safe working attitudes into performance appraisals;
production of goods or services. and the review of reasons for personnel turnover. The
A review of the literature suggests that three use of personal protective equipment and provisions
main approaches that have emerged to deal with for first aid, injury management, workers’ compen-
the corresponding areas where hazards may derive sation and rehabilitation are also included here as
namely—safe place, safe person and safe systems contingency measures for when prevention and con-
(Makin and Winder 2006). trol measures have failed. These strategies focus on
Safe place strategies focus on the existing physical the individual worker.
workplace environment and may include the use of safe Safe person strategies may be particularly useful
plant and equipment; preventive maintenance, moni- when the workplace is not fixed, for example in call
toring and inspection; site security; and contingency out work where the physical environment is unknown
plans for emergency situations and ongoing hazard and there is heavy reliance on skills and expertise to
reviews. manage the existing situation. These strategies may
Safe place strategies often involve extensive capi- be limited by the complexities of human behaviour,
tal, as it is more difficult to retrofit safety if it has not such as fatigue and times of job or life stress. These
been included in the design phase. Changes necessary must be appreciated at the time of application and
to the existing environment are often identified using allowances or additional controls used to counteract
the risk assessment and control process. Solutions may these parameters.
be more reliable when the hazards are predictable and Safe system strategies are frequently found in tem-
there is widespread knowledge available. However, plates for occupational health and safety management
when this is not the case and the hazards are unique systems (OHS MS) and deal with the concept or design
to the particular scenario, or the risks are unforeseen, phase, as well as providing guidelines on gathering

739
data to understand the work processes and consider the current control measures fail (the ‘‘raw’’ hazard pro-
bigger picture in review activities. Included here are file); secondly, an assessment was to be made on the
provisions for safe design, safe supply and procure- risk remaining once existing prevention and control
ment; competent supervision; good communication; strategies had been applied (the residual risk profile).
use of consultation; incident management; and means This was to give an indication of the amount of risk
of self checking via specialist audits or system reviews. reduction that had been achieved and to help identify
The focus here is on management and systems to opportunities for improvement. This was performed
promote a safe working environment. by using a risk ranking matrix factoring in a combi-
In order to determine the usefulness of the above nation of both severity and likelihood and a resulting
approach, an assessment tool was developed so that allocation of either high, medium-high, medium or
the safe place, safe person, safe system framework low. It should be noted that the use of non-emotive lan-
could be applied to determine the hazard profile of an guage was deliberately selected for providing feedback
organisation. The assessment tool comprised of sup- about hazard profiles as this was considered an impor-
porting material for each framework element to allow tant step in breaking down barriers to the improvement
a risk ranking exercise to be conducted. This included process and avoiding blame. For example words such
a definition and scope for each element and risk out- as ‘‘catastrophic’’ or ‘‘extreme’’ were not used in risk
comes if the element was overlooked. The individual ranking labels. Where elements were handled with
assessments for each element were aided by the devel- expertise this was recognised and fed back to organisa-
opment of a series of prompts that considered possible tion by giving it a risk ranking of zero—or ‘‘well done’’.
risk factors and possible prevention and control strate- Also, an assessment was made on the level of for-
gies for each of the elements (see Figure 1 for elements mality applied to the systems invoked and whether
for ‘‘Electrical’’ and ‘‘Stress Awareness’’). or not all the elements proposed by the safe place,
The risk ranking exercise was conducted as a two safe person, safe system framework had in fact been
stage process: firstly without taking into account inter- addressed by the organisation. The level of formality
ventions that were already in place so that areas of was also assessed to recognize where informal systems
vulnerability could be identified should any of the were used to manage risks, but did not contain a high

Electrical
All electrical equipment should be handled appropriately by those who are suitably qualified and kept in good working order. Other electrical hazards include
electric shock; static electricity; stored electrical energy, the increased dangers of high voltage equipment and the potential for sparks in flammable/explosive
atmospheres. Where live testing is necessary, only appropriately trained and qualified personnel should do so in compliance with relevant legislation and codes.
The risk is that someone may be injured or fatally electrocuted or cause a fire/explosion by creating sparks in a flammable or explosive atmosphere.

Possible Risk Factors Possible Prevention and Control Strategies


Electrical Equipment/Goods in Use Worn Cables Isolation Procedures Use of Authorised/ Qualified Repairers
High Voltage Equipment In Use Overloaded Power Points Lead/Cable Checks Circuit Breakers
Use of Residual Current Detectors High Voltage Procedures
Messy Power Leads Static Electricity Generated
Static Electricity – Use of Earthing Devices or Non-Conducting Materials
Use of Unqualified Personnel for Electrical Work Live Testing Necessary
Lightning Protection
Breakdowns out of Normal Business Hours Difficulty in Isolating Circuits

Stress Awareness
Personal skills, personality, family arrangements, coverage of critical absences, resourcing levels and opportunities for employees to have some control over
work load are factored into ongoing work arrangements so as not to induce conditions that may be considered by that particular employee as stressful. Plans are
available for dealing with excessive emails and unwelcome contacts.
The risk is that the employee becomes overwhelmed by the particular work arrangements, and is unable to perform competently or safely due to the particular
circumstances.

Possible Risk Factors Possible Prevention and Control Strategies


Shift Work Night Shifts Training in Necessary Skills Prior to Promotion/ Transfer/ Increased Responsibilities
High Level of Responsibility Long Shifts Consultation Before Changes Made Plans to Cover Critical Absences
Frequent Critical Deadlines Young Families Authority Commensurate with Responsibility Access to Necessary Resources
No Control Over Work Load No Consultation Prior to Changes Consideration of Individual Family Circumstances Adequate Staffing Levels
Lack of Necessary Skills Covering of Absences/ Unforseen Staff Shortages Authority Commensurate with Responsibility Negotiated Workloads

Figure 1. Examples of supporting material and prompts for elements in the revised assessment tool.

740
level of documentation to record actions. This was Ultimately, the Nominal Group Method was
to explore whether there was a connection between selected over the Delphi Method for a number of
the level of risk reduction achieved and the use of reasons:
formal systems, as well as to highlight areas for further
growth. • the extensive nature of the literature review which
A pilot study was conducted with a prelimi- formed the primary source of input into the process;
nary assessment tool to trial the technique and case • the potential for the Delphi Method to become
study protocol. The methodology for this qualitative overly extended if difficulty is encountered reaching
approach was conducted according to Yin (1989). a consensus, or if the return of the reviewed material
A qualitative approach was selected because previous is delayed; and
research investigating means of assessing the effec- • the synergistic effect when working together in a
tiveness of OHS MS had shown that ‘‘one size did workshop setting whilst using the Nominal Group
not fit all’’ and there was a need for a more cus- Technique was thought not only to streamline the
tomised approach (Gallagher, 1997). Hence, there process, but also enrich the final outcome.
would be difficulty in evaluating the results of the
same tool being applied to different organisations The Nominal Group Technique is not without limi-
when each organisation has unique needs and an tations, namely the potential for domination of certain
unpredictable hazard profile. The qualitative method- members, and the possibility of groupthink (Petersen,
ology would allow the variables to be better under- 1975). In order to counteract this potential bias,
stood and addressed before undertaking a larger scale the guidelines offered by Delbeqc, Van de Ven and
quantitative study. Gustafson (1975a; b) for conflict resolution and the
constructive use of diverging opinions were observed.
After the Nominal Group Technique was selected
2 METHOD for internal validation by a panel of experts from
academia and industry, a letter of invitation was sent
The initial development of the assessment tool was out. The final panel members included representa-
based on an extensive review of the literature. Meth- tion by a manual handling expert; a psychologist; an
ods to provide internal construct validity such as the occupational hygienist; a dangerous goods expert; a
Delphi Method and the Nominal Group Method were human factors analyst; three members with chemical
considered as most appropriate for this particular engineering experience; an occupational toxicologist;
research as there were no studies known to the authors and industry representatives with experience in manu-
where a statistically significant quantitative approach facturing and design. Three academics were involved
had been used successfully, and so there were no stud- including the chairman of the Standards committee
ies available for comparison of results. Hence the for the development of AS/NZS 4804/4801: 2001
above mentioned methods were both indicated for the Occupational Health and Safety Management Systems
development of new knowledge where comparison (Standards Australia. 2001a, 2001b). All but two of the
with others studies was not available and success in invited members were able to attend on the day, but all
both cases hinged on a balanced and vigorous peer provided input to the process.
review process. The actual Nominal Group Technique session was
The Delphi Method involves the use of an expert conducted by assembling the members of the panel
panel, with each member receiving the document for together after the first stage of the pilot study was com-
review. The review is performed independently and pleted and the preliminary report on the pilot study
returned. This is to avoid the potential for domina- produced to demonstrate the application of the assess-
tion or influence from other members. Multiple rounds ment tool and findings of the risk ranking exercises
of the review then take place until convergence is (see Table 1, and Figures 2–4).
reached. The issue of convergence forms the criti- The members of the panel were each given a copy
cal point where the application of the Delphi Method of the assessment tool and the preliminary report one
is challenged. Without skillful determination of the month prior to the review date so there was ample
end point, there is the possibility of creating a very time to read through the information. The Nominal
long, drawn out process (Landetta, 2005; Linstone Group Technique was carried out in two stages—one
and Turoff, 1975). for brainstorming of ideas for improvements and the
The Nominal Group Technique is similar to the second for voting on the top five ideas from the
Delphi Method, although instead of the information improvements suggested. The brainstorming stages
for review being sent out with a questionnaire, the were split into four sessions so that equal time was
panel members are collected together and changes or allocated for each of the three sections—safe place,
improvements are brainstormed in workshop setting safe person and safe systems; as well as time to con-
(Delbeqc, 1975b). sider improvements to the format of the final report.

741
Table 1. Framework model used for OHS MS for pilot study.

Safe place Safe person Safe systems

Baseline Risk Assessment Equal Opportunity/ OHS Policy


Anti-Harassment
Ergonomic Assessments Training Needs Analysis Goal setting
Access/ Egress Inductions -Contractors/ Visitors Accountability
Due Diligence
Plant/ Equipment Skill acquisition Review/Gap Analysis
Resource Allocation/
Storage/ Handling/ Disposal Work Organisation Administration
Procurement with
Amenities/ Environment Accommodating Diversity OHS Criteria
Supply with OHS
Electrical Job Descriptions consideration
Noise Training Competent Supervision
Hazardous Substances Behaviour Modification Safe Working Procedures
Biohazards Health Promotion Communication
Networking, Mentoring,
Radiation Further Education Consultation
Installations/ Demolition Conflict Resolution Legislative Updates
Preventive Maintenance Employee Assistance Programs Procedural Updates
Modifications—Peer
Review/Commissioning First Aid/ Reporting Record Keeping/ Archives
Customer Service—Recall/
Security—Site /Personal Rehabilitation Hotlines
Emergency Preparedness Health Surveillance Incident Management
Housekeeping Performance Appraisals Self Assessment Tool
Plant Inspections/ Monitoring Feedback Programs Audits
Risk Review Review of Personnel Turnover System Review

Safe Systems Safe Place Safe Systems


Safe Place
32% 39% 30%
37.3%

Safe Person
29% Safe Person
33.3%

Figure 2. Examples of hazard distributions without interventions (left) and with interventions (right).

The top five ideas were voted upon using a weight- The external validation of the methodology was
ing system -five for the most important idea down to provided by the triangulation of data during the assess-
one for the least important of the five ideas selected. ment stage—seeking to find convergence of obser-
The votes were made in confidence and collected for vations; questionnaire and interview responses; and
counting. The purpose of having all of the members objective documentary evidence. The purpose of this
together was to provide synergy and the opportunity external validity was not for statistical inference or
for explanation by the authors of the assessment tool as to establish causation, but to ensure the coherence of
well as to share any lessons learnt from the pilot study. theoretical reasoning for the development of emergent

742
80

70

60

Risk Ranking
50

40

30

20

10

0
Safe Place Safe Person Safe Systems

Figure 3. An example of the risk reduction graph in the revised report (dark columns—without interventions; light
columns—with interventions).

Inductions- All visitors and contractors to the workplace are made aware of any hazards that they are likely to encounter
Contractors/Visitors and understand how to take the necessary precautions to avoid any adverse effects. Information regarding
the times of their presence at the workplace is recorded to allow accounting for all persons should an
emergency situation arise. Entry on site is subject to acceptance of site safety rules where this is applicable.
The risk is that people unfamiliar with the site may be injured because they were unaware of potential hazards.

Procedures for contractors and visitors are working well.

Incident Management A system is in place to capture information regarding incidents that have occurred to avoid similar incidents from recurring
in the future. Attempts are made to address underlying causes, whilst also putting in place actions to enable a quick
recovery from the situation. Root causes are pursued to the point where they are within the organisations control or
influence. Reporting of incidents is encouraged with a view to improve rather than to blame. Near miss/ hits are also
reported and decisions to investigate based on likelihood or potential for more serious consequences. Investigations are
carried out by persons with the appropriate range of knowledge and skills.
The risk is that information that could prevent incidents from recurring is lost and employees and others at the workplace
continue to contract illnesses or be injured.
This isused more as a database for reporting rather than as a problem solving tool. Selective application of
root cause analysis, corrective action and evaluation may yield significant improvements in this area.

Figure 4. Excerpts from the preliminary report illustrating the style of reporting used (above).

themes and paradigm shifts. In this sense, the vali- authors. The final framework comprised of twenty ele-
dation process for qualitative research does not try to ments for each aspect of the model, making a sixty
achieve the same goals as quantitative research; the element matrix, three more than the original model.
aim is instead to provide multiple perspectives, and A Risk Reduction graph was added to the final report
in doing so overcome the potential for bias in each to increase clarity and assist in the interpretation of the
individual method. A collage is then formed to give final results (see Figure 3).
depth and breadth to the understanding of complex Three new elements were added: Receipt/Despatch
issues (Yin, 1989). to cover OHS issues associated with transportation of
materials to and from the workplace; Personal Protec-
tion Equipment to address issues related to the safe use
of PPE; and Contractor Management to ensure that all
3 RESULTS the lines of responsibility are well understood and that
all the information necessary has been exchanged.
After the peer review process a number of changes Details of changes within elements in the frame-
were made to the final framework which represented work model were:
the building blocks of systematic occupational health
and safety management, and there was a minor mod- • Training Needs Analysis was incorporated into
ification to the format of the final report. A letter Training.
was sent out to each of the panel members explaining • Work Organisation—Fatigue and Stress Awareness
the changes made and a copy of the final assessment was modified to remove stress awareness, which
tool. Whilst all the votes were carefully considered, became its own element.
the final decision on the layout was made by the • Noise included more information on vibration.

743
• Access/Egress included a reference to disabled The pilot study was then taken to a second stage that
access/egress. was completed after the Nominal Group Technique
• Risk Review was renamed Operational Risk review had been performed to investigate whether the
Review. risk ranking exercise could be used to make targeted
• Plant Inspections/Monitoring was renamed Inspec- improvements in the workplace. This was conducted
tions/Monitoring and the explanatory information by asking the organisation to select three elements that
modified to reflect the intention that this was not it would like to improve, then choosing three questions
the walk around inspections associated with house- about each element (making a total of nine) that would
keeping but more to do with understanding the be asked each month for a period of four months. The
process. questions were to be phrased so that they would gener-
• Storage/Handling/Disposal had Storage/Handling ate a simple yes or no answer, and one person was asked
removed so the element just refers to Disposal. to be responsible for the actions nominated. The pur-
• Ergonomic Assessment was renamed Ergonomic pose was to target visible outcomes that would improve
Evaluation and had any references to Occupational the final risk rating for the particular elements selected.
Hygiene removed from the explanatory material. To improve the objectivity of the controlled self assess-
ment exercise, these answers were to be independently
Not many changes were necessary to the format of spot checked each month. Only three elements were
the final report, although it was found that the two pie targeted in recognition of the time it takes to implement
charts in Figure 2 were difficult to interpret, so the bar changes to a safety program and the desire to keep the
graph in Figure 3 was added to illustrate the relative task manageable and hopefully to obtain some early
risk reduction that taken place. wins that might encourage management to continue
A number of other changes were incorporated into with the improvement program.
the format of the revised OHS assessment tool after The organisation in which the pilot assessment was
the pilot study was conducted. The most significant conducted was also asked to identify some means
of these was to merge the two assessments (one with- of measuring whether or not OHS performance was
out interventions and the other with interventions in improving during the implementation of the study.
place) into the one section. This was to enhance ease Guidance material was provided on the strengths and
of use and reduce the time taken to conduct the actual limitations of various commonly encountered mea-
assessments. Also, the tool was colour coded to assist surement indicators in the project field kit supplied
navigation of the information whilst on site. at the onset of the pilot study (Makin and Winder,
Once the changes to the revised OHS assessment 2007). The pilot organisation opted to continue mea-
tool were finalised, advertisements were placed in an suring injuries and incidents and the three elements
Australasian local OHS journal, an OHS email alert as targeted were: work organisation; access/egress and
well as on the home page for UNSW School of Safety incident management. At the end of the four months
Science’s website, to attract participants into the study. the organisation was asked to complete a short evalua-
A brochure was also produced to give details of the tion survey. The outcomes of this monitoring process
study to those interested in a hard copy and electronic are shown in Figure 5.
format. This ensured that all participants received the
same information. A total of eight organisations were
identified for the next phase of project with the revised 4 DISCUSSION
assessment tool and improved reporting format.
The size of the organisation to be used for the The use of the Nominal Group Technique was found
case studies was not specified in the advertisements to be of great benefit to the development of the OHS
as the authors were interested to find out what sized assessment tool by offering an opportunity for vig-
organisations would respond. The pilot study was con- orous peer review by a group of experts of varying
ducted successfully on a medium sized manufacturer backgrounds. Not only was this necessary to support
in the plastics industry. Of the eight organisations that the internal validity of the assessment tool developed,
responded and were eventually selected with the case but it was also found to greatly enrich the final version
studies, one was a small retail business; two were small of the tool for later use in the multiple case study exer-
to medium sized family businesses and the remainder cise. Further, each expert was able to bring in detailed
were larger organisations. Where larger organisations knowledge of their interpretation of the OHS elements,
were involved, it was considered that as the OHS and the final product could not be said to be a reflection
assessment tool was originally intended for use in of any one viewpoint.
small to medium sized enterprises the OHS assessment The difficulties encountered with the application
tool, it could be suitable if the scope of the assess- of the Nominal Group Technique were mainly related
ment was limited to a smaller, defined division of the to the logistical problems of availability of panel
organisation. members. As a result two of the members were unable

744
10
LTI's
Medical treatments
First Aid treatments
8 Reports
Numbers of Reports

0
Jun Jul Aug Sep Oct Nov Dec Jan
Month

Figure 5. Injury and incident results after completion of phase 2 of the pilot study.

to attend the workshop on the day, but were able to offer Furthermore, the success of the workshop would
their comments for review at a later date. Also this task not have been possible without the ability to feedback
was performed as an honorary role, so it was necessary the experience of the pilot study. This was found to be
to ensure that the half day was well organised. To assist crucial in terms of assessing the proposed framework’s
this process, agendas were sent out in advance and the initial viability and many lessons were learnt along
program steered and facilitated by an academic who the way so the development of the preliminary assess-
was not involved in the actual development of the tool ment tool was a very dynamic process. During the pilot
itself to maintain objectivity. The workshop was able study methods that did not appear to be workable or
to adhere to timelines suggested and the process was were too cumbersome were quickly modified from the
considered to run very smoothly. Although there were feedback received at the time. For example, the two
clearly differing views from some of the panel mem- stage assessments were taking too much time when
bers this was not unexpected as each brought their performed in isolation so they were combined in the
own perspective and experience and sharing this was final assessment tool. Furthermore, the triangulation
in itself a worthwhile exercise. Where differing opin- of data involving observation, interviews and a review
ions remained unresolved, the members were directed of documentation was found to produce the most fruit-
to express their views at the confidential voting stage, ful and reliable data when information was sought
and to cast their votes on the balance of information from layers of the organisation. It was found that it
available. Once the votes were tallied and the feed- was very important to collect the differing percep-
back worked into the final version of the assessment tions from management, operations personnel and the
tool, the panel members were given another opportu- individual workers and that these were all slightly dif-
nity to express any concern with the final outcome by ferent. Management’s opinions were found to be more
feeding information back to the authors within a rea- optimistic, whilst sometimes the individual workers
sonable time period. No changes were requested and were more skeptical (and perhaps more realistic) and
most panel members expressed satisfaction with the operations tended to be somewhere in-between. Where
final outcome. there was gross misalignment of opinions, these areas
Clearly the success of the Nominal Group Tech- were studied in more depth until a clearer picture of
nique would be heavily influenced by the range, the situation emerged. Sometimes this would involve
breadth and depth of the experience of the panel some external research to verify the facts, for example
members selected and this selection process was where this involved the use of hazardous substances
considered to be the most important stage. In this to ensure that the correct requirements were in fact
particular scenario, the process could have been being used.
enhanced by the inclusion of an electrical engineer and The proposed framework itself was found to be very
mechanical/structural engineer although a number of useful for the OHS management assessment and these
very experienced chemical engineers were present. findings are discussed in more detail in a previous

745
article (Makin and Winder, 2008). During the pilot 5 CONCLUSION
study the broader context offered by the safe place, safe
person and safe systems model was able to highlight The use of a pilot study and the Nominal Group Tech-
areas of vulnerability that had perhaps been disguised nique to trial the application of the safe place, safe
by a focus on production and hazards to do with the person, safe system model through the development
physical plant environment such as noise and man- of an assessment tool was found to be very reward-
ual handling. Prior to the study there were significant ing and worthwhile, and essential to the integrity
issues related to work organisation and the use of inap- of the research being undertaken. The results have
propriately long shifts that were unresolved. The pilot enabled this research to be taken to the next level—
study was able to highlight how this was potentially multiple case studies which are currently in progress
exposing workers to unnecessarily high levels of risk, and near completion. This qualitative approach is
and the longer shift hours were accentuating the prob- highly recommended for this particular field of
lems with noise, fatigue, manual handling as well as research and preliminary results from the case studies
solvent exposure. suggest that there is much scope for future develop-
The second stage of the pilot study involving a ment and further work, in particular for customising
monthly questionnaire was found to be more difficult the current OHS assessment tool for specific industry
to implement and depended on having someone within fields. Furthermore, the application of this tool was
the organisation who was highly motivated to see the not limited to small to medium enterprises as origi-
process through. Fortunately, the preliminary results nally thought, and may provide a useful benchmarking
of the OHS assessment on the pilot study were con- exercise across larger organisations where they are
sidered to be very worthwhile by the organisation and comprised of smaller subsidiary groups.
this generated enough enthusiasm to proceed to the
second stage. However, the organisation was undergo-
ing a period of flux and management turnover, so this ACKNOWLEDGEMENTS
second stage was delayed until the situation settled.
Although the OHS assessment was conducted with The authors gratefully acknowledge Dr. Carlo
the preliminary tool in March and April in 2007, the Caponecchia, who was facilitator of the nominal group
second stage wasn’t fully underway until the follow- session, and all contributors to the session.
ing October even though the agreed follow up actions
were decided in May. Despite these delays, a clear
improvement in OHS performance was observed (see
Figure 4), and it appeared that some time lag was REFERENCES
involved until the effects of the program came into
fruition—such as discussing the outcomes of two inci- Delbecq, A.L., Van de Ven, A., Gustafson, D. H. (1975a)
Profile of Small Group Decision Making. In: Group Tech-
dent investigations per month at regular, but informal, niques for Program Planning. Glenview Illinois: Scott,
operational meetings. Whilst no statistical correlation Foreman and Company. pp. 15–39.
was attempted due to the qualitative nature of the study, Delbeqc, A., Van de Ven, A., Gustafson, D. H. (1975b) Group
a further explanation of the improved trend was the Decision Making in Modern Organisations. In: Group
increased focus and attention on safety and health pro- Techniques for Program Planning. Glenview, Illinois:
moted by the study and the use of a targeted approach Scott, Foresman and Company. pp. 1–13.
that was realistic. The follow up actions had been set by Landeta, J. (2005) Current validity of the Delphi method
the organisation themselves and it was important that in social sciences. Technological Forecasting and Social
they were in full control of the process. The prelim- Change In press, corrected proof.
Linstone, H.A., Turoff, M. (1975) Introduction. In: The
inary assessment was also able to feedback positive Delphi Method: Techniques and Applications. Reading,
information in areas they had excelled, for example Massachusetts: Addison-Wesley Publishing Company.
in inductions for visitors and contractors, and this pp. 1–10.
was well received and helpful for encouraging their Gallagher, C. (1997) Health and Safety Management Sys-
co-operation with the study. tems: An Analysis of System Effectiveness. A Report to
Finally the reporting style utilised was very well the National Occupational Health and Safety Commis-
received and the report was able to be widely dis- sion: National Key Centre in Industrial Relations.
tributed. The pictorial representation of key infor- Makin, A.-M., Winder, C. (2006) A new conceptual frame-
mation and colour coding was found to be useful work to improve the application of occupational health
and safety management systems. In: Proceedings of
in the quick dissemination of main points and was the European Safety and Reliability Conference 2006
considered to facilitate the interpretation of mate- (ESREL 2006), Estoril, Portugal, Taylor and Francis
rial to a wider audience from individual workers in a Group, London.
safety committee setting, to operations personnel and Makin, A.-M., Winder, C. (2007) Measuring and evaluat-
management. ing safety performance. In: Proceedings of the European

746
Safety and Reliability Conference 2007 (ESREL 2007), Standards Australia. (2001b). AS/NZS 4804:2001 Occu-
Stavanger, Norway, Taylor and Francis Group, London. pational Health and Safety Management Systems—
Makin, A.-M., Winder, C. (2008) A new conceptual frame- General Guidelines on Principles, Systems and
work to improve the application of occupational health Supporting Techniques. Sydney: Standards Australia
and safety management systems. Safety Science. In press, International Ltd.
corrected proof doi:10.1016/j.ssci.2007.11.011 Yin, R. (1989) Case Study Research: Design and Methods.
Petersen, D. (1975) Coping with the Group. In: Safety Man- Newbury Park, US: Sage Publications.
agement: a Human Approach. Deer Park, New York:
Aloray. pp. 205–215.
Standards Australia. (2001a). AS/NZS 4801:2001 Occu-
pational Health and Safety Management Systems—
Specification with Guidance for Use. Sydney: Standards
Australia International Ltd.

747
Safety, Reliability and Risk Analysis: Theory, Methods and Applications – Martorell et al. (eds)
© 2009 Taylor & Francis Group, London, ISBN 978-0-415-48513-5

Exploring knowledge translation in occupational health using the mental


models approach: A case study of machine shops

A.-M. Nicol & A.C. Hurrell


University of British Columbia Centre for Health and Environment Research, Vancouver,
British Columbia, Canada

ABSTRACT: The field of knowledge translation and exchange is growing, particularly in the area of health ser-
vices. Programs that advance ‘‘bench-to-bedside’’ approaches have found success in leveraging new research into
a number of medical fields through knowledge translation strategies. However, knowledge translation remains
an understudied area in the realm of occupational health, a factor that is interesting because workplace health
research is often directly applicable to risk reduction activities. This research project investigated knowledge
translation in one occupational setting, small machine shops, where workers are exposed to Metal Working
Fluids (MWF) which are well established dermal and respiratory irritants. Using the mental models approach,
influence diagrams were developed for both scientists and were compared with qualitative interview data from
workers. Initial results indicated that the sphere of influence diagrams would benefit from the inclusion of other
stakeholders, namely policy makers and product representatives. Overall, findings from this research suggest
that there is only minimal transfer of scientific knowledge regarding the health effects of metal working to
those at the machine shop level. A majority of workers did not perceive metal working fluids to be hazardous
to their health. Of note was the finding that MWF product representatives were rated highly as key sources
of risk information. The translation of scientific knowledge to this occupational setting was poor, which may
be due to varying perceptions and prioritizations of risk between stakeholders, lack of avenues through which
communication could occur, an absence of accessible risk information and the small size of the workplaces. The
mental models approach proved successful for eliciting information in this occupational context.

1 INTRODUCTION researchers, health experts, decision-makers, regula-


tory bodies and governments. Although most occu-
Work is a central feature of life for most adults, provid- pational health researchers are aware of the need to
ing our livelihood and sense of place, and occupying communicate with front-line workers and decision-
at least a third of our waking hours. Not surprisingly, makers, there has been little research on knowledge
work can have a profound impact on health. There is translation or risk communication processes within
growing recognition that many diseases (e.g., asthma occupational settings. To date, understanding the fac-
and chronic obstructive lung disease, joint and ten- tors affecting the application of knowledge in this field
don disorders, stress-related mental health problems, remains extremely limited.
and some cancers and communicable diseases) can be The primary objective of this research was to inves-
caused or augmented by workplace exposures. tigate how health information is translated in one
Because of its vital role in adult life, the work- occupational setting, machine shops, where workers
place provides a valuable opportunity for promoting are exposed to complex mixtures of chemical and
occupational health. A key feature of occupational biological hazards in the form of metal working flu-
health research is that it is often possible to trans- ids (MWFs). These exposures have been linked to
late research results into direct prevention activities occupational asthma, increased non-allergic airway
in the workplace. Thus, the communication or trans- responsiveness to irritants, chronic airflow obstruction
fer of occupational health risk information has the and bronchitis symptoms (Cox et al., 2003) and have
potential to have an immediate and profound effect been the focus of considerable recent regulatory atten-
on work-related morbidity and mortality. tion in Canada and the US. This project is linked to an
The process of communicating risk information to ongoing study of risk factors for lung disease among
workplaces involves workers, managers, engineers, tradespeople in British Columbia, Canada, which has

749
found machinists to be at higher levels of risk for were done over the phone. Each interview took approx-
lung problems than three other trade groups (Kennedy, imately 20-35 minutes. The respondents were asked
Chan-Yeung, Teschke, & Karlen, 1999). Using inter- open-ended questions that were created using guid-
view data from experts, workers and managers, this ance from the expert’s Mental Model (see 2.2). Work-
project aimed to identify knowledge gaps and mis- ers were queried about their work history and habits,
perceptions about metal working fluid exposure, to as well as their knowledge of MWF exposure and
determine socio-cultural and organization factors that the personal protection strategies they undertook in
influence how knowledge is transferred in an occupa- the workplace. They were also asked questions about
tional context and to determine preferred channels or health effects associated with MWFs, including where
pathways for health risk communication. they would look for information on health effects, and
what steps they would take to mitigate these effects.
These open-ended questions were often followed up
2 METHODS by probes designed to elicit more information on a
particular subject matter.
2.1 Data collection
Data was collected for this project using the Men-
tal Models methodology developed by Morgan et al 2.2 Mental model development
at Carnegie Mellon University (Morgan, 2002). This All interviews were audio-taped and transcribed and
method has been previously applied in an occupational entered into NVivo. To construct the expert model,
context (Cox et al., 2003; Niewohner, Cox, Gerrard, & transcripts from the expert interviews were coded
Pidgeon, 2004b). The data was collected in two phases, with a coding schema developed through an itera-
beginning with interviews with scientific experts in tive process of fitting the codes to the data, based
the field of MWF exposure and effects, followed by on grounded theory (Strauss & Corbin, 1998) and
interviews with workers employed in machine shops. informed by previous mental models work (Cox et al.,
2003; Niewohner, Cox, Gerrard, & Pidgeon, 2004a).
2.1.1 Expert interviews The coding schema was also informed by a litera-
A member of the study team who is an academic expert ture review of existing chemical information regarding
on the health effects of MWF compiled a list of experts MWFs, which aided in the development of the model
on MWF and health effects. The list was comprised categories included in the expert mental model (i.e.
primarily of academic researchers from the US and exposure processes, machine types, etc.). The ini-
Europe, but also included US government researchers tial coding schema was reviewed and validated by an
and occupational health professionals. Of this list of expert who was part of the research team.
experts, the study team was able to contact 16, and 10 The expert mental model covered five main
of these consented to participate. domains: shop health and safety factors, MWF
The interviews, which were carried out by a sin- exposure factors, regulatory and economic factors,
gle trained research assistant, were conducted over the health effects, and exposure modes. Within these five
phone and lasted for an average of 30 minutes. The broad categories, related topics, such as informa-
first two interviews were used to pilot test the survey tion sources, reactive behaviors, and physical safety
instrument and were therefore not included in the final barriers emerged as sub-categories.
analysis. The respondents were asked open-ended The transcripts from the worker interviews are
questions about exposures, health effects, and miti- currently being analyzed using a similar grounded
gation strategies relating to MWF in the workplace. theory-informed method. A worker mental model is
They were also asked about their attitudes and prac- currently under construction.
tices relating to the communication of their research
results to decision-makers in industry and regulatory
agencies.
2.3 Data analysis
2.1.2 Worker interviews For the health effects and information sources analy-
To recruit machinists, introductory letters were sent sis, data for each of these constructs was abstracted
to 130 machine shops in the province of British from NVivo and reviewed by two members of the
Columbia, Canada, and were followed up with at least research team who had expertise in the areas of health
one phone call. Twenty-nine workers from 15 different effects and risk communication. Data from the workers
machine shops agreed to participate in an interview. were compared and contrasted with the expert model
The interviews were conducted by four different and areas of both congruence and disconnect were
trained interviewers. Twenty of the interviews were identified. Results were entered into tables to present
done at a private location at the worksite, and nine comparisons.

750
3 RESULTS identify ‘‘the lungs’’ as potential site of health prob-
lems. Of note, nine of the workers (31%) described
3.1 Demographics having personal experience with either a lung effect
from MWF exposure or ‘‘feeling’’ MWF mists in their
Respondents from the 15 participating machine shops
lungs.
were all male, and represented a range of ages and
There was greater concurrence between experts’
levels of experience in the trade. The demographic
and workers’ awareness of specific dermal conditions
details of the machinist respondents can be found in
that can occur as a result of MWF exposure, including
Table 1.
rash, dry hands, itchy hands, dermatitis and eczema.
Sixty-two percent of workers could identify a specific
3.2 Knowledge and description of health effects dermal health effect such as eczema, although a further
31% were only able to identify ‘‘the skin’’ in general as
Differences were found between the experts’ and
a potential site for health effects. Forty percent of the
workers’ descriptions of the health effects that can
workers said that they had experienced adverse effects
be caused by MWF exposure in the workplace (see
on their hands from MWF exposure.
Table 2). In particular, only 28% of the workers were
Four of the experts (44%) discussed the associa-
able to describe symptoms that could occur in the
tion between cancer and MWF exposure, although
lungs as a result of MWF exposure (such as cough,
proportionally fewer (17%) of the workers described
asthma, bronchitis, difficulty breathing). The major-
MWFs as cancer-causing agents. Of the workers who
ity of the experts described respiratory issues in detail,
described cancer, there was a general tendency to men-
providing a description of symptoms and specific med-
tion smoking and its carcinogenic potential in the same
ical terminology of diseases associated with MWF
discussion.
exposure such as such as hypersensitivy pneumonitis
There were health effects that workers described
(HP), occupational asthma and decreased lung func-
that experts did not, particularly irritation that could
tion. Only two of the workers were able to describe
occur in eyes. Two workers also suggested that MWF
asthma as a potential condition from MWF exposure,
could affect blood.
one mentioned decreased lung function, and none
Within the cohort of workers, 21% stated that
mentioned HP.
MWFs were not harmful to health, even though in
While unable to provide any specific terms or symp-
some cases these workers did note that MWF expo-
toms, a further 45% percent of workers were able to
sure could cause skin problems. Finally, there were
two people in the worker group who stated that they
were unaware of any potential health effects of MWF
Table 1. Machinist demographics. exposure.
Characteristic # % 3.3 Sources of information
Age Experts and workers were asked slightly different
20–29 3 10% questions regarding sources of health and safety
30–39 10 34%
40–49 12 41%
50+ 2 7% Table 2. Description of health effects, experts and
Unknown 2 7% workers.
# of years in trade
5 to 10 7 24% Workers Experts
11 to 15 7 24% (n = 29) (n = 10)
16 to 20 6 21% Health effects % %
21 plus 7 25%
Unkown 2 7% Described specific health effects
Shop size that can occur in the lungs 28% 70%
<10 people 5 17% Described specific health effects
11–50 people 13 45% that can occur on the skin 62% 70%
50 plus 10 35% Described a relationship between
Unknown 1 3% MWF exposure and cancer 17% 40%
Types of machines Central nervous system depression 3% 10%
Manual 4 14% Eye Irritation 17% 0%
CNC 5 17% Problems with blood 7% 0%
Both 15 52% Poisonous 3% 0%
Unkown 4 14% Stated that MWFs do not
Other 1 3% cause health effects 21% 0%

751
information in the workplace. Table 3 presents Table 5. Workers’ trusted sources.
responses to the open ended questions ‘‘How do you
think that workers learn about new scientific advances Worker answer (n = 29) %
in MWFs? and ‘‘How about safety issues around
WorkSafeBC 31
MWFs?’’. While 40% of workers noted that occu-
Government 17
pational health and safety training was a source of MSDS 14
information, the same number of experts did not think Manufacturer/supplier 14
that workers received such information at all. Material Other workers 3
Safety Data Sheets (MSDS) were ranked fairly low as Union 3
information sources for workers amongst the scientific Researchers 3
experts. Don’t know 3
Table 4 shows the responses to the following open
ended question that was posed to workers: ‘‘If you
were going to learn about the health effects of metal
working fluid exposure, or maybe some of the hazards Table 6. Expert communication pathways.
that you are exposed to in your shop, where would you
Expert answer (n = 10) %
go for this sort of information?’’ Suppliers and manu-
facturers were the most referred to sources, followed Workplace management 70
by MSDSs, which is in sharp contrast to the responses Government health and safety agencies 60
of the experts. Other workers and the internet were Workers 50
also major sources for workers not described by the Industry/Suppliers 40
experts. Unions 40
Workers were also asked ‘‘what person, agency or Government (other than safety agency) 10
group would you trust the most for information on Physician 10
MWFs, either about the product or possible associ-
ated health effects?’’ The responses to this question,
shown in Table 5, indicate that most workers trust
WorkSafeBC, British Columbia’s workers compen- government were the next most trusted, followed by
sation board. Various levels and departments within MSDSs and manufacturers/suppliers.
Table 6 presents results of the questions asked to
experts on how they had attempted to communicate
Table 3. Expert answers to: How do workers learn about the results of their MWF research. Most had pro-
new health and safety issues and advances? (n = 10). vided information to either workplace management
or to a government health and safety agency. Half
Expert Answer % reported that they had talked to workers directly about
MWF health effects and only one reported talking to
‘‘They don’t’’ 40
Occupational health and safety training 40 physicians.
Trade media (magazines, pamphlets) 30
Union 30
General news media 10 4 DISCUSSION
MSDS 10
Gov’t agencies 10
4.1 Health effects
Good workplace risk communication requires that
workers receive information about the chemicals that
Table 4. Worker answers to question regarding main they use and that workers understand the potential
sources of information used. health effects that these chemicals can cause. As
Schulte et al. (Schulte et al., 2003) states ‘‘effec-
Worker Answer (n = 29) % tive transfer, receipt and utilization of occupational
health and safety information will only be realized to
Suppliers and manufacturers 86 the extent to which recipients actually can understand
MSDSs 69 the information transmitted’’ (p. 522). The results of
Managers or other workers 66 this research suggest that while workers are aware that
The internet 48
Health and safety committee 41
they are being exposed to MWFs during the course of
Government organizations 34 their jobs, most have only a generalized understand-
Container labels 28 ing of how these compounds may affect the body. Such
results are not unique to this research and have been

752
found in other occupational health research such as were at least aware of the correct areas of the body
that conducted by Sadhra (Sadhra, Petts, McAlpine, that MWFs could affect. Of particular interest from
Pattison & MacRae, 2002). the worker data was the issue of eye irritation. This
Of concern were the findings that three quarters of effect was not noted by any of the experts, even though
the workers queried were unable to provide any detail MSDSs for MWF list eye irritation as a potential health
about the effects that MWF might have on their respi- effect. In fact, a review of MSDS for metal working
ratory tract. In addition, they did not link symptoms fluids found that there was more information about
such as cough, difficulties breathing, phlegm, asthma, eye irritation on some sheets than potential respiratory
and bronchitis to MWF exposure. Researchers such effects. A review of the published literature revealed
as Nowak et al (Nowak & Angerer, 2007) indicate no research focused specifically on eye irritation and
that one of the problems of identifying occupational MWF exposure.
disease is that fact that the symptoms aren’t necessar-
ily correlated directly in time with an exposure and as
4.2 Information sources
such, may happen after a worker has left the workplace.
This mechanism, coupled with a lack of awareness on The flow or transfer of information between the
the part of the workers about the types of symptoms ‘‘expert’’ realm and the ‘‘workplace’’ realm appeared
that MWFs can cause, makes the correct diagnosis to be hampered by a number of barriers in this study.
of occupational respiratory disease very challenging. In particular, the responses from experts and work-
Gupta et al (Gupta & Rosenman, 2006) has suggested ers on the topic of where to find information on
that hypersensitivity pneumonitis (HP) rates in the US MWFs showed a significant disconnect between the
are most likely under-reported due to factors such as groups. None of the workers were ignorant of poten-
inadequate disease recognition. Without information tial sources of information on the health effects of
about workers’ occupational and environmental expo- MWFs, although 40% of experts believed that work-
sures, doctors may misdiagnose conditions like HP as ers did not learn about health and safety information.
atypical pneumonia. The review by Santos et al (San- Workers also identified MSDSs as important infor-
tos et al., 2007) of diagnostic factors for occupational mation sources, while only 10% of experts believed
asthma found that lack of awareness of the associa- that workers learned from MSDSs (this finding is in
tion between symptoms and workplace exposures was keeping with the earlier discussion of effects of MWF
one of the most significant factors contributing to on eyes). Suppliers and manufacturers were the most
diagnostic delays. commonly mentioned information source by workers,
The workers’ descriptions of dermal effects were with 86% of workers stating that they would go to sup-
markedly different from those of respiratory problems, pliers and manufacturers for information. In contrast,
with a majority of workers being able to describe dis- none of the experts mentioned suppliers and manu-
tinct symptoms of MWF exposure such as rash, itch- facturers. These results are consistent with a study
iness and dry hands. These results may be due to the by Sadhra et al (2002), which found considerable
fact that many of the workers had actually experienced differences between the information sources men-
these problems personally, or had known others who tioned by experts and by workers in the electroplating
had these types of skin problems. Indeed, occupational industry.
skin diseases are the most commonly reported work- These results suggest that many MWF experts
place related conditions (Lushniak, 2003). Research perceive knowledge translation processes as ‘‘bro-
by Sadhra et al (Sadhra et al., 2002) found that work- ken’’ or non-existent, even though experts did report
ers tended to talk more easily and automatically about efforts to communicate their research results to audi-
more common health problems rather than those that ences beyond the academic/scientific community. The
were considered more serious. Indeed, many work- majority of experts reported that they communicated
ers in this study noted that they had skin effects, yet research results to workplace management; however,
these weren’t necessarily considered serious, or even most experts were disillusioned about their communi-
real ‘‘health’’ effects, even though they were eligible cation efforts and the potential of these processes to
for compensation. For example, when asked about the be translated to those at risk. Experts expressed a vari-
effects of short-term exposure, one worker replied: ety of opinions as to why they felt that their efforts
to send risk messages to workers were ineffective.
A number of experts directed frustration at workplace
‘‘As far as sick . . . I wouldn’t call what I had being
managers and regulatory bodies for seemingly not
sick. It’s just, you know, you have a rash on your
heeding scientific advice:
hand and I did have time off because of that.’’
‘‘ . . . the communication [with workplace man-
There were relatively few workers who described agement] was unsuccessful in that I didn’t get any
health effects that were erroneous, indicating that most feedback [ . . . ] on what happened next.’’

753
‘‘I think we presented [the regulatory body] Expert: No. Being an academic, unless there
with what we thought were positive findings but, was funding, I wouldn’t know.’’
I think, since then we’ve been a little disap-
pointed that they haven’t really capitalized on the These comments demonstrate experts’ recognition
research as much as they might have done.’’ of their own role in the risk communication process
and their awareness that different communication tech-
‘‘I have to say that we have been very disap- niques are necessary to reach worker audiences. This
pointed with the way that the [regulatory agency] suggests a need for appropriate training, resources, and
have failed to publish the reports that we did in a incentives to participate in non-traditional knowledge
lot of this work.’’ translation efforts. Some funding agencies, such as
Canada’s Canadian Institutes for Health Research, are
This perceived lack of responsiveness from now actively promoting such efforts by requiring aca-
decision-makers who are actually in a position to demic proposals to have knowledge translation plans,
effect changes in the workplace was problematic for and by providing funding for research into effective
experts. This frustrating situation may cause them to knowledge translation practices (Canadian Institutes
abandon their efforts to communicate with decision- of Health Research, 2003).
makers. There is evidence that research dissemination
and uptake is hampered both by researchers’ imperfect
understanding of decision-making contexts, as well as 4.2.1 Trust in Information Sources
by the organizational and/or political pressures fac- The role of trust in mediating how risk messages are
ing decision-makers such as regulatory bodies. Lomas perceived, attended to, and acted upon has been widely
(Lomas, 1996) suggests that structures to support and acknowledged in risk perception and communication
improve ongoing knowledge translation and exchange research. Studies have found that distrust heightens lay
between researchers and decision-makers are needed concerns and responses to risk messages, and leads to
to speed the process of research dissemination and questioning of the actions of risk regulators and author-
uptake. He suggests a cultural shift involving new ities (Cvetkovich & Löfstedt, 1999). Lay perceptions
organizational models for both decision-makers and of trust and credibility in risk messengers are depen-
researchers, as well as enhanced funding to sup- dent on three factors: perceptions of knowledge and
port ongoing knowledge translation between both expertise; perceptions of openness and honesty; and
groups. perceptions of concern and care (Kasperson, Golding,
Other experts, when discussing their communica- & Tuler, 1992). These factors were evident in workers’
tion activities, expressed a level of discomfort with, or discussion of trusted information sources.
lack of knowledge of, appropriate lay communication Worker responses to the question about what per-
techniques. son, agency or group they would trust the most
for information on MWF reveal a further disconnect
‘‘I think [communication to workers] is some- between the most used sources of information and the
thing that scientists overall have to do a lot more most trusted sources of information. Manufacturers
of. They have to interest a lay audience in what and suppliers were mentioned most often as a source
they do, and it’s an area, I think, we all need to of information, yet the provincial workers’ compensa-
do a lot more in.’’ tion board was reported to be the most trusted source.
Indeed, many workers specifically noted that they
‘‘I’d like to know a way of getting [health and did not trust manufacturers and suppliers information
safety] information over to people so they actu- source even though they used it. The reason for this
ally took heed of advice before they actually got distrust is apparent in the comment of one worker that
problems.’’ ‘‘they’re just trying to sell you something.’’ Since trust
is seen as a prerequisite to effective risk communi-
‘‘Expert: I figure the way that I am doing it, I cation (Kasperson et al., 1992), this demonstration
would admit, is not the best. I think a program to of distrust in manufacturers is problematic. Although
directly give your results to the labourers would workers may receive information on MWFs and their
be an ideal pathway to go. It is not something that potential impacts on health from these sources, they
our department routinely does–if ever–except for may disregard recommendations or ignore precaution-
communities. Talking to the workers–that’s not ary steps due to a perception of dishonesty in the risk
something I have ever done and I’m not familiar messengers.
with that many people who have. Workers identified the provincial workers compen-
sation board as their most trusted source of infor-
Interviewer: I see. Do you have an idea of how mation. Workers described the board as ‘‘unbiased,’’
you would go about developing such a program? ‘‘independent,’’ and ‘‘non-profit.’’ Many workers

754
pointed out that it was in the board’s best (financial) how workers and experts understand the effects of
interest to prevent workers from becoming sick, and MWFs, particularly in the area of dermal exposure, but
thus it was also in their best interest to provide accu- that much more attention needs to be paid to providing
rate and balanced information. A number of workers workers with a more comprehensive understanding of
also mentioned that the board had resources to conduct the effects of MWF on the respiratory tract.
research and to make evidence-informed decisions. This study has also illuminated a number of impor-
Thus, the board fits the factors of expertise, honesty, tant disconnects between how workers do receive
and concern put forward by Peters et al (Peters, Cov- information as opposed to how they would like to
ello, & McCallum, 1997). These results suggest that receive information, an important distinction that may
risk messages delivered by the workers compensation be impeding the awareness and management of work-
board may be more likely to be trusted, and thus acted place risks. Additionally, this study uncovered a degree
upon. However, not one worker mentioned the Board of frustration on the part of experts in their attempts
as a source of information that they would use to learn to communicate their findings and a relatively bleak
about hazards associated with MWF. Thus, efforts view of the current workplace communication milieu
would need to be made in order to actively disseminate for the worker. Neither of these conditions, as they
information from this source to workers. stand, will enhance the communication and exchange
of MWF exposure data in the occupational context.
At the outset of this study, manufacturers and sup-
5 STRENGTHS AND LIMITATIONS
pliers were not expected to play such a key role in
the dissemination of health and safety information on
A strength of this mental models approach rests on the
MWFs. These unexpected findings have led to a third
ability to develop a detailed representations of occu-
phase of interviews with a selection of manufactur-
pational hazards from different perspectives. However,
ers and suppliers. The results of these interviews are
these representations rest on interview data that can-
expected to shed additional light onto the role that this
not be assumed to be a direct reflection of partici-
group plays in the communication of health and safety
pants’ conceptual understandings. Participants may
issues relating to MWF.
not mention implicit knowledge of hazards or work-
place behaviour, or may edit their comments based on
what they consider appropriate for a conversation with
ACKNOWLEDGEMENTS
a researcher.
This study benefited from interviews with a com-
The authors would like to thank Dr. Susan Kennedy,
prehensive range of experts who have studied MWFs
Emily Carpenter, Reid Chambers and Natasha
from various disciplinary angles, including occupa-
McCartney for their assistance with this paper. This
tional hygiene, respiratory medicine, environmental
project was funded in part by the Canadian Institute
toxicology, etc. The research community in this area
for Health Research.
is small, resulting in a small sample size drawn from
around the world.
In contrast, the sample of workers was relatively
REFERENCES
large, but was drawn only from the province of British
Columbia, Canada. Thus, the workers’ conceptual rep- Canadian Institutes of Health Research,. (2003). Knowl-
resentations of MWF regulation and use may differ edge translation overview. Retrieved June 29, 2006, from
from experts’ due to geographic specificities in regu- http://www.cihr-irsc.gc.ca/e/7518.html
lation, economics, and use. In addition, some of the Cox, P., Niewohmer, J., Pidgeon, N., Gerrard, S.,
workers were also participants in an ongoing study of Fischhoff, B. & Riley, D. (2003). The use of men-
respiratory health of tradespeople. Thus, these respon- tal models in chemical risk protection: Developing a
dents might have been more aware of the respiratory generic workplace methodology. Risk Analysis : An Offi-
health effects of MWF (although this hypothesis is not cial Publication of the Society for Risk Analysis, 23(2),
supported by the results of this study). 311–324.
Cvetkovich, G.T. & Löfstedt, R. (1999). In Cvetkovich G. T.,
Löfstedt R. (Eds.), Social trust and the management of
6 CONCLUSION risk. London: Earthscan.
Gupta, A. & Rosenman, K. D. (2006). Hypersensitivity pneu-
The results of this research have implications not monitis due to metal working fluids: Sporadic or under
reported? American Journal of Industrial Medicine, 49(6),
only for workers but also for the broader fields of 423–433.
occupational health and safety, occupational medicine Kasperson, R.E., Golding, D. & Tuler, S. (1992). Social
and disease surveillance, and occupational knowledge distrust as a factor in siting hazardous facilities and
translation. Through this Mental Models process we communicating risk. Journal of Social Issues, 48(4),
have determined that there is some overlap between 161–187.

755
Kennedy, S.M., Chan-Yeung, M., Teschke, K. & Karlen, B. Nowak, D. & Angerer, P. (2007). Work-related chronic
(1999). Change in airway responsiveness among appren- respiratory diseases–current diagnosis. [Arbeitsbedingte
tices exposed to metalworking fluids. American Jour- chronische Atemwegserkrankungen. ‘‘Bekommen Sie
nal of Respiratory and Critical Care Medicine, 159(1), am Wochenende besser Luft’’?] MMW Fortschritte Der
87–93. Medizin, 149(49–50), 37–40.
Lomas, J. (1996). Improving research dissemination and Peters, R.G., Covello, V.T., & McCallum, D.B. (1997). The
uptake in the health sector: Beyond the sound of one hand determinants of trust and credibility in environmental risk
clappingCentre for Health Economics and Policy Anal- communication: An empirical study. Risk Analysis, 17(1),
isys. Department of Clinical Epidemiology and Biostatics. 43–54.
McMaster University. Sadhra, S., Petts, J., McAlpine, S., Pattison, H. & MacRae, S.
Lushniak, B.D. (2003). The importance of occupational skin (2002). Workers’ understanding of chemical risks: Elec-
diseases in the united states. International Archives of troplating case study. Occupational and Environmental
Occupational and Environmental Health, 76(5), 325–330. Medicine, 59(10), 689–695.
Morgan, M.G. (2002). Risk communication: A mental Santos, M.S., Jung, H., Peyrovi, J., Lou, W., Liss, G.M. &
models approachCambridge University Press. Tarlo, S.M. (2007). Occupational asthma and work-
Niewohner, J., Cox, P., Gerrard, S. & Pidgeon, N. (2004a). exacerbated asthma: Factors associated with time to
Evaluating the efficacy of a mental models approach for diagnostic steps. Chest, 131(6), 1768–1775.
improving occupational chemical risk protection. Risk Schulte, P.A., Okun, A., Stephenson, C.M., Colligan, M.,
Analysis : An Official Publication of the Society for Risk Ahlers, H., Gjessing, C., et al. (2003). Information
Analysis, 24(2), 349–361. dissemination and use: Critical components in occupa-
Niewohner, J., Cox, P., Gerrard, S. & Pidgeon, N. (2004b). tional safety and health. American Journal of Industrial
Evaluating the efficacy of a mental models approach for Medicine, 44(5), 515–531.
improving occupational chemical risk protection. Risk Strauss, A.L. & Corbin, J.M. (1998). Basics of qualita-
Analysis : An Official Publication of the Society for Risk tive research: Techniques and procedures for developing
Analysis, 24(2), 349–361. grounded theorySage Publications.

756
Safety, Reliability and Risk Analysis: Theory, Methods and Applications – Martorell et al. (eds)
© 2009 Taylor & Francis Group, London, ISBN 978-0-415-48513-5

Mathematical modelling of risk factors concerning work-related


traffic accidents

C. Santamaría, G. Rubio, B. García & E. Navarro


Instituto de Matemática Multidisciplinar, Universidad Politécnica de Valencia, Valencia, Spain

ABSTRACT: Work-related traffic accidents suppose an important economic and public health problem. In this
context, it is important to improve our knowledge on the main factors that influence these accidents, in order to
design better preventive activities. A data base coming from an insuring company is analyzed, in which infor-
mation regarding personal characteristics of the individual is contained, as well as labor situation characteristics.
Cox model is used in order to construct a predictive model of the risk of work- related traffic accident. From the
obtained model we study if personal or labor characteristics act like predicting factors of work-related traffic
accidents.

1 INTRODUCTION type of contract, schedule, and others. The aim of


this paper is to construct a predictive model based
Work-related traffic accidents suppose an important in these types of variables, using a data base com-
economic and public health problem. For instance, it ing from an insuring company. Further research is
is reported that it is the largest cause of occupational needed in order to extend the used data type, looking
fatality in the United Kingdom (Clarke et al., 2005). In for in other data bases and designing perhaps suitable
Boufous & Williamson (2006) it is argued that traffic questionnaires.
crashes are one of the leading causes of occupational The paper is organized as follows. In section 2 we
fatalities in various parts of the industrialized world. present the data. In Section 3 we state the predictive
Clarke et al., (2005) reviews some research carried out model by means of proportional hazards Cox model.
over recent years that suggests that drivers who drive In section 4 the model is represented by means of a
for business purposes are at an above average risk of useful graphic tool. Finally we do a little discussion in
accident involvement relative to the general driving section 5.
population.
There exists research on this matter that has led to
certain amount of suggestions in order to manage the 2 DATA
risk to drivers’ health and safety (DfT 2003), (Bomel
2004), (WRSTG 2001). In this context, it is impor- The sample comprises 22249 individuals that suffered
tant to improve our knowledge on the main factors a work accident between january 2006 and febru-
that influence these accidents, in order to design better ary 2008, who joined their companies starting from
preventive activities. 1-1-2003. From these accidents, 2190 were traffic
Different studies give a great weight to the human accidents.
factor in the origin of the accidents, (Lawton & Parker Variables considered for this study were age, sex,
1998), (Lewin 1982). In Cellier et al., (1995) it is professional situation (waged or autonomous, labeled
argued that age and professional experience are related Situapro2cat), and several company features which we
to the frequency and gravity of the accidents, in the describe in the following.
sense that the youngest and the oldest, as well as those Pertemp: relationship between individual’s com-
of smaller professional experience have the highest pany and the work centre connected with the acci-
indices in both aspects. A low educative level is also dent. There are four possibilities: 1. the work centre
related to the most serious accidents, according to belongs to the company; 2. the work centre does not
Híýjar et al., (2000) (these and other references are belong to the company and the relationship is of con-
reported in López-Araujo & Osca (2007)). tract/subcontract; 3. the work centre does not belong
Some factors can be connected to the professional to the company and the relationship is of company of
situation of the person, like type of company or work, temporary work; 4. the work centre does not belong

757
Table 1. Categorical variables. Table 2. Cox model.

Variable N (%) Variable β̂ exp(β̂) se(β̂) z

Sex Age −0.0537 0.948 0.00252 −21.347


Men 16954 76.2 Sex 0.4774 1.612 0.04511 10.584
Women 5295 23.8 Pertemp2 −1.1896 0.304 0.23726 −5.014
Situapro2cat Pertemp3 1.3117 3.713 0.17971 7.299
Waged 22053 99.1 Pertemp4 −0.0707 0.932 0.23686 −0.299
Autonomous 196 0.9 Topap 0.4566 1.579 0.10238 4.460
Pertemp
1 20905 94.0 Variable exp(β̂) p-value lower .95 upper .95
2 760 3.4
3 364 1.6 Age 0.948 0.0e+00 0.943 0.952
4 220 1.0 Sex 1.612 0.0e+00 1.475 1.761
Topap Pertemp2 0.304 5.3e−07 0.191 0.485
No 21590 97.0 Pertemp3 3.713 2.9e−13 2.610 5.280
Yes 659 3.0 Pertemp4 0.932 7.7e−01 0.586 1.482
Topni Topap 1.579 8.2e−06 1.292 1.929
No 22034 99.0
Yes 215 1.0
Topspa
No 3802 17.1
Yes 18447 82.9 an individual from the time he/she joins the company,
Topspm to the time of traffic accident, which will be a censored
No 20859 93.8
Yes 1390 6.2
time if the event has not occurred yet when the study is
Topspp finished. Then we investigate the influence on the risk
No 19803 89.0 of accident of individual’s features, as well as features
Yes 2446 11.0 related to the company.
Toptd This approach would let us to detect the workers
No 21916 98.5 with higher risk of traffic accident at a given time t,
Yes 333 1.5 say one or two years after his/her entry in the company.
This would help to adopt suitable preventive mea-
sures. Moreover, predictive factors close related to the
company could suggest the advisability of structural
to the company and the relationship is different to the reforms.
previous ones. A first step could be to use the Cox proportional
The following six variables refer to the preventive hazards model (Cox 1972), and then analyze if it fits
organization of the company. accurately the data. The focus is modelling the risk of
Topap: preventive organization personally assumed traffic accident at time t, what is obtained from the
by the employer (no/yes). hazard function h(t). It is well known that from the
Topni: there is not any preventive organization Cox regression model the hazard for individual i at
(no/yes). time t is given by
Topspa: external prevention service (no/yes).
Topspm: common prevention service (no/yes). hi (t) = exp(β1 x1i + β2 x2i + · · · + βp xpi )h0 (t) (1)
Topspp: own prevention service (no/yes).
Toptd: appointed worker or workers (no/yes). where h0 (t) is the baseline hazard function, and xki are
the explanatory variables of individual i.
Age values are between 16 and 77 years old, Parameters estimates in the Cox regression model
with mean 33.39 and standard deviation 10.79. The are presented in Table 2.
preceding variables are summarized in Table 1. We will show in the next section a useful way
of treating the information obtained from the model.
Nevertheless, now we can extract some conclusions
3 A PREDICTIVE MODEL FOR already. For example, keeping the rest of variables
WORK-RELATED TRAFFIC ACCIDENTS equal, workers with Pertemp = 2 have a risk 0.304
times the risk of workers with Pertemp = 1. In other
In order to study predictive factors for work related words, the risk is minor if the work centre where
traffic accidents, a possible approach could be the the accident took place does not belong to the com-
time—to—event analysis. We record the follow up of pany and the relationship is of contract/subcontract,

758
Figure 1. Nomogram.

with respect to the situation of work centre belong- Calculate total points and find corresponding number
ing to the company. On the other hand, workers with on axis (Total Points). Draw vertical line down to axes
Pertemp = 3 have a risk 3.713 times the risk of work- to find the worker’s probabilities of remaining traffic
ers with Pertemp = 1. That is to say, the risk increases accident free for one and two years.
a lot if the work centre does not belong to the com- For example, a worker with Topap = 1, contributes
pany and the relationship is of company of temporary approximately 14 points. This is determined by com-
work. With respect to preventive organization, there is paring the location of the value 1 on the Topap axis
in this way some conclusion: the risk if the preventive to the points scale above and drawing a vertical line
organization is personally assumed by the employer is between the two axes. In a similar manner, the point
1.579 times the risk if the preventive organization is values for the rest of predictor variables are determined
not personally assumed by the employer. and are summed to arrive at a total points value. For
Of course, these and other conclusions that may be example, Pertemp = 1 would give 35 points, Sex = 2
extracted from the model are provisional, and further gives 14 points, Age = 35 gives about 71 points, what
analysis is necessary, by improving the model, or han- produces 134 points for a worker with these predic-
dling the information with another models, and also tors. This value is marked on the Total Points axis, and
looking for another data base. drawing a vertical line down we obtain a probability
of about 0.77 of being free of traffic accident the first
year, and about 0.66 the second year.
4 NOMOGRAM We assess the accuracy of the nomogram (and of the
model) using the concordance index (Harrell 2001),
The model may be represented by means of nomo- which is similar to an area under the receiver operating
grams. A nomogram is a graphic tool easily inter- characteristic curve, and applicable to time-until-event
pretable. So, it is an interesting tool to take advantage data. On a 0.5 to 1.0 scale c provides the probabil-
of the model. Figure 1 depicts a nomogram for predict- ity that, in a randomly selected pair of individuals in
ing probability of no occurrence of traffic work related which one of them suffers an accident before the other,
accident at one year and two years after the individual the individual who had the accident first had the worse
joins the company. Typographical reasons force us to predicted outcome from the nomogram. c = 0.5 repre-
rotate the figure, but the natural way of looking at it is sents agreement by chance; c = 1.0 represents perfect
obvious. discrimination. A total of 200 bootstrapping resamples
To read nomogram draw vertical line from each tick were used to reduce overfit bias and for internal val-
mark indicating predictor status to top axis (Points). idation (Harrell et al., 1982). We obtained c = 0.68.

759
These statistical analyses and the nomogram were per- Bomel 2004. Safety culture and work related road acci-
formed using S-Plus software (PC Version 2000 Pro- dents. Road Safety Research Report 51, Department for
fessional; Insightful Corp, Redmond, WA) with addi- Transport, London.
tional functions (called Design)(Harrell 2001) added. Boufous, S. & Williamson, A. 2006. Work-related traffic
crashes: A record linkage study. Accident Analysis and
Prevention 38 (1), 14–21.
5 DISCUSSION Cellier, J., Eyrolle, H. & Bertrand, A. 1995. Effects of age
and level of work experience on occurrence of accidents.
The presented approach might be improved if we get Perceptual and Motor Skills 80 (3, Pt 1), 931–940.
Clarke, D., Ward, P., Bartle, C. & Truman, W. 2005. An in-
richer data bases. The concordance index could go
depth study of work-related road traffic accidents. Road
up if we take account another features from individ- Safety Research Report 58, Department for Transport,
uals and companies. It would be very interesting, for London.
instance, to take account information about recurrent Cox, D. R. 1972. Regression models and life tables (with
traffic accidents in an individual. There are several discussion). Journal of the Royal Statistical Society Series
extensions of Cox model designed to deal with recur- B 34, 187–220.
rent events that have become popular (Andersen & Gill DfT 2003. Driving at Work: Managing work–related road
1982), (Prentice et al., 1981), (Wei et al., 1989), and safety. Department for Transport, HSE Books.
many other useful methods. Harrell, F. E. 2001. Regression Modeling Strategies. With
Aplications to Linear Models, Logistic Regression, and
The model includes one of the variables referred to
Survival Analysis. Springer.
the preventive organization of the company (Topap). Harrell, F. E., Califf, R. M. & Pryor, D. B. 1982. Evaluating
This suggest a connection between organizational the yield of medical tests. JAMA 247, 2543–2546.
safety culture of companies and work related traffic Híýjar, M., Carrillo, C. & Flores, M. 2000. Risk factors in
accidents. In fact, there are studies on this issue, see highway traffic accidents: a case control study. Accident
Bomel (2004) and references therein. It is pointed out Analysis & Prevention 32, 5, 703–709.
in Bomel (2004) that key components of organiza- Lawton, R. & Parker, D. 1998. Individual differences in acci-
tional safety culture are training, procedures, planning, dent liability: A review and integrative approach. Human
incident feedback, management and communications. Factors 40, 655–671.
Lewin, I. 1982. Driver training: a perceptual motor skill
Among reached conclusions it is founded out that the
approach. Ergonomics 25, 917–925.
most critical factors for management of car driver López-Araujo, B. & Osca, A. 2007. Factores explica-
occupational road risk (ORR) are fatigue, pressure, tivos de la accidentalidad en jóvenes: un análisis de la
training, incident management and communications. investigación. Revista de Estudios de Juventud 79.
It would be interesting to explore this and related Prentice, R. L., Williams, B. J. & Peterson, A. V. 1981. On
factors within time-to-event framework. the regression analysis of multivariate failure time data.
Biometrika 68, 373–389.
Wei, L. J., Lin, D. Y. & Weissfeld, L. 1989. Regression
REFERENCES analysis of multivariate incomplete failure time data by
modeling marginal distributions. Journal of the American
Andersen, P. K. & Gill, R. D. 1982. Cox’s regression model Statistical Association 84, 1065–1073.
for counting processes: a large sample study. Annals of WRSTG 2001. Reducing at–work road traffic incidents.
Statistics 10, 1100–20. Work Related Road Safety Task Group, HSE Books.

760
Safety, Reliability and Risk Analysis: Theory, Methods and Applications – Martorell et al. (eds)
© 2009 Taylor & Francis Group, London, ISBN 978-0-415-48513-5

New performance indicators for the health and safety


domain: A benchmarking use perspective

H.V. Neto, P.M. Arezes & S.D. Sousa


DPS, School of Engineering of the University of Minho, Guimarães, Portugal

ABSTRACT: Given the need of organisations to express their performance in the Health and Safety domain
by positive indicators, such as the gains within that domain and proactive actions carried out to improve work
conditions, a research effort has been carried out in order to respond to this particular need, or at least to have a
valid contribute for that purpose. As a result of this effort, a performance scorecard on occupational Health and
Safety was developed and is briefly presented in this paper.

1 INTRODUCTION indicators that promote the need for continuous


and updated information, and a preventive approach
The contemporary systemic focus adopted in the regarding the risks factors in the occupational environ-
domain of Health and Safety (H&S) at work estab- ments. Therefore, it is urgent that the organisations’
lishes itself as the consequence of an entire evolu- H&S performance is also characterized by ‘‘positive’’
tional process, either in terms of developing these indicators, reflecting gains associated with that par-
issues within the organisations, or at the models of ticular domain, as well as by a proactive approach
operational management level. More than a simple in seeking the most adequate working conditions. In
add-value contribute in terms of risk management, a accordance, the main aim of this study was to iden-
H&S management system is a philosophical and oper- tify a possible H&S performance ‘‘key-indicators’’
ational challenge for the organization that attempts to scheme, which might be able to play a very important
implement it. The reason for this challenge is because role in defining a specific organization position, in
it requires a structured approach for identification, terms of H&S performance, when compared with oth-
evaluation and control of the risk factors of their activ- ers companies (Benchmarking). This was carried out
ity and an ongoing effort to continuously improve its considering also several scenarios (regarding external
efficiency and effectiveness. competitors, regarding other delegations or units of
A management system, despite being structured the same company or other subsidiary organizations).
according to an international or national standard, it
only will be effective if its implementation results
2 CRITICAL FACTORS OF SUCCESS
in a clear improvement in a significant number of
indicators, which are typically designated as ‘‘perfor-
2.1 Structured Matrix of Performance Indicators
mance indicators’’. While most of the indicators used
to evaluate the organisations’ management are ‘‘pos- Benchmarking is one of the most famous tools used
itive’’ indicators, i.e., they express gains rather than to support the process of continuous improvement. It
losses (company profits, no. of customers, market represents a tool that accomplishes with the need for
share, etc.), in the H&S domain the used indicators continuous and systematic analysis and measurement
are, traditionally, ‘‘negatives’’, i.e., indicators related process (Camp, 1993), using as reference a specific
with issues that companies intend to minimize (acci- baseline, or golden standard. This allows, at least
dent rates, economical and material losses, costs, etc.). theoretically, thecomparisonof, forexample, practices,
Despite the significant importance of these issues, it is proceduresandperformancestandardsbetweendifferent
clear that they do not fit the need for a proactive pro- organizations and within the same organization. From
file required by a management system, because they a methodological point of view, it can be stated
reflect and are supported predominately by a reactive that Benchmarking is, simultaneously, an approach,
behaviour. consideringthatitseeksanorganizationalimprovement
The current management models assume a system- and an identification of the best practices at the product
atic monitoring (Pinto, 2005), privileging proactive and organizational processes level, and a process of

761
continuous and systematic search, since it implies the to reflect the major carriers operating in this area, to
development of an original idea, strongly based on promote a consistent diagnosis of the target fields of
mechanisms of ownership, adaptation (or adjustment), the analysis, in order to provide the identification of
optimization and development, but also a technique real performance improvements, and, simultaneously,
for data collection and for knowledge generation. be applicable to the largest number of organizations.
Benchmarking is, actually, a very successful Some of the aspects that confirm the need for oper-
methodology in organizational contexts, but it requires ational ability come from the fact that the structured
a definition of critical factors of success that empha- matrix contains:
sise the organisation strategy and mission for improv- – critical success factors and key elements that can,
ing some key indicators. Based on this assumption and naturally, evolve into a more minimalist matrix and
on the relevance to have H&S performance indicators with a broader scope;
that complies with the contemporary organizational – extremely strategic elements, both for the H&S
needs, an inventory of the H&S critical factors of suc- domain and for the success of the organization as a
cess found in the literature (technical, scientific and whole, leading to the possibility to be integrated in
normative documents) was developed, after this some an overall performance matrix, such as the Balanced
factors identified by the performance indicators listed Scorecard;
used to monitor and/or measure was collected. Finally, – a great operational flexibility, leading to a model
a systematization of the selected elements through a that is adaptable to different organizations and/or
structured matrix of performance indicators was car- temporalities, i.e., that allows to be used entirely, or
ried out. This matrix will allow establishing a proposal segmented according to users needs;
for a performance scorecard. – the main principles of scorecarding, as well as the
The scorecarding is one of the contemporary main requirements of Performance Benchmarking.
archetypes of management that are clearly framed
by the principles of continuous improvement. Its Due to some restraints associated with the dimension
aim is the development of systematic organizational of this paper, it will be not possible to explain here the
processes of monitoring and measuring that provide entire conceptual model and the operational consid-
continuous information and allows the implementa- erations previewed by the SafetyCard. Therefore, we
tion of a continuous improvement process (Armitage will try to identify, the main elements of the model,
& Scholey, 2004). This improvement will be reached and to synthesize them in a few points, such as:
because it favours a reporting based in key indicators, – Organizational design – considers aspects related
which will be (or at least, might be) representative of to the H&S services organisation model, the cover-
the performance on critical factors of success for one, age assured by technical elements and the systemic
or more, of the organisational domains. approach of the carried out activities. Accordingly,
the used indicators are related to the type of techni-
cal coverage and the systemic focus regarding H&S
2.2 Performance Scorecard for H&S Management
operation and management;
Systems
– Organizational culture – considers aspects related
The need to establish a matrix of performance results to the beliefs, the rules and standards of behaviour
that considers both proactive and reactive practices, set by the organization on H&S matter. Therefore,
and that fit within the Portuguese and European stan- it considers indicators that refer to organisation and
dards, led to a proposal for a structured matrix of individual values, rules and codes of conduct and to
performance results to be used in the scope of a H&S the basic assumptions, description and evaluation
management systems. The international vision, which of the H&S services performance;
was supposed to be the basis of this proposal, was – Occupational health services – considers aspects
considered not only because the references arising related to the surveillance and health promotion,
from the technical and scientific world framework, thus, it contemplates performance indicators struc-
but also due to the requirements imposed at an Euro- tured on the basis of the two considered segments:
pean and international level regarding H&S regulatory health surveillance and promotion;
and legal framework. This international vision is also – Operational service of occupational hygiene &
stated in the designation selection for the proposal, safety – consider indicators related to the work acci-
SafetyCard—Performance Scorecard for Occupa- dents statistical control, the development and design
tional Safety and Health Management Systems. of training programmes, the planning and imple-
However, this was not the only challenge that we mentation of prevention and protection measures.
have tried to address, many others were considered in Therefore, the segments of analysis refer to the field
this the attempt to structure a performance scorecard of the organization and operability of the Hygiene
of this type. The complexity and multidisciplinarity of & Safety services, accidents control and analysis,
the H&S domain, meant that the developed model had training and prevention and protection actions;

762
– Internal emergency plan – the definition of a implementation of the SafetyCard (Neto, 2007). The
functional organisation, based on the structural used organization operate in the Construction branch,
planning, accountability, selection, preparation and specifically on the Construction of Buildings, and
testing of means and devices that should be able had 134 employees, 6 of them are H&S practitioners
to respond to emergency situations, assumes here a (4 technicians of level V and 2 of level III).
role as the main vector of performance. Therefore, As some indicators had to be standardised to inte-
the performance indicators were organized accord- grate the matrix of performance results and consider-
ing to three analytical segments: (i) Planning, (ii) ing that we had not any reference or Benchmarking
Attributes and Responsibilities and (iii) Devices. partner, it was used a previous period (past year) to
– Monitoring and/or measurement services – con- allow the comparison. The SafetyCard was not applied
siders analytical segments, such as the (i) control entirely, since 6 performance indicators were not appli-
of workplace environmental conditions, (ii) the cable to the considered organization and 2 did not have
mechanisms of monitoring and/or measurement the data for the needed computation. However, taking
and (iii) the implementation of corrective actions. into account the flexibility of the model, the results’
Therefore, the selection of performance indicators matrix of performance has been modelled to that spe-
intends to evaluate the organisation practices in the cific company, without losing quality or empirical
field of environmental control, monitoring and/or relevance. For illustrating this exercise, an xample of
measurement; the scorecard for the considere company is presented
– Work safety equipments – considers issues related in table 1. As it is possible to see in this table, the over-
to organizational practices in what concerns the all performance result was of 0,740, which according
integration of H&S requirements into the process to the scale considered, reflects a good performance.
of selection, acquisition and maintenance of work Thus, we conclude that this organization had a good
safety equipments. performance in matter of H&S at Work.
The previous results would be the overall assess-
2.3 Performances Weighting and Classification ment of performance, but we can use this analysis and
One of the objectives of the analytic model was to detail the evaluation. Accordingly, and based in the
encourage a global, but also partial, screening of analytical domains previously stated, we can mention
the H&S performance level, hence that has assumed the following aspects:
both a quantitative and qualitatively character. It was
intended to establish a normalized system of perfor- – Organizational design – consistent organizational
mance indicator, so that all performance measures structure, in part this is a result of excellent tech-
could be compared. This process of normalization is nique coverage. The point less strong is the reduced
grounded in a numerical basis, where each indicator systemic approach, but that is due to the fact of
always assumes a result between zero and one, and the H&S management system is being prepared
in some cases they assume a traditional binary score and implemented. This system is being prepared
(zero or one). under the guidelines of OHSAS 18011:1999/NP
Since the entire structure of weights is made on 4397:2001, which certainly will bring, in the short
this basis, from the application of the first coefficients term, benefits to the organization, both in terms of
of importance on the first level of the structure, all performance and at a practical level.
the scores assume a continuous distribution between – Organizational culture – characteristic traits of
the limits associated with each analytic element. The institutional values, norms and standards of
maximum amount possible in each of the stages will behaviour and basic assumptions of description and
be equivalent to the value of the multiplier (M), orig- evaluation in matter of H&S have been identified,
inating that each stage may vary always between zero which are transposed into a strong organizational
and one, even the final classification. The sum of the culture focused to the protection and maintenance
scores obtained in each domain represents a total value of the acceptable working conditions.
of performance, both in quantitative terms and in qual- – Occupational health services – great operational
itative terms, since there is the possibility to analyse and structuring sense. The organization assures all
the final numerical value through a traditional discrete the procedures to promote health in the workplaces
scale of classification ( Very good, good, . . .). and implement a few mechanisms for monitoring
workers’ health.
– Operational service of occupational health &
3 SAFETYCARD: RESULTS OF A CASE safety – inadequate monitoring and documentation
STUDY at the risk prevention level. The organization had
to pay strong fines, both in monetary terms, and
At this point, the aim will be to make a presentation in terms of severity and absenteeism induced by
of the most important obtained results from a pilot accidents.

763
Table 1. Example of the Performance Benchmaring Scorecard (summary) for the case study.

Weighted segment Weighted domain


No. of Baseline
Analytic domain Analytic segment indicators weight M (a) Partial M (b) Partial

Organizational Technical covering 4 1,00 0,70 0,70


0,05 0,04
design Systemic focus 2 0,25 0,30 0,08
Organizational Values 3 1,00 0,50 0,50
culture Behaviour standards 3 1,00 0,20 0,20 0,20 0,20
Basics assumptions 4 0,92 0,30 0,28
Occupational health Surveillance 6 0,85 0,75 0,64
0,10 0,09
services Promotion 2 1,00 0,25 0,25
Operational service Organization 3 0,40 0,05 0,02
of occupational Accidents incidence 10 0,67 0,15 0,10
safety & hygiene Formation 6 0,76 0,25 0,19 0,25 0,16
Prevention 5 0,49 0,40 0,19
Protection 3 1,00 0,15 0,15
Internal emergency Planning 5 0,90 0,40 0,36
plan Attributes and
0,15 0,12
responsibilities 7 1,00 0,25 0,25
Mechanisms 10 0,60 0,35 0,21
Monitoring and/or Control of environmental 4 0,13 0,55 < 0, 07 0,20 0,09
measurement work conditions
services Mechanisms of monitoring
and/or measurement 4 0,69 0,30 0,21
Corrective action 2 1,00 0,15 0,15
Safety equipments Maintenance 4 1,00 0,50 0,50
0,05 0,05
Safety instructions 3 0,84 0,50 0,42
Total 90 0,740

(a) The letter M represents the multiplier associated with the baseline weight and with the maximum score that can be obatined
in a specific segment. (b) The letter M represents the multiplier associated with the segment and with the maximum score that
can be obtained in a specific domain.

– Internal emergency plan – excellent basis of struc- and to the prescription of elements relating to
turing and planning, with the organization ensuring the work safety equipments. This was the critical
the main procedural mechanisms of response to success area where the organization obtained the
emergencies (plans, responsibilities and devices). best score.
However, there is an operational weakness, because
the organization does not have evidences of its oper-
ational ability (for example, no fire drills were
carried out). 4 CONCLUSIONS
– Monitoring and/or measurement service – low level
of monitoring of the environmental conditions aris- The best way to conclude one ‘‘journey’’ is go back to
ing from the adopted processes. The organization the starting point. From the literature review, it seems
acknowledged the existence of some ergonomic consensual that organizations need a structured matrix
nature risk factors and occupational exposure to of positive indicators that go beyond the assessment
harmful agents, like noise. However, have not devel- of some organization’s attributes that do not favour the
oped specific procedures for evaluating the expo- idea of a set, and does not fully reflect the overall H&S
sure levels of the workers. This points as become a performance. So it is important to have a scorecard that
critical segment, due to the fact that it penalizes the can be able to reflect the structural H&S performance
organization. of an organization, and to allow internal and external
– Work safety equipments – great strategic importance comparisons (Benchmarking in the various possible
is given both to the acquisition and maintenance, scenarios).

764
The performance scorecard that has been devel- Camp, R. 1993. Benchmarking: O Caminho da Quali-
oped and implemented reflects applicability and show dade Total—Identificando, Analisando e Adaptando as
technical-scientific relevance, allowing the diagnosis Melhores Práticas da Administração Que Levam à Maxi-
of a structural H&S Management System. This mização da Performance Empresarial. São Paulo: Livraria
diagnosis could be carried out both in terms of work Pioneira Editora (in Portuguese).
Neto, H.V. 2007. Novos Indicadores de Desempenho em
conditions and organizational values, and in H&S Matéria de Higiene e Segurança no Trabalho: perspec-
performance monitoring and/or measurement. tiva de utilização em Benchmarking, Dissertation thesis of
Finally, it is also necessary to highlight that there is the MSc. in Human Engineering, School of Engineering.
some work to be done, but it is expected that the pre- Guimaraes: University of Minho (in Portuguese).
sented tools could be improved and refined in order Pinto, A. 2005. Sistemas de Gestão da Segurança e Saúde
to have a reliable and useful tool for performance no Trabalho—Guia para a sua implementação, 1a Edição.
assessment. Lisbon: Edições Sílabo (in Portuguese).

REFERENCES

Armitage, H. & Scholey, C. 2004. Hands-on scorecard-


ing—How strategy mapping has helped one organization
see better its successes and future challenges. CMA
Management 78(6): 34–38.

765
Safety, Reliability and Risk Analysis: Theory, Methods and Applications – Martorell et al. (eds)
© 2009 Taylor & Francis Group, London, ISBN 978-0-415-48513-5

Occupational risk management for fall from height

O.N. Aneziris & M. Konstandinidou


NCSR ‘‘DEMOKRITOS’’ Greece

I.A. Papazoglou
TU Delft, Safety Science Group, Delft, The Netherlands

M. Mud
RPS Advies BV, Delft, The Netherlands

M. Damen
RIGO, Amsterdam, The Netherlands

J. Kuiper
Consumer Safety Institute, Amsterdam, The Netherlands

H. Baksteen
Rondas Safety Consultancy, The Netherlands

L.J. Bellamy
WhiteQueen, The Netherlands

J.G. Post
NIFV NIBRA, Arnhem, The Netherlands

J. Oh
Ministry Social Affairs & Employment, The Hague, The Netherlands

ABSTRACT: A general logic model for fall from height has been developed under the Workgroup Occupa-
tional Model (WORM) project, financed by the Dutch government and is presented. Risk has been quantified
risk for the specific cases of fall from placement ladders, fixed ladders, step ladders, fixed scaffolds mobile
scaffolds, (dis) assembling scaffolds, roofs, floor openings, fixed platforms, holes, moveable platforms and
non moving vehicles. A sensitivity analysis assessing the relative importance of measures affecting risk is pre-
sented and risk increase and risk decrease measures are assessed. The most important measures in order to
decrease fatality risk owing to falls from fixed ladders is the way of climbing, for step ladders their location,
for roofs, floors and platforms not to work on them while being demolished, for mobile scaffolds the existence
of safety lines, for fixed scaffolds protection against hanging objects, for work near holes and (de) installing
scaffolds the use of fall arrest and for moveable platforms and non moving vehicles the existence of edge
protection.

1 INTRODUCTION between 1998 and 2004, while the number of deaths


is on the average 25 per year.
Occupational fatalities and injuries caused by falls Several studies have been performed examining the
from height pose a serious public problem and are the causes of injuries and deaths from falls, such as those
leaders of occupational accidents in the Netherlands. from the National Institute of Occupational Safety and
Falls from height constitute 27% of the reported 12000 Health (NIOSH) Fatal Accident Circumstances and
accidents, which have occurred in the Netherlands Epidemiology (FACE) reports (NIOSH, 2000), OSHA

767
report on falls from scaffolds (OSHA 1979), OSHA
report on falls from elevated platforms (OSHA 1991),
the study of McCann (2003) for deaths in construction
related to personnel lifts and the study of HSE (2003)
for falls from height in various industrial sectors.
The Dutch government has chosen the quantitative
risk approach in order to determine the most impor-
tant paths of occupational accidents and optimize the
risk reduction efforts. It has embarked the Work-
group Occupational Risk Model (WORM) project, as
presented by Ale et al., (2008). Major part of the
WORM project is the quantification of occupational
risk, according to the bowtie methodology developed
within the project and presented by Papazoglou & Ale
(2007).
Of the 9000 analyzed GISAI (2005) occupational
accidents, which have occurred in the Netherlands
between 1998 and 2004, 805 have been classified as
falls from placement ladders, 70 from fixed ladders,
187 from step ladders, 245 from mobile scaffolds, 229
from fixed scaffolds, 78 falls while installing or de-
installing scaffolds, 430 falls from roofs, 415 from
floors, 235 from fixed platforms, 74 falls in holes,
205 from moveable platforms and 206 from non mov-
ing vehicles. Logical models for fall from height have
been presented by Aneziris et al (2008).
This paper presents the overall quantified risk, the
specific causes and their prioritization for the fol-
lowing occupational hazards: a) fall from placement
ladders, b) fall from fixed ladders, c) fall from step
ladders d) fall from fixed scaffolds e) fall from mobile
scaffolds e) fall while (dis) assembling scaffolds, f) Figure 1. General bowtie for fall from height.
fall from roofs g) fall from floor openings h) fall from
fixed platforms i) fall in holes j) fall from moveable into the initiating event and the safety measures aim-
platforms k) fall from non moving vehicles. ing at preventing a fall. The initiating event represents
The paper is organized as follows. After the intro- working on the high structure, while the primary safety
duction of section 1, section 2 presents a general logic measures preventing a fall are strength and stability of
model for fall from height and risk results for all fall structure, user stability and the edge protection.
from height cases. Section 3 presents the ranking of the Strength of structure: The structure should be able
various working conditions and/or safety measures in to support the imposed load by the user and the asso-
terms of their contribution to the risk. Finally section ciated loads (persons or equipment). It is applicable
4 offers a summary and the conclusions. to all fall from height cases, with the exception of
fall in hole. It is defined as a two state event with the
following cases: success or loss of strength
Stability of structure: The structure itself through
2 LOGICAL MODEL FOR FALL its design and material provides the necessary stability
FROM HEIGHT so that it does not tip over. It is applicable to all fall
from height cases, with the exception of fall from roof,
In this section a general model for fall from height floor, platform and fall in a hole. It is defined as a two
is presented, which may be applied in all fall from state event with the following cases: success or loss of
height cases, while more detailed models for fall from structure stability
ladders, scaffolds, roofs, holes, moveable platforms Use stability: Given a strong and stable structure,
and non moving vehicles, are described by Aneziris the user should be able to remain on the structure with-
et al (2008). Figure 1 presents the fall from height out losing his stability. This measure is applicable to
bowtie. The Center event represents a fall or not from all fall from height cases. It is defined as a two state
the structure (ladder, scaffold, roof, hole, moving plat- event with the following cases: success or loss of user
form, or non moving vehicle) and it is decomposed stability.

768
Table 1. Support safety barriers affecting primary safety barriers, for all fall from height
accidents.

Type of fall Support barriers

STRENGTH
LADDER • Type or condition of ladder
SCAFFOLD • Structural Design Construction
ROOF/FLOOR/FIXED • Roof Surface Condition
PLATFORM
MOVEABLE PLATFORM • Condition of Lift/Support
NON MOVING VEHICLE • Loading
STRUCTURE STABILITY
LADDER • Placement and Protection
• Type or condition of ladder
SCAFFOLD • Anchoring
• Foundation
• Scaffold Protection
MOVEABLE PLATFORM • Foundation/Anchoring
• Position of Machinery/ weight
NON MOVING VEHICLE • Foundation
• Load Handling
USER STABILITY
LADDER • User Ability
HOLE IN GROUND • User Ability
ROOF/FLOOR/FIXED PLATFORM • User Ability
SCAFFOLD • User Ability
• Floor Condition
MOVEABLE PLATFORM • User Ability
• External Conditions
• Movement Control
• Position of Machinery or
weight on platform
NON MOVING VEHICLE • Ability
• Load Handling
• Working Surface

Edge Protection: This measure includes the provi- Measures that affect the stability of the structure
sion of guardrails that enhance the stability of user. It is are structure specific and are presented in Table 1.
applicable to all fall from height cases with the excep- Foundation is a measure affecting all structures with
tion of ladders and non moving vehicles. It can be in the exception of ladders. Ladder stability is affected
one of the following states: present, failed or absent. by its condition, placement and protection.
User Ability is a measure affecting user stability
in all fall accidents. Other measures, such as working
2.1 Support safety barriers surface, depend on the type of fall accident and are
A Support Safety Barrier (SSB) contributes to the presented in Table 1.
adequate function of the Primary Safety Barriers and
2.2 Probability influencing entities (PIEs)
influence the probability with which the primary safety
barrier-states occur. There are three types of support In several instances the safety barriers of the model
barriers, those affecting structure strength, structure are simple enough to link directly to easily under-
stability and user stability. stood working conditions and measures as in the
Measures that affect strength are structure specific barrier ‘‘Anchoring’’, which affects the stability of a
and are presented in Table 1. Condition of the struc- fixed scaffold. Assessing the frequency with which
ture surface affects the strength of ladder, roof and anchoring exists is straightforward.
moveable platform. Structural design and construction In other instances, however, this is not possible.
affects scaffold strength and loading affects non mov- For example, the support barrier ‘‘Floor surface con-
ing vehicles. dition’’ may be analysed into more detailed and more

769
Table 2. PIES characteristics and frequencies.

Barrier
PIEs success
Barrier PIEs frequency probability

ROOF • Edge Protection 0,36 0,265


PROTECTION absent
• No Edge 0,17
Protection Next to
non-supporting parts
ROOF • Roof/working 0,43 0,3
SURFACE platform/floor (parts)
CONDITION being built or torn
down
• Roof/working 0,17
platform/floor (parts)
not intended to
support exerted
weight
ABILITY • Capacity to keep 0,21 0,205
balance on roof
• Walking 0,19
backwards
• Unfit: unwell 0,09
• Hands not free 0,35
• Overstretching 0,19
• Substandard 0,2
movement (slip, trip)
• Outside edge protection 0,17
• Weather 0,13
• Slope 0,32
PFA • Collective fall 0,18 0,118
arrest (CFA)
• Condition of CFA 0,05
• Personal fall arrest 0,16
• Anchorpoints FA 0,16
• State of Maintenance 0,04

concrete measures that affect its quality. Such specific a support barrier to be in one of its possible states is
measures are: (i) floor which is being demolished; (ii) given by the weighted sum of the frequencies of the
floor not able to support weight. Similarly the bar- influencing factors (RIVM 2008).
rier ‘‘Fall Arrest’’ may be analysed into the following PIEs and their frequencies as well as the failure
measures: i) Use of Collective Fall arrest; ii) Mainte- probability for the barriers they influence for fall from
nance of Collective Fall arrest; iii) Use of Personal Fall roofs are presented in Table 2. All other PIEs for fall
arrest; iv) Maintenance of Personal Fall arrest. Such from height models are presented in RIVM (2008).
factors have the name of Probability Influencing Enti- Frequencies of PIEs have been assessed through sur-
ties (PIEs). Each influencing factor (PIE) is assumed veys of the working condition in the Dutch working
to have two possible levels, ‘‘Adequate’’ and Inade- population and reflect the Dutch National Average
quate’’. The quality of an influencing factor is then set RIVM (2008).
equal to the frequency with which this factor is at the
adequate level in the working places. Then the quality
2.3 Right hand side (RHS)
of the barrier is given by a weighted sum of the influ-
encing factor qualities. The weights reflect the relative The right hand side of the fall bowties in combination
importance of each factor and are assessed by the ana- with the outcome of the centre event determine the
lyst on the basis of expert judgement. Currently equal consequences of the falls. Four levels of consequences
weights have been used. This way the probability of are used: C1: No consequence; C2: Recoverable

770
1,00E-05 PLACEMENT LADDER

1,00E-06 Substandard movements


Use of hands for support
1,00E-07 Substandard condition/ fitness of person
Substandard position of person on equipment
1,00E-08 Location: signs to prevent impact
Location: protection against impact
1,00E-09 Dimension: fixation if long
Placement

Moveable
Installing
ladder or

platform

platform
scaffold

scaffold

scaffold

moving
ground

vehicle
Mobile

Hole in
ladder

ladder

Fixed

Fixed
Fixed

steps

Floor

Non-
Roof
Step

Angle: prevention against sliding


Angle: secured top

1 2 3 4 5 6 7 8 9 10 11 12 Substandard surface/ support


Placement ladder is in good condition
Recoverable injury /hr Permanent Injury /hr) Fatality (/hr) Surface conditon of steps
Dimension/ type
Condition/ presence of anti-slip
Figure 2. Individual risk per hour for fall from height
hazards. 0 10 20 30 40 50 60
percentage %
RISK DECREASE RISK INCREASE

injury; C3: Permanent injury; C4: Death. Events of Figure 3. Risk fatality increase and risk decrease for fall
the RHS are the Height of the fall, the type of the sur- from placement ladder bowtie, for various working condi-
face and the medical attention. More details on these tions (PIEs).
events are presented by Aneziris et al (2008).
Individual risk of death, permanent injury and
recoverable injury per hour have been assessed accord- FIXED LADDER
ing to the methodology presented by Aneziris (2008),
Substandard movements
Papazoglou et al (2008) and are presented in Figure 2.
Fall from roof has the higher fatality risk 2.2 × External force exerted on the

10−7 /hr, followed by fall from moveable platform person


Use of both hands for climbing/
8.06 × 10−8 /hr. descending
Substandard condition/ fitness of
person
Substandard position of person on
equipment
3 IMPORTANCE ANALYSIS
Safety cage

Fixed ladder is not in good condition


To assess the relative importance of each factor influ- (damaged)
encing the risk from fall, two importance measures Design of fixed ladder (dimension of
have been calculated. rungs)
Design of arrival area (ergonomics
lay-out, etc.)
1. Risk decrease: This measure gives the relative
Surface conditon of steps
decrease of risk, with respect to the present state, if
the barrier (or PIE) achieves its perfect state with 0 20 40 60 80 100 120 140
probability equal to unity. percentage (%)

2. Risk increase: This measure gives the relative RISK DECREASE RISK INCREASE
increase of risk, with respect to the present state,
if the barrier (or PIE) achieves its failed state with
probability equal to unity. Figure 4. Risk fatality increase and decrease for fall from
fixed ladder bowtie, for various working conditions (PIEs).
Risk decrease prioritizes the various elements of the
model for the purposes of possible improvements.
It is more risk—effective to try to improve first a
3.1 Fall from height-ladders
barrier with higher risk decrease effect than another
with lower risk decrease. Placement ladder: The most important measure in
Risk increase provides a measure of the importance order to decrease fatality risk is the location of signs to
of each element in the model to be maintained at its prevent impact and the use of the right type of ladder
present level of quality. It is more important to con- which, if used 100% of the time a placement ladder is
centrate on the maintenance of a barrier with high risk used, will decrease risk by 21% and 20% respectively.
increase importance than one with a lesser one. The The most important measure in order to maintain risk
effect each PIE has on the overall risk is presented in is to keep the ladder in good condition. If this is not
Figures 3–14. done risk increases by 54%.

771
Fixed ladder: The most important measure in FIXED SCAFFOLD

order to decrease fatality risk is to use both hands Fall arrestors, safety nets

for climbing, and to be in the right position (not Protection of scaffold against being struck by a
vehicle
on top and not overreaching). If these measures are Health checks based on clear criteria for people
working on heights
used 100% a fixed ladder is used, risk decreases No ladder placed on top of a scaffold
by 34% and 21% respectively. The most impor-
Safe access of scaffold
tant measure to maintain risk is the good physical
Protection against hanging/swinging objects
condition of the worker and avoidance of exter-
Footings capable of supporting the loaded scaffold
nal force to be exerted on him. If this is not without displacement.

Scaffold on a level and firm foundation

Scaffold surfaces non-slippery and free from


obstacles
STEP LADDER
Adequate floors (all boards present, stable and not
broken)
Adequate stabilising supports and/or ties to
Substandard movements adjacent structures
External force exerted on the person on the Erection/ modification of scaffold by competent
persons
stepladder
Proper design and construction
Use of hands for support
Edge protection absent
Substandard condition/ fitness of person
Substandard position of person on Regular inspection of scaffolds
equipment
Adequate edge protection
Substandard placement of stepladder
RISK DECREASE RISK INCREASE 0 20 40 60 80 100 120 140
Surface/ support percentage (%)

Location of ladder
Figure 7. Risk fatality increase and risk decrease for fall
Step ladder is in good condition
from fixed scaffold bowtie, for various working conditions
Surface conditon of steps (PIEs).
Dimension

Condition/ presence of anti-slip shoes

0 20 40 60 80 100
RISK INCREASE
RISK DECREASE percentage %
(DE) INSTALLING SCAFFOLD

Fall arrestors, safety nets


Figure 5. Risk fatality increase and risk decrease for fall
from step ladder bowtie, for various working conditions Protection of scaffold against being struck by a
vehicle
(PIEs). Health checks based on clear criteria for people
working on heights

MOBILE SCAFFOLD Safe access of scaffold (from the inside)

Protection against hanging/swinging objects


Fall arrestors, safety nets
Protection of scaffold against being struck by a Footings capable of supporting the loaded scaffold
vehicle without displacement.
Health checks based on clear criteria for people
working on heights Scaffold on a level and firm foundation
No ladder placed on top of a scaffold
Adequate floors (all boards present, stable and not
Safe access of scaffold (from the inside) broken)

No unstable objects used to support scaffold No construction/ adjacent structure to support/


attach scaffolds
Scaffold on a level and firm foundation
Adequate stabilising supports and/or ties to
Scaffold surfaces non-slippery and free from
adjacent structures
obstacles
Adequate floors (all boards present, stable and not Criteria (eg. Wind-load, height-width ratio, etc.) to
broken) determine anchoring requirements
Adequate breaks (mobile scaffolds only)
Erection/ modification of scaffold by competent
Adequate stabilising supports and/or ties to persons
adjacent structures
Criteria (Wind-load, height-width ratio) to determine
Proper design and construction
anchoring requirements
Erection/ modification of scaffold by competent
persons Edge protection absent
Proper design and construction
Adequate edge protection
Edge protection absent

Regular inspection of scaffolds 0 10 20 30 40 50 60 70 80 90 100

Adequate edge protection


RISK DECREASE RISK INCREASE
0 10 20 30 40 50 60 70 80 90 100
RISK DECREASE RISK INCREASE
percentage (%)

Figure 8. Risk fatality increase and risk decrease for fall


Figure 6. Risk fatality increase and risk decrease for fall scaffold while (de) installing it, for various working condi-
from mobile scaffold, for various working conditions (PIEs). tions (PIEs).

772
FALL FROM ROOF FALL FROM PLATFORM
State of Maintenance
Anchorpoints FA State of Maintenance
Personal fall arrest Anchor points FA
Condition of CFA Personal fall arrest
Collective fall arrest (CFA) Collective fall arrest
Slope Weather
Weather Outside edge protection
Outside edge protection Substandard movement (slip, trip)
Substandard movement (slip, trip)
Overstretching
Overstretching
Hands not free
Hands not free
Unfit
Unfit
Walking backwards
Walking backwards
Capacity to keep balance on floor
Capacity to keep balance on roof
Roof not intended to support exerted Platform overloaded
weight
Platform being built or demolished
Roof being built or torn down
No EP next to non-supporting parts EP absent

EP absent 0 20 40 60 80 100

0 10 20 30 40 50 60 70 80
RISK DECREASE RISK INCREASE percentage %
RISK DECREASE RISK INCREASE percentage (%)

Figure 11. Risk fatality increase and risk decrease for fall
Figure 9. Risk fatality increase and risk decrease for fall from platform, for various working conditions (PIEs).
roof, for various working conditions (PIEs).

FALL IN HOLE
FALL FROM FLOOR

State of Maintenance PFA State of Maintenance FA

Anchor points FA
Personal fall arrest Anchorpoints FA

State of Maintenance CFA


Personal fall arrest
Collective fall arrest presence

Weather
Illumination
Outside edge protection
Substandard movement
Substandard movement
Overstretching (slip, trip)

Hands not free


Walking backwards
Unfit
Walking backwards
External force by
machinery or equipment
Problems to keep balance on floor
Floor not intended to support exerted
weight Hole (edge) protection
absent
Floor being built or demolished
No EP next to non-supporting parts
Cover absent
Edge protection absent
0 10 20 30 40 50 60 70 80 90 100
0 20 40 60 80 100 120 140
percentage %
RISK DECREASE RISK INCREASE
RISK DECREASE RISK INCREASE percentage %

Figure 10. Risk fatality increase and risk decrease for fall Figure 12. Risk fatality increase and risk decrease for fall
from floor, for various working conditions (PIEs). in hole, for various working conditions (PIEs).

done fatality risk increases by 116% and 108% that it is fully extended. If this is not done risk increases
respectively. risk by 85%.
Step ladder: The most important measure in order
to decrease fatality risk is the location of ladder, so
3.2 Fall from height- Scaffolds
that it cannot be hit by falling object. This measure
increases risk by 26%, if used 100% of time a step Mobile scaffolds: The most important measure in order
ladder is used. The most important measure in order to decrease fatality risk is the use of a safety line or
to maintain risk is the placement of the step ladder, so a harness belt, which decrease risk by 60%, if used

773
FALL FROM MOVEABLE PLATFORM 100% of time a mobile scaffold is used. In case they
Fixation of temp. Platform do not exist risk will increase by 88%.
Outside the platform
Substandard movement (slip, trip) Fixed scaffolds: The most important measure in
Overstretching
Hands not free
order to decrease fatality risk is the existence of
Unfit protection against hanging/swinging objects, which
Self induced external force by machinery or
equipment
Anchoring
decreases risk by 75%, if used 100% of time a fixed
Foundation: stabilisatoren scaffold is used. In case it does not exist fatality risk
Foundation: ondergrond
Position of the beams
increases by 122%.
Position of load
Moveable height hit by vehicle or hanging/swinging
(De) Installing scaffolds: The most important mea-
load
User hit by rolling, sliding, swinging or hanging object sure in order to decrease fatality risk is the use of fall
Heavy wind
Control failure (operator or machine induced)
arrestors and safety nets, which decrease risk by 48%,
Load blocks movement if used 100% of time a scaffold is installed or dein-
Unclear control panel
Malfunction (of transmission of movement)
stalled. In case they do not exist risk will be increased
Substandard lifting/hoisting mechanism by 93%.
Substandard brake
Metal fatigue and corrosion of structural/support parts
Wear and tear of structural/support parts
PFA condition/state
PFA anchoring
3.3 Fall from roofs
Personal Fall Arrest
Collective Fall Arrest lifts The most important measure in order to decrease
Edge Protection Absent
fatality risk is to avoid work on a roof that is being
RISK DECREASE RISK INCREASE
0 100 200 300 400 500 600 700 800 900 1000
percentage %
demolished, which decreases risk by 58% if used 100%
of time while working on a roof. The most important
measures to maintain risk is not to walk on weak spots
Figure 13. Risk fatality increase and risk decrease for fall
from moveable platform, for various working conditions
of roofs and to maintain roof edge protection. If they
(PIEs). are not followed risk will increase risk by 70%.

3.4 Fall from floors or balconies


The most important measure in order to decrease fatal-
FALL FROM NON MOVING VEHICLE
Hands not free
ity risk is to avoid work on a floor that is being
demolished, which decreases risk by 21% if used 100%
Unwell
of time while working on a floor or balcony. The
Moving backwards
most important measures in order to maintain risk is
External force
to keep safety nets in a good condition, have suffi-
Substandard movement cient anchoring points for fall arrests and maintain fall
Uneven arrest equipment. If each of them is not followed risk
Objects will increase by 116%.
Wet

Unknown surface (strength)

Unknown surface
3.5 Fall from fixed platforms
Slope
The most important measure in order to decrease fatal-
No grip
ity risk is avoid working on a platform that is being
Round/rolling parts
demolished, which decreases risk by 23% if used 100%
Unbalanced loading of time while working on affixed platform. The most
Unsecured load important measures to maintain risk is to maintain the
Loading and stabilising edge protection of platforms. If this is not done risk
Position of the road
increases by 70%.
Corrosion

Edge protection absent


3.6 Fall from hole in ground
0 20 40 60 80 100 120 140 160
The most important measure in order to decrease fatal-
percentage %
RISK DECREASE RISK INCREASE ity risk is the existence of fall arrest which decreases
risk by 45% if used 100% of time while working near a
Figure 14. Risk fatality increase and risk decrease for fall hole. The most important measure in order to maintain
from non moving vehicle, for various working conditions risk is the existence of edge protection. If it is absent
(PIEs). risk will increase risk by 88%.

774
3.7 Fall from moveable platform Aneziris O.N, Papazoglou I.A., Baksteen H., Mud M.L., Ale
B.J.M, Bellamy L.J., Hale A.R., Bloemhoff A., Post J.,
The most important measure in order to decrease fatal- Oh J.I.H., 2008. Quantified risk assessment for fall from
ity risk is the existence of edge protection, which height. Safety Science, Volume 46, Issue 2: 198-220.
decreases risk by 65% if used 100% of time while GISAI, 2005. Geintegreerd Informatie Systeem Arbeids
working near on a moveable platform. The most impor- Inspectie: Integrated Information System of the Labor
tant measure in order to maintain risk is the fixation Inspection in the Netherlands.
of the platform, since its absence will increase risk 9 HSE (2003). Fall from Height—Prevention and risk control
times. effectiveness, ISBN 07176221 5, http://www.hse.gov.uk/
research/rrpdf/rr116.pdf.
McCAnn M. (2003) ‘‘Deaths in construction related to per-
3.8 Fall from non moving vehicle sonnel lifts, 1992-1999’’, Journal of Safety Research,34,
507–514.
The most important measure in order to decrease fatal- NIOSH, (2000). ‘‘Worker Death by Falls’’, US department
ity risk is the existence of edge protection which of Health and Human Services, www.cdc.gov/elcosh/
decreases risk by 39% if used 100% of time while docs/d0100/d000057/d000057.html.
climbing on a non moving vehicle. The most impor- OSHA (1979). ‘‘Occupational fatalities related to scaf-
tant measure in order to maintain risk is securing and folds as found in reports of OSHA fatality/catastrophe
balancing load, since their absence will increase risk investigations’’, Washington DC.
by 139% and 137% respectively. OSHA (1991). ‘‘Selected occupational fatalities related to
vehicle—mounted elevating and rotating work platforms
as found in reports of OSHA fatality/ catastrophe investi-
gations’’ , Washington DC.
4 CONCLUSIONS Papazoglou I.A., Ale B.J.M., 2007. A logical model for quan-
tification of occupational risk, Reliability Engineering &
A general logical model has been presented for quanti- System Safety 92 (6): 785-803.
fying the probability of fall from height and the various Papazoglou I.A, L.J. Bellamy, K.C.M. Leidelmeijer, M.
types of consequences following all fall from height Damen, A. Bloemhoff, J. Kuiper, BJ.M. Ale, J.I.H.
accidents. The model has been used for risk reducing Oh, 2008, ‘‘Quantification of Occupational Risk from
measures prioritization, through the calculation of two Accidents’’, submitted in PSAM 9.
RIVM 2008 WORM Metamorphosis Consortium. The Quan-
risk importance measures: the risk decrease and the tification of Occupational Risk. The development of
risk increase. The calculations were made for fatality a risk assessment model and software. RIVM Report
risk. 620801001/2007 The Hague.

REFERENCES

Ale B.J.M., Baksteen H., Bellamy L.J., Bloemhof A.,


Goossens L., Hale A.R., Mud M.L., Oh J.I.H., Papazoglou
I.A., Post J., and Whiston J.Y.,2008. Quantifying occu-
pational risk: The development of an occupational risk
model. Safety Science, Volume 46, Issue 2: 176-185.

775
Safety, Reliability and Risk Analysis: Theory, Methods and Applications – Martorell et al. (eds)
© 2009 Taylor & Francis Group, London, ISBN 978-0-415-48513-5

Occupational risk management for vapour/gas explosions

I.A. Papazoglou
TU Delft, Safety Science Group, Delft, The Netherlands

O.N. Aneziris & M. Konstandinidou


NCSR ‘‘DEMOKRITOS’’, Aghia Paraskevi, The Greece

M. Mud
RPS Advies BV, Delft, The Netherlands

M. Damen
RIGO, Amsterdam, The Netherlands

J. Kuiper & A. Bloemhoff


Consumer Safety Institute, Amsterdam, The Netherlands

H. Baksteen
Rondas Safety Consultancy, The Netherlands

L.J. Bellamy
White Queen, The Netherlands

J.G. Post
NIFV NIBRA, Arnhem, The Netherlands

J. Oh
Ministry Social Affairs & Employment, The Hague, The Netherlands

ABSTRACT: Chemical explosions pose a serious threat for personnel in sites producing or storing dangerous
substances. The Workgroup Occupational Risk Model (WORM) project financed by the Dutch government
aims at the development and quantification of models for a full range of potential risks from accidents in
the workspace. Sixty-three logical models have been developed each coupling working conditions with the
consequences of accidents owing to sixty-three specific hazards. The logical model for vapour/gas chemical
explosions is presented in this paper. A vapour/gas chemical explosion resulting in a reportable-under the Dutch
law-consequence constitutes the centre event of the model. The left hand side (LHS) of the model comprises
specific safety barriers, that prevent the initiation of an explosion and specific support barriers that influence the
adequate functioning of the primary barriers. The right hand side (RHS) of the model includes the consequences
of the chemical explosion. The model is quantified and the probability of three types of consequences of an
accident (fatality, permanent injury, recoverable injury) is assessed. A sensitivity analysis assessing the relative
importance of each element or working conditions to the risk is also presented.

1 INTRODUCTION which at least 20 can be characterized as chemical


explosions (Baksteen et al., 2008). Chemical explo-
Chemical explosions pose a very serious threat for sions may be due to different causes such as creation
personnel in sites handling hazardous substances. of explosive atmospheres and mixtures, sudden igni-
According to OSHA 201 fire and explosion related tion of flammable substances, ignition of explosives
accidents occurred in 2006 in US (OSHA 2008). Only or violent exothermic reactions. Their consequences
in the Netherlands 25 explosions occur per year from are usually serious and may involve multiple victims.

777
The Workgroup Occupational Risk Model (WORM) familiar with logical models this influence diagram is
project has been launched by the Dutch government in called, within the WORM project, bowtie model.
order to manage and reduce occupational risk. The The logical model provides a way for organising
aim of the project is the quantification of occupational various events from a root cause via the centre event,
risk through logical models for a full range of poten- ending up with a reportable damage to the health of
tial risks from accidents in the workspace (Ale et al., the worker. The use of such a model is twofold. On the
2008). Data for the development of these models are one hand it provides the accident sequences, that is, the
derived from the GISAI database (GISAI 2005) of the sequences of events that lead from a fundamental or
Netherlands Ministry of Work which includes approxi- root cause to the final consequence. On the other hand,
mately 12500 accident cases reported between January it provides a way for quantifying the risk (Papazoglou
1998 and February 2004. 2007).
Of the analysed GISAI accident cases, 126 cases The structure of the paper is as follows: Sections 2
have been classified as vapour/gas chemical explo- and 3 illustrate the general concept of the logical model
sions accidents with reportable consequences. The for vapour/gas chemical explosions along with spe-
modelling and the quantification of these cases are cific details. Section 4 presents the quantification of
described in this paper. Quantification for other types the Bowtie. The results and the ranking of the various
of occupational accidents such as falls from ladders working conditions and/or safety measures in terms of
(Aneziris et al., 2008a), or crane activities occupa- their contribution to the risk are presented in section 5.
tional accidents (Aneziris et al., 2008b) has already Finally section 6 concludes the paper.
been performed within the WORM project. An overall
assessment of the risk from 63 specific occupational
hazards is given in Papazoglou et al (2008). 2 LOGICAL MODEL FOR VAPOUR/GAS
From the observed accident cases scenario-models CHEMICAL EXPLOSIONS
have been firstly developed to capture the sequence of
events leading to the accident (Bellamy et al., 2007). Occupational explosion accidents are accidents where
The scenario-model is the basis for the logical mod- the injuries are the result of the effects of an explo-
elling in the WORM project (Papazoglou 2007). This sion. For the purpose of modelling explosions are
logical model consists in successive decomposition of characterised by pressure (wave) effects and some-
the overall accident consequence into simpler and sim- times the launching of fragments. A distinction has
pler events until a final level of event resolution is been made between physical explosions and chemical
achieved. Each level of events is logically intercon- explosions. Physical explosions are explosions which
nected with the more general events of the immediately are caused by an over-pressurisation of containment
upper level. The events of the lower level of decom- for any reason other than a chemical explosion inside
position form an influence diagram consisting of two the containment. Chemical explosions are explosions
parts connected by a main event called the Centre caused by a) vapour or gas mixtures; b) dust; c)
Event (CE) and representing the occurrence of an acci- the ignition of (solid) explosives and d) explosive
dent resulting in a reportable consequence (here a reactions (explosive run-away reactions, auto-ignition
vapour/gas explosion). This is a very important char- reactions, combustion reactions).
acteristic of the model. Owing to the nature of the Four different models have been developed to
available data that correspond to joint events of explo- include those four types of explosions.
sions resulting in reportable consequences, the Centre This paper presents the model and its quantification
Event refers to events that either result in a reportable for vapour/gas chemical explosions.
consequence or not (i.e. no explosion or an explosion
without reportable consequences). Usually all events
to the left of this event represent events aiming at pre- 2.1 Left hand side of the model (LHS)
venting the CE from occurring and the corresponding The left hand side of the model consists of an initiating
part of the diagram is called Left Hand Side (LHS). event and corresponding safety measures (techni-
All events to the right of the CE correspond to events cal or procedural) aiming at preventing a vapour/gas
aiming at mitigating the consequences of the CE and explosion with reportable consequences.
this part of the model is called Right Hand Side (RHS)
(Papazoglou 2007). In the core of the WORM project,
however, the events to the left are events that influence 2.1.1 Initiating events
the probability of the Centre Event occurring, the lat- The initiating event represents activities where work-
ter being an accident with reportable consequence. ers are adding substances to a containment (filling,
The events to the right of the Centre Event, simply feeding, pressurising); venting, discharging, releas-
condition the severity of the reportable consequence. ing, emptying of a containment/substance; opening a
For communication purposes with safety engineers not containment (e.g. a valve of an oxygen gas cylinder);

778
closing a containment; working with chemicals an ignition source and pose an explosion hazard for the
(performing experiments, surface treating, putting operator are distinguished:
objects in galvanic baths); working with explosives;
using an ignition source (heating, hot work activities, 1. Working (at locations) with systems with enclosed
switching on equipment); manual moving a contain- flammable substances.
ment; cleaning a containment; disconnecting a battery 2. Working (at locations) where ventilation is the
and fighting a fire. suitable measure for preventing the creation of
explosive vapours/gases.
3. Working (at locations) in which explosive atmo-
2.1.2 Primary and support safety barriers spheres are normally present.
A safety barrier is a physical entity, a technical, 4. Working (at locations) with flammable substances
hardware, procedural or organisational element in which can vaporize resulting in explosive vapour
the working environment that aims either at prevent- mixtures.
ing something from happening (e.g. the CE) or at
mitigating the consequences of something that has Chemical explosions from vapour/gas mixtures
happened. Safety Barriers can be distinguished in were modelled according to the model shown in
Primary and Support Barriers. A Primary Safety Figure 1. First block in the model is the ‘‘Mission
Barrier (PSB) either alone or in combination with split’’. This block splits the initial mission to four
other PSBs may prevent the initiation of an explo- mutually exclusive working environments each with
sion. A Support Safety Barrier (SSB) sustains the the potential of one of the four types of explosion. The
adequate function of the PSB and influence the prob- mission split values are {48%, 25%, 15%, and 12%}
ability with which the primary safety barrier-states for the four explosion types respectively. The meaning
occur. of these values is twofold: either they express the per-
centage of time a worker spends in activities related
with each explosion type (for single worker assess-
2.2 Right hand side (RHS) ment) or in a multi-worker assessment they express
The right hand side of the chemical explosions model the percentage of workers working exclusively in the
in combination with the outcome of the centre event environment related to each explosion type.
determines the consequences of the chemical explo- Safety barriers to prevent the four different types of
sions. Four levels of consequences are used: C1: No a vapour/gas explosion are presented in the following
consequence; C2: Recoverable injury; C3: Permanent sections.
injury; C4: Death. The quantification of occupa-
tional risk for chemical explosions will be presented 3.1 Prevention of uncontrolled substance release
in form of probabilities for the three levels of possible
consequence severity. This barrier (PSB1 in figure 1) models explosions tak-
ing place due to uncontrolled flammable substance
release and the introduction or existence of ignition
sources in the same space. This barrier belongs to
3 SAFETY BARRIERS FOR VAPOUR/GAS type 1 explosion safety barriers and has one success
EXPLOSIONS state and three failure states:
A vapour/gas explosion occurs when an ‘‘explosive State 1: Success state corresponding to no explosion
mixture’’ is formed and this mixture comes into con- since substance release has been prevented (no release
tact with an ignition source. Consequently the general of flammable substance).
objective of the safety functions in relevant situations State 2: Failure state that models the release of fla-
is to avoid the simultaneous occurrence of ‘‘ignition mmable substance and subsequent explosion given
sources’’ and ‘‘explosive mixtures’’. This can be done that an ignition source will be introduced by a human
by preventing the formation of an ‘‘explosive mixture’’ activity.
and if this is not possible either by keeping the ‘‘explo- State 3: Failure state that models the release of fla-
sive mixture’’ isolated from the space where ignition mmable substance and subsequent explosion due to
sources exist (or it is likely to exist), or by keeping igni- ignition source introduced by an equipment malfunc-
tion sources isolated from spaces where an explosive tion. This state models the joint event of flammable
mixture exists. substance release and the introduction of ignition sou-
Based on the physicochemical characteristics of rce because of equipment malfunction (e.g. sparks,
explosive mixtures and the process industry expe- shorts).
rience along with information from the accidents State 4: Failure state that models the release of fla-
occurred in the Netherlands, four different situations mmable substance and subsequent explosion due to
where an explosive mixture can come into contact with failure to separate the released flammable vapour from

779
Figure 1. Logical model for vapour or gas explosions.

(normally) existing ignition sources. This state mod- possible to turn it off, electrical defect, missing insu-
els the joint event of flammable substance release and lation) or which is the wrong type of equipment (use
the failure to isolate this vapour from existing ignition of non-explosion proof equipment).
sources. State 4: Explosion takes place given normally pre-
When this barrier is in any of the three failure states sent ignition sources. Flammable vapours are gen-
an explosion of type 1 may occur. erated and remained undetected because no pro-
visions or possibilities for the indication/detection
3.2 Prevention of explosion of flammable of the presence of explosive mixtures have been
atmosphere in closed space taken.
State 5: Explosion takes place given normally pre-
This barrier (PSB2 in figure 1) models explosions tak-
sent ignition sources. Indication/detection provisions
ing place in closed spaces where flammable vapours
are present but failed or diagnosis/response has failed
are produced and are supposed to be removed by a
so flammable vapours have been generated.
ventilation system. Absence or failure of the ventila-
State 6: Explosion takes place given normally pre-
tion system allows the built-up of the explosive vapour
sent ignition sources. Flammable vapours are intro-
and the explosion takes place either because of the
duced where the ignition sources are from other areas
erroneous introduction of ignition source (by human
through the removal of barriers.
activity or equipment malfunction) or failure to detect
When this barrier is in any of the states 2–5 above
the presence of the vapour and the separation from nor-
an explosion of type 2 may occur if flammable vapours
mally present ignition sources. This barrier prevents
are created.
type 2 explosions and has six states (1 success state
and 5 failure states) as follows:
3.3 Prevention of explosion of flammable
State 1: Success state resulting in no explosion since
atmosphere in or near a system
no ignition sources have been introduced or separation
that normally produces such atmosphere
barriers are in full function.
State 2: Explosion takes place given that flammable This barrier (PSB4 in Figure 1) models explosions
vapours exist and an ignition source is introduced by where the explosive atmosphere is always present due
human activity. to badly designed process. Explosion occurs where
State 3: Explosion takes place given that flammable an ignition source is introduced. No separation of the
vapours exist and an ignition source is introduced flammable atmosphere from the ignition source is pos-
owing to equipment failure. This involves equipment sible in this case. This barrier models explosions of
which fails and forms an ignition source (e.g. not type 3 and has one success and two failure states:

780
State 1: Success state resulting in no explosion bec- atmosphere’’ barriers in state 2 (e.g. PSB6 with PSB1,
ause process is designed in a way that no explosive PSB8 with PSB4 and PSB9 with PSB5 in Figure 1).
atmosphere is generated. Introduction of ignition sources due to equipment
State 2: Failure state corresponding to genera- malfunction is included in the states of the safety
tion of ;explosive atmosphere and subsequent explo- barriers (see sections 3.1–3.4 above).
sion given that an ignition source will be introduced
by a human activity.
State 3: Failure state corresponding to generation of 3.6 Ventilation systems
explosive atmosphere and subsequent explosion due Adequate ventilation systems ensure that explosive
to an ignition source introduced by an equipment atmosphere will not be created in confined spaces.
malfunction. This support barrier has two states: ‘Exists’ and
When this barrier is in any of the two failure states ‘Absent’. ‘Exists’ means that ventilation system exists
an explosion of type 3 may occur. in the working place but an explosion still may occur
because the existing ventilation system either fails or is
inadequate and an explosive atmosphere has been cre-
3.4 Prevention of explosion of a flammable ated. ‘Absent’ means that no ventilation system exists
atmosphere created by evaporation so an explosive atmosphere may be created.
of flammable material This barrier is influencing barrier ‘‘Prevention of
explosion of flammable atmosphere in closed space’’
This barrier (PSB5 in Figure 1) models explosions
as it is shown in Figure 1. Existence of a ventilation
at locations with flammable substances which can
system implies a lower probability of chemical explo-
vaporize suddenly resulting in explosive vapour mix-
sion due to prevention failure in closed spaces than
tures—explosion type 4. This barrier has one success
when a ventilation system is ‘absent’.
and three failure states.
State 1: Success state resulting in no explosion since
no explosive atmosphere is generated. 3.7 Personal protective equipment (PPE),
State 2: Failure state that models the generation of protection other than PPE and emergency
explosive atmosphere and a subsequent explosion response
given that an ignition source will be introduced by Personal Protective equipment (e.g. helmets and safety
a human activity. glasses to protect from fragments and ear muffs
State 3: Failure state that models the generation of to avoid drum rupture), Protective barriers (such
explosive atmosphere and a subsequent explosion due as explosion proof areas or doors) and Emergency
to an ignition source introduced by an equipment response and prompt medical attention in the work-
malfunction. place may mitigate the effects of an explosion. Those
State 4: Failure state that models the generation of barriers have two states: ‘Exists’ and ‘Absent’ meaning
explosive atmosphere and a subsequent explosion due that either the barrier exists or does not exist (absent) or
to failure to separate the explosive atmosphere from not used in the working place. ‘‘PPE’’, ‘‘Other protec-
existing (normally) ignition sources. tion than PPE’’ and ‘‘Emergency response’’ influence
When this barrier is in any of the three failure states all prevention barriers (see also quantification issues
an explosion of type 5 may occur. in section 4).

3.5 Barriers preventing the introduction 3.8 Age of operator


of an ignition source through human activities Age of the person that was injured by the explosion
These barriers (PSB6, PSB8, PSB9 in Figure 1) rep- may have an influence on the consequences of the
resent the introduction of an ignition source through explosion on the victim. This event has two states:
various human activities (e.g. hot works, human errors, ‘Age of operator <= 50 years old’ and ‘Age of oper-
smoking). Three barriers have been introduced for the ator > 50 years old’. ‘‘Age of operator’’ influences all
three types of explosions that are due to human ignition prevention barriers.
causes (type 1, type 3 and type 4 explosions). The bar-
riers have two states namely ‘‘success’’ and ‘‘failure’’
3.9 Probability influencing entities (PIEs)
which correspond to the successful and failed respec-
tively prevention of introduction of ignition sources by In several instances the safety barriers of the model
human activities. are simple enough to link directly to easily understood
Failure of such a barrier can result in an explo- working conditions and measures as in the barrier
sion if combined with the corresponding ‘‘flammable ‘‘Ventilation’’. Assessing the frequency with which

781
ventilation systems exist in the working environment is PIES and their values as well as the failure proba-
straightforward since either those systems exist or not. bility for the barrier they influence for the logic model
In other instances, however, this is not possible. of vapour/gas chemical explosions are presented in
For example, the support barrier ‘‘Other protection Table 1.
than PPE’’ may be analysed into more detailed and
more concrete measures that affect its quality. Such
specific measures are: (i) explosion suppression sys- 4 QUANTIFICATION PROCESS
tems (explosion proof areas such as control rooms or
explosion resistant doors); (ii) the distance between In general the level of resolution of a logical model
the explosion point and the location of the worker. used in ORM was driven by the available data. A
Similarly the barrier ‘‘Emergency response’’ may be logic model provides a collection of event outcomes
analysed into the following measures: i) supervision or barrier states which may lead to an accident when
or monitoring at the workplace; ii) emergency team they coexist in particular states. These accidents have
which is always present or standby; iii) first aid and specific consequences. The general form of such a
decompression facilities; iv) multiple rescue possi- sequence is:
bilities (direct access); and v) professional medical
assistance. Such factors have the name of Proba-
C = {S1 , S2 , .....Sn , B1 , B2 , . . .Bm } (1)
bility Influencing Entity (PIEs). Each influencing
factor (PIE) is assumed to have two possible lev-
els, ‘‘Adequate’’ and ‘‘Inadequate’’. The quality of an Analysis of available accident data allowed the
influencing factor is then set equal to the frequency assessment of the number of times such accident
with which this factor is at the adequate level in the sequences occurred during a given period of time. Sur-
working places. Then the quality of the barrier is given veys of the Dutch working population assessed the
by a weighted sum of the influencing factor qualities. exposure of the workers to the specific hazards over
The weights reflect the relative importance of each the same period of time. Consequently it was possible
factor and are assessed by the analyst on the basis of to assess the probability P(C) of the various accident
expert judgement. Currently equal weights have been sequences. Surveys of the Dutch working places and of
used. This way the probability of a barrier to be in one the corresponding conditions allowed the assessment
of its possible states is given by the weighted sum of the of the overall probability of some individual barriers
frequencies of the influencing factors (RIVM 2008). (e.g. see Table 1). If such assessment is made then

Table 1. PIES characteristics and values.

PIE Barrier
Barrier PIE PIE characteristics value failure

PSB6- 1 Prevention of introduction of an ignition 0.17 0.17


source (human activity - type 1 explosion)
PSB8- 1 Prevention of introduction of an ignition 0.08 0.08
source (human activity - type 3 explosion)
PSB9- 1 Prevention of introduction of an ignition 0.06 0.06
SSB1- 1 source (human activity - type 4 explosion)
Ventilation systems 0.06 0.06
SSB2- 1 Personal Protective 0.3 0.3
Equipment
SSB3- 2 Explosion suppression systems 0.26 0.25
control room or doors
Distance to explosion 0.24
SSB4- 5 Supervision/ monitoring at 0.2 0.23
the location
Emergency team present or 0.23
standby
First aid/decompression 0.24
facilities
Rescue possibilities (access) 0.24
Professional medical 0.24
assistance 0.17 0.17
SSB5- 1 Population share exposed
>=50 years

782
probabilities of the form P(S1 , S2 , . . ., B1 , . . .Bi , . . .) 5.1 Importance analysis
can be estimated where (S1 , S2 , . . ., B1 , . . ., Bi , . . .)
To assess the relative importance of each factor influ-
are the barriers that can be quantified independently
encing the risk from vapour/gas chemical explosions
of the accident data. Then equation (1) can be
two importance measures have been calculated.
written as:

1. Risk decrease: This measure gives the relative


P(C) = P(S1 , . . ., Bi ) ∗ P(Bi+1 , . . ., Bj/S1 , . . ., Bi ) (2) decrease of the risk if the barrier (or PIE) achieves
its perfect state with probability equal to unity.
2. Risk increase: This measure gives the relative
In the above equation P(C), P(S1 , . . . , Bi ) are increase of the risk if the barrier (or PIE) achieves
known hence P(Bi+1 , . . ., Bj /S1 , . . ., Bi ) can be calcu- its failed state with probability equal to unity.
lated. The overall objective of ORM was then to be
able to calculate the new value of P(C) given a num- Risk decrease prioritizes the various elements of the
ber of safety measures that have a specific effect and model for the purposes of possible improvements. It
consequently change P(S1 , . . . , Bi ) to P(S1 , . . . , Bi ).
is more risk—effective to try to improve first a barrier
As a result events that could not be independently with higher risk decrease percentage than another with
quantified had to be grouped together as (Bi+1 , . . .Bj )lower risk decrease.
since only the probability of the joint event could be Risk increase provides a measure of the importance
quantified. In particular within the group of events of each element in the model to be maintained at its
(Bi+1 , . . ., Bj ) there were always the events of an acci-
present level of quality. It is more important to con-
dent with specific consequences. For example the centrate on the maintenance of a barrier with high risk
probability of the joint event of P(Failure to detect increase importance than one with a lesser one.
explosive atmosphere, explosion, fatality) could be The effect each barrier has on the risk rate per type
assessed and hence P(Failure to detect explosives of explosion and on the overall risk is presented in
atmospheres/explosions) could be assessed either but Tables 2 and 3.
not the independent probability P(Failure to detect From the risk decrease shown in Table 2 it follows
explosive atmospheres). Based on these constraints the that ‘‘Personal Protective Equipment’’ and ‘‘Other pro-
model presented in section 3 has been developed and tection than PPE’’ are the barriers with the most
quantified. significant effect on the overall risk reduction (higher
values of risk decrease) if made perfect, that is if they
are used 100% of the time. On the other hand, the
results in Table 3 indicate that the barriers with the
5 RESULTS highest levels of risk increase are the ‘‘Ventilation
Systems’’ and the ‘‘Protection other than PPE’’ This
Quantification of the model for vapour gas explosions means that these two are the most important barriers
given a total exposure of 1.62x109 hours resulted in to maintain because if failed they cause the highest
the following probabilities: risk increase. It should be emphasized that the results
Probability of Explosion with reportable shown in Tables 2 and 3 and the corresponding conclu-
Consequence (hr −1 ) = 7.78 × 10−8 sions are valid for the particular case modelled. This
Probability of Lethal Injury (hr −1 ) = 9.26 × 10−9 is a worker or a group of workers that are exposed
Probability of Permanent Injury (hr −1 ) = 2.38×10−8 to the hazards of the four types of vapour/gas explo-
Probability of Recoverable Injury (hr −1 ) = 4.47 ×10−8 . sions with the quantitative mission split discussed in

Table 2. Risk rates per type of explosion and Overall risk decrease for each safety barrier.

Type 1 Type 2 Type 3 Type 4 Overall Risk


Safety barrier explosion explosion explosion explosion risk rate decrease

Base case 2.65E-08 1.79E-08 1.05E-08 2.28E-08 7.78E-08


Prevention of ignition source (type 1 explosion) 2.22E-08 7.35E-08 −5.53%
Prevention of ignition source (type 3 explosion) 5.13E-09 7.24E-08 −6.94%
Prevention of ignition source (type 4 explosion) 2.16E-08 7.65E-08 −1.58%
Ventilation systems 6.17E-09 6.60E-08 −15.08%
Personal Protective Equipment 1.04E-08 1.28E-08 7.50E-09 1.63E-08 4.69E-08 −39.66%
Other protection than PPE 1.60E-08 1.19E-08 7.00E-09 1.52E-08 5.02E-08 −35.49%

783
Table 3. Risk rates per type of explosion and Overall risk increase for each safety barrier.

Type 1 Type 2 Type 3 Type 4 Overall Risk


Safety barrier explosion explosion explosion explosion risk rate increase

Base case 2.65E-08 1.79E-08 1.05E-08 2.28E-08 7.78E-08


Prevention of ignition source (type 1 explosion) 4.76E-08 9.89E-08 27.19%
Prevention of ignition source (type 3 explosion) 7.22E-08 1.39E-07 79.35%
Prevention of ignition source (type 4 explosion) 4.22E-08 9.72E-08 25.00%
Ventilation systems 2.02E-07 2.61E-07 235.91%
Personal Protective Equipment 6.43E-08 2.98E-08 1.75E-08 3.81E-08 1.50E-07 92.52%
Other protection than PPE 5.80E-08 3.58E-08 2.10E-08 4.57E-08 1.60E-07 106.15%

Table 4. Risk decrease for the three levels of consequences.

Risk Risk Risk


Recoverable decrease Permanent decrease Lethal decrease
injuries (%) injuries (%) injuries (%)

Base case 4.47E-08 2.38E-08 9.26E-09


Prevention of ignition source (type 1 explosion) 4.14E-08 −7.38 2.31E-08 −2.94 8.96E-09 −3.24
Prevention of ignition source (type 3 explosion) 4.19E-08 −6.26 2.12E-08 −10.92 9.26E-09 −0.001
Prevention of ignition source (type 4 explosion) 4.41E-08 −1.34 2.33E-08 −2.10 9.13E-09 −1.40
Ventilation systems 3.90E-08 −12.75 2.10E-08 −11.76 6.03E-09 −34.88
Personal protective equipment 2.53E-08 −43.40 1.56E-08 −34.45 6.02E-09 −34.99
Other protection than PPE 2.85E-08 −36.24 1.56E-08 −34.45 6.06E-09 −34.56
Emergency response 3.73E-08 −16.55 1.82E-08 −23.53 8.64E-09 −6.70

Table 5. Risk increase for the three levels of consequences.

Risk Risk Risk


Recoverable increase Permanent increase Lethal increase
injuries (%) injuries (%) injuries (%)

Base case 4.47E-08 2.38E-08 9.26E-09


Prevention of ignition source (type 1 explosion) 6.09E-08 36.24 2.73E-08 14.71 1.07E-08 15.55
Prevention of ignition source (type 3 explosion) 7.61E-08 70.25 5.41E-08 127.31 9.26E-09 0.01
Prevention of ignition source (type 4 explosion) 5.34E-08 19.46 3.24E-08 36.13 1.14E-08 23.11
Ventilation systems 1.33E-07 197.54 6.82E-08 186.55 6.00E-08 547.95
Personal protective equipment 8.99E-08 101.12 4.30E-08 80.67 1.68E-08 81.43
Other protection than PPE 9.31E-08 108.28 4.85E-08 103.78 1.87E-08 101.94
Emergency response 6.95E-08 55.48 4.27E-08 79.41 1.14E-08 23.11

section 3. For this specific case the most effective mea- Risk importance measures can be calculated not
sure for reducing risk is the increase in the percentage only with respect to the probability of an explosion
of time that PPEs are used. with reportable consequences but with respect to spe-
This conclusion might change if a different mission cific type of consequence, as shown in Tables 4 and 5.
split is considered or if an explosion type is considered As it can be seen from Table 4 (last column) ‘‘Ventila-
in isolation. For explosion type 2, for example, pres- tion Systems’’ ‘‘Personal Protective Equipment’’ and
ence of ‘‘Ventilation systems’’ at 100% of the cases ‘‘Other protection measures’’ are practically equally
decreases the risk rate by 65.5%, whereas the ‘‘PPE’’ important in terms of their improvement with respect
and ‘‘Other protection’’ by 28.5% and 33.5% respec- to fatalities. This is due to the fact that the conditional
tively. Thus the risk of a type 2 explosion is reduced probability of a fatal injury given a type 2 explosion
more by the increase–from the present level- of the is much higher than for the other types of explosions.
use of ventilation systems rather than the PPE or other Thus even if type 2 explosions participate only by 25%
protective measures. in the mission split the increase in the presence of ven-

784
tilation systems is very important for decreasing the REFERENCES
risk of fatality.
As for the risk increase results of Table 5: In all Ale B.J.M., Baksteen H., Bellamy L.J., Bloemhof A.,
consequences levels ‘‘Ventilation systems’’ have the Goossens L., Hale A.R., Mud M.L., Oh J.I.H., Papazoglou
greater risk increase percentages, while ‘‘Other pro- I.A., Post J., and Whiston J.Y., 2008. Quantifying occu-
tection than PPE’’ is second in ranking for recoverable pational risk: The development of an occupational risk
model. Safety Science, Volume 46, Issue 2: 176–185.
and lethal injuries, and ‘‘Prevention of introduction Aneziris O.N. Papazoglou I.A., Baksteen H., Mud M.L., Ale
of ignition sources for type 3 explosion’’ is second B.J.M, Bellamy L.J., Hale A.R., Bloemhoff A., Post J.,
for permanent injuries. Third barrier in risk increase Oh J.I.H., 2008a. Quantified risk assessment for fall from
ranking is ‘‘PPE’’ for lethal and recoverable injuries height. Safety Science, Volume 46, Issue 2: 198–220.
and ‘‘Other protection than PPE’’ for the permanent Aneziris O.N., Papazoglou I.A., Mud M.L., Damen M.,
ones. Kuiper J., Baksteen H., Ale B.J.M., Bellamy L.J.,
In this way all safety barriers can be ranked accord- Hale A.R., Bloemhoff A., Post J.G., Oh J.I.H, 2008b.
ing to the effect they induce in the overall risk as well as Towards risk assessment for crane activities Safety Sci-
in the risk of lethal, permanent or recoverable injuries. ence. doi:10.1016/j.ssci.2007.11.012
Baksteen, H., Samwe, M., Mud, M., Bellamy, L., Papa-
zoglou, I.A., Aneziris, O., Konstandinidou, M., 2008
6 CONCLUSIONS Scenario—Bowtie modeling BT 27 explosions, WORM
Metamorphosis Report.
Bellamy L.J., Ale B.J.M., Geyer T.A.W., Goossens L.H.J.,
A logical model has been presented for quantifying the Hale A.R., Oh J.I.H., Mud M.L., Bloemhoff A, Papa-
probability of vapour/gas chemical explosions and the zoglou I.A., Whiston J.Y., 2007. Storybuilder—A tool for
various types of consequences following these types the analysis of accident reports, Reliability Engineering
of accidents. The model includes primary and support and System Safety 92: 735–744.
safety barriers aiming at preventing chemical explo- GISAI, 2005. Geintegreerd Informatie Systeem Arbeids
sions. For the quantification of the model the exposure Inspectie: Integrated Information System of the Labor
rates (total time spent in an activity involving each haz- Inspection in the Netherlands.
ard per hour) have been used which was estimated with OSHA 2008. Bureau of Labor statistics http://data.bls.gov/
GQT/servlet/InitialPage.
user (operators) surveys and real accident data com- Papazoglou I.A., Ale B.J.M., 2007. A logical model for quan-
ing from the reported accident database GISAI. The tification of occupational risk, Reliability Engineering &
probability of the consequences of such accidents is System Safety 92 (6): 785–803.
presented in three levels: fatalities, permanent injury Papazoglou I.A, L.J. Bellamy, K.C.M. Leidelmeijerc, M.
and non-permanent injury. Surveys also provided data Damenc, A. Bloemhoffd, J. Kuiperd, BJ.M. Alea, J.I.H.
for the working places and the corresponding condi- Oh, ‘‘Quantification of Occupational Risk from Acci-
tions allowing in this way the assessment of the overall dents’’, submitted in PSAM 9.
probability of some individual barriers. The model RIVM 2008 WORM Metamorphosis Consortium. The Quan-
has been used for risk reducing measures prioritiza- tification of Occupational Risk. The development of
a risk assessment model and software. RIVM Report
tion through the calculation of two risk importance 620801001/2007 The Hague.
measures: the risk decrease and the risk increase. The
calculations were made for the overall risk and the
risk in three levels of consequence severity. ‘‘Personal
Protective Equipment’’ and ‘‘Ventilation Systems’’ are
the barriers with the most important risk values in the
overall risk ranking analysis.

785
Safety, Reliability and Risk Analysis: Theory, Methods and Applications – Martorell et al. (eds)
© 2009 Taylor & Francis Group, London, ISBN 978-0-415-48513-5

Occupational risk of an aluminium industry

O.N. Aneziris & I.A. Papazoglou


National Center ‘‘DEMOKRITOS’’, Athens, Greece

O. Doudakmani
Center for Prevention of Occupational Risk, Hellenic Ministry of Employment and Social Affairs, Thessaloniki, Greece

ABSTRACT: This paper presents the quantification of occupational risk in an aluminium plant producing
profiles, located in Northern Greece. Risk assessment is based on the Workgroup Occupational Risk Model
(WORM) project, developed in the Netherlands. This model can assess occupational risk at hazard level, activity
level, job level and overall company risk. Twenty six job positions have been identified for this plant, such as
operators of press extruders, forklift operators, crane operators, painters, and various other workers across the
process units. All risk profiles of workers have been quantified and jobs have been ranked according to their
risk. Occupational risk has also been assessed for all plant units and the overall company.

1 INTRODUCTION developed. According to this model occupational risk


in a company is calculated by assessing the hazards
Occupational safety and health is a major concern to the workers in this company are exposed the duration
many countries and the traditional way to deal with it of the exposure and the integration of the risk to all
is legislation, regulation, standards and safety guide- hazards and all workers.
lines. The Ministry of Social Affairs and Employment A tree-like structure is used to develop the compos-
in the Netherlands developed Workgroup Occupa- ite model of ORM as depicted in Figure 1.
tional Risk Model (WORM) project, a large scale The Top Level of the tree corresponds to the entity
project during 2003-2007 to improve the level of safety under analysis.
at workplace, by introducing quantitative occupational The second level provides the type of ‘‘Company-
risk. WORM is presented in (RIVM 2008) and its main position’’ corresponding to a specific type of job along
achievement is the development of the occupational with the number of people in each position type. There
risk model, which is built on the detailed analysis of are n = 1, 2, . . . , N company positions each occupied
9000 accident reports in the Netherlands. by T1 , . . ., Tn employees, respectively.
Occupational risk is performed for an aluminium
plant, located in Greece. Aluminium industry is the
most dynamic branch of the non ferrous metal working
industry in Europe with significant activity in Greece.
Data regarding workers’ jobs, activities and hazards
were available, as well as general accident data of
Greek aluminium industries.
This paper is organized as follows. After the intro-
duction of section 1, section 2 presents briefly the
methodology of occupational risk and section 3 the
description of the aluminium plant. Section 4 presents
the job positions of all workers and section 5 occupa-
tional risk quantification and results.

2 OCCUPATIONAL RISK

In the framework of the WORM project a model


for the quantifications of occupational risk has been Figure 1. Composite ORM model structure.

787
The third level of the tree describes for each Then
position-type the activities required to perform the cor-
responding job along with the respective frequencies. 
Nm
This means that a particular job is described in terms P1,m = p1,n
of a number of activities each one of which is per- n=1
formed a specific number of times over a given period.
Thus the nth job position is characterized by Mi activi- 
Nm 
n−1
P2,m = p2,n p1,r
ties A(n, 1), . . ., A(n, m), .., A(n, Mn ) each performed
n=1 r=1
with annual frequency f(n,m).
Finally, performance of a specific activity is asso- 
Nm 
n−1
ciated with a number of single hazards (out of P3,m = p3,n p1,r
the sixty three single hazards) and a correspond- n=1 r=1
ing duration of exposure to each and every haz-
ard. Thus activity A(n,m) is associated with hazards 
Nm 
n−1
P4,m = p4,n p1,r (1)
h(n, m, 1), h(n, m, 2), . . ., h(n, m, K nm ). Risk is calcu-
n=1 r=1
lated as a combination of the contributions of Jobs,
activities and Bowties.
Where it has been assumed that any consequence
2.1 Calculation at the hazard Level (recoverable injury, permanent injury, death) happen-
ing in the kth hazard interrupts the activity and hence
WORM has assessed the risk per hour of exposure for successful completion of the Activity requires suc-
63 hazards on the basis of the characteristics of the cessful (no consequences) completion of the preceding
average Dutch worker. These characteristics express (k-1) dangers. Any other consequence interrupts the
the working conditions and are quantified as percent- stream of dangers and the activity results in this
age of the time that the worker is working under consequence.
specified levels or types of those conditions. Case-
-specific analyses can be made by adjusting these
characteristics to the specific conditions. These cal- 2.3 Calculation at the job level
culations provide the risk as the probability of one of
three possible consequences (recoverable injuries, per- A worker job in a given period of time undertakes a
manent injuries, and fatalities) for the duration of the number of activities, where each activity consists of a
exposure, as presented by Papazoglou et al., (2009) number of dangers (Bowties).
and Ale et al., (2008). There are M activities Am m = 1, 2, 3, . . ., M . Each
activity consists of Nm Dangers, dn n = 1, 2, . . ., Nm.
Each activity is repeated fm times a year m =
2.2 Calculation at activity level 1, 2, . . ., M (frequencies). Then the risk for the period
of interest (year) will be calculated as follows:
Next the risk at the activity level is calculated. A gen-
For each activity (m) we calculate risk per activity
eral assumption is that if during one of the actions an
as in section 2.2
accident occurs resulting in a consequence (recover-
Assumption: recoverable injury, permanent injury &
able injury, permanent injury or death) the Activity is
fatality interrupt the activity and no additional expo-
interrupted and the overall consequence of the activity
sure is possible.
is the same. That is no more exposure to the same or
For each Activity (m) given the annual frequency
additional hazards is possible.
fm we calculate the annual risk per activity
Let
Assumption: recoverable injury during the f th
n = 1, 2, . . ., Nm is an index over all the dangers of undertaking of the activity does not preclude under-
the mth activity. taking of the same activity again for the f + 1, .. up to
p1,k : Probability of No-Consequence in the kth the fm time.
hazard
p2,k : Probability of recoverable Injury in the kth aP1,m = (P1,m) )fm
hazard
p3,k : Probability of permanent injury in the kth aP2,m = 1 − aP1,m − aP3,m − aP4,m
hazard
1 − (1 − P3,m − P4,m )fm
p4,k : Probability of death in the kth hazard aP3,m = P3,m
p1,m : Probability of No consequence for activity m P3,m + P4,m
p2,m : Probability of recoverable injury for activity m
1 − (1 − P3,m − P4,m )fm
p3,m : Probability of permanent injury for activity m aP4,m = P4,m (2)
p4,m : Probability of death for activity m P3,m + P4,m

788
where: aP1,m, aP2,m, aP3,m, aP4,m annual probability of the plant consists of six major units, which are the
no consequence, recoverable injury, permanent injury following: extrusion, surface treatment, die, storage,
and of death of activity m. packaging and mechanical.
For the M activities we calculate the total annual
risk as follows: a) Extrusion unit: Aluminium billets are the starting
stock for the extrusion process. They are introduced

M into furnaces and heated up to 450◦ C, cut at the
R1 = aP1,m required length and inserted in a press, where extru-
m=1 sion takes place. Pressure is exerted on the billet
which is crushed against a die. The newly formed
R2 = 1 − R1 − R3 − R4 extrusion is supported on a conveyor as it leaves

M 
m−1 the press. Depending on the alloy, the extrusion is
R3 = aP3,m (aP1,m + aP2,m ) cooled after emerging from the die, so as to obtain
m=1 r=1 sufficient metallurgical properties. The profile is
then stretched, since it is placed in a traction to

M 
m−1
obtain the exact geometrical shape. It is then con-
R4 = aP4,m (aP1,m + aP2,m ) (3) veyed to a saw, where it is cut in order to obtain the
m=1 r=1 required commercial length, and transported with
cranes to the ageing treatment. This is performed
where: R1, R2, R3, R4 total annual probability of no con- in furnaces where aluminium profiles are heated at
sequence, recoverable injury, permanent injury and 185◦ C for 6-7 hours. Figure 2 presents the block
death. diagram of the plant.
Again the assumption is made that recoverable b) Surface treatment consists of the cleaning, anodiz-
injury during activity (m) does not preclude under- ing and coating sections. In the cleaning section
taking of the remaining activities during the year.

BILLET
2.4 Overall risk
STORAGE
Given a company with N jobs and Tn workers perform-
ing the nth job, the overall risk is approximated by the
FURNACE 450
expected number of workers to suffer each of the three
sequences.
BILLET CUT-

R2,o = R2 T2 DIES
n EXTRUDING

R3,o = R3 T3
STRETCHING
n

R4,o = R4 T4 (4)
PROFILE
n

where: R2,0 , R3,0 , R4,0 overall risk of recoverable injury, AGEING FURNACE
permanent injury and death.

SURFACE TREATMENT
3 PLANT DESCRIPTION
ANODIZING PAINTING
The aluminium plant produces profiles for various
applications in the industry and building construc-
tion. The heart of this industry is the extrusion press
PACKAGING TOOLS
section, where raw material arriving in the form of bil-
lets, are transformed to aluminium profiles. Next they
are transferred to the surface treatment section, so as
to acquire the required aesthetical and anti corrosion PROFILE STORAGE
properties. Four additional sections are required to .
support profile production, which are the die section,
storage, packaging and mechanical support. Therefore Figure 2. Plant block diagram.

789
profiles are immersed in a bath of NAOH where of press extruder, forklift operator, painter etc., which
they are cleaned from oils and treated for painting. are described in this section, along with the associated
In anodizing the metal is covered with a layer of activities and hazards.
protective oxide which adds anti corrosion proper-
ties, by an electrolytic process. Aluminium profiles
are transported to the painting unit where they are 4.1 Extrusion
hanged either horizontally or vertically and painted.
Painting with powder in either horizontal or verti- There are four job positions in this unit: extruder
cal units provides good aesthetic result of the final press operator, extruder worker, stretching and cutting
product. operators.
c) Die section: In this section dies, which are a) Press extruder operator: He is responsible for
required in the extrusion press unit for the pro- the cut of the billets to the required length, their load-
duction of profiles, are treated either mechanically ing and unloading on the press extruder, the operation
or chemically, in order to remove traces of metal of the press and completing the required documents.
which have remained on them. Clean and well Therefore his job is decomposed into four activities
maintained dies influence the quality of the final (cut of the billets, loading/unloading billets on press,
product. press operation and completing documents), as pre-
d) Packaging unit: after surface treatment the profiles sented in Figure 3 and Table 1. While cutting billets,
are moved into other areas where they are packed which occurs every day for two hours, he is exposed
and prepared for transportation to customers. The to the following hazards: fall on same level (for 2
plant is equipped with an automatic packaging unit hours), contact with falling object from crane (for 0.2
and profiles are palletized so as to be protected from hours), contact with hanging or swinging objects (for
surface damage and twisting. 0.2 hours), contact with moving parts of machine (for
e) Storage areas for aluminium billets and profiles. 0.5 hours), trapped between objects (for 0.6 hours) and
Aluminium billets, which is the starting material of contact with hot surface (for 2 hours). Figure 2 presents
the plant, are stored in an area close to the extrusion the decomposition of this job into activities and their
process and transported by forklifts. Packed alu- associated hazards, while Table 1 presents also the fre-
minium profiles are transported either by forklifts quency and duration of the activities and the exposure
or by cranes to the storage area. to hazards. Similar Tables are provided for all jobs
f) The machinery section consists of various opera- described in this section and presented by (Doudak-
tions supporting the production of profiles. mani 2007). There are 6 press extruder operators in
this plant working on an eight hour shift basis.
b) extruder worker: His activities are to heat the
4 COMPANY POSITIONS dies, install them in the press extruder, and transport
them either by crane or by trolley to the die section.
This plant has seventy seven workers distributed along He is exposed to the same hazards as the extruder
the six units, described in the previous section. There press operator, but with different durations. There are 6
are twenty six different types of jobs, such as operator extruder workers working on an eight hour shift basis.

COMPANY

FORKLIFT OPERATOR … PRESS EXTRUDER OPERATOR PAINTER

LOADING/UNLOADING CUT OF PRESS DOCUMENTS

FALL ON THE SAME LEVEL

CONTACT WITH FALLING

CONTACT WITH HANGING

CONTACT WITH MOVING

TRAPPED BETWEEN OBJECTS

CONTACT WITH HOT SURFACE

Figure 3. Decomposition of company into job positions, activities and hazards.

790
Table 1. Activities and associated hazards of press extruder operator.

Exposure to hazard (hrs)

Activities Contact Contact Contact


of press with with Contact with hot
extruder moving Fall on falling with or cold
operator Trapped parts of a the object hanging/ surfaces
and their between machine same -crane swinging or open
frequencies objects -operating level or load objects flame

Cut of the billets 0.6 0.5 2 .2 0.2 2


2 hrs/day
Loading/unloading 0.3 0.3 2 .2 0.2 2
billets on press
2 hrs/ day
Completing documents 1 hr/ day 0.1
Press operation 3 hrs/ day 1

c) stretching operator: He checks the profiles arriv- There are 4 workers with the same job description in
ing to the stretching press, is responsible for the press this unit.
operation and the transportation of the profiles to the c) crane operator. He operates the crane that trans-
cutting machines. He is exposed to the same hazards ports profiles to the anodizing baths. He is exposed
as the extruder press operator but with different dura- to the following hazards: fall on the same level, hit
tion. There are 6 stretching operators working on an by falling object, trapped between object and contact
eight hour shift basis. with chemicals.
d) cutting operator: He operates the saw and cuts d) forklift operator. His job is to transport pro-
the profiles to the required length. He manually moves files from the anodizing to the painting section. He is
the cut profiles and arranges them on special cases, exposed to the following hazards: hit by falling object,
which are transported by crane to the ageing furnace. hit by rolling object and struck by moving vehicle.
He is exposed to the following hazards: fall on same e) workers at the horizontal (or vertical) painting
level, contact with falling object from crane, contact unit. The worker’s activity is to hang alloys on the
with hanging or swinging objects, contact with moving painting unit. He is exposed to the following hazards:
parts of machine, trapped between, contact with falling fall on the same level, fall from height, falling object,
object during manual operation. There are 6 cutting flying object, trapped between object, move into object
operators working on an eight hour shift basis. and contact with hot surface. There are 5 workers with
the same job description in this unit.
f ) painter. He either paints the profiles himself, or
4.2 Surface treatment prepares the painting powder and operates the auto-
There are seven different job positions in this unit: matic machine. He is exposed to the following hazards:
worker at the entrance (or exit) of the anodizing unit, fall on the same level, falling object from crane,
crane operator, worker at anodizing unit, workers at the trapped between object, move into object and contact
horizontal or vertical painting units, painter, cleaner with chemicals.
and forklift operator. g) cleaner. His main activity is to clean the paint-
a) worker at anodizing unit. His activity is to trans- ing unit. He is exposed to the following hazards:
port profiles by trolleys to this unit. He is exposed fall on the same level, hit by falling object, trapped
to the following hazards: fall on the same level, hit between objects, move into object and contact with
by falling object, trapped between object, contact hot chemicals.
surface and move into object.
b) worker at the entrance (or exit) of the anodizing
4.3 Die section
unit. The worker in the entrance carries profiles from
trolleys and ties them on vertical poles, where they There are five job positions in this section, where dies
will be anodized. At the exit of this unit, when anodiz- are treated either chemically or mechanically. These
ing has been performed, profiles are stored on pallets. are the following: operators installing, sandblasting,
He is exposed to the following hazards: hit by falling cleaning chemically and hardening dies and also
object, trapped between object and move into object. operators using machine tools.

791
a) operator installing dies on the press excluder. His a) forklift operator transporting alloys: his activities
main activity is to install the dies on the press extruder, are unloading alloys from trucks, arranging them in the
but he also transports them by a crane or by trolley. He storage area, transporting them to the process area and
is exposed to the following hazards: fall on same level, emptying bins with aluminium scrap. The hazards to
hit by falling object from- crane, contact by hanging or which the operator is exposed are: hit by falling object,
swinging objects, contact with falling object manually hit by rolling object, fall from height and struck by
transported and trapped between objects. moving vehicle. There are 2 forklift operators with the
b) operator sandblasting dies. Apart from his main same job description in this unit.
activity which is die sandblasting he is involved in b) forklift operator transporting profiles: his activ-
material handling. He is exposed to the following haz- ities are transporting profiles, loading and unloading
ards: fall on same level, contact with falling object, trucks, piling profiles and transporting bins of scrap.
contact with hanging or swinging objects, trapped The hazards to which the operator is exposed are hit by
between objects, contact with moving parts of machine falling object, fall from height and struck by moving
and contact with flying objects. vehicle. There are 6 forklift operators with the same
c) operator cleaning chemically dies. Apart from job description in this unit.
his main activity, which is cleaning dies, he is involved c) worker of storage area: his activities are crane
in material handling. He is exposed to the following operation, arrangement, transportation and labeling of
hazards: fall on same level, contact with falling object, profiles and also checking documents. The hazards to
contact with hanging or swinging objects, trapped which the operator is exposed are hit by falling object
between objects and contact with chemicals. from crane, hit by hanging or swinging object, fall
d) operator hardening dies. Apart from his main from height, fall on the same level, move on object,
activity, which is hardening dies he is involved in mate- trapped between objects and struck by moving vehicle.
rial handling. He is exposed to the same hazards as the There are 10 workers with the same job description in
operator who cleans dies. this unit.
e) worker using machine tools. His activities ate to
use tools and to handle materials. He is exposed to the 4.6 Machinery section
same hazards as the operator sandblasting the die.
This section consists of a press operator, an operator
of machine tools, an insulation fitter and a carpenter.
4.4 Packaging a) press operator: his activities are press operation
and material handling. The hazards he is exposed to are
There are three job positions for packaging profiles
hit by falling object, move on object, trapped between
which are the operator of the automatic packaging
objects and contact with moving parts of machine.
machine, a worker manually packaging and a helper.
b) operator of machine tools: his activities are
a) operator of packaging machine: His activities
operation machine tools and material handling. The
are transporting profiles by trolleys, feeding the pack-
hazards he is exposed to are hit by falling object, hit by
aging machine and operating it. He is exposed to the
flying object, move on object, trapped between objects
following hazards: hit by falling objects, contact with
and contact with moving parts of machine.
moving parts, trapped between objects, move into
c) insulation fitter: his activities are fitting insula-
an object, contact handheld tool and stuck by mov-
tion and material handling. The hazards to which he
ing vehicle. There are 2 operators with the same job
is exposed are the same as the press operator, but with
description in this unit.
different duration of exposure.
b) a workermanually packages alloys and transports
d) carpenter: his activities are operation with saws
them by trolleys. He is exposed to the following haz-
and tools for cutting and also packaging. The haz-
ards: hit by falling objects, trapped between objects,
ards to which he is exposed are hit by falling object,
move into an object, contact handheld tool and stuck
trapped between objects and contact with moving parts
by moving vehicle. There are 2 workers with the same
of machine and contact with handheld tools.
job description in this unit.
c) a helper cuts with a saw and packages manually.
He is exposed to the same hazards as the operator of 5 RESULTS AND CONCLUSIONS
the packaging machine. There are 2 workers with the
same job description in this unit. Occupational risk has been calculated for all job posi-
tion of the plant and is presented in Figures 4 and 5
and Table 2. Figure 4 presents present annual risk
4.5 Storage areas
of death and Figure 5 annual permanent and recov-
There are three job positions in the storage area, which erable injury for all job positions in this plant. The
are: forklift operators transporting alloys and profiles operator at the entrance of the painting unit has the
and a worker. highest probability of death (3.25×10−5 /yr) followed

792
in the storage area (1.92 × 10−4 /yr) and the worker
3,10E-05
manual handling, at the painting unit (1.85×10−4 /yr).
The helper at the packaging unit has the highest prob-
2,60E-05
ability of permanent injury (2.22 × 10−4 /yr) followed
2,10E-05
by the operator at the entrance of the painting unit
1,60E-05 (2.03 × 10−4 /yr) and the worker performing sand-
1,10E-05
blasting of the dies (1.85 × 10−4 /yr). The operators at
the entrance of the painting unit and the worker in the
6,00E-06
storage area have are the most dangerous jobs owing
1,00E-06 to the high probability of death and recoverable injury,
manual handling painting
operator entrance painting unit

worker -entrance anodizing

insulation fitter
die installation

crane operator-anodizing
extruder worker

forklift operator -billets

operator of machine tools


die sandblasting

op. die machine tools


die hardening

forklift operator- painting

helper packaging
cutting operator
stretching opertaor
extruder operator

cleaner
forklift operator -profiles

worker packaging
worker in storage

die chemical

operator of packaging machi


painter

carpenter
operator of press for tools
while the helper at the packaging unit is also regarded
as a dangerous job, owing to the high probability of
permanent injury. High fatality risk of workers at the
entrance of the painting unit, at the storage area and
the worker performing sandblasting of the dies can
be further analyzed in order to obtain the most seri-
ous hazards, these workers are exposed to. Figure 6
presents these results and it appears that all three work-
ers are exposed to high hazard of contact with falling
object from cranes.
Risk has also been calculated for the six units
Figure 4. Probability of fatality of worker in the plant of the plant. This quantification is performed as
(/year). described in section 2.4 for each unit. The storage
area has the higher expected number of deaths (2.8 ×
10−4 /year) and recoverable injury (2.76×10−3 /year),
2,50E-04
followed by the surface treatment (2.54 × 10−4 /year
and 1.97 × 10−3 /year respectively). The extruder area
has the higher expected number permanent injury
2,00E-04 (2.4 × 10−3 /year), followed by the surface treatment
(1.76 × 10−3 ). It should be marked that the extruder
unit, the storage and surface treatments areas have
1,50E-04
most of workers in the plant, which are 24, 18 and
14 respectively.
The overall annual company risk is equal to
8.44 × 10−4 /year for fatality risk, 8.66 × 10−3 /year
1,00E-04

for permanent injury and 8.18 × 10−3 /year for recov-


5,00E-05 erable injury.
Accident data for aluminium plants in Northern
Greece were available, as presented by (Doudakmani
0,00E+00 2007). Totally 96 accidents have occurred during the
manual handling painting
operator entrance painting unit

worker -entrance anodizing

insulation fitter
die installation

crane operator-anodizing
extruder worker

forklift operator -billets

op. die machine tools

operator of machine tools


forklift operator- painting
die sandblasting

die hardening

helper packaging
extruder operator

cutting operator
stretching opertaor

cleaner
forklift operator -profiles
worker in storage

worker packaging
die chemical

operator of packaging machi


painter

carpenter
operator of press for tools

period 2001–2005, 2 permanent injuries and 94 recov-


erable injuries. 29 accidents occurred in the extrusion
unit, 25 in the surface treatment, 23 in the storage
areas, 11 in packaging, 5 in machinery and 3 in
the die section. Figure 8 presents the percentage of
recoverable accidents in each unit together with the
contribution of the unit recoverable risk to the over-
Permanent Injury Recoverable Injury
all plant recoverable risk. In the storage area 24% of
the total recoverable accidents have occurred, while
Figure 5. Probability of recoverable and permanent injury this area has 27% of the total recoverable risk. The
for each worker (/year).
number of accidents does not give the same ranking
of hazards as risk, since according to risk ranking first
by the worker in the storage area (2.18 × 10−5 /yr) comes the storage area then the surface treatment and
and the worker performing sandblasting of the dies then the extruder unit area, while if the number of acci-
(1.91×10−5 /yr). The operator of at the entrance of the dents was considered for risk ranking the extruder unit
painting unit has also the highest probability of recov- would come first, followed by the storage area and the
erable injury (2.33 × 10−4 /yr) followed by the worker surface treatment unit.

793
Table 2. Occupational risk of the aluminium plant (/year).

Position Job Permanent Recoverable


in company positions Fatality injury injury

Extruder operator 6 5,44E-06 8,38E-05 5,30E-05


Extruder worker 6 8,75E-06 1,11E-04 7,18E-05
Cutting operator 6 5,28E-06 1,04E-04 6,15E-05
Stretching op. 6 6,53E-06 1,01E-04 9,33E-05
Forklift op.-billets 2 5,94E-06 4,28E-05 7,92E-05
Forklift op.-prof. 6 8,40E-06 5,59E-05 1,14E-04
Worker storage 10 2,18E-05 1,22E-04 1,92E-04
Die installation 1 6,50E-06 6,33E-05 5,40E-05
Die sandblasting 1 1,91E-05 1,85E-04 1,17E-04
Die chemical 1 1,23E-05 1,01E-04 1,08E-04
Die hardening 1 1,23E-05 1,01E-04 1,08E-04
Op. die machine tools 1 9,00E-06 1,43E-04 7,53E-05
Operator entrance painting unit 5 3,25E-05 2,03E-04 2,33E-04
Painter 1 5,99E-06 4,98E-05 1,20E-04
Cleaner 1 3,02E-06 5,86E-05 8,70E-05
Forklift op.paint 1 3,60E-06 2,46E-05 3,23E-05
Manual painting 1 3,39E-06 6,78E-05 1,86E-04
Crane op.anodizing 1 1,37E-05 1,06E-04 7,01E-05
Worker-entrance anodizing 4 1,54E-05 1,10E-04 7,74E-05
Operator of press for tools 1 1,08E-05 1,34E-04 6,28E-05
Operator of machine tools 1 2,77E-06 1,10E-04 6,47E-05
Insulation fitter 1 4,90E-06 1,22E-04 6,99E-05
Helper packaging 1 7,41E-06 2,22E-04 1,08E-04
Operator of packaging machi 4 7,21E-06 1,36E-04 9,11E-05
Worker packaging 4 7,10E-06 9,95E-05 8,75E-05
Carpenter 4 2,71E-06 1,74E-04 7,21E-05

Extrusion unit 1,56E-04 2,40E-03 1,68E-03


Storage area 2,80E-04 1,64E-03 2,76E-03
Die section 5,92E-05 5,93E-04 4,62E-04
Surface treatment 2,54E-04 1,76E-03 1,97E-03
Packaging unit 2,93E-05 1,06E-03 4,86E-04
Machinery 6,47E-05 1,16E-03 8,22E-04
Overall Risk 8,44E-04 8,63E-03 8,18E-03

3,50E-05
3,00E-05
2,50E-05
3,00E-03
2,00E-05
2,50E-03
1,50E-05
1,00E-05 2,00E-03
Death
5,00E-06 1,50E-03 Permanent Injury
0,00E+00 Recoverable Injury
1,00E-03
worker in storage operator entrance die sandblasting
painting unit 5,00E-04

0,00E+00
Fall from height Struck by moving Vehicle
E

Y
R

G
TR
IE
G

ER
DE

IN

Contact falling object from crane Contact Hanging or Swinging Objects


D
A

G
CE

N
R
U

HI
O
TR

CK
FA
R

Trapped between Contact flying object


C
EX

ST

A
R

PA

M
SU

Figure 6. Risk of each hazard (/year) for worker in stor-


age, operator in entrance of painting unit and operator at die Figure 7. Annual risk (death, permanent injury, non perma-
sandblasting. nent injury) in each unit.

794
3,50E-01
occupational risk in the plant and therefore the most
3,00E-01 dangerous job and units can be identified.
2,50E-01
2,00E-01
1,50E-01
1,00E-01
5,00E-02 REFERENCES
0,00E+00

Ale B.J.M., Baksteen H., Bellamy L.J., Bloemhof A.,

Y
.
IE
R

G
E

TR

ER
AG
DE

IN
D

G
CE

IN
U

Goossens L., Hale A.R., Mud M.L., Oh J.I.H., Papazoglou

KA
TR

H
FA

AC
R

C
EX

ST

PA
I.A., Post J., and Whiston J.Y., 2008. Quantifying occu-

M
SU

Risk Number of accidents


pational risk: The development of an occupational risk
model. Safety Science, Volume 46, Issue 2: 176–185.
Doudakmani O. 2007, Risk assessment of an aluminium
Figure 8. Risk contribution and percentage of accidents in plant, Master Thesis dissertation, Greek Open University.
each unit. Papazoglou I.A., Bellamy, L.J., Leidelmeijer, K.C.M.,
Damen, M., Bloemhoff, A., Kuiper, J. Ale, B.J.M. and
Oh, J.I.H., ‘‘Quantification of Occupational Risk from
Occupational risk has been performed for an Accidents’’, submitted in PSAM 9.
aluminium plant and risk of fatality, permanent injury RIVM Report 620801001/2007, ‘‘The Quantification of
and recoverable injury have been calculated for all Occupational Risk. The development of a risk assessment
job positions, plant units and for the whole company. model and software’’, WORM Metamorphosis Consor-
Risk prioritization can be achieved by quantifying tium, 2008.

795
Safety, Reliability and Risk Analysis: Theory, Methods and Applications – Martorell et al. (eds)
© 2009 Taylor & Francis Group, London, ISBN 978-0-415-48513-5

Risk regulation bureaucracies in EU accession states: Drinking water safety


in Estonia

K. Kangur
King’s College London, Department of Geography, London, UK
Estonian University for Life Sciences, Tartu, Estonia

ABSTRACT: The objective of the research presented in this paper is to improve understanding of risk regulation
regimes in new European Union member states. In order to do so, the research focuses on drinking water safety
regulation in Estonia. This paper tests the importance of rules, cultures, capacities and design of regulatory
bureaucracies in determining the processes and outcomes of a risk regulation regime. The effect of fragmented
nature of the regime, deep-rooted dominating pressures, institutional capacities, and regulatory actors present
in Estonian drinking water regulation are discussed in this paper.

1 RISK REGULATION IN THE EU ACCESSION 1.1 Case of drinking water regulation


COUNTRIES
In order to understand the functioning of different
components of regulation, we should first have a
This article looks at the drivers of a risk regulation
look at the institutions in place working for drinking
regime functioning in an EU accession country, Esto-
water safety. The drinking water regulation cuts across
nia. The analysis specifically focuses on the bureau-
governance regimes related to the Ministry of Social
cracy as the determinant of the regulatory regime
Affairs and the Ministry of Environmental Affairs in
effectiveness. The risk regulation regime approach
Estonia.
is used in order to disaggregate the components and
In the process of EU accession, the European
drivers inside the regulatory bureaucracy.
Drinking Water Directive 98/83/EC (Directive 1998),
Risk regulation poses a particularly interesting
(DWD) was adopted to be an Estonian regulation. In
domain for analysing regulation, as it tends to be con-
order to protect public health, an Order for drinking
strained by limited resources, competing priorities,
water regulation (Ministry of Social Affairs 2001) sets
cognitive uncertainties, bounded rationalities, con-
risk-based standards for 42 parameters. These include
flicting interests, ungovernable actors, and unintended
include maximum allowable concentrations for chem-
consequences (Rothstein et al., 2006). Risk regulation
ical and microbiological components and organoleptic
in new member states offers a particularly compelling
characteristics of the water. The Health Protection
material of analysis as the application of the Euro-
Inspectorate under Ministry of Social Affairs super-
pean Union Directives is more than just adapting the
vises quality and safety requirements of drinking
‘‘black-letter’’ of the law.
water. The drinking water regime is closely linked
The member states themselves as well as the
to the water protection programmes of the Ministry
European Commission set high hopes on the imple-
of Environment. The allocation of the EU Structural
mentation of new regulation in accession countries.
support finances for drinking water and sewage system
However, problems may appear with enforcing the
upgrading is decided in the Ministry of Environment
European Directives as new regulations have to fit with
Water Department.
the deep-rooted dominating pressures, institutional
The municipalities bear responsibility for providing
capacities, and regulatory actors present in the state.
drinking water supply and treatment services unless
The case of drinking water safety regime is ought to
they have given the responsibilities to a private com-
exemplify some of the problems the regulatory system
pany. After privatisation of the state water supply
of an EU accession state, Estonia is experiencing.
system in 1995-1997, most of the municipal water

797
utilities were transformed into public limited compa- Europeanisation processes and the functioning of risk
nies. Public water supply companies serving larger bureaucracies in new EU accession states.
settlements belong to the Association of Water Works.
Some smaller municipalities have joined in a munici-
pal syndicate in order to be able to cope with expensive
development of the water supply systems. 2.1 Functionality of the risk bureaucracy
This article tries to find explanations for current The functioning of the core of government is crucial
functioning of the drinking water safety regulation. in order to be able to organise measures for miti-
The risk regulation regime approach will be applied in gating environmental health risks. The dynamics of
order to address the mechanics and dynamics as well adopting EU regulation into Estonian national legal
as the outcomes of the regulation. system will be analysed. Looking at the information-
gathering component of regulation should show the
systems in place for identifying where the weaknesses
might lie in putting the regulations to work. The
2 RISK REGULATION REGIME APPROACH transfer of rules into institutions and practices of rule-
enforcement needs to be observed as a crucial step
The risk regulation regime (RRR) framework looks at of risk regulation. The influence of regulatory design,
the governance of risk as a holistic system to anal- bureaucratic cultures, and the availability of knowl-
yse why and in which stages new regulation might edge, finances and administrative capacities will be
fail (Hood et al., 2004). The RRR perspective encom- assessed as possible determinants of the functioning
passes the complexity of institutional geography, rules, of risk bureaucracy (Fig. 2).
practices and animating ideas related to regulation of
a particular hazard.
The RRR approach ambitiously brings out system-
2.1.1 The nature of rules
atic interaction of regime components of information-
gathering, standard-setting, and behaviour modifica- Safety rules apply in order to protect the public from
tion. The nature of the risk, the lay and expert under- environmental health hazards. The rules for regula-
standings of the risk and its salience, but also interest tion have to balance the regulatory intervention by the
groups’ reactions to the division of costs and bene- state (the degree of regulatory bureaucracy) and mar-
fits of risk regulation may influence its functioning as ket forces, but also set the threshold of risk tolerance.
forces from the context of regulation (Fig. 1). The feasibility analysis behind the rules determines
Risk regulation as a whole is under-explored in their workability. The basis of many EU environmen-
new member states of the European Union, let alone tal health directives stems from the time before the
though any comparative perspectives in the Baltic 1990s. At that time, attention was focused on the
States region. The next section draws on some of policy formulation, with little regard to the issues
the theoretical knowledge that is available on the of integration and implementation outcomes (Weale
et al., 2000). Little consideration of the national con-
text when adopting the rules in state regulation may
hinder proper implementation of safety requirements.

Organised Public
pressure groupso salience
Bureaucratic Nature of
cultures rules

RISK REGULATION
BUREAUCRACY Standard-setting
Functionality of
haviour-modification Risk
Bureuaucracy
ormation gathering

Knowledge
controversies Capacities Regulatory
design

Figure 1. External pressures of risk bureaucracy. Figure 2. Internal drivers of risk bureaucracy.

798
2.1.2 Bureaucratic cultures local and regional administrations of accession coun-
Regulatory cultures, operating conventions, attitudes tries .(Weale et al., 2000; Homeyer 2004; Kramer
of those involved in regulation, and the formal and 2004). The bureaucratic overload and the rushing for
informal processes influence the functioning of risk adoption of EU policies has been described as one of
bureaucracies. The negotiations associated with the the main reasons for insufficient policy analysis and
EU accession demonstrated that in the accession states the poor quality of legislation in Estonia (Raik 2004).
a general political passivity did not encourage major Due to the disproportionate development of differ-
discussions about the suitability of EU regulations in ent scientific agendas in the past, (Massa and Tynkky-
accessions states (Kramer 2004; Pavlinek and Pickles nen 2001), environmental health impact assessments
2004). In contrast, the national political elites were are often lacking or incomplete in Eastern Europe.
proactive in shifting their loyalties, expectations and Inadequate scientific capacities on national level
political activities toward a new pan-European centre encourage regulators in smaller countries to copy the
(Raik 2004). research and policy innovations of larger countries
It is argued that whereas long-established, more (Hoberg 1999).
populous EU member states are unlikely to rely so Above sections demonstrate the importance of reg-
heavily on compliance for their legitimacy, standing ulatory design, embedded cultures and capacities in
and reputation, EU newcomers would naturally make determining the functioning of regulation in EU acces-
greater efforts to implement faithfully any environ- sion countries. The next section will look at how the
mental directives and thus, prove their credentials as importance of these determinants was assessed in case
cooperative, reliable and committed member states of Estonia.
.(Perkins and Neumayer 2007). By publicising and
exchanging information about different practices and
reports on progress of enforcing the EU laws, it is 3 ESTONIAN EXPERIENCES WITH
hoped that reputation, mutual learning and compe- ADOPTING THE EU DRINKING
tition mechanisms are being set in motion among WATER SAFETY REGULATION
member states (Heritier 2001).
One of the big changes in procedural norms of For information gathering on drinking water safety
regulation is the post-soviet countries’ shift from regulation in Estonia, the study involved an interac-
command-and-control regulation to more discursive, tion between documentary study and a snowballing
non-hierarchical modes of guidance (Swyngedouw programme of interviews to locate relevant docu-
2002; Skjaerseth and Wettestad 2006). An analysis of mentary material and elicit attitudes and practice not
Central and Eastern European (CEE) environmental contained in documents. Much of the information
policies after 1989, however, reveals strong influ- required was found in documents detailing doctrine
ences of state socialist legacies in determining the and practice of regulators, legal and statutory materi-
content and practices of regulatory reform (Pavlinek als from Estonian re-independence time, EU directives
and Pickles 2004). and scientific literature. The open-textured analysis
of these materials was followed by identifying and
2.1.3 Regulatory architecture filling in the gaps through semi-structured interview
programme with 22 Estonian key actors from regu-
The regulatory structure entails the ways in which latory components’ key institutions (ministerial level,
institutional arrangements are adopted for comprehen- regional and local administrations and inspectorates)
sive and robust regulation. Rather than the nation-state and representatives of dominant actors in the context
setting their own regulatory agenda as before, in more of regulation (scientific experts, water suppliers and
recent times regulatory decision-making has shifted organised interest groups). The fieldwork conducted
either upwards to the European Commission and from June 2007 to March 2008 revealed complex
Council (or beyond) or alternatively downwards to the insights of the motivations, objectives and expecta-
regional or local level (Swyngedouw 2002; Löfstedt tions of different governance actors with regard to
and Anderson 2003). Introduced extensive complex- drinking water safety and its new regulations.
ity, rivalries and overlaps of competencies may inhibit
proper scrutiny of regulatory implementation (Hood
1986), and enforcing regulations can become rigid and 3.1 Bureaucracies and the standard-setting
non-reflexive (Baldwin and Cave 1999). component of drinking water regulation
Estonian Ministry of Social Affairs Order on drinking
2.1.4 Regulatory capacity water (2001) applies to all the water supplies pro-
Despite massive EU financial support and long dura- viding drinking water for communities larger than 50
tion of transitional periods, meeting the EU require- people. In the process of adopting the DWD for Esto-
ments is hindered by the inherent inefficiencies in both nian regulation, national provisions and even stricter

799
safety requirements could have been added to make the rural population not being under surveillance simply
regulation more protective of the Estonian public. A because there might have not been enough pressure
derogation period until 2010 was set for smaller oper- from the lower levels of regulatory regime (munic-
ators (serving up to 5000 people) to meet the drinking ipalities, local inspectors) to explain the need for
water quality requirements, to allow for renovation of changing the rules. An understanding of the problems
water purification and supply systems. on the ground could easily become obfuscated within
A group consisting of scientists, representatives of the hierarchical regulatory structure. This was espe-
the larger water companies and the Ministry of Social cially the case during the accession period, when the
Affairs officials was formed as an expert group, in regulatory cadres were changing rapidly.
order to draft the drinking water quality laws. Neither
consumer protection groups nor representatives of the
3.2 Bureaucracies and information gathering on
municipalities were present at the regulation drafting
the change of states in drinking water safety
round-table. The scientists advocated the inclusion of
more restrictions of hazardous substances (e.g., bar- Having adopted the new regulations, there was a need
ium) and the setting of minimum values for water to obtain information about the change of states in
hardness. Water suppliers argued for more risk-based drinking water safety. Regular monitoring of drinking
monitoring schemes. After the group negotiations, the water quality by operator companies should follow the
Ministry of Social Affairs made the decision simply Health Inspectorate’s prescriptions. The frequency of
to adopt the DWD, with no modifications. Three main how often water suppliers are obliged to take full tests
factors can be considered as influential for why no (covering 42 quality requirements) and minimal num-
Estonian national modifications were considered. ber of tests (18 quality parameters) depends on the
Firstly, the DWD adoption negotiations in Estonia size of the company’s clientele. All economic water
were rushed because of the fast pace of EU acces- suppliers (providing water from 100 to over 100 000
sion. There was a situation in 2000-2004 where EU people) are expected to test four times a year for a
accession countries such as Estonia, Latvia, Lithuania, minimum monitoring program. Companies providing
Slovakia, Slovenia, Malta and Cyprus were compet- water to less than 10 000 people only have to take
ing with each other for faster integration into the EU one full test per year. Companies serving 10,000 to
regulatory system. Thus, there was little time for con- 100,000 people take full tests three times a year and
sideration for any alternative policies from that of the operators with over 100,000 clients as often as over
established EU requirements. As a result, the need 10 times a year. After every five years, the European
for inclusion of additional safety parameters or the Commission checks on the monitoring results of espe-
applicability of the EU directive’s requirements was cially larger water companies (serving more than 5000
not analysed. people).
Secondly, as the Estonian bureaucratic system has In reality, neither the operators nor the Health
a long tradition of elite-centred decision-making, its Inspectorates test water as often as expected, as the rep-
bureaucrats have had little experience in carrying resentative of Inspectorate commented in interview.
out consultations with other relevant parties in the Tests conducted by smaller companies are sporadic;
decision-making process. Skills for interacting with and there are very few records on the drinking water
outside expertise and knowledge about how to make quality provided through private wells. This means that
use of the information gathered from parties that were in addition to the uncontrolled drinking water sources
present in the national regulation design were missing. from private wells (23% of population), the water
This did not allow for proper addressing of local exper- delivered by smaller operators (providing for 13% of
tise regarding the drinking water issues. A motive population) is also not under systematic surveillance
for not acknowledging national drinking water related for any potential health risks.
studies may stem from the bureaucrats’ educational There are regional differences in water quality. At
background, as the interviewed scientists claimed. The the time of adoption, the representatives of water
general neglect of education on environmental health companies and the scientists pressured for more site-
issues has contributed to bureaucrats’ low awareness specific risk-based monitoring requirements. Con-
about the drinking water issues in Estonia. Thus, the centrating more frequent testing on those hazardous
Ministry officials might not have been capable of substances found in the area and carrying out fewer
appreciating the scientific contributions to drinking testing of the other substances could have allevi-
water regulation design. ated monitoring pressure on water companies. The
The third aspect that may have driven the policy- rationale for site-specific monitoring programme was
makers to neglect any national modifications was the not considered in designing monitoring guidelines,
poor extent of communication with other levels of though.
regulatory regime. The Ministry of Social Affairs There are some reasons that might explain why the
may have overlooked issues such as 23% of mostly Inspectorates check smaller companies less frequently.

800
Firstly, there is insufficient financial and personnel the clients of smaller water suppliers has been com-
capacity for controlling the large number of water sup- promised due to negligence of health priorities that
pliers. The representative of the Health Inspectorate should be promoted by the Ministry of Social Affairs.
recently interviewed stressed that the inspectors are Another set of enforcement problems is related
under a heavy workload and the finances to conduct to the way regulation works. Investing regulatory
full analyses are scarce. attention into the larger companies may be explained
Secondly, the Inspectorate may not be completely through the Estonian bureaucratic culture, which pri-
aware of or simply not give full consideration to the oritizes good performance with respect to the Euro-
risks that non-surveillance may pose to the people pean Commission. As the records that interest the
obtaining uncontrolled drinking water from smaller European Commission most (those on larger company
operators. The European Commission demands infor- compliance) show conformity with EU requirements,
mation only about the larger water companies from the the Inspectorates are seen by Brussels to be efficient.
Estonian Health Inspectorate. Therefore, Inspectors Thus, credibility among the public seem to be a less
have no incentives for pushing for wider control. important driver for the Inspectorates. This is espe-
cially as they are used to top-down decision-making
traditions.
One could presume that if those smallest suppliers
3.3 The functioning of behaviour-modification
and private wells were not monitored, there would be a
according to the drinking water regulations
system for informing the individual well users to take
The records on behaviour modification according to protective measures. Yet, there have not been any infor-
the drinking water regulations show that larger com- mation campaigns nor has there been any demand for
panies are now using modern purification techniques more information regarding the drinking water from
and are for the most part following the set quality the users. A consideration that the private wells are
requirements (Sadikova 2005). The scientific stud- something that the owners have to manage themselves
ies and Inspectorate’s data, which is available on the is prevailing. The support schemes and information
water supplies in the rural areas, however, shows that dissemination for addressing the drinking water qual-
many risk parameters exceed the defined health lim- ity in individual wells has not been institutionalized
its. For example, northern Estonian drinking water through the Ministry orders either.
sources are affected by a radioactive bedrock zone
(Lust et al., 2005), the north-eastern part of Estonia has
excessive sulphates, and high fluoride levels are prob- 4 CONCLUSIONS
lematic in western areas of Estonia (Karro et al., 2006).
Shallow individual wells are generally subject to nitro- Aligning national safety standards with the Euro-
gen compounds from fertilizers and microbiological pean Union rules requires from bureaucracies careful
contamination (Sadikova 2005). These contaminants decisions about the organisation of regulatory respon-
have been associated with gastrointestinal, circula- sibilities, about the approaches for attaining their
tory and/or nervous system diseases. There are still objectives, as well as choices about practical allocation
relatively poor water infrastructure and purification of ever-scarce capacities to employ these strategies.
systems in parts of Estonia. This, together with the This paper focused on the bureaucratic determi-
poor information about drinking water quality, puts up nants of drinking water safety regulation efficiency
to 36% of the public (customers of smaller companies in Estonia. The standard-setting, monitoring and
and users of private wells) at risk with their health. enforcement activities associated with drinking water
The enforcement of the drinking water regula- regulation in Estonia may be described as an EU alle-
tions may be achieved through investments or strict giance striving process. Search for power, patronage
regulation. Due to the structure of the bureaucratic and reputation are the main compliance-driving forces
enforcement system, the Ministry of Environment has of inspectors on national level, but may also determine
the power and control over allocation of finances with the state bureaucracies behaviour on European level.
respect to water. Thus, environmental concerns have Deeply rooted bureaucratic cultures may function
prevailed above those concerned with public health. as gatekeepers for the take-up or neglect of more
The Ministry of Environment with minimal consulta- innovative non-hierarchical modes of enforcement.
tion with the Ministry of Social Affairs designed the National scientific incapacities have carried over to
holistic water management plans. Priority given for poor bureaucrat’s awareness on drinking water safety
larger communities’ drinking water supplies and sew- issues leading to insufficient local policy analysis and
erage system updating has been cost-beneficial from simply application of preset rules. Available financial
the larger environmental and public health points of and administrative capacities have led to a reinterpre-
view, as the larger communities are safeguarded. How- tation of the set standards and some neglect of smaller
ever, the safety of inhabitants of large rural areas and operators’ quality control. Allocating scarce resources

801
and controls to the larger companies has benefited the Hood, C., Rothstein, H. & Baldwin, R. 2004. The Govern-
viability of larger communities, but smaller and rural ment of Risk. Understanding Risk Regulation Regimes.
communities appear to have been ignored. Oxford: Oxford University Press.
The complexity of the regulatory structure, span Karro, E., Indermitte, E., Saava A., Haamer, K. & Marandi,
through the EU expert committees, national levels of A. 2006. Fluoride occurrence in publicly supplied drink-
ing water in Estonia. Environmental Geology 50(3):
government, and their sub-departments, may create 389–396.
an illusion of regulatory control, yet the real drinking Kramer, J.M. 2004. EU enlargement and the environment:
water safety issues may remain unattended. six challenges. Environmental Politics 13(1): 290–311.
Lust, M., Pesur E., Lepasson, M., Rajamäe, R. & Realo, E.
2005. Assessment of Health Risk caused by Radioactivity
ACKNOWLEDGMENTS in Drinking Water. Tallinn: Radiation Protection Centre.
Löfstedt, R.E. & Anderson, E.L. 2003. European risk policy
This article presents some of the findings of author’s issues. Risk Analysis 23(2): 379.
PhD research at King’s College London that is Massa, I. & Tynkkynen, V.P. 2001. The Struggle for Russian
Environmental Policy. Helsinki: Kikimora Publications.
financed by Estonian Academic Mobility Founda- Ministry of Social Affairs. 2001. Requirements for Drinking
tion and Ministry of Science and Education grant SF Water Quality and Control, and Analysis Methods. In RTL
0170006s08. 2001, 100, 1369. Riigi Teataja: Tallinn.
Pavlinek, P. & Pickles, J. 2004. Environmental pasts & envi-
ronmental futures in post-socialist Europe. Environmental
REFERENCES Politics 13(1): 237–265.
Perkins, R. & Neumayer, E. 2007. Implementing multilateral
Baldwin, R. & Cave, M. 1999. Understanding Regulation: environmental agreements: an analysis of EU Directives.
Theory, Strategy and Practice. Oxford: Oxford University Global Environmental Politics 7(3): 13–41.
Press. Raik, K. 2004. EU accession of Central and Eastern European
Directive. 1998. Directive 98/83/EC of the European Par- countries: democracy and integration as conflicting logics.
liament and of the council of 3 November 1998 on the East European Politics & Societies 18(4): 567–594.
quality of water intended for human consumption. Official Rothstein, H., Huber, M. &, Gaskell, G. 2006. A theory
Journal of the European Communities: OJ L 330. of risk colonization: The spiralling regulatory logics of
Heritier, A. 2001. New Models of Governance in Europe: societal and institutional risk. Economy and Society 35:
Policy-Making without Legislating. Vienna: Renner 91–112.
Institute. Sadikova, O. 2005. Overview of Estonian Drinking Water
Hoberg, G. 1999. Sleeping with an elephant: the Ameri- Safety. Tallinn: Health Inspectorate.
can influence on Canadian environmental regulation. In Skjaerseth, J.B. & Wettestad, J. 2006. EU Enlargement
B. Hutter (ed), A Reader in Environmental Law: 337–363. and Environmental Policy: The Bright Side. FNI Report
Oxford: Oxford University Press. 14/2006. Lysaker: The Fritjof Nansen Institute.
Homeyer, V.I. 2004. Differential effects of enlargement on Swyngedouw, E. 2002. Governance, Water, and Global-
EU environmental governance. Environmental Politics isation: a Political-Ecological Perspective. Meaningful
13(1): 52–76. Interdisciplinarity: Challenges and Opportunities for
Hood, C. 1986. Administrative Analysis: An introduction to Water Research. Oxford: Oxford University Press.
Rules, Enforcement, and Organizations. Sussex: Wheat-
sheaf Books.

802
Organization learning
Safety, Reliability and Risk Analysis: Theory, Methods and Applications – Martorell et al. (eds)
© 2009 Taylor & Francis Group, London, ISBN 978-0-415-48513-5

Can organisational learning improve safety and resilience during changes?

S.O. Johnsen
NTNU, IO Centre, Norway

S. Håbrekke
SINTEF, Norway

ABSTRACT: We have explored accident data from British Rail in the period from 1946 through 2005. Our
hypothesis has been that safety is improved through learning from experience. Based on a quantitative analysis
this hypothesis is tested for the data using a simple regression model. We have discussed the model and its
limitations, benefits and possible improvements. We have also explored our findings based on qualitative theory
from the field of organisational learning, and have suggested key issues to be explored to improve safety and
resilience during changes, such as safety cases, standardisation of training, unambiguous communication and
sharing of incidents and mitigating actions.

1 INTRODUCTION Experience can be expressed different, dependent


on the type of industry to be explored. In an indus-
1.1 Learning from experience try production environment it could be the number of
units produced. In transportation it could be transport
The influence of experience on productivity has been
distance in miles/km or that experience has a simple
observed based on data from the Second Word War.
linear relationship to time.
Based on experience, it was observed a systematic
According to the hypothesis, when plotting the
improvement of productivity and reduction of time to
number of accidents as a function of experience, we
build airplanes and Liberty ships, see Yelle (1979).
expect an exponential decrease in the numbers of acci-
Doubling of cumulative production, i.e. cumulative
dents. Using time (t) as a basis for experience, the
experience, typically improved productivity by a fixed
level of accidents should be approximately given by the
percentage—and thus reduced the unit costs by a fixed
exponential expression a · e−bt . The constant a should
percentage. This was called the law of experience,
indicate the level of accidents at the starting point, and
expressed as:
the constant b is an indication of the speed of learning
‘‘The unit cost of value added to a standard product
from experience.
declines by a constant percentage (typically between
20 and 30 percent) each time cumulative output dou-
bles.’’ Productivity increases due to experience at a
1.2 Proposed case—British Rail from 1946 to 2005
rate proportional to its value, and hence there is
an exponential relationship between productivity and British Rail (BR) was the main line railway operator
experience. in Great Britain from 1946 to 1994. We have gathered
We suggest a similar relationship between safety data on almost all fatal railway accidents from 1946 to
and experience. As experience is built our hypothesis 2005, and we have analysed the data with respect to
is that accidents and incidents are reduced due to learn- the hypothesis. We are basing our work on data used
ing among the workforce and management. We are by Evans (2007). In the period from 1946 there have
proposing that ‘‘accidents due to experience in opera- been major technological and organisational changes.
tions’’ has the same behaviour as ‘‘productivity due to One major change was the deregulation of British Rail
experience in operations’’, our hypothesis is: ‘‘when in 1994, leading to a fragmentation of BR into more
cumulative production is doubled or in general cumu- than 100 separate organisations.
lative experience is doubled, accidents are reduced The reason to use data from British Rail has been
by a fixed percentage.’’ Based on this hypothesis, the number of accidents during the period, making it
there should be an exponential relationship between possible to discuss and validate a quantitative hypoth-
accidents and experience. esis. The large changes introduced by deregulation in

805
1994 and their consequences on safety have also been 2.3 Short term perturbations
explored.
A more complex model could take into account the
We would expect an increase in accidents after 1994
short term effects of the dynamics of learning and
due the large scale changes, but increased scrutiny of
focus on experience. Learning could be influenced by
the effects and an increased focus on accidents could
the increased alertness after an accident and the com-
moderate the negative effect of deregulation.
placency setting in when no accidents are happening.
The increased alertness should lead to fewer accidents,
while the increased complacency could lead to more
2 MODEL DESCRIPTION accidents. This dynamic behaviour could be modelled
as some sort of a sinus curve interposed on the expo-
2.1 Individual and organisational learning nential model. We have not established such a model
We are using a definition by Weick (1991) to define yet, but we anticipate that the accident data would
learning, .i.e.: ‘‘..to become able to respond to task- show such a relationship if plotted in a logarithmic
demand or an environmental pressure in a different scale.
way as a result of earlier response to the same task
(practice) or as a result of other intervening relevant
experience’’. 2.4 Benefits and limitations
Based on this definition of learning the actors that The simplicity of the model in (1) is the major argu-
learn must sense what is going on, assess the response ment for the examination of the railway data for
based on earlier response or experience and respond learning by experience.
with a different behaviour. But such a simple model has its limitation, espe-
Related to organisational learning we are focusing cially when analysing the results. Accidents are spon-
on the result of organisational learning, in that we are taneous and discrete, and should best be modelled by a
observing what the actors in the organisation is actu- Poisson distribution. Also a prediction of the numbers
ally doing and the results of their actions related to the of accidents in the future must be done in carefulness.
level of accidents. Due to learning, we are assuming However the regression line approximation illustrates
that the new and safer practice is carried out regularly, the accident trend over a rather long time period. Our
that it is present in the behaviour of several individuals purpose is to present the shape of a trend line rather
engaged in similar tasks, included in procedures and than specified values (such as number of accident a
that new members in the organisation are instructed certain year). Another benefit of the model in (1), is the
in the new practice. This description is based on the various possibilities of exploration of learning. This
definition of organisational learning as described by can be quantified in different ways and not only by the
Schøn (1983). number of accidents alone. This is further discussed
in section 3.1.
2.2 Development of accidents based on experience Equation (1) can be simplified in a logarithmic
system as a linear model instead of exponential:
Based on the preceding discussion our proposed null-
hypothesis is: In the long run accidents follow an
exponential decreasing regression line based on expe- ln A(t) = ln a − b · t (2)
rience, where experience is expressed by time.
In our quantitative analysis we have elaborated the This representation makes it possible to fit a linear
hypotheses, see section 2.6. regression line.
In our proposed quantitative model the level of his-
torical accidents at time t (years after 1946), A(t) is
following a regression line on the form 2.5 Model improvements
Further exploration of a model from an accident anal-
A(t) = a · e−bt (1) yses or statistic analysis point of view can be done by
more detailed regression analysis, life time analysis or
Here a and b are the parameters in the model. Note that time series modelling.
the model considers t as a continuous variable while Evans (2007) uses Generalised Linear Models
we in practice only consider t for the distinct integral (GLMs) to analyse the train data. Here the num-
years. This is a weakness of the model. bers of accidents during one year are assumed to
Examining accident data from year to year may be be Poisson distributed. In such a model the failure
challenging due to the stochastic nature of accidents rate as a function of time can be presented, i.e. a
and especially the variation in number of fatalities model with only one parameter. Also different types
from year to year. of GLMs can be compared in order to achieve a model

806
which gives the best explanation of the variation in 5. R 2 (indicates how well the regression line fits the
the data. data) is less than 0.801 .
The accident data can also be treated as life time
data, where a life time is the time until an acci- If our analysis shows none of the above, we cannot
dent appears. Here NHPP (Non-Homogenous Poisson reject the hypothesis that the accident level follows
Process) or Cox proportional hazard regression are an exponential decreasing trend, and the hypothesis is
suitable alternatives. accepted.
Experience related to transportation could be
defined not only as time, but as accumulated travel
length as well. Experience in a production envi- 3 DATA ANALYSIS
ronment could be defined as time or accumulated
produced equipment. Both GLMs and Cox regression 3.1 Presentation of data
can take into account as many explanatory variables
This analysis is based on data of all railway accidents
as desired, and tests to check the significance of the
where the major part is movement and non-movement
variables in the model can be executed. Thus accu-
accidents. Collisions, derailments, overruns and colli-
mulated travel length (l) could be an explanatory
sions between train and road vehicles are also classes
variable in addition to time, e.g. for the exponential
of accidents included in the study. Train fires and other
regression:
train accidents are not included.
The data represents both fatal accidents and the total
A(t) = a · e−(b1 t+b2 l) . number of fatalities in each accident. Every fatal acci-
dent registered has at least one fatality. In addition
Another alternative for analysing the train data is the number of million train km per year is part of the
by time series modelling. In the data analysis we will data, and accidents pr million train km is a natural
see that the observations seem to oscillate around the expression for the accident level.
regression line. Thus e.g. an ARIMA model could be In our model, the sum of accidents and fatalities pr
relevant to estimate and explore. million train km is used. As fatal accidents are the
numbers of events in which the fatalities occurred,
this may seem rather odd. However experience and
2.6 Hypothesis testing learning is most probable achieved both through high
frequency of accidents and severe consequences of
The quantitative null-hypothesis proposed in section accidents from which organisations learn. We have
2.2, that the accidents follow an exponential decreas- chosen an equal weighting of both accidents and fatal-
ing trend, is b is significantly different from 0 in the ities pr million train km as the quantitative measure of
regression line described by (1). accidents. Both frequency and consequences are taken
To decide whether the values follow an exponen- into account. It is most likely that learning increases
tial expansion; i.e. if our hypothesis is not rejected, due to the number of accidents and the number of
we evaluate different measures. If the residuals show fatalities in an accident. If we were only consider-
independency and seem to be sampled from a normal ing the number of accidents, years with few accidents
distribution, we also check if the P-value, Con- but many fatalities would give a wrong impression of
fidence interval and Standard deviation of b, the the learning level. Analysing fatalities only, do not
T-statistic and the R 2 -value give reasons to reject the take into account the learning due to the number of
hypothesis. If one of the following criteria (all out- accidents with few fatalities.
put from the analysis) is fulfilled we will reject the Alternatively a different weighting of the numbers
null-hypothesis: of accidents and fatalities, respectively, may give a
more appropriate picture of the accident levels. How-
1. The T-statistic for b is in absolute value below 2.0;
ever, an exploration of different weighting shows that
i.e. b is not a significant parameter (considered as 0)
this does not make any significant influence on the
and our model reduces to A(t) = a.
analysis with respect to our hypothesis.
2. Standard deviation of b is great compared to the
parameter value; the values scatter.
3. P-values (probability of an accident level being
totally independent of the particular year) is greater 1 The value of 0.80 should not be considered as a norma-
than 0.05. tive value. It is just a value in order to explore the goodness
4. Dependent residuals, i.e. the residuals are not of fit in addition to what can be seen from the plotted esti-
independent and normally distributed and show mated regression line together with the observed data.
certain trend; i.e. the model does not fit the data In this case we are satisfied with 80 percent or more
rather well. explanation of the data variation in the model.

807
It is worth mentioning that the data reported was ten years all values are above the estimated line,
in 1991 changed from calendar year to fiscal year followed by a period in the 90’s where the values
(1 April to 31 March) and in 2005 changed back again. are all below the estimated line. The last ten years
This means that there may be some data missing/ with observed data again seems to give independent
overwritten during the two transitions for the data used residual.
in this analysis. However, the accidents/fatalities and 5. R 2 equals 0.91.
the train-km data have been collected over the same
intervals, such that the transitions have very restricted Based on the ANOVA results, the null-hypothesis
influence on our analysis. cannot be rejected and are thus is accepted. The
estimated regression line from (1) is:

3.2 Data analysis A(t) = 0.27 · e−0.043t ,


A least square regression is executed to the sum of
the number of accidents and the number of fatalities and the corresponding estimated linear model in (2) is
pr million train km in the years 1946 thorough 2005. ln A(t) = −1.3 − 0.043t.
The best fit is characterized by the sum of squared
residuals with the least value, a residual being the dif-
ference between an observed value and the value given 3.3 Short range fluctuations
by the model. The regression estimates the parameters We observe that the actual accident data seem to oscil-
a and b. To evaluate the goodness of fit of the model late as some sort of a sinus curve around the estimated
to the observed data, we examine all the R 2 value, curve, see Figure 2. This could be due to natural vari-
the residuals, P-value, T-statistic and confidence inter- ation or a result of increased awareness and increased
val of the parameters in order to verify the hypothesis safety after an accident, followed by complacency and
stated. less safety in a period with no incidents or accidents.
We see that the observed values seem to nearly
follow an exponential trend.
Results of the ANOVA test: 3.4 Discussion of parameters
1. The T-statistic for b is in absolute value a lot greater The parameters a and b estimated are of interest for
than 2.0. The 90% confidence interval of b is comparing different periods or different industries.
(−0.0455, −0.0398). The value of a tells where we are at the start of the
2. Standard deviation of b is only 4% of the parameter learning process, and the value b says something about
value. how fast we learn. The greater b the faster learning,
3. P-values are far less than 0.05. but one should note that this may depend on a: If data
4. From the plot above the residuals seems to be considered in the analyses are from a period where the
independent and their values indicates a normal dis- industry is new we except a greater b than if the data
tribution. The independency becomes less reliable considered are from a period where the industry is well
after the beginning of the 80’s. In a period for about established.

1,6 0,5

1,4 0

1,2 -0,5

-1
1
-1,5
0,8
-2
0,6
-2,5
0,4
-3
0,2
-3,5

0 1940 1950 1960 1970 1980 1990 2000 2010


1940 1950 1960 1970 1980 1990 2000 2010
Figure 2. Short range fluctuations between actual accident
Figure 1. Long range fit of accident data from 1946 to 2005. data and estimated model (2).

808
This can be illustrated using the British Rail data, technology. An example is the implementation of elec-
e.g. by dividing the period in two, i.e. one period tric motors replacing steam engines. At first the large
from 1946 to 1975 and the other from 1976 to 2005. steam engines were replaced by large electric motors
Analysing these two separately, gives the following in factories, using long chain drives to distribute the
distinctions: power in the factory. Later small electrical motors were
distributed directly in the production process as they
– The first period fits the data better than the last. were needed. This removed the danger from the long
– The first period has a greater b value, and hence chain drives. Thus the efficient and safe use of new
indicates stronger learning than the last. technology was based on exploration and learning in a
We have also established a model based on cumu- period that could take years. This is discussed further
lative distances travelled as a basis for experience, and in Utterback (1996).
have also identified a good fit. Improvements in working procedures and safety
The following two remarks must be considered thus usually take place after some years of experiences
under examination of data with respect to accidents with new technology.
and learning:
1. The period (time and length) of analysis: What 4.2 Consequences of our model of learning
period in the history of the industry is considered? from accidents
It is probable more likely that learning and devel-
opment are happening initially. Also the length of We have proposed a hypothesis, suggesting a model
the period plays an important role, due to increased where the levels of accidents are dependent on experi-
validity of a large sample size. For a huge dataset ence; more precisely the levels of accidents per year fit
will for instance outliers (observed values rather an exponential trend. We have based our model on the
distinct from estimated values) have less influence experiences from the Railways in Great Britain. Our
on the regression line (both establishing and evalu- simple model of accidents at time t, is A(t) = a · e−bt .
ating). A small dataset without any particular trend The hypothesis indicates that experience and the
will often result in a regression line rather similar level of learning is a key element in reducing accidents.
to the horizontal/constant mean value line. Hence, If we are using the iceberg theory from Hein-
it is important to consider a data sample of an rich (1959) we are assuming that accidents, near
appropriate sample size. accidents/incidents and slips are attributed to the same
2. The number of accidents: If few accidents are factors. The distribution between these categories was
observed, the model does not fit very well. This suggested by Heinrich to be:
could be interpreted as the stochastic character of – 1 Major accident
accidents. – 29 Minor accidents
In the analysis, we have considered the total number – 300 incidents, no-injury accidents
of both accidents and fatalities for all types of acci- – 3000(?) slips, unsafe conditions or practices.
dents. However, dealing with accidents and fatalities
separately do not impact the acceptance of the hypoth- The result of more recent studies contradict this
esis test. As expected the accidents fit the model better hypothesis, one issue is that severity is dependent on
than the fatalities, as fatalities naturally vary more. For the energy involved and the distribution of incidents
the different types of accidents it is only the class for differ between different types of workplace. How-
movement and non-movement accidents which fits the ever, our point is to focus on learning from minor
model. This accident type also makes the greatest con- accidents or incidents in order to avoid major acci-
tribution to all accidents. The other two classes, train dents. Both accidents and deviations are of interest
collisions and collisions between trains and road vehi- related to learning. If we manage to focus and doc-
cles, probably due to few accidents (the model do not ument minor accidents, incidents and slips in order
handle 0-values) and large variation in the data, do not to establish organisational learning; and succeed with
fit the model. this learning, it should imply that major and minor
accidents should decrease.
A ‘‘Heinrich pyramid’’ with few major accidents
but documentation, exploration and learning of many
4 DISCUSSIONS OF SOME RELEVANT
slips could be an indication of a learning and resilient
ISSUES
organisation.
However, if we do not focus, document and explore
4.1 Use of technology to improve safety
minor accidents, incidents and slips—it could lead
Technology could be an important factor in safety, to more serious accidents due to less learning from
but often new technology initially is mimicking old experience. Thus in en environment of less learning

809
we should find a ‘‘Heinrich pyramid’’ with more major (e.g. transport) as the response measures must be equal
accidents but few slips. (e.g. accidents per million transport kilometre).
At the 2007 HRO conference, in Deauville, May
28–30 examples of this were mentioned from aviation
in USA. Airlines that had explored minor accidents, 4.4 How to increase safety further?
incidents and slips had none serous accidents. Airlines If improved safety is due to experience and learning
that had documented few minor accidents, incidents from accidents as suggested by our model, safety may
and slips had several serous accidents. This could indi- be improved by focusing more on exploring organisa-
cate poor learning. Since this is comparison between tional learning and organisational inquiry as discussed
companies in the same industry with same energy by Schøn (1983). This means that we should explore
level, this could indicate different order of learning. what has gone wrong based on using actual data in an
open and testable manner. We should try to explore
double-loop learning, doing reflection in a learning
4.3 Level of analysis—Industry comparisons arena among relevant actors from both inside and
outside the organisation.
We can use the suggested model and perform an anal- Some consequences of that could be an improved
ysis on different organisational levels and of different focus on:
industries.
By different organisational levels we mean either: – Common training to ensure clear communication
(1) within an organisation, looking on accident data and common understanding of what has happened.
from different parts of the organisation analysing – Exploring known incidents more thoroughly and
which part having the better organisational learning ensure that the employees are aware of what can go
and thus safer practices, (2) comparing same type wrong and why. This must involve key actors in the
of organisations within a country—identifying what organisation to ensure that organisational learning is
organisations having the better organisational learning taking place. A systematic analysis of several inci-
system and (3) comparing same type of organisations dents together could help identify common errors
between countries—trying to identify what countries and thus increase learning.
having the better organisational learning systems. – Improve and focus on scenario training by dis-
Organisations could be compared to discuss which cussing and reflections on what can go wrong in
kind of organisation has the better accident learning the future, especially with a cross functional team
system. As an example, within aviation you could involving actors from outside the organisational
identify the airlines having the better accident learning boundary.
system.
Different industries (e.g. aviation and shipping)
could be compared to discuss which kind of industries 4.5 What should be done in an environment
has the better accident learning system. with no accidents?
In a system with better organisational learning, the
consequences should be that the whole system is learn- In an environment with almost no accidents, it
ing faster and accidents are decreasing faster, e.g. a becomes more important to analyse minor accidents,
greater value of b. Improved organisational learning incidents and slips in an open manner. Sharing of
in this context should mean more information sharing incidents must be performed, in order to improve learn-
among the involved actors in the whole system and ing. The use of narratives or story-telling could be
better understanding of the causes of accidents. used to create understanding and support of a learning
In a system with initially safer technology, the initial environment.
value of a should be lower, meaning that the levels of Scenario training should also be explored to
accidents are initially lower. increase and sustain organisational learning.
In the airway industry there is international shar-
ing of incidents and accidents—a system that seems
4.6 Effect of the deregulation of BR in 1994
to have improved organisational learning, due to more
incidents and more learning experiences across coun- When we look at accident data from 1994, we do not
tries and different organisations. Thus it should be find serious deviation from our model and hypothesis.
checked if the airway industry is learning faster and There were no large scale increases of accidents, thus
accidents are decreasing faster, e.g. a greater value of indicating that mitigation of the changes were taking
b in the same period than other industries. places such as learning and sharing of experience. One
Note that the values of a and b from the regression argument to support this was the increased focus on
results analysing different industries are only com- safety in the railway system due to the deregulation and
parable when considering the same type of industry increased public scrutiny of accidents and incidents.

810
The increased focus, information and discussion may We have structured the responses based on how
have sustained the low level of accidents after 1994. learning is taking place, e.g. issues related to sens-
Other issues that has been mentioned, from Evans, ing what is going on based on communication and
has been that the greatest number of fatal accidents are common training, who has responsibility to act, and
movement and non-movement accidents. Two particu- how to assess the response based on earlier response
lar things that will have contributed to the fall in these or experience. The main practices we have identified
are: (1) A major reduction in the number of trains are in accordance with our hypothesis related to learn-
without centrally-locked doors (they have now gone ing, e.g. a strong focus on practices supporting and
completely). The doors of trains without central lock- enabling organisational learning that is a focus on:
ing could be opened while the train was in motion,
and there were fatalities due to passengers falling from – Sense what is going on—based on unambiguous
trains, usually because they opened a door. Such events communication, and common training;
are now virtually non-existent. – Unambiguous responsibility:
(2) Fewer track workers and better management – Respond based on earlier experience—unambiguous
of track maintenance work, leading to fewer fatal procedures:
accidents to staff working on the track. – Support learning based on experiences and incident
reporting across organisations:
– Organisational learning—common procedures and
4.7 Further exploration of the model perceptions.
In this paper we discuss neither the use of the model
in prediction nor the uncertainty/confidence bounds. This is explored in the following.
Data from a restricted period were the latest years are 1. Sense what is going on—unambiguous commu-
excluded, are fitted to a model which can be used to nication: The use of protocols or formalised commu-
predict the data (most likely in the future). The pre- nication templates is essential when communicating
dicted value can then be compared with the actual cross interfaces. Pre-determined protocols and forms
historical value to see whether or not they are inside reduce difficulties in understanding and learning and
the confidence bounds and then verifies the property should be mandatory.
of prediction of the model. 2. Sense what is going on based on common
training—Standardised training for operators, focus-
ing on communication and handling of deviations. It
4.8 Objectivity, reliability and validity is especially important to establish common models or
perceptions of how incidents happen as described by
The proposed hypothesis can be tested on different Schøn (1983). It is also important to share an under-
industries, and there are clear objective criteria to be standing of ‘‘culture’’ e.g. perceptions, knowledge
used to check if the hypothesis is correct as described and behaviour between the different companies. Good
in 4.1. We do propose that the result is objective. experience has been obtained by the use of scenario
We are proposing that the result is reliable, due to training and simulators in addition to normal training.
the quantitative approach we have used. In a simulator—scenarios including deviations from
Validity is difficult to ascertain, the hypothesis normal operations can be tested, and the other side of
should be explored in different industries and on dif- the interface can be included.
ferent data to be assured of validity. In Duffey (2002) 3. Unambiguous responsibility: Unambiguous or
there are several examples validating these issues. unclear responsibility (e.g. ‘‘grey areas’’ of responsi-
bility) should not be tolerated. It is essential to have a
perfect clarity in tasks definition and responsibilities
5 PRACTICES FROM THE RAILWAY cross interfaces, especially if learning and exploration
INDUSTRY should be done.
4. Respond based on earlier experience—
We have discussed some organisational learning issues unambiguous procedures: It is essential that different
with the railway industry related to changes both parties harmonise their procedures so that operators
locally and changes involving international railway adopt the same behaviour both during normal opera-
traffic. Key issues identified during interviews and tions and exceptions. Based on our hypothesis, learn-
workshops have been related to new practices related to ing is based on incidents and accidents. To improve
communication, procedures, responsibility, incident learning across boundaries it should be important to
reporting, training and common risk perceptions. decide on one set of rules cross boundaries, and ensure
The changes have been related to the changes in that both the basic rules are the same and also that the
British Rail but also related to railway traffic from common understanding of the rules is the same. Trans-
Britain to France. lation of a rule from English to French and then back to

811
English could be one way of exploring possible differ- learning it should be important to establish proactive
ences in understanding and increase learning in cross and common risk perception and understanding: It
border cooperation between Great Britain and France. would be helpful for two different organisations or
Rules and procedures that are actually used should be actors to agree on a common model (‘‘common men-
kept as ‘‘living documents’’, meaning that the docu- tal model’’) for identifying and managing risks and
ments should be updated by the working professionals the resources to control risks. Some of the most dif-
themselves. ficult issues to resolve are due to differences in the
5. Support learning based on experiences and inci- conceptualisation of risks and risk management.
dent reporting across organisations: It should be a
clear obligation to report any condition that could
imply a risk for other companies. All parties must share
their databases regarding events that could improve or REFERENCES
degrade safety, and also share the resulting recommen-
dations. This would ensure a possibility for common Duffey, R. and Saull, J. 2002, ‘‘Know the Risk: Learn-
learning and an increased level of safety for all opera- ing from Errors and Accidents: Safety and Risk in
tors. Both interfacing organisations will benefit from Today’s Technology’’ Butterworth-Heinemann ISBN-13:
978-0750675963.
the ability to admit that they are different without infer- Evans, A.W. 2007. ‘‘Rail safety and rail privatization in
ring value or preference. One partner’s solution is not Britain’’, Accident Analysis & Prevention 39: pages
necessary the only right solution, one should share 510–523.
experiences (both from accidents, fatalities and good Heinrich, H.W. 1959. ‘‘Industrial accident prevention—
practices) to provide an opportunity to learn from each A scientific approach’’ Mc Graw Hill, New York.
other. Schøn, D.A. 1983 ‘‘Organisational Learning’’ in Morgan
6. Organisational learning—common procedures G (1983) ‘‘Beyond Method’’ Sage, Beverly Hills, pages
and perceptions: Harmonisation of procedures by 114–129.
project teams cross organisational boundaries should Utterback, J.M. 1996 ‘‘Mastering the Dynamics of Innova-
tion’’ HBS, Boston.
be done. Experience shows that groups with repre- Weick, K.E (1991). ‘‘The non-traditional quality of organi-
sentatives from each of the companies (or countries) zational learning,’’ Organization Science.
involved in operations should be established and meet Yelle, L.E. 1979. ‘‘The Learning Curve: Historical Review
face to face, to create confidence, common under- and Comprehensive Survey’’, Decision Sciences 10:
standing and a good learning environment, and estab- 302–328.
lish harmonised procedures. As part of organisational

812
Safety, Reliability and Risk Analysis: Theory, Methods and Applications – Martorell et al. (eds)
© 2009 Taylor & Francis Group, London, ISBN 978-0-415-48513-5

Consequence analysis as organizational development

Berit Moltu
SINTEF Technology and Society, Trondheim, Norway

Arne Jarl Ringstad


StatoilHydro ASA, Trondheim, Norway

Geir Guttormsen
SINTEF Technology and Society, Trondheim, Norway

ABSTRACT: In this article we argue that consequence analysis is about organisational change and thereby
methodologically should be treated as part of that. Traditionally methods of consequence analysis is not sufficient
or enough. HSE (Health, Safety and Environment) is also about Organisational Development (OD). Both in
information and data gathering, in decision and participation, and in safe and secure implementation of suggested
changes, we consider this argument to be important.
The article is based on R&D projects done in the Norwegian oil company StatoilHydro ASA under the heading
of Integrated Operations (IO). The strategy was to choose several pilot projects in one asset to be analysed by
consequences as far as HSE was concerned. The idea was further to spread the successfully pilots to other assets
after a successful Consequence Analysis (CA).
Our approach to understand organizations is inspired by Science and Technology Studies (STS) and sees
organisations as complex seamless networks of human and nonhuman actants (Actor Network Theory, (ANT)
(Latour 1986)). We understand organisations as the ongoing process created by the interests of different actants
like ICT, rooms, work processes, new ways to work, being organised and managed. This in addition to an
understanding of communities of practice (Levy & Venge 1989) is the point of starting to discuss CA as part of
OD. Another used method is based on the risk analysis tool HAZID (Hazard Identification) witch is used in the
Norwegian offshore industry as a planning tool to identify hazardous factors and to evaluate risk related to future
operations. HAZID were used as a basis for collecting qualitative data in our concept of consequence analysis.
Different method was used to identify positive and negative consequences related to implementation of (IO) in
two cases, the steering of smart wells from onshore, and a new operation model on an offshore installation.
We observed that the methods had qualities beyond just evaluation of consequences. During the interviews
on smart well different groups of actants started to mobilize according to the change process from pilot to
broad implementation, new routines and improvements of the pilot were suggested by the production engineers
even though they have been operating along these lines for years. But now as the pilot might go to broad
implementation, different interests initiated a change of the pilot from the process engineers.
During the interviews and the search conferences on the case of a new operational model, we observed that
the discussions generated a new common understanding among the informants about the pilot, the whole change
process. The method helped to clarify what the changes would mean in day to day operation, how they were
going to work and what the potential consequences could be. It also generated a new understanding of why
changes were proposed.
All these questions are important issues in change management and elements that can be discussed related to
organisational learning. Consequence analysis can be a useful change management and organisational learning
tool, if the traditional design and use of such analysis can be changed.

1 INTRODUCTION technology. Traditional work processes and organi-


sational structures are challenged by more efficient
The oil and gas industry is undergoing a fundamental and integrated approaches to exploration and pro-
change in important business processes. The transi- duction. The new approaches reduce the impact of
tion is made possible by new and powerful information traditional obstacles—whether they are geographical,

813
organisational or professional—to efficient use of an mentation strategy and adequate tools and methods.
organisation’s expertise knowledge in decision making A method that facilitate analysis and prediction across
(Kaminski, D. 2004; Lauche, Sawaryn & Thorogood, a broad range of consequences categories are deemed
2006; Ringstad & Andersen, 2008). particularly useful.
Descriptions of the new approaches exist elsewhere However, many traditional consequence analysis
(e.g. Upstream technology 2007), and will not be methodologies are concerned with one consequence
repeated here. The approaches can be subsumed under category (e.g. safety or cost) and/or are based on one
the heading Integrated Operations (IO). Numerous particular approach to data collection and analysis.
definitions of IO exist in the industry. In StatoilHydro Although it would be possible to utilise a number
(2007) IO is defined as: of different consequence analyses prior to any IO
New work processes which use real time data to implementation, it was decided to develop a new
improve the collaboration between disciplines, organ- methodology for consequence analysis particularly
isations, companies and locations to achieve safer, suited for the purpose.
better and faster decisions. The new method should:
It is generally assumed that improved decision mak-
– Be suited for analysis of a broad range of conse-
ing processes in turn will lead to increased production,
quence categories
less downtime, fewer irregularities, a reduced num-
– Be flexible (i.e. allow the analyst to use different
ber of HSE-related incidents, and in general a more
types of data and data collection methods, and the
efficient and streamlined operation.
method should be flexible across analyses objects
The fundamental changes in work execution as a
(e.g. a refinery and an offshore installation)
result of IO are illustrated in Figure 1 and are briefly
– Involve personnel affected by IO in the analysis to
described below:
ensure participation in the change process.
– The old assembly line work mode is seriously chal-
The IO-program of StatoilHydro has chosen a strat-
lenged by IO. More tasks can be performed in a
egy from pilot to broad implementation in the efforts
parallel fashion, thereby reducing total time con-
of achieving the visions of IO. A special exemplary
sumption. From a decision making perspective,
practice due to the IO characteristics in one of the
parallel work execution means a more iterative and
assets are chosen as a pilot. This practice is first eval-
relational process.
uated to be defined a pilot. Then a CA is carried out
– Multidisciplinary teamwork becomes more critical
with conclusions and recommendations for a broader
as the availability of real time data increases, and
implementation or not. The decision is to be taken by
work is performed in a parallel fashion more or less
the process owners.
independently of physical location.
This paper comprises two case studies exemplifying
– Real time data at different locations make it pos-
the new method in use, and a general discussion of pros
sible for personnel at these locations to cooperate
and cons of the new method based on several analyses
based on a shared and up-to-date description of the
performed in StatoilHydro in 2007. This discussion
operational situation.
gives an especial emphasis on how CA and OD might
– Videoconferencing and readily access to data and
be seen together, as two sides of the same process.
software tools reduce the need for specialists to be
on location. This increases the availability of expert
knowledge for operational units, and reduces the
2 METHODS IN CONSEQUENCE ANALYSIS
time it takes to muster the experts.
The diverse and fundamental changes associated In the following we present both a theoretically
with IO require a careful and deliberate imple- approach or an attitude underlying the method e.g.
Actor Network Theory, the consequence categories
Serial Parallel used as a basis for the CA, the structure of the method
and the practical data techniques used. This method is
developed and used in two pilot cases in the IO pro-
Single discipline Multidiscipline
teams
gram of StatoilHydro, ‘‘Steering of smart wells from
onshore’’ at Snorre B, and ‘‘New operation model’’ at
Dependent of Independent of
Huldra Veslefrikk).
physicallocation
physical location physicallocation
physical location

Decisionsbased
Decisions based Decisionsbased
Decisions based
onexperience
on experience onrealtime
on realtimedata
data
2.1 Actor network theory-identifying of actants
data and controversies
In the IO-case or IO-pilot of ‘‘Steering of smart wells
Figure 1. Changes in work execution as a result of IO. from onshore’’ from the field Snorre B in StatoilHydro

814
ASA, SINTEF used a new approach to a CA method, 2. Qualitative data collection, interviews and search
named Actor Network Theory (Latour, 1986), based conference.
on Science and Technology studies (STS), since this 3. Use of a ‘‘consequence matrix’’ to help sort raw
pilot is very much about the development of a new data based on the factor categories ‘‘organization
complex technology, where as we will see, there might and management’’ , ‘‘personnel and competence’’,
be a lot of different technologically solutions to this ‘‘operations and regularities’’, ‘‘HSE’’, ‘‘economy’’
issue of smart wells. and ‘‘company reputation’’.
This pilot was also about the complex interplay 4. Analysis of data—using ‘‘ANT-analysis’’, ‘‘cluster
between technology and organization, ‘‘a seamless analysis’’, chains of argumentations.
web’’ (Callone, 1986) of how to use and operate this 5. Evaluation of risk related to the found negative
technology e.g. a network of different actants human consequences vs. positive consequences.
and nonhuman and how they chain in different ‘‘het- 6. Conclusions and suggestions.
erogeneous engineering’’ (Callone, 1986). To study
the local community of practice, (Levy and Venge,
1985), their interactions, negotiations and struggles 2.3 Data collection—interviews and search
more in depth where this technology is in use, gives conferences
an important input to the understanding of the pro et SINTEF further developed the CA method in
contras of the pilot, and the potential broader HSE order to evaluate a new integrated operation model
consequences of such a pilot. The case showed that (IO-model) to be implemented in StatoilHydro ASA’
technology was not yet frozen as the CA started. In ‘‘Huldra Veslefrikk organisation’’. The method aimed
opposite the work on the CA makes it develops further. to identify potential positive and negative conse-
quences related to implementation of the new oper-
2.2 CA method—visualization of consequence ation model according to the five consequence cate-
categories gories mentioned above (fig. 2).
In both cases qualitative data were collected through
A basis was also to identify both positive and negative document studies, thirty individual interviews (Smart
consequences related to the categories ‘‘organiza- wells, Snorre B), seven individual interviews (New
tion and management’’, ‘‘personnel and competence’’, operation model, Huldra Veslefrikk) and one search
‘‘operations and regularity’’, ‘‘HSE’’, ‘‘economy’’ and conference (Huldra Veslefrikk), (e.g. Emery & Purser,
‘‘company reputation’’, which are a broader set of 1996) with relevant personnel form the Huldra Vesle-
categories than normally in CA. frikk organization. Search conferences were not effec-
This figure illustrates a linearity or a cause and tuated in the Smart Well case due to the initially high
effect chain between the main factors used as an ana- controversy about this pilot.
lytically tool. In studying the communities of practices The interviews could be performed either with indi-
or ‘‘real life’’ in these cases, we see this is of course vidual informants or in groups of people. The choice
more a messy matter (Law, 2004). One of the main depends on how important are the controversies, and
activities of researchers is to tidy up the mess, and lin- how complex is the understanding of the operational
earity in-between categories might be one way to tidy practice and the communities of practice that follows
up. The linearity was the basis for the method’s further that. A combination might also be a good solution.
procedure. But first as a starting point to identify the
different aspects of potentially consequences within
these categories, it is important to identify the most
important groups of actants (Bijker & Pinch, 1985)
participating in the pilot studied. Then the most obvi-
ous controversies are important to identify. A quick
visualisation of these is often useful to make as a draft
to be changed as the analysis goes on. The usefulness
of identifying the controversies at an early stage is also
to early be able to investigate whether there is a con-
nection between the controversy and the risk level of
the potentially consequences.
The further methods follows a well known phase
divided, linear, stepwise procedure known in many
analysis, evaluations and change programs.
1. Identification of groups of actants and main con- Figure 2. Visualization of the consequence categories as a
troversies. basis for the analysis.

815
IO-project reports and descriptions of the new pro- and the consequence matrix to also start the analysis.
posed IO-models in was the basis for the document Here is the point where employees often feel the par-
studies in both cases. As a basis for the interviews and ticipation ends, which creates a situation of resistance
the search conferences we used the proposed change at the time of implementation.
measures needed to implement the new model as an
interview guide. In addition we initially ask for the
2.5 Analysing data
history of the initialisation of the pilot to get access to
the main actants and controversies, and to get in touch To analyse data, one of the methodologically start-
with the most important arguments. ing points was to find the controversies and paradoxes
The group interviews have the basic aim to gather about the smart well technology, and to identify the
information, and might be less time-consuming than different groups of actants that are involved in the
individual interviews, but might get access to more controversies. By identifying the different controver-
supervision information. The search conference as sies one also identifies the interests that are connected
such is more a technique to create an arena for a to the controversies, and the constellations of inter-
common dialogue on especial issues. A combina- ests that the different actants are chained in. Interests
tion of these techniques is often been seen to be are to be seen as the ‘‘driving forces’’ for changes.
fruitful. Interests are what makes things happen both in a pos-
In the IO change processes we have seen conflict- itive and a negative way, e.g. interests are also what
ing interests between management representatives and make things not happen. If one wants to understand
trade unions. The search conference can be a use- the OD aspects of a CA, one has to understand the
ful tool in order to overcome these ‘‘change process main interests. And if one want to do Organizational
barriers’’. The search conference can create open- Change one has to be able to play with the main inter-
ness among the participants (show that things are ests or to be able to play the game, to chain in with the
what they appear to be); create an understanding of a different interests in different enrolements and trans-
shared field (the people present can see they are in the lations (Latour, 1986) to make a strong enough chain
same world/situation); create psychological similarity to be able to do Change Management, if not it is all in
among the representatives; and it can generate a mutual vane.
trust between parties. All these elements are found to Part of the analysis was also to describe which
be important in order to achieve effective communica- presumptions underlying the positive consequences
tion within and between groups (Asch, 1952), and in found, and to suggest compensating actions to the
this case to bring the planned change process forward challenges found. The main stakeholder in the anal-
in a constructive direction. ysis is in these cases SINTEF. Consequence analysis
is something in-between an evaluation and scenario
thinking, and trained skilled methodological and ana-
2.4 Use of consequence matrix
lytical skills are of course required. But a higher degree
In order to sort hypothetical negative and positive con- of participation in the analysis, and to test out the
sequences after implementation of the suggested pilot, analysis might be a fruitful idea, and with the search
we used a matrix to help us sort the informant’s state- conference as a tool, a possibility that is not so far away.
ments within the categories ‘‘organization and man- But the last responsibility for the analysis should be the
agement’’, ‘‘personnel and competence’’, ‘‘operations action researchers.
and regularity’’, ‘‘HSE’’, ‘‘economy’’ and ‘‘company In addition to identify the different aspects of poten-
reputation’’. To the positive consequences we tried to tially consequences of the pilot mentioned above,
describe which presumptions underlying these state- positive as negative, the CA has to do a ranging of
ments, and to the challenges found we tried to suggest the different arguments by importance. E.g. by risk
compensating actions. level or sometimes by interests (As seen in figure 4).
New ICT (Information and Communication Tech- One way might be to find what argumentations and
nology), with the use of large screens gives new chain of argumentations that are used by visualizing
possibilities for search conferences. In group inter- the arguments by ‘‘cluster analysis’’. We often end up
views and in the search conferences these matrixes with only a few central arguments, as the basis for
might be collective created, showed on a large screen, conclusion.
which might give a good enthusiasm, participation The ‘‘cluster analysis’’ aimed to find clusters in
and founding of the results. In a case like this there statements regarding proposed negative consequences
will be many arguments and the matrixes gives a nice related to one or several IO-model measures. As
way to ‘‘tidy the messy arguments’’ and easily give an a result it was easier to see how several IO-model
overview. The concurrent production of this matrix in measures could cause interaction effects (e.g. severe
a search conference might in addition be timesaving. negative consequences) within the different categories
A further step might be to use the search conference shown in figure 2.

816
The clusters of the negative consequences were then 1. Administrative work tasks offshore to be transferred
risk evaluated, based on the risk analysis tool HAZID to the onshore organization
(Hazard Identification). The HAZID tool defines risk 2. To make the onshore support organization more
as probability multiplied with consequence. A risk operative (e.g. manning the onshore support team
level for each consequence cluster was established with offshore personnel in rotation)
by a qualitative evaluation of how probably it was 3. To make the offshore organization more opera-
for each cluster to occur, and how critical it would tive (e.g. more time spent on operations, less on
be, e.g. how large the interests are concerned to this administrative tasks)
consequence. 4. To enhance handovers between shifts offshore by
All steps in this design, except for the risk eval- improved planning onshore
uation of clusters, were carried out in close coop-
The OD-process was carried out by using a
eration with our informants and the StatoilHydro
method called ‘‘CORD-MTO’’ (Coordinated Off-
IO-development team.
shore operation and maintenance Research and
The extended focus on employee involvement
Development—Man-Technology-Organization) as a
through interviews and search conferences must be
basis. The process leading up to a proposed IO-model
characterized as a relatively new approach within
turned out to be filled with conflicts between the
CA-designs. Also the focus on analysing conse-
project management group and labour union represen-
quences in a broader perspective than just HSE must
tatives. This was mostly due to change management
be characterized as new, as definitively the approach
issues and what the labour unions characterized as
of ANT is CA. The effects of this kind of approach
an unfair OD process. We, as an independent part
will be discussed further in this paper.
of the process, also observed a kind of uncertainty
The method as so contains of well known elements,
among employees about how the new organization
but the newness in the method is in the combination of
would look like, and what consequences the structural
these well known elements as ANT, interviews, matrix,
changes would have for each individual employee. We
search conference, cluster analysis, etc.
have observed this kind of ‘‘change anxiety’’ in many
change processes, but in this case we observed that
the process of carrying out a CA, we believe had an
3 THE USE OF CONSEQUENCE ANALYSIS unexpected effect upon this ‘‘change anxiety’’.
(CA) DATA IN ORGANIZATIONAL We observed that the CA method used had quali-
DEVELOPMENT (OD), IN STATOILHYDRO’S ties beyond just evaluation of consequences. During
IO PROGRAM the search conferences we observed that the discus-
sions generated a new common understanding among
In the following we describe two different cases based the informants (employees) about the whole change
on the methodology we described in chapter 2. process and the new proposed operation model. The
method helped to clarify what the changes would
mean in day to day operation, how they were going
3.1 Case 1: ‘‘Implementation of a new integrated to work and what the potential consequences could
operation model in StatoilHydro ASA’ Huldra be. It also generated a new understanding of why
Veslefrikk organization’’ changes were proposed. All these issues are important
in change management, and they are elements that can
The StatoilHydro installations Huldra and Veslefrikk
be discussed related to organisational learning. CA can
were classified as ‘‘tail production installations’’,
therefore be seen as a useful change management and
meaning they were in the last phase of production
organizational learning tool if the traditional design
initiated to prolong the economic lifetime of a field.
and use of such analysis can be changed.
This situation can represent rising production costs
and potential lower profitability. In order to obtain the
profitability, the Huldra Veslefrik organization had to
3.2 Case 2: The pilot ‘‘Steering of smart wells
increase the efficiency of the operations and to cut
from onshore’’ in StatoilHydro ASA
administrative costs. Implementation of IO was then
seen as a solution, and the organization became a part A meeting in the W& D (well and srilling) network
of StatoilHydro’s pilot program for implementation of 19.11.04 decided that with planning of new fields it
IO-models. should be prepared for the implementation of down
An organizational development (OD) process was whole equipment or DIACS valve, e.g. smart wells.
started to find a suitable IO-model for the orga- For existing fields the same policy is decided for
nization. As a basis for the model design, they planning of new wells. Deviations from this should
emphasized the following design criteria for the new be followed by economically calculations. This pilot
IO-model: is about a potential implementation of smart wells

817
as a part of an IO strategy from pilot to broad
participation, in existing wells in former StatoilHy-
dro. The pilot is named ‘‘Steering of smart wells from
onshore’’. The pilot takes place in the field Snorre B,
which came from former Hydro into StatoilHydro.
Originally Snorre B with it’s technologically inven-
tions came from the former small oil company Saga.
Through performance of a Consequence analysis
SINTEF should help StatoilHydro in take good deci-
sions whether the pilot should be recommended for a
broader implementation or not.
StatoilHydro had about 10–15% of the smart wells
worldwide. In December 2006 we found 48 comple-
tions of smart wells with altogether 147 DIACS valves.
There is an increasing trend in smart well completion
in the company, and about 100 smart wells with 320
valves over 25 fields were anticipated in 2010.
One of the conclusions of our CA was that there are
no safety consequences because the DIACS vales are Figure 3. Overview of the main alternatives for smart wells
not part of the barrier system. The main consequences by existing assets in StatoilHydro.
are of potential economically gains in production opti-
malisation, and potentially changes for the production
engineers if they are changing from today’s reactive
operation to a more proactive operation with the use
of realtime data. More personnel resources of the pro-
duction controllers are needed and it might be a more
24 hrs operation in one or a way.
We mapped the present practice in 6 different assets
(Heidrun, Veslefrikk, Gullfaks hovudfelt, Snorre A,
Gullfaks satelitter and Visund) to see the gap between
the pilot and the present practice in these assets. To
main way of operating smart wells was identified, e.g.
a manual way of operating as we see it in Gullfaks hov-
udfelt, Snorre A, Veslefrikk, Heidrun. In this solution Figure 4. Main controversies in the analysis of the smart
the production engineers and the supplier has to travel well pilot.
offshore to operate the DIACS physically. There is a
common understanding that this is not a good solu-
tion. The main controversies are between the pilot of controversy is the question of choosing and developing
Snorre B or the solution as they have in Norne, Visund, of technology, which solution to choose. Different
Gullfaks. technology might support different operational solu-
In the analysis of all the potential consequences tion e.g. who can push the buttons for operating the
we soon realised that we need to make a distinction smart wells.
between operations (who is pushing the button) vs. And when it goes from a single autonomous case
steering (who is planning and initializing the process) to a pilot with possible consequences for other assets
due to an unclear linking between language or term the chain of interests are getting larger. The two main
and practice, and to thereby avoid misunderstandings. technologically alternatives are between to digital vari-
As a premise for the further analysis it is given as a pre- ants, one with the integration in the OS (operation
sumption that onshore is always steering anyhow, the station) system and operation from SCR. Many of the
competence for that is onshore in Drilling & Well and assets want to go in this direction. The alternative of
in Petroleumtechnology. The controversy is weather the pilot is to have a separate operation system which
SCR (Sentral Controle Room) or the productions engi- is used now for the pilot. Operation today is done by
neers/production controllers onshore is the one who the supplier Well dynamics which has to travel from
should operate the DIACS valves or if it should done their office to StatoilHydro to operate the valves which
by a supplier onshore as in the pilot of Snorre B. influenced the time of reaction from decision to oper-
Their largest controversy in this pilot is who shall ation if that is important. One of the most important
operate the DIACS valves, who shall push the but- consequences against an integration in OS is the poten-
ton, SCR offshore or onshore. And connotated to that tial possibilities for external hacking which one avoids

818
with a separate control system as in the pilot. But secu- potential opponents as a political argument against the
rity is said to be well handled at the statoil@plant. planned changes and not an argument that is based in
It also involves larger development cost to integrate professional discussions.
than the separate solution of the pilot. The pilot has a
script that makes an operation from onshore preferred.
An integration in OS opens for a symmetry between 5 CONCLUSIONS
on and offshore operations, and thereby might con-
serve status quoe, as far as todays situation on who In this paper we argue that consequence analysis is to
should operate the valves, whether the pilot might push be seen as a part of a planned organizational change
a change. process. In fact the organizational change process
As the CA started a discussion within the pilot starts when the CA starts. Thereby CA and OU should
weather the pilot initially was good enough evaluated not be seen as separate parts.
to become a pilot or not. While the interviews came Objective analyses of consequences do not exist. At
about the production engineers then starts to create the time one starts interviewing about potential con-
suggestions of what can be changed in the pilot as sequences of an action, different groups of actants are
they realize that this might be the reality for many col- starting to chain and to mobilize their common inter-
leagues in other assets, and that theirs practice might ests, as we see it in the smart well case, and the change
be the standardized proactive, even tough they have process starts.
not done anything to improve this or to come with the The CA might better be seen as a part of a planned
same suggestions the two three years in-between now organisational change program, and with trying to
and after the pilot was evaluated. achieve a good dialogue and a collaborate atmosphere
among the parts. It is as we see in the Huldra Veslefrikk
case not easy to achieve a good change process if the
4 FROM PILOT TO BROAD process of analysis in advanced (CORD analysis) has
IMPLEMENTATION AS A CHANGE not followed a good participated process in advance;
STRATEGY it is very hard to achieve that later.
The best advice is to use the energy for change that
One of the main strategies to achieve their aims of IO is to be found in the mobilizing and chaining of inter-
in StatoilHydro has been to define different locally ests. One has to enroll important actants and chains
existing practices which contains good ‘‘IO character- of important interests if not the OD program will be
istics’’ as a pilot to be considerer a broad implantation in vain.
in the other assets after first an evaluation and an then To succeed one has to understand the concrete oper-
a broader consequence analysis. The pilot of smart ational challenges in the pilot, and the seamless web
wells in Snorre B was a locally initiative and a con- of technology and organization and thus these needs
cept that was decided when the field was developed, to be described and understood. The CA has one large
that we can see from the choice of concept of plat- advantage dealing with this that OD programme rarely
form. Here we see a ‘‘top down’’ strategy, e.g. the IO interfere with. CA might contribute to make OD more
initiative meets a locally ‘‘bottom up’’ initiative devel- successful.
oped in Snorre B. There is a huge variety in practices
among the assets due to local autonomy and different
history, different field caracteristics. When making REFERENCES
such connections with local and more global change
strategies it is important to well inform the pilot about Asch, S. (1952). Social psychology. Englewood Cliffs, NJ.:
Prentice-Hall.
its chosen status so that everybody knows, to avoid Bijker, W.E., Hughes, T.P. Pinch, T. (1987). The Social
the killing of locally commitment. This is also impor- Construction of Technological Systems, The MIT Press,
tant to avoid that the local people don’t fell they are Cambridge, Massachusetts.
hostages for larger organizational changes in other Callone, M. (1986). Some elements of a sociology of trans-
assets, into practices that might work well for them, lation: domestication of the scallops and the fisherman,
but they might anyhow create large resistance to these John Law (ed.) pp. 196–229.
changes in other fields, they are just not invented here Emery, M. & Purser, R.E. (1996). The search conference—a
and does thereby not fit in, and might demand some method for planning organizational change and commu-
trouble to change locally even though they might have nity action. San Francisco: Jossey-Bass Publishers.
Latour, B. (1986). Science in Action, Harvard University
been a successful and smooth practice elsewhere. If Press, Massachusetts.
the pilot is not locally enough anchored, the questions Lauche, K., Sawaryn, S.J. & Thorogood, J.L. (2006). Capa-
will be posed if it is good enough evaluated locally bility development with remote drilling operations. Paper
and thereby any argumentation to support a broad presented at the SPE Intelligent Energy Conference and
implementation might be effect fully be stopped by Exhibition, Amsterdam, The Netherlands.

819
Lave, J. & Wenger, E. (1991). Situated Learning. Legitimate Ringstad, A.J. & Andersen, K. (2007). Integrated operations
peripheral participation. Cambridge University Press. and the need for a balanced development of peo-
Law, J. (2004). After Method. Mess in social Science ple, technology and organization. Paper presented at
Research. London, Routledge. the International Petroleum Technology Conference,
Kaminski, D. (2004). Remote real time operations centre for Dubai, UA.
geologically optimised productivity. Paper presented at StatoilHydro (2007). Integrated operations in StatoilHydro.
the AAPG International Conference, Cancun, Mexico. Monthly Newsletter, May 2007.
Moltu, B. (2003). PhD Thesis, ‘‘BPR in Norwegian! The Upstream technology, Feb. 2007, pp. 36–37. Interview
mangementconcept og Business Process Reengineering with Adolfo Henriquez, Manager—Corporate Initiative
(BPR) as a culturally praxis, (Nr 2004:81) NTNU, Trond- Integrated Operations, StatoilHydro.
heim, Norway.

820
Safety, Reliability and Risk Analysis: Theory, Methods and Applications – Martorell et al. (eds)
© 2009 Taylor & Francis Group, London, ISBN 978-0-415-48513-5

Integrated operations and leadership—How virtual cooperation influences


leadership practice

K. Skarholt & P. Næsje


SINTEF Technology and Society, Trondheim, Norway

V. Hepsø & A.S. Bye


StatoilHydro, Trondheim, Norway

ABSTRACT: This paper aims to discuss how the use of advanced information and communication technology
impacts leadership practice. The paper is based on a research study accomplished at the Kristin asset on the
Norwegian continental shelf. The technology we explore is Integrated Operations (IO), and how organizations
can benefit from using this kind of technology. We discuss the results of our study, focusing on virtual cooperation
among leadership teams located onshore and offshore in the Kristin organization. To date, some research on how
to succeed in virtual teams exists, but few studies explore leadership in virtual teams. The strength of this study
is the in-depth insight of how and why IO shapes the work practice of leaders and operators/technicians. So far,
few empirical research studies shed light on how IO functions and is experienced by the people involved. The
research has mostly focused on the theoretical models of IO.

1 INTRODUCTION develop new work practices and change management


Henriquez et al (2007).
In this paper we discuss how virtual cooperation How technology is able to coordinate and communi-
through the use of Integrated Operations influences cate tasks within virtual teams is of great importance.
leadership practice. The paper is based on a research The IO technology consists of high-quality video
study conducted at the Kristin asset on the Norwegian conferencing, shared work spaces and data sharing
continental shelf. facilities. These arenas include so-called collabora-
tion rooms (operation rooms) for rapid responses and
decision-making. The design includes video walls to
1.1 Integrated operations share information and involve people in discussions,
having eye contact with each other both onshore and
Today, several oil companies on the Norwegian conti- offshore. IO technology is characterized by vividness
nental shelf have implemented IO as a strategic tool to and interactivity. According to Steuer (1992), vivid-
achieve safe, reliable, and efficient operations. There ness is the ability of a telecommunications medium
are a variety of concepts describing IO, also called to produce a rich environment for the senses, which
e-Operations and Smart operations. IO allows for a means having a range of sensory input (voice, video,
tighter integration of offshore and onshore personnel, eye contact, etc.), as well as depth of information band-
operator companies, and service companies, by work- width. Interactivity denotes the degree to which users
ing with real-time data from the offshore installations. can influence form or content in their telecommunica-
The Norwegian Ministry of Petroleum and Energy tions medium. In our study we found that the use of
(in white paper no. 38) defines IO as: ‘‘Use of informa- collaboration rooms creates the sense of being present
tion technology to change work processes to achieve in a place different from one’s physical location, i.e.,
improved decisions, remote control of processes and the sense of ‘‘being there’’.
equipment, and to relocate functions and personnel to
a remote installation or an onshore facility’’. Thus, IO
1.2 Integrated operations at Kristin
is both a technological and an organizational issue,
focusing on the use of new and advanced technol- To explore how IO shapes new leadership practices,
ogy as well as new work practices. The IO technology we discuss findings and knowledge acquired from a
implementation is not considered to be a major obsta- research and development project undertaken in 2007.
cle in StatoilHydro. The most challenging issue is to The purpose of this project has been to describe IO in

821
the Kristin organization, focusing on work practices Olsen & Olsen (2000) describe which elements
concerning operation and maintenance at the platform are crucial for success in virtual team work, such
with onshore operational support. In addition, the as: the sharing of knowledge, coupling in work, the
purpose has been working with organizational devel- need for collaboration to solve tasks, and the need for
opment, focusing on which organizational capabilities technology that effectively supports communication
Kristin ought to develop. and decision-making. We explore how these elements
The Kristin asset, operated by StatoilHydro, is a affect cooperation and outcomes in the organization
condensate and gas field on the Norwegian continen- we study.
tal shelf. The platform is located 240 km offshore
from Trondheim, Norway, and production started in
November 2005. The Kristin organization wanted to 2 METHOD
develop an IO mindset in order to be able to operate
the platform with a minimum of people on board for The empirical data for this study is comprised of obser-
safety reasons and maximize production and opera- vations and interviews. We have interviewed managers
tional efficiency, as well as keeping the platform in both onshore and offshore, operators offshore within
optimal technical condition. Compared to large off- all functions, and disciplines represented at the plat-
shore installations having an operation crew of 200 form (electro, mechanic, automation, instrument,
employees, there are only 31 employees working on process, among others) and technical lead engineers
the Kristin platform during any shift period. This lean onshore within most of the disciplines. The collected
organization influences the communication on board, material comprises semi-structured interviews with a
and the communication between offshore and onshore total of 69 informants, as well as extensive participat-
personnel. ing observations both onshore and offshore. Analyses
The Kristin organization has two management of interviews were conducted based on the principles
teams, one onshore and one offshore, each located in a of grounded methodology (Strauss and Corbin, 1998)
collaboration room. There are continuous video links with qualitative coding techniques. Examples of par-
onshore and offshore, so both management teams can ticipating observations are being present at formal and
see each other at all times during the day. informal meetings in the collaboration rooms both
onshore and offshore, as well as following the work
1.3 Virtual leadership teams of the operators when they were out in the process
plant doing maintenance and operation tasks.
How to succeed in virtual teams has been quite well The research approach has been co-generative
described in the literature, but there are few studies learning (Elden & Levin, 1991). The basic idea is
exploring leadership in virtual teams. As with tra- that practitioners and researchers create new practices
ditional team work, research on virtual teams has together and parallel to developing a deeper under-
demonstrated the importance of effective communi- standing of the research question in focus. Examples
cation and coordination within virtual teams (Lipnack of this close interaction between practitioners and
and Stamps, 1997). Virtual teams are often charac- researcher in our study are as follows:
terized by high levels of autonomy rather than direct
control, which will affect leadership practice. • During the project period, the researchers and key
Lipnack & Stamps (1997) define a virtual team as: practitioners met on a regular basis (every month)
‘‘A group of people who interact through interdepen- for working sessions. This was central to the devel-
dent tasks guided by a common purpose that works opment of the analysis of the IO work practice. At
across space, time and organizational boundaries with work sessions, difficult issues could be discussed,
links strengthened by webs of communication tech- misunderstandings were sorted out, and findings
nologies’’. IO is about how members of a geographi- that needed interpretation were discussed.
cally distributed organization (offshore and onshore), • By holding informal meetings, being embedded
participate, communicate and coordinate their work during data collection, etc., the project contributed
through information technology. to co-learning and reflection on work practices
In this paper, we focus on two types of virtual between researchers and practitioners. A set of
management concerning the cooperation between the shared notions and concepts was developed, and
onshore and offshore organization at Kristin. First, thus also a higher awareness of critical organiza-
this paper explores how the use of IO technology by tional aspects.
virtual management teams influences the cooperation
and communication between the offshore and onshore In addition, the researchers presented and discussed
management teams. Second, we explore the virtual the project results closely with all people involved in
cooperation between the onshore technical engineers the project—offshore operators, onshore and offshore
and the offshore operators. management, and onshore technical lead discipline

822
engineers. This had the effect of the researcher gaining According to Bass (1990), transformational leader-
deep insight into people’s work practices. ship means that a leader communicates a vision, which
In terms of methodology, project execution fol- is a reflection of how he or she defines an organi-
lowed these steps: zation’s goals and the values which will support it.
Transformational leaders know their employees and
1. Provisional mapping of functions, arenas and inspire and motivate them to view the organization’s
relations. Prioritization of the most important vision as their own (Bass and Avioli, 1994). Such
functions/arenas/relations. (Tools: Workshop with leadership occurs when one or more persons engage
stakeholders). with others in such a way that leaders and follow-
2. Collection of data. Information on arenas and rela- ers lift each other to higher levels of motivation. At
tion used as collection guide. Evolving focus with Kristin, the concept of integrated operations—what it
new information. (Tools: Observations, conversa- really means for this organization—involved defining
tions, interviews). a vision and values concerning how the work ought
3. Analysis of data for key findings and observations. to be performed—on board, and between the offshore
Conducted simultaneously with data collection. and onshore personnel. Kristin has a lean and compe-
4. Identification of important themes in the material, tent organization, where the operators/technicians in
using stakeholders to sound the importance and the Operation & Maintenance (O&M) team offshore
secure ownership (Tools: Work meetings). possess expertise not necessarily found among their
5. Suggesting short-term and long-term actions. Pri- superiors. The motivation at Kristin has been empow-
oritizing actions together with management and erment, which has affected the autonomous work of
stakeholders. Presenting findings and actions for the operators and the delegating leadership style.
management, stakeholders, and employees. (Tools: Another leadership characteristic found at Kristin is
Facilitating workshops). situational leadership, which means that leaders allow
for flexible solutions and actions adapted to the spe-
cial conditions and situations in the organization. The
3 DISCUSSION: INTEGRATED OPERATIONS lean offshore organization at Kristin with few persons
AND LEADERSHIP PRACTICE within each discipline necessitates flexible problem-
solving, which includes cooperation across disciplines
Certain characteristics of an organization’s leadership to support each other’s work. Situational leadership is
practice will have an impact on the outcomes of virtual the opposite of trying to generalize or standardize work
cooperation. This will influence the quality of rela- practices and routines. Situational leadership theories
tions between the offshore and onshore organization. in organization studies presume that different leader-
Below we begin by describing the characteristics of the ship styles are better in different situations, and that
leadership practice in the organization of study. Next, leaders must be flexible enough to adapt their style to
we discuss the virtual cooperation and work practice the situation in which they find themselves. A good
between the offshore and onshore organization: i) the situational leader is one who can quickly adapt his or
virtual management team, ii) the virtual support from her leadership style as the situation changes. Hersey
the onshore technical engineers. and Blanchard (1977) developed situational leadership
theory. They categorize leadership style according to
the amount of control exerted and support given in
3.1 Characteristics of leadership practice
terms of task and relationship behaviours; persuasive,
We have explored the leadership at Kristin from a instructive, participating, and delegating behaviour.
relational perspective, and find that leadership is Instructive behaviour means giving precise instruc-
characterized by a transformational and situational tions and controlling execution. Persuasive behaviour
leadership style. involves defining tasks, but seeking ideas and sug-
Leadership philosophy and style will impact how gestions from the workers. A participating leadership
the offshore operators conduct their work, particularly style is when the leader facilitates and takes part in
the way in which they are expected to lead them- decisions. A delegating behaviour means that lead-
selves and take responsibility for the operations both ers delegate the responsibility for decision-making and
individually and collectively as a team. According to execution.
Wadel (2005), organizational restructuring into flat The level of competence among workers will influ-
organizations and autonomous work teams means that ence whether a supportive or controlling leadership
co-workers to a larger extent have to lead and support behaviour is adopted. To lead personnel with low
each other. This change in roles and practice among degree of competence, a manager will define tasks
workers also changes the role of leadership. To explore and supervise the employees closely. On the other
this, we have to understand leadership from a relational hand, leading highly skilled workers involves delegat-
perspective as well as from a situational perspective. ing tasks and responsibility, and the control lies with

823
the employees. High levels of expertise do not require technical units onshore attend to be informed about
as much control from the manager. At the Kristin the situation on the platform, and to give advice if
platform, the leadership style is adaptive. Depending needed.
on the situation and discipline, it is primarily char- So, what are the benefits of this close, but still
acterized by a participating, persuasive, or delegating virtual cooperation? First of all, StatoilHydro has esti-
management style. This is because of the highly skilled mated huge savings in operation costs over the first
personnel working at the platform. The O&M-crew is year from integrated operations. This is a statement
directed primarily by the Operation supervisor, who from one of the platform managers at Kristin: ‘‘Half
is their line manager, but they are also directed by of the saving was due to the way we work. The other
the technical lead engineers onshore. This is further half was due to having a quality process plant’’. Thus,
discussed in Chapters 3.2 and 3.3. the reliability of and uptime at Kristin has been very
profitable.
The successful use of collaboration rooms has
affected the economic outcomes. One important
3.2 Virtual cooperation between the offshore
assumption is that everyone using the collaboration
and onshore management teams
rooms both offshore and onshore know each other well.
We have examined which kinds of work practices, ben- They meet in person at informal and formal meetings
efits, and outcomes the Kristin leadership teams, both onshore quite often, which strengthens the quality of
offshore and onshore, have achieved by the use of inte- the virtual work. The random personal contact and the
grated operations. First, we present and discuss how fact that people know each other makes the distance
the management teams actually work in the collabo- leadership more close (Maznevski & Chudoba, 2000).
ration rooms. Then we discuss the benefits and how This is an important criterion for success in virtual
they are achieved. cooperation.
At Kristin, the management is organized as follows: Next, we find that peripheral awareness has devel-
There are two management teams, one onshore and oped at Kristin, which means that you develop a deep
one offshore, each located in a collaboration room. understanding of what is going on at the platform.
The collaboration is supported by the use of video The condition of peripheral awareness improves the
conferencing and data sharing facilities, where both organization’s capability to achieve rapid responses,
management teams can see each other at all times which in turn allows for more effective problem-
during the workday. Also, process data is online and solving and decision-making processes. One example
available at both locations and can be shared. is the low number of backlog activities concerning
The offshore management team at Kristin is com- operations, maintenance, and HSE work. Compared
prised of four managers: a Platform Manager, an Oper- to large installations on the Norwegian continental
ation Supervisor (O&M), an Operations Engineer, and shelf, which have a high number of backlog activities,
a Hotel & Administration Supervisor (H&A). The at Kristin they have managed to handle these issues
management team offshore manages maintenance and effectively as a team.
operations in close collaboration with the manage- The contextual differences (different work atmos-
ment onshore. The onshore management is comprised pheres, weather conditions, etc.) offshore and onshore
of a Platform Manager, an Operation Supervisor, an become less important by the use of collaboration
Operation Engineer, and a Technical Support Super- rooms. In the onshore collaboration room there are
visor. They share the collaboration room with some several video walls showing pictures/video of the plat-
technical engineers, who support operations and mod- form, the technical equipment, and the people working
ifications offshore. Both the offshore and onshore there.
management uses the collaboration room on a perma- This daily and close communication creates a sit-
nent basis, as their office, and not as a meeting room uation of shared situational awareness between the
like several other assets on the Norwegian continental onshore and offshore managers. Rosseau et al. (2004:
shelf do. 14–15), Artman (2000), and Patrick and James (2004)
The onshore management team is responsible for argue that there is an increasing interest in studying
giving day-to-day operational support to the offshore team cognition, based on the fact that teamwork, or
organization, and for the planning of maintenance working towards a shared goal, requires information
programs and tasks on a long-term basis. This takes sharing and coordination. Shared situational aware-
place through formal daily meetings and through ness represents the overlap between team members, or
informal and ad-hoc dialogue during the day. Each the degree to which team members possess the same
morning the offshore and onshore management teams situational awareness or shared mental models. Shared
have shared virtual meetings to inform and dis- mental models are ‘‘ . . . knowledge structures held by
cuss the last 24 hours of operation and the next 24 members of a team that enable them to form accu-
hours to come. Here, representatives from different rate explanations and expectations for the task, and

824
Figure 1. Different levels of integration.

in turn, to coordinate their actions and adapt their a management team, and not only as individuals.’’
behaviour to demands of the task and other team mem- (Manager)
bers’’ (Cannon-Bowers et al 1993: 228 in French et al ‘‘One important aspect with integrated operations
2004). at Kristin is the informal communication that happens
Figure 1 above illustrates different levels of integra- 16 hours every day. In the operation room I receive a
tion in the virtual communication. The highest level lot of useful information from the other managers who
of interaction is the social level. are sharing this room with me.’’ (Manager)
The challenge is to enable human and technical The platform management at Kristin expresses that
elements to work together as integrated units. Com- they aim at behaving as one management team, mean-
munication through the use of technology means ing that they want to co-ordinate problem-solving and
more than the transfer of knowledge and informa- decision-making between shifts. Once a week, even
tion. Interoperability must be present in each of the in their time off, the platform managers arrange a
four domains: physical, information, cognitive, and telephone meeting to discuss and share opinions con-
social (Alberts & Hayes, 2005). Videoconferencing cerning operation plans and their execution. In this way
requires interoperability on many levels; from the they develop hands-on knowledge regarding what’s
physical (technological) level to the social (organi- going on at the Kristin platform, where tasks are being
zational) level. At Kristin, the integration between followed up on and rapidly solved.
the onshore and offshore organizations has reached ‘‘We [the management] is quite co-ordinated across
the social level. This means that the organization has the shifts. We organize a telephone meeting every
gained organizational improvements, such as situa- Thursday: 1) a meeting among the platform managers,
tional awareness or shared understanding and efficient and 2) a meeting among platform managers and the
decision-making processes. O&M-supervisor. This is very special I think, I have
Shared understanding/shared situational awareness never experienced this leadership practice at other
has a significant impact on the ability of teams to offshore installations’’. (Manager)
coordinate their work and perform well. Shared under- Performing as an integrated management team, has
standing affects performance in several ways, such as influenced the sharing of common values and philos-
predicting the behaviors of team members, increas- ophy concerning how to organize and run the plat-
ing satisfaction and motivation, and taking actions form. This has been beneficial in terms of operational
that benefit the team and the outcomes (Hinds & efficiency.
Weisband, 2003). In the absence of shared understand- ‘‘I find that values, norms and philosophy at
ing, frustrations, conflicts and distrust can develop. Kristin are common and shared between the platform
In virtual teams, shared understanding is more dif- managers’’. (Manager)
ficult to generate. At Kristin, the IO mindset and ‘‘The way we work must not be dependent on who’s
technology has improved the ability to obtain shared at work. It matters how we work as a team’’. (Manager)
understanding.
Below are some statements which illustrate the Their goals are consistent leadership behaviors and
benefits of integrated operations: to obtain co-ordinated management solutions across
shifts. Nevertheless, this can be challenging to achieve
‘‘The collaboration room enables access to impor- across different shift periods. For example, there
tant information, where we get to know about each are situations where the management team forgets to
other tasks and an overall impression of the work inform the next shift about all the decisions taken, but
onshore and offshore. Thus, we perform the work as these are not critical decisions.

825
In summary, these are the capabilities or benefits they manage. One of the technical engineers put it this
developed through management use of collaboration way: ‘‘For some of us, the collaboration room becomes
rooms both offshore and onshore: like a drug’’. What he means is that you become depen-
dent on being present and available in the collaboration
• Efficient problem-solving and decision-making room. If you are not present, you may miss impor-
processes tant issues and discussions of what is going on during
• Common ground: shared situational awareness the day.
• Shared leadership practices
• Shared values
3.3 Virtual cooperation between the technical
In the above we have focused on the benefits of management onshore and the operators
virtual cooperation among leadership teams onshore offshore
and offshore. We have also examined whether the At Kristin, the operation and maintenance tasks per-
collaboration rooms can represent a barrier concern- formed by offshore operators are based on remote sup-
ing leadership practice at the platform. The operators port from technical engineers onshore. Their function
working in the O&M-team offshore are located in an is not the management of people, but the management
office landscape next to the collaboration room. We of technical tasks within operation and maintenance.
asked the operators if the co-location of managers For example, there is one domain engineer within
hampers or increases a manager’s availability. the electro discipline who is planning and support-
We found that the managers themselves wish and ing the work of the electricians on the platform. This
try to be available for the operators at any time during engineer is a domain expert and system responsible.
the working day, as this statement illustrates: A similar situation exists for the other disciplines on
‘‘My concern is that the management team should board (mechanics, automation among others). He/she
be available at all times during the day, even though we remotely assists the operations performed on the plat-
are located in the collaboration room. I understand if form on a daily and long term basis, such as the
some of the operators find that it makes us less avail- planning and prioritizing of operation and mainte-
able, but my wish is to be available as a manager.’’ nance tasks. The crew on the platform is very much
(Manager) dependent on the skills and knowledge of these system
Most of the operators are of the opinion that the responsible engineers, and on their availability in the
offshore management team is available, and feel free daily decision-making and task-solving processes.
to contact their managers whenever they need to. The The work practice and virtual cooperation between
reasons for contacting management during a workday technical engineers onshore and operators offshore is
mostly involve the need for signatures and approvals characterized by telephone meetings, telephone con-
of work orders (WO) and working permits (AT). The versations, e-mails, and face-to-face cooperation on
operators find that management responds quickly. the platform. Meetings rarely take place in the col-
‘‘Personally, I have no problems with the managers laboration rooms. For example, the electricians and
sitting in the collaboration room. The managers have mechanics on the platform have weekly telephone
told us that if the door is locked they are busy. If not, meetings with their technical engineers onshore. In
we are welcome.’’ (Operator) addition, the technical engineers go offshore to Kristin
Nevertheless, some operators find that the co- 2–3 times a year, on average. This results in person-
localization of managers may be a barrier for calling nel onshore and offshore knowing each other well,
upon the manager’s attention. In some situations, they and they develop a shared situational awareness of the
are unsure of whether or not they are disturbing them. operation and maintenance conditions.
This is when the managers are in contact with the man- We find a lot of shared characteristics between
agement team onshore, or if one of the mangers is the different disciplines, such as the close coopera-
having a telephone conversation. The managers say tion between operators within different disciplines and
that they try to invite people in when they notice them technical engineers. The relation is characterized by
standing outside the room waiting. mutual trust, and they refer to each other as good col-
Another challenge concerning virtual management leagues:
is that some of the managers spend more and more ‘‘I have a close cooperation with the operators
time in virtual space (collaboration room). This influ- on Kristin. They are highly skilled, work indepen-
ences how much time the managers offshore spend dently, and know the platform very well. I’m in daily
out in the process plant where the workers spend most dialogue with them, and we have weekly telephone
of their time. This then influences the amount of time meetings. Together we discuss technical challenges
spent together with the operators. Similarly, the leaders and problems.’’ (Technical engineer)
onshore spend much time in the collaboration room, ‘‘We are very satisfied with the technical support
and becomes less present and available to the workers from our discipline expert. We appreciate it when he

826
comes offshore. It creates a common understanding of that some tasks offshore require more instruction than
the technical problems that we are facing.’’ (Operator) others, such as well control, which is managed and
The operators trust the technical engineers’ abil- decided onshore by a production engineer. On the other
ity, experience, and knowledge to support their work hand, tasks within electro are managed by close coop-
offshore. The engineers have demonstrated compe- eration with onshore personnel and are characterized
tency with the quality of work and behavior necessary by a participating behavior.
to accomplish production at the platform. According The overall impression is a qualified support from
to Jarvenpaa and Leidner (1999), teams that highly the technical engineers onshore, but there are some
trust each another tend to perform better. If trust challenges. One challenge is that there is a delay in
exists in the relationships it means that much of the bringing support to the platform, because of the vast
work involved in monitoring and controlling others number of jobs these engineers have to deal with, both
becomes less important (McEvily, 2003, pp. 92–93), on Kristin and on others platforms they are support-
and this reduces the transaction costs associated with ing. In addition, they are involved in many discussions
operations. ‘‘ . . . trust is the willingness to accept with contractors and suppliers about how to solve
vulnerability based on positive expectations about technical issues. Nevertheless, problems that are not
another’s intention or behaviors . . . trust represents a critical can wait, while some problems need observa-
positive assumption about the motives and intentions tion and evaluation across shifts. The operators express
of another party, it allows people to economize on an understanding for the discipline expert’s situation,
information processing and safeguarding behaviors’’. and, similarly, the discipline experts express a wish to
It can be difficult to manage and support people be able to respond more rapidly, so that the operators
you do not see. The daily and close relation between are able to accomplish their tasks efficiently. This is an
the operators and the technical engineers encourage example of how they mutually trust each other in their
positive trust relations. Their relation is character- efforts to obtain confident results regarding operation
ized by a social level of interaction, (ref. Figure 1). and maintenance at Kristin.
They know each other quite well (they have met Another challenge is that the technical support from
each other in person), and the technical engineers the onshore organization is very dependent on who
are highly committed to the work they are supposed is performing it, because it is based on one person’s
to manage and support on the platform. These good expertise. If the technical engineer does not manage to
relations lead to efficient problem-solving and high- provide the necessary support to the platform, this role
quality performance of operations and maintenance on or function does not work. So, the system is fragile and
board. Trusting relations between management and is dependent on extrovert, available, co-operative, and
workers lead to increased responsibility and power highly skilled engineers.
to autonomous workgroups (Skarholt & Finnestrand, We have asked the technical engineers onshore how
2007). Mac Evily (2003) argues that the tie sustaining they experience the availability and involvement of
trust becomes stronger because there are additional the management team onshore, located in operation
dimensions and relational contents. In addition to rooms. We find that the co-localization of managers in
exchanging information and advice, friendships are some situations impedes involvement from the techni-
also developed. Thus, the trust element is the glue, or cal engineers. The close cooperation between offshore
the foundation for a flexible structure of communica- and onshore management in some situations leads to
tion and enrolment realized through virtual and bound- quick decisions where the engineers are not includes
ary work (Hepsø, in Jemielniak & Kociatkiewicz (ed) as part of the decision-making loop. Thus, the col-
(2008). laboration room can be a barrier for involving the
We find that the technical management onshore experts. This is similar to what the operators offshore
is characterized by a situational leadership style. experienced. Nevertheless, we find that in critical situ-
Examples of situational leadership are as follows: ations, when their expertise is necessary, the engineers
actively take part in discussions and solutions together
• Electro: Instructive on strategy, hands-on problem- with the management team.
solving using a participating style
• Programmed maintenance: Instructive on strategy,
participating in execution 4 CONCLUSIONS
• Well control: Instructive in steering, participating
in POG-activities (production optimizing goals) This paper explore on how integrated operations
have an impact on leadership practice, and how
The level of complexity concerning the execution of virtual collaboration create integration among man-
tasks will influence leadership style; whether the style agement teams and personnel onshore and offshore.
is delegating, participating, persuasive, or instruc- At Kristin, the concept of integrated operations and
tive. The reason behind different management styles is the use of collaboration rooms have created shared

827
situational awareness, which is crucial for obtain- Hinds, P.J. and Weisband, S.P. (2003), Knowledge Sharing
ing efficient problem-solving and decision-making and Shared Understanding. In Gibson, C.B., Cohen, S.G.
concerning safety, production and maintenance. (ed) (2003): Virtual Teams that Work, Creating Conditions
In our study, we find that IO enhances the for Virtual Team Effectiveness, Jossey Bass.
experience of integration and common understand- Henriques, A. et al. (2007), Integrated Operations: how Sta-
toil is managing the challenge of change. Management
ing between the onshore and offshore organizations, focus, vol. 25.
where the virtual contact through the use of collabora- Hepsø, V. (2008), Boundary-spanning practices and para-
tion is experienced as ‘‘being in the same room’’. This doxes related to trust among people and machines in a
results in better and faster decisions, because both the high-tech oil and gas environment. In: Jemielniak, D. &
onshore and the offshore managements have in-depth Kociatkiewicz (ed), Management Practices in High-Tech
knowledge about the situations/problems. Environments.
The challenging aspects with the use of collabora- Hersey, P. and Blanchard, K.H. (1977), Management of orga-
tion rooms is that it can impede the managers’ hands- nization behavior: utilizing human resources. (3rd ed.),
on relationships with people outside this room, such Prentice-Hall, Englewood Cliffs, NJ.
Jarvenpaa, S.L. and Leidner, D.E. Communication and trust
as the relations with the operators/technicians offshore in Virtual Teams. Organization Science, 1999, 10(6),
and the technical engineers onshore. Both groups have 791–815.
expressed a wish for more involvement from their Maznevski, M.L. and Chudoba, K.M. (2000), Bridging
management onshore and offshore in problem-solving Space over Time: Global Virtual Team Dynamics and
tasks. Effectiveness, Organization Science, 11(5), 473–492.
Our focus has been on how organizations can ben- Mayer, R.C., Davis, J.H. and Schoorman, F.D. (1995), ‘‘An
efit from the use of new and advanced technology. Integrative Model of Organizational Trust. The Academy
The challenge is not the technology itself, but the of Management Review; 20,3.
organizational aspects, such as developing hands-on McEvily, B., Perrone, V. and Zaheer, A. (2003), Trust as
an Organizing Principle. Organization Science. 14(1),
leadership practices, clear roles and tasks, common 99–103.
goals, trust, and knowledge and skills. These elements Patrick, J. and James, N., (2004), A Task-Oriented Perspec-
are essential for developing an efficient organization tive of Situation Awareness in Banbury, S. and Tremblay
with motivated and skilled employees and managers. (ed), A Cognitive Approach to Situation Awareness:
theory and application, Ashgate Publishing Company,
Burlington VT, USA.
REFERENCES Rousseau, R., Trembley, S. and Breton, R. (2004), Defining
and Modelling Situation Aeareness: A critical Review in:
Alberts, D.S. and Hayes, R.E. (2005), Power to the Edge, Banbury, S. and Tremblay (ed), A Cognitive Approach
Command and Control in the Information Age, CCRP to Situation Awareness: theory and application, Ashgate
Publication Series. Publishing Company, Burling.
Artman, H. (2000), Team Situation Assessment and Infor- Skarholt, K. and Finnestrand, H.O. (2007), The influence
mation Distribution, Ergonomics, 43, 1111–1128. of trust in an industrial team organization: How dif-
Bass, B.M. (1990), From transactional to transformational ferent organizational cultures affect trust building and
leadership: Learning to share the vision. Organizational maintenance. Paper for the 23rd EGOS Colloquium,
Dynamics, (Winter), 19–31. Vienna.
Bass, B.M. and Avolio, B. (1994), Improving Organiza- Steuer, J. (1992), Defining Virtual Reality: Dimensions
tional Effectiveness Through Transformational Leader- Determining Telepresence, Journal of Communications,
ship. Thousand Oaks, Calif.: Sage, 1994. 42(4), 73–93.
Eldon, M. and Levin, M. (1991), Co-generative learn- Strauss, A.L. and Corbin, J. (1998), Basics of qualita-
ing: Bringing participation into action research. In tive research: Techniques and procedures for developing
William Foote Whyte (Ed.), Participatory action research grounded theory (2nd ed.). Thousands Oaks, CA: Sage.
(pp.127–142). Newbury Park, CA: Sage. Wadel, Carl Cato (2005), Når medarbeidere må lede hveran-
French, H.T., Matthew, M.D. and Redden, E.S. Infantry dre—organisasjonsendring mot mer medarbeiderledelse.
Situation awareness. In Banbury, S. and Temblay (ed) Tidsskrift for Arbejdsliv, 7. årg, nr. 4.
(2004), A Cognitive Approach to Situation Awareness;
theory and application, Ashgate Publishing Company,
Burlington VT, USA.

828
Safety, Reliability and Risk Analysis: Theory, Methods and Applications – Martorell et al. (eds)
© 2009 Taylor & Francis Group, London, ISBN 978-0-415-48513-5

Outsourcing maintenance in services providers

J.F. Gómez, C. Parra & V. González


Industrial Management PhD Program at the School of Engineering, University of Seville, Spain

A. Crespo & P. Moreu de León


Industrial Management School of Engineering, University of Seville, Spain

ABSTRACT: This paper presents a framework about management in maintenance outsourcing in a service
provider company. It proposes key aspects for taking decisions in a well-established and controlled organization.
Cost is not the most important aspect to consider in outsourcing, the decision has to be a global and strategic
idea inside the company. Of course, not only the directors must take part, but also the technical personnel of
maintenance. We are trying to offer a basic guide to establish an outsourcing service, with guidelines and possible
evolution. It is based on a practical view of knowledge management over ten years of professional experience
focused in networks. Below there is a case study which demonstrates a methodology for decision-making and
shows how to optimize the organization without losing differing levels of knowledge. For this, we employ
quantitative and qualitative criteria to obtain a wide consensus and acceptance.

1 INTRODUCTION 2 OUTSOURCING

The outsourcing in maintenance is a practice being Outsourcing is defined as the delegation of business
increasingly used (Elfing & Duening 1994), espe- functions totally or partially to another company along
cially with services providers. Although the decision with part of the administrative and operational control.
to outsource is not a simple decision, it is a strategic Therefore, it is established between two companies,
decision (Click & Duening 2005) for an organiza- a supplier and a customer, a contractual relationship
tion, and as such, it should align itself with the governed by service agreements.
business to impact positively on the objectives of the Mainly, with a process of outsourcing we are look-
organization. ing for the specialization in activities, not keys for
There are different strategic reasons for which peo- the organization (Elfing & Duening 1994), such as
ple decide to execute processes of outsourcing. For systems, accounting, buildings, human resources,
example many managers who are making a process call centres, engineering, logistics, etc. For which,
of outsourcing, believe that it is an assignment car- it can transfer resources which bore those functions
ried out of responsibility to manage a part of the formerly.
business with the supplier. Other motives are primar- The decision of outsourcing is a strategic decision
ily economic, issues which endanger the control of (Earl 1996), aimed at improving the objectives of the
outsourcing. organization:
Faced with this, it is advisable to continue the pro-
cess guided by decision making-steps, to ensure the • Improving Quality
outcome of outsourcing is properly reached in main- • Improving Security
tenance. In this document we will attempt to provide • Reducing Cost
a framework for guiding implantations of outsourcing • Optimizing Resources
in service providers companies.
For this, we have structured this document in five Therefore, the organization should focus its efforts
parts. In the first two paragraphs, points 2 and 3, we in improving those functions that are a source of com-
begin with a base revision on outsourcing and main- petitive advantages and more profitable to the core
tenance in suppliers of services. After, in point 4 we business.
will develop the reference model that is proposed and Outsourcings have several advantages and disad-
finish with a case study and conclusion. vantages (Alexander & Young 1996, Halvey & Melby

829
2005, Jharkharia & Shankarb 2005, Tho 2005), and a stable situation where the organization is controlled,
within the advantages we can list: and avoid a difficult management of change.
The maintenance outsourcing could be an advan-
• Reduction costs, at the same quality to employ a tage as in other businesses, with the order to devote
supplier more specialized most of the internal efforts in the core processes, and
• Restructuring costs, changing fixed costs by vari- seek the specialization of external agents. Although, it
able costs in terms of services provided should be guided by three types of criteria: strategic,
• Stimulates the local employment through contracts technical and economic.
with local firms Organizations often outsource those activities
• Obtaining rapid budget by selling assets which have work patterns that are fluctuating in their
• Improvement of quality, for higher specialization burden and performance, and then the maintenance
• Access to outside expert knowledge and especially within distribution networks that meet
• Standardization and access to scale economies this requirement.
• Flushes resources for other purposes Below, we describe the nature of maintenance in
• Improves company focus companies of distribution services and consider them
• Improving management of difficult functions to for the decision of outsourcing.
handle
• Optimizing routine tasks
• Share the risk by flexibility of demand with the 3 MAINTENANCE IN SERVICES PROVIDERS
supplier company
• Provides legal guarantee for services Maintenance is characterized as a highly complex
• Relationships developed between financial aspects field inside business and involves various disciplines:
and levels of service management, human resources, company economy,
• Starting point for changes in the organization security, and knowledge of the whole production
• Speed through reengineering chain. Another consideration is that maintenance
activities are all times under pressure to reduce costs
We also have to consider potential risks and disad-
more than valuing the benefits or damages that it
vantages which affect any outsourcing plan:
avoids (Carter 2001, Mitchell et al. 2002) to the
• Unfulfilled or questionable expectations, for a company. Show of this importance, it is the weight
scenario developed to generate the process of of O&M activities in GDP, 9.4% in Spain (AEM
outsourcing. 2005), and other international studies put it among
• Changes in the quality for breach of agreements on the 15%–40% depending on sector (Mulcahy 1999,
services, either by the knowledge or capabilities of Mobley 2002).
the supplier company, or by errors in the definition The concurrence of these disciplines implies that it
of the same company. can be difficult to determine the appropriate decision
• Loss of knowledge or skills through transfer to the every time.
supplier, where it is more difficult to retain and On the other hand, we can define a service provider
improve, this happens frequently. company as those that provide clients certain ser-
• Loss of control over the externalized functions, vices that are supported and distributed by a network
source of learning for the internal staff. infrastructure, such as gas companies, water, elec-
• Dependence by the supplier could cause adverse tricity, telecommunications, etc. This infrastructure is
consequences for the client (investments extraordi- often organized and composed of elements prepared in
nary). hierarchical structures and replicated by areas of distri-
• Loss of security by transferred staff to the supplier, bution (Fig. 1). These companies fulfill the following
by hoax and illegal transmission of knowledge and characteristics:
information to the competence. 1. Elements geographically dispersed and in condi-
• Public and internal opinion for outsource jobs to tions of environment not optimal
another company. 2. High number of interconnected elements
• Loss of motivation for staff involved in the service, 3. High number and classes customers
because it can create a feeling of alienation within 4. Hierarchical structure in networks with levels of
the client company and result in the staff feeling aggregation of customer service
their jobs are valueless. 5. The network is dynamic and suffers configurational
and operational changes
Although, the decision about which activities are to
6. High needs of human resources and spares
be outsourced, it is often described as the beginning of
the process, however, really the process should begin In these types of enterprises, maintenance is a key
much earlier, defining the mechanisms to start from department (Earl 1994), by its contribution to look

830
SOURCE

Primary
Connections

Secundary
Connections

Tertiary
Connections

Customer
Link

Figure 1. Infrastructure of a service provider company.

Figure 2. Objectives of the management model.


after the interests or satisfy the needs of clients and
benefits of enterprises (Zhu et al. 2002).
There are five different types of maintaining to con- • Management of a high set of elements from a large
sider for these types of enterprises according to most number of suppliers
of the standards (Crespo 2007, Benoit 2006, Levitt • High dedication to resolution of incidents
2003, Wireman 1991): • Operation not automated, manual, dedication to
repetitive tasks
1. Corrective Maintenance.
• Reactive management before occurrences
2. Preventive Maintenance.
• Network documentation in paper format
3. Predictive Maintenance.
• Absence of unique and updated inventory
And the most recently maintenances: The situation tends to become more complex: incre-
4. Proactive Maintenance, a set of activities designed asingly difficult to manage systems due to increased
to detect and correct an incidence before it occurs volume and large geographic dispersion. Therefore,
avoiding its effects within the network and in this situation as a starting point for a process of
services (Tanenbaum 1991). outsourcing is not the most appropriate.
5. Perfective Maintenance. In the spirit of continu- Recommendation for outsourcing is to establish a
ous improvement (IEEE 1219 1993, UNE 66174 structured management model in maintenance, as a
2003) In the spirit of continuing improvement it redesign (Davenport 1993, Hammer 1993) based on
is a set of projects to improve the performance of activities, and with objectives of ensuring service qual-
the network using the knowledge of maintenance ity (Klein 1994), so that it facilitates decision-making
(Kent 1990), also called ‘‘Design-out Maintenance and finds points of improvement more quickly and eas-
(DOM)’’ (Gelders & Pintelon 1988). ily. In sum, a model ‘‘oriented to customers, processes
and services delivery’’ (Fig. 2). Then, we develop our
model is intended to be a support for outsourcing in
4 OUTSOURCING AND MANAGEMENT maintenance, searching:
IN SERVICES PROVIDERS
• Balance between the fulfillment of internal and
There are many standards models of processes, best external requirements; strategic, operational and
practices, and Information Technologies facilities tactical (Kaplan and Norton 1996)
(Hammer 1990, Peters 1982), unfortunately ‘‘no sin- • Transform resources and services in customer sat-
gle model works in all conditions’’. Some of which isfaction, fulfilling specifications and, restrictions
referencing will seek to take advantage: on management and cost (Nakajima 1992)
Then the implantation of the reference model is
• EFQM (2006), TPM (Nakajima 1992) and ISO
developed in the following six phases (Fig. 3):
9001 (2001), management by process and quality
• ITIL (ITSMF 2007), e-business
• eTOM (2007), network management 4.1 Mission, objectives and responsibilities
• CMMI (2007), system management to evaluate
The definition of European standard UNE 13306
maturity of companies
(2001) about maintenance management is quite com-
The initial situation of the majority of organizations prehensive, although we have to identify the mission
in distribution services are characterized by: of maintenance to complete that definition according

831
outsourcing and maintenance, in search of efficiency,
1º MISSION AND
OBJECTIVES through a balance between quality and cost. The imple-
mentation of a complete management system can
reduce between 10–30% of the annual budget of main-
2º DEPARTMENT tenance (Crain 2003, PMRC 2004), highlighting the
STRATEGY
main improvements upon cost and task control, vital
5º CHANGE in the control of outsourcing.
MANAGEMENT
In addition, based on Campbell and Jardine
3º PROCESSES (2001) and standards, we can consider that the min-
AND ACTIVITIES
imum support systems for a Computerized System
Maintenance Management System (CMMS), also
4º CONTROL called (MMIS) Maintenance Management Informa-
SYSTEM tion System (Pintelon & Gelders 1992) are six (Fig. 3):

5º SELECTION 4.4.1 Inventory system


SUPPLIER A correct management of the configuration within
a network ensures rigorous bulletins and organiza-
Figure 3. Phases of the management model. tion (ITSMF 2007), which maintains a history about
the evolution of elements in a planned manner. So
it reduces the risk of loss of control, ageing and
to the characteristics of the services providers compa- variations in service quality.
nies: ‘‘guarantee the service’’, to ensure the proper
functioning of the services supplied. And starting 4.4.2 Monitoring system
from this mission, we define responsibilities to get It is key point for proactivity in maintenance. It pro-
the objectives of the department. vides information in real time about status of the
network and services (Lee 1995) with the objective
to ensure maximum network availability (Yan 2003),
4.2 Strategic maintenance outsourcing plan with the highest quality, in a rapid and effective
response for incidents, preventing potential problems
This phase establishes a strategy to achieve the goal
before they start (IMSCENTER 2007).
and maintain a solid and operational network that
ensures services, according to requirements set by
the organization. Strategy is based on three interre- 4.4.3 Activities and resources management system
lated viewpoints: improvement of services quality, Its mission is managing, planning and documenting
cost reduction and resources optimization. categorizing activities associated with human resour-
ces and infrastructure elements. Therefore it is char-
acterized by encompassing knowledge management
4.3 Processes and activities definition of historical data as a source for managing problems
(symptom-cause-solution) and learning. This mod-
Set all necessary functions to carry out the strategies
ule is the integration between technology and social
and to structure them in processes and activities. To
variables: tasks, resources, personnel and organi-
categorize activities by their strategic value, as their
zation (Earl 1994). Activities have to be automatic
employment in cost and resources, as for their con-
by work-flow systems to provide increased levels
tribution to quality, taking into account the type of
of availability, reliability, flexibility and speed of
knowledge in which is based and the control possi-
services from the point of view of technical and
bilities. In this point, it will determine whether the
economic.
implementation of these activities should be carried
out with internal or external resources (Fixler & Siegel
1999, Grossman & Helpman 2002). It is important not 4.4.4 Integration module with the rest
to forget that within outsourcing costs, we can appreci- of company systems
ate further preparation, implementation, maintenance This module has an objective to allow impact analy-
and completion. sis in maintenance from the point of view of business
operations and clients (Lee 2004). Interconnection
must be conducted and controlled, at least with the
4.4 Outsourcing knowledge and documentation following enterprise systems:
in maintenance control system
• Economics system
This is where it should establish maintenance con- • Human resources management system
trol system and where it defines the way to assess • Logistics

832
• CRM, Customer Relationship Management 4.6 Management of changes
• Documentary management system
Planning correct transition is important, it is a learn-
• Knowledge management system.
ing phase oriented to the supplier for fulfilling agreed
service levels. On the other hand, to ensure business
4.4.5 Balance scorecard in maintenance continuity in outsourcing, it should also be considered
and with other systems a transitional phase and, a possible reversion distin-
The balance scorecard is a pillar in evaluation and con- guishing if it occurs in the transitional phase, at any
trol of compliance with the department objectives. Its time, or at the end of contract.
aims are alignment of the department with company To work with an outsourcing model of these char-
strategy; relates all activities, processes, systems and acteristics implies important changes for everyone,
resources all with the operational objectives and strate- especially those teams responsible that have to take
gic (UNE 66174 2003, UNE 66175 2003, Kaplan & a much more participatory role in management.
Norton 1996).
To this end, it collects a coherent set of indicators,
financial, about business, customer relationship and 5 A CASE STUDY IN A
continuous improvement. TELECOMMUNICATIONS COMPANY

4.4.6 Expert support system for taking decisions As an example, to simplify, we will only focus on
This system gives support to take decision with max- the outsourcing decision in a telecommunications
imum information, to facilitate the achievement of provider to evaluate importance of each activity by
objectives (Davis 1988). The recommendation is to its contribution towards the maintenance goals and to
be formed as a module that integrates: decide that activities could be outsourced.
From a strategic point of view, (Kaplan & Norton
1. Decision Support System for decision making 1996, Campbell & Jardine 2001, EFQM 2006), it
(DSS) (Turban 1988, Bui & Jarke 1984) through must abide by basic maintenance objectives, which
scientific models based on all the information from summarize in the following six categories:
the systems.
2. Expert System (ESs) (Shu-Hsien 2005) to emulate 1. Management
through artificial intelligence, human reasoning 2. Economical
like an expert. 3. Production or business
4. Quality or related to customers
Thismoduleappliesboth, informationmanagement, 5. Security, environmental and ethics
as statistical models (Marple 1987) and simulations 6. Evolution and Improvement
to submit patterns and solutions to facilitate deci- On the other hand, from a tactical point of view
sion making in maintenance (Iserman 1984, Jardine & processes of the department should also be taken into
Banjevic 2006). account: corrective, preventive, predictive, proactive
and perfective.
4.5 Supplier selection From an operational point of view, it has to consider
maintenance activities. to simplify the study, within
Once the reach of outsourcing is defined, from a stable these activities, only the most important activities are
situation, it proceeds to supplier selection and the plan- considered.
ning of outsourcing implementation. There are many
considerations to take into account during this negoti- 1. To manage incidents, all kinds of incidences
ation to avoid the risks listed above, but the main point 2. Monitoring alarms and status of network and
is that it is a collaborative negotiation or win-to-win services
process, with the supplier as a strategic partner. It is 3. On demand activities, to support other internal
advisable to guide suppliers to offer services based on departments in field
levels of their knowledge, and thus avoid the approach 4. Preventive activities
to only reducing cost. 5. Predictive activities, analysis to avoid or minimize
Aspects most favoured to select a supplier are: future impacts
6. Perfective activities. Improvement plans or tasks
• Experience in a sector to optimize infrastructure and services
• Flexibility on demand for services 7. Logistics. Stores management and spares
• Confidence 8. Budget and human resources. To control of bud-
• Technical and economic solvency get, resources, tools, staff, vehicles, etc. . .
• Will of collaboration oriented to services as strategic 9. Security. Tocontrolsecurity, healthandsafetyrisks
support 10. Documentation and compilation of processes,
• Transparency of suitable pricing procedures, reports, etc. . .

833
Table 1. Relative Importance between variables. Saaty scales.
geometric mean (3.086) of the six individual values
(3 + 4 + 3 + 2 + 4 + 3) by the sum of the second
1 3 5 7 9 column in the figure 4 (14.5), and at the same manner
with the rest of the matrix.
Same Weak Strong Proven Absolute Successive matrixes from activities compared
1/3 1/5 1/7 1/9 according to each strategic objective were developed
Slightly less Less Much less Absolute less (with indices RC all less than 0.06, valid), with the
(1/2, 1/4, 1/6, 1/8, 2, 4, 6, 8) Intermediate values if it is exception of cost, where we employ the activity budget
necessary rate (quantitative criteria). These matrixes are multi-
plied by their respective eigenvector (W) to obtain in
one column this contribution (Tab. 3).
Table 2. Index of comparisons randomly generated. In short, it obtains the weights for activities in the
Table 4, multiplying each cell of the Table 3 by the
ICrandom 0 0,58 0,9 1,12 1,24 1,32 1,41 1,45 respective cell in the same column of the W vector
of the Figure 5. Then activities are ranked depending
n 2 3 4 5 6 7 8 9 on their importance in relation with the objectives of
maintenance:
1. Budget and Human Resources 17.70%
For decision-making, we rely on properties of the 2. Documentation 13.96%
AHP method (Saaty 1977, 1980, 1990) for decisions in 3. Predictive 13.58%
group (Dyer & Forman 1992) of selected maintenance 4. Perfective 12.15%
experts from several hierarchical levels. The Analytic 5. Monitoring 10.87%
Hierarchy Process ‘‘AHP’’ is a methodology to synthe- 6. Preventive 10.31%
size a solution (a matrix) of a complex problem through Activities less valued are:
a breakdown in parts ordered hierarchically, quantify-
7. Security 7.00%
ing and comparing variables in pair with a normalized
8. Manage Incident 6.84%
and reciprocal scale of relative importance (Tab. 1).
9. Logistic 5.11%
In the use of this method, it can use subjective val-
ues, which implies a degree of uncertainty or lack of
reliability. To measure reliability coefficient ‘‘RC’’ is Produc- Man age- Secu- Imp ro-
wij Quality Cost
used ratio between rate of consistency IC of a com- tion ment rity vement

parisons array into pairs and value of the same index Quality 1 3.09 0.50 2.40 0.46 1.12

of a comparisons array into pairs randomly generated Cost 0.32 1 0.30 1.51 0.22 0.55

(Tab. 2). The reliability is sufficient if RC is smaller Production 1.99 3.37 1 3.36 0.93 1.70

than or equal to 0.10; otherwise, it must be reviewed Man agement 0.42 0.66 0.30 1 0.17 0.46

to improve its consistency. Security 2.19 4.57 1.07 5.73 1 1.26


Improvement 0.89 1.82 0.59 2.18 0.79 1

IC
RC = ≤ 0.1 (1) 6.8 14.5 3.8 16.2 3.6 6.1
ICrandom
Figure 4. Matrix completed with the average of six individ-
λmax − n
IC = (2) ual comparisons.
n−1

So, the problem is hierarchically structuring with Table 3. Matrix of activities rates to each objective.
criteria and alternatives, in three levels:
Quality Cost Production Management Security Improvement

1. Goal
Manage Incident 0.117 0.058 0.076 0.057 0.056 0.040
2. Maintenance objectives as criteria Monitoring 0.094 0.104 0.107 0.068 0.109 0.145
3. Activities as alternatives On demand
activities 0.026 0.022 0.021 0.028 0.0266 0.027
For valuing objectives, it is used an expert group Preventive 0.074 0.085 0.099 0.074 0.122 0.125

poll with qualitative criteria depending on their strate- Predictive 0.166 0.109 0.121 0.121 0.115 0.185
Perfective 0.126 0.112 0.122 0.148 0.093 0.162
gic importance. Each technician of a six group com- Logistics 0.039 0.069 0.072 0.031 0.039 0.050
pares them employing table 1 and after, the resulting Budget and 1
matrix (Fig. 5) is built weighing the average of indi- Human R. 0.147 0.310 0.27 0.285 0.110 0.073
Security 0.064 0.048 0.037 0.040 0.115 0.069
vidual values (Fig. 4), e.g. 0.21 (in the second cell Documentation 0.148 0.083 0.074 0.149 0.214 0.123
of the first row in the figure 5 is calculated dividing

834
Table 4. Matrix of activities rates according to their impor- knowledge in maintenance thanks to information
tance in relation with strategic objectives. systems facilities, ‘‘e-maintenance’’ (Yua et al. 2003,
IMSCENTER 2007):
Qual- Cost Produc- Manag- Secu- Improv-
ity tion ement rity ement
• Facilitate manage agreements on service levels
and delivery service reports. Then IT contributes
Manage to effectiveness and efficiency.
incident 0.019 0.004 0.019 0.004 0.016 0.006 • Orientation towards services more than elements of
Monitoring 0.015 0.008 0.027 0.004 0.032 0.023 infrastructure, searching for continuous improve-
On demand ment in services and processes to reduce costs and
activities 0.004 0.002 0.005 0.002 0.008 0.004 times, and to improve value and quality.
Preventive 0.012 0.006 0.025 0.005 0.036 0.019
The aim with this model is to increase decision
Predictive 0.026 0.008 0.031 0.008 0.034 0.029 reliability with experience, and in accordance with
Perfective 0.020 0.008 0.031 0.009 0.027 0.025 department strategy:
Logistics 0.006 0.005 0.019 0.002 0.012 0.008
Budget and
1. Improved organization and structure
Human R. 0.023 0.023 0.070 0.018 0.032 0.011 2. Using a rational and logical analysis, it seeks a
Security 0.010 0.004 0.009 0.002 0.034 0.011
solution for a complex problem with various alter-
Documen- natives in conflict and in conditions of uncertainty
tation 0.023 0.006 0.019 0.009 0.063 0.019 [DIXON66]
3. Aligned with company strategy, it considers pro-
cesses, objectives and activities
Produc- Manage- Secu- Impro-
4. Employment qualitative criteria, to rationalize
wij Quality Cost
tion ment rity vement intangible quality and value judgments from
Quality 0,15 0,21 0,13 0,15 0,13 0,18 experts to extract specialists knowledge
Cost 0,05 0,07 0,08 0,09 0,06 0,09
5. Promote positive attitudes towards improving
Production 0,29 0,23 0,27 0,21 0,26 0,28
maintenance
6. Consensus in groups with different interests
Management 0,06 0,05 0,08 0,06 0,05 0,08
7. Categorize alternatives
Security 0,32 0,31 0,28 0,35 0,28 0,21
8. Improve interactively
Improvement 0,13 0,13 0,16 0,13 0,22 0,16 9. Report processes for future developments
10. Easy to use and flexible with information available
W 0,159 0,073 0,257 0,062 0,294 0,156
This method reduces time in decision making,
increases quality and security of the final decision,
Figure 5. Weights in strategic criteria (RC = 0.01579,
acceptable). and produces motivation and satisfaction with goals
and team work.
In conclusion, according to maintenance outsourc-
10. On demand activities 2.49% ing in a service provider, it suggests that to compose
levels of externalisation progressively in time, increas-
This situation conducts processes of externalization
ing internal knowledge and control about activities
towards these last four routine and repetitive activi-
before to recruit once. That is, to make a partial
ties with not crucial importance to the core business.
outsourcing:
The expert group feels motivated with this decision
for participating, and it is suggested to advance more • with a flexible contract
in outsourcing after a stable period externalizing: • guarantee business productivity through service
level agreement
• Monitoring, at least first attention level
• devoting staff to manage contractual relationship
• Preventive maintenance, guiding planning inter-
and monitor services
nally by predictive and perfective maintenance
• outsourcing should be guided primarily by strategic
criteria
6 CONCLUSION But it should carry out the analysis with caution,
because in the case of outsourcing the level beyond
This reference model has been implemented in at the norm, there is a point of Irreversibility of decision
least two companies in distribution of telecommuni- where it would be impossible to react, this point
cations services, and as such can be developed in a is where the procedure to prevent consequences
high depth and customization for certain scenarios. expressed would be unacceptable to act upon due to
In addition, it is possible to increase control and time and resources.

835
REFERENCES European Foundation for Quality Management. 2006. EFQM
Framework for Management of External Resources. By
AEM, Asociación Española de Mantenimiento. 2005. El EIPM—EFQM.
Mantenimiento en España: Encuesta sobre su situación Gelders L. & Pintelon L. 1988. ‘‘Reliability and mainte-
en las empresas españolas. nance’’ in: Doff, R.C. and Nof, S.J. (ed.), International
Alexander M. & Young D. 1996. Strategic outsourcing. Long Encyclopedia of Robotics, Application and Automation,
Range Planning 29 (1): 116–119. Wiley, New York.
Benoît Iung 2006. CRAN Laboratory Research Team Goldratt E. 1997. Cadena Crítica. Ediciones Diaz de Santos.
PRODEMAS in Innovative Maintenance and Depend- Grossman G.M. & Helpman E. 2002. Integration ver-
ability. Nancy University—Nancy Research Centre sus Outsourcing in Industry Equilibrium. The Quarterly
for Automatic Control (CRAN). CNRS UMR 7039 Journal.
(http://www.cran.uhp-nancy.fr). Halvey J.K. & Melby B.M. 2005. Information Technol-
Bourne M. & Neely A. 2003. Performance measurement ogy Outsourcing Transactions: process, strategies and
system interventions: the impact of parent company ini- contracts. John Wiley & Sons, Inc.
tiatives on success and failure. Journal of Operation and Hammer & Champy 1993. Reengineering the Corporation.
Management. Harper Business.
Campbell J.D. & Jardine A. 2001. Maintenance excellence. Hammer M. 1990. Reengineering Work: Don’t Automate
New York: Marcel Dekker. 2001. Obliterate. Harvard Business Review.
Carter, Russell A. 2001. Shovel maintenance gains from Ian Tho 2005. Managing the Risks of IT Outsourcing.
improved designs, tools and techniques. Elsevier Engi- Elsevier Butterworth-Heinemann.
neering Information. Intelligent Maintenance Centre 2007. www.imscenter.net.
Click R.L. & Duening T.N. 2005. Business Process Iserman R. 1984. Process fault detection based on modelling
Outsourcing: The competitive Adventage. John Wiley & and estimation methods. Automatica.
Sons, Inc. ITSMF, IT Service Management Forum 2007. ITIL v3.
CMMI Product Team. Software Engineering Institute 2007. Information Technology Infrastructure Library. ITIL v2.
CMMI® for Development, Version 1. CMMI-DEV, V1.2, Information Technology Infrastructure.
CMU/SEI-2006-TR-008, ESC-TR-2006-008. Jardine A.K.S., Lin D. & Banjevic D. 2006. A review
COBIT [Control Objectives for Information and related Tech- on machinery diagnostics and prognostics implement-
nology] 1992. Objetivos de Control para la información ing condition based maintenance. Mech, Syst. Signal
y Tecnologías relacionadas. Asociación para la Auditoría Process.
y Control de Sistemas de Información, (ISACA, Informa- Jharkharia S. & Shankarb R. 2005. Selection of logistics
tion Systems Audit and Control Association), y el Instituto service provider: An analytic network process (ANP)
de Administración de las Tecnologías de la Información approach. International Journal of Management Sciente,
(ITGI, IT Governance Institute). Omega 35 (2007) 274–289.
Crespo M.A., Moreu de L.P. & Sanchez H.A. 2004. Inge- Kaplan, Robert S. & David P. Norton 1996. The Balanced
niería de Mantenimiento. Técnicas y Métodos de Apli- Scorecard: Translating Strategy Into Action. Boston, MA:
cación a la Fase Operativa de los Equipos. Aenor, Harvard Business School Press.
España. Kent Allen 1990. Encyclopedia of Computer Science and
Crespo M.A. 2007. The Maintenance Management Frame- Technology. CRC Press.
work. Models and Methods for Complex Systems Mainte- Klein, M.M. 1994. The most fatal reengineering mis-
nance. Londres, Reino Unido. Springer. takes. Information strategy: The Executive’s J. 10(4)
Davenport T. 1993. Process innovation: Reengineering 21–28.
work through Information Technology. Harvard Business Lee J. 1995. Machine perfomance monitoring and proac-
School Press. tive maintenance in computer-integrated manufacturing:
Dixon J.R. 1966. Design engineering: inventiveness, anal- review and perspective. International Journal of Computer
ysis, and decision making. New York, McGraw-Hill, Integrating Manufacturing.
Inc. Lee J. 2004. Infotronics-based intelligent maintenance
Dyer R.F. & Forman E.H. 1992. Group decision support with system and its impacts to close-loop product life cycle sys-
the Analytic Hierarch Process. Decision Support Systems. tems. Proceedings of de IMS’2004 International Confer-
Earl M.J. 1994. The New and the Old of Business Pro- ence on Intelligent Maintenance Systems, Arles, France.
cess Redesign. Journal of Strategic Information Systems, Levitt Joel. 2003. Complete Guide to Preventive and Predic-
vol. 3. tive Maintenance. Industrial Press.
Earl M.J. 1996. The Risks of Outsourcing IT. Sloan Manage- M. Davis. 1988. Applied Decision Support. Prentice Hall,
ment Review. 37. 26–32. Englewood Cliffs.
Elfing T. & Baven G. 1994. Outsourcing technical services: Marple S.L. 1987. Digital Spectra Analysis. Prentice
stages of development. Long Range Planning 27 (5): Hall.
42–51. Mike Crain 2003. The Role of CMMS. Industrial Technolo-
EN 13306:2001. Maintenance Terminology. European Stan- gies Northern Digital, Inc.
dard. CEN (European Committee for Standardization), Mitchell Ed., Robson Andrew, Prabhu Vas B. 2002. The
Brussels. Impact of Maintenance Practices on Operational and
Fixler D.J. & Siegel D. 1999. Outsourcing and Productiv- Business Performance. Managerial Auditing Journal.
ity Growth in Services. Structural Change and Economic Mobley Keith 2002. An Introduction to Predictive Mainte-
Dynamics. nance. Elsevier.

836
Mulcahy R. 1999. The CMMS technology revolution—why Tanenbaum, Andrew S. 1991. Computer Networks. Ed.
‘‘Best-of Breed’’ wil still be best. International Journal of Prentice-Hall.
Maintenance and Asset Management. The Institute of Electrical and Electronics Engineers. Inc.
Nakajima Seiichi 1992. Introduccion al TPM (Manten- 1993. IEEE 1219. Standard for Software Maintenance.
imiento Productivo Total). Productivity Press. The Plant Maintenance Resource Center 2004. CMMS Imple-
Neely A.D., Gregory M. & Platts, K. 1995. Performance mentation Survey Results—2004. The Plant Maintenance
Measurement System Design—A Literature Review and Resource Center.
Research Agenda. International Journal of Operations and Tung Bui & Matthias Jarke 1984. A DSS for cooperative
Production Management. multiple criteria group decision making. STERN School
Peters T. & Waterman H.R. Jr. 1982. ‘‘In Search of Excel- of Business, Working Paper Series IS-84-45.
lence’’. Turban E. 1988. Decision Support and Expert Systems:
Pintelon L.M. & Gelders L.F. 1992. Maintenance manage- Managerial Perspectives. New York: Macmillan.
ment decision making. European Journal of Operational UNE 66174 2003. Guide for the assessment of quality man-
Research. agement system according to UNE-EN ISO 9004:2000
Ren Yua, Benoit Iung, Herv!e Panetto 2003. A multi-agents standard. Tools and plans for improvement. UNE.
based E-maintenance system with case-based reasoning UNE 66175 2003. Systems of Indicators. UNE.
decision support. Engineering Applications of Artificial UNE-EN ISO 9001:2000. Quality management systems—
Intelligence 16: 321–333. Requirements. International Organization for Standard-
Saaty T.L. 1977. A Scaling Method for Priorities in Hier- ization.
archical Structures. Journal of Mathematical Psychology, Wireman T. 1991. Total Productive Maintenance. Industrial
15: 234–281, 1977. Press.
Saaty T.L. 1980. The Analytic Hierarchy Process. McGraw Yan S.K. 2003. A condition-based failure prediction and
Hill. processing-scheme for preventive maintenance. IEEE
Saaty T.L. 1990. How to make a decision: The analytic Transaction on Reliability.
hierarchy process. European Journal of Operational Zhu G., Gelders L. & Pintelon L. 2002. Object/objective-
Research. oriented maintenance management. Journal of quality in
Shu-Hsien Liao 2005. Expert system methodologies and maintenance engineering.
applications—a decade review from 1995 to 2004.
Elselvier. Expert Systems with Applications 28: 93–103.

837
Safety, Reliability and Risk Analysis: Theory, Methods and Applications – Martorell et al. (eds)
© 2009 Taylor & Francis Group, London, ISBN 978-0-415-48513-5

Revising rules and reviving knowledge in the Norwegian railway system

H.C. Blakstad & R. Rosness


SINTEF Technology and Society, Trondheim, Norway

J. Hovden
Department of Industrial Economics and Technology Management,
Norwegian University of Science and Technology (NTNU), Trondheim, Norway

ABSTRACT: This paper presents and discusses four safety rule modification processes in the Norwegian
railway system. It focuses upon the impact from the processes upon railway knowledge and in particular the
ambitions to change from predominantly experience based prescriptive rules towards risk based outcome oriented
rules, i.e. a deductive top-down approach to rule development.
The cases met this challenge with an inductive bottom-up approach to rule development, a strategy given the
name ‘‘reverse invention’’. Discussions about the new approach and the processes of reverse invention stimulated
inquiries into railway knowledge that revived this knowledge. It remained uncertain whether the inquires resulted
in actual new knowledge. The new approach also stimulated a reduction of relational and contextual elements
of the railway knowledge. According to theory these elements are important for the ability to decode theoretical
knowledge and to judge its relevance for future use.

1 INTRODUCTION 1.1 Safety rules and knowledge


Safety rules have been directed at control of known
Rules constitute an important record of the organiza- risks and can be seen as a result of the knowledge
tion’s learning about its operational dangers (Reason, at the time of the rule making. In Baumard’s termi-
1997; Reason et al. 1998). Hale (1990) argues that nology they represent explicit collective knowledge
the greatest value of safety rules lies in the process of (Baumard, 1999). However, there is a general con-
actually finding out and writing down the rules. The cern that it is neither possible nor wise to transfer all
organization can retain this value by treating the rules lessons learned into explicit collective knowledge and
as a living repository of lessons learned in the life of rules (Baumard, 1999; Hale, 1990; Rasmussen, 1997;
the system. Accordingly, safety rules are not only a Reason, 1997).
result of a learning process, they can be seen as a part Furthermore, the knowledge transformed into rules
of that process. can differ. Many traditional safety systems have used
There is limited scientific knowledge about per- safety rules providing detailed prescriptions to the
forming safety rule modifications (Hale & al., 2003). operative staff of what to do in response to predicted
This also implies that there is limited scientific knowl- situations or requirements to the states of the system
edge of the close relationship between safety rules and (Hale, 1990). The knowledge base for the develop-
knowledge of activities and related risks in a regulated ment of such prescriptive rules were usually extensive
area and the possible impact from rule modification knowledge of the systems’ functioning combined with
upon this knowledge. knowledge derived from practical experiences with
The purpose of this paper is to explore how safety accidents and dangerous events (Rasmussen, 1997).
rule modification can influence organizational knowl- Hence, the prescriptive and experience based safety
edge about operational dangers. It presents results rules can represent an explicit collection of informa-
from a case study of four safety rule modification tion about what to do under certain conditions based
processes in the Norwegian railway system. Special on knowledge derived from lessons learned in the past
attention is given to consequences of the ambition (Baumard, 1999; Hale, 1990; Rasmussen, 1997).
to change from predominantly experience based pre- There is now an increased use of rules for safety
scriptive rules towards risk based outcome oriented management and outcomes. When found necessary,
rules. these are supplemented with lower level rules to form

839
a dynamic hierarchy of safety rules, see for instance new fundamentals each time. They only build upon
Hale & al. (1997), Hovden (1998). Risk assessment is the past when experiences are embodied in a theory.
often required to decide upon the necessity of rules. The decision-makers are always prepared to start from
Such an approach to the development of safety rules the ground up, i.e. from scratch.
can be seen as a deductive and risk based top-down Seen together, this theory indicates that to change
approach to rule development (Hovden, 1998). an experience based, prescriptive safety rule tradition
The change from experience based prescriptive towards outcome oriented rules based on risk analy-
rules towards rules for safety management and out- ses will require a change in attention and knowledge
comes based on risk assessments represents a change tradition of the rules.
in the knowledge base for the rules. This implies a
change in the attention and selection of knowledge
1.2 Safety rules and knowledge in the Norwegian
that will be considered relevant and expressed through
railway system
the rules, i.e. from knowledge about what to do under
certain conditions towards knowledge about intended The Norwegian railway system has a tradition of pre-
outcomes. scriptive rules directed at the operative staff at the
The new approach to rule modification can be lower levels of organizational hierarchies. The rules
seen as change in the dominating type of rationality has been developed with growing knowledge of the
and knowledge. As the tradition has been that rules system’s technology, activities and interactions and
have been developed in accordance with rule develop- with experiences of unwanted events or accidents
ers’ current understanding of the actual situation, the (Gulowsen & Ryggvik, 2004; Ryggvik, 2004). Much
logic of appropriateness has played an important role of the knowledge has been derived from practice and
(March, 1994). The knowledge base here is familiar- consisted of collective tacit and explicit knowledge.
ity and experience. This knowledge is rather implicit, This knowledge was shared through an internal educa-
i.e. tacit, and the information treatment is intuitively. tional system, practice oriented trainee programs and
Ellström (1996) labels the knowledge perspective of socialization. Here the rules served an important role
this tradition ‘‘intuitively-contextual’’. for the structure of the education and as knowledge
The increased emphasis upon goals in outcome ori- carriers.
ented rules requires rule makers to identify possible In 1996, steps were taken to open the Norwegian
alternatives and choose between them in accordance railway system to new traffic operators. The Norwegian
with their contribution to preset goals of the rules. This state owned railway company (NSB) was divided
approach is linked to another type of rationality that into the National Railway Administration, which was
March (1994) calls rationalistic choice based decision responsible for infrastructure management, and NSB
making. This form of decision making is preference BA, a state owned traffic operator. An independent
based and supposed to apply the logic of consequence. regulatory body, the Norwegian Railway Inspectorate,
March argues that the logic of consequence makes was established.
great demands on the abilities of individuals and insti- The railway sector, and in particular the Norwegian
tutions to anticipate the future and to form useful Railway Inspectorate, has been influenced by the
preferences. safety management traditions of the Norwegian oil
The introduction of risk analyses represents an addi- industry (Ryggvik 2004). This tradition has empha-
tional move in this direction. Perrow (1999) argues sized internal control principles with extensive use of
that such a development represents a strengthening risk analyses and outcome oriented rules. The devel-
of the tradition of absolute rationality. This is a form opment has resulted in initiatives, especially from the
of rationality wherein calculations can be made about Norwegian Railway Inspectorate, to change the tradi-
risks and benefits, clearly showing which activities we tion of experience based, prescriptive rules towards
should prefer. Risk analyses thus serve as support for outcome oriented rules based on results from risk
choice based decision-making as described by March analyses.
(1994). The intentions of a development towards a deduc-
The scientific nature of the outcome oriented rules tive and risk based top-down approach to safety-rule
and risk analyses also resemble the highest level modifications was evident in two different projects.
of technical rationality described by Schön (1991). One project was established for modification of the
Here the strategy is ‘‘First thinking and then acting’’. ‘‘traffic safety rules’’. In general, these rules were
Ellström (1996) labels the knowledge perspective of detailed prescriptive action rules that coordinated
this tradition ‘‘rationalistic’’. The dominating knowl- the activities of the operative staff involved in traffic
edge base of this knowledge tradition is theoretical operations. The management of this project encour-
and explicit and the treatment of information is ana- aged the rule developers to ‘‘think new’’ and develop
lytical. Lindblom (1959) argues that the rationalistic outcome-oriented rules formulated as goals and to
strategy implies that decisions have to start from base these upon risk analyses. From the beginning, the

840
rule-imposers were the Norwegian Railway Adminis- hereafter called the ‘‘Traffic-rule case’’ was chosen
tration. Later this responsibility was transferred to the to represent one case in the study. This case was fol-
Norwegian Railway Inspectorate. lowed until the work was transferred to the Norwegian
The other project had as purpose to improve Railway Inspectorate.
the management of infrastructure maintenance. One Among the subprojects of the Maintenance project,
element in this project was to modify the ‘‘main- three cases were chosen for the study. These were the
tenance rules’’. These rules were organized in dif- projects modifying rules for the signal-, power supply-
ferent sets for each subsystem of the infrastructure. and superstructure-infrastructure. These cases were
They were mainly detailed prescriptive action or state followed until the rules were approved.
rules directed at the operative staff that served both The information for the study was collected by
safety and other purposes. The different subsystems interviews of 41 people that had been involved in
had varying characteristics regarding time sequencing the modification processes, studies of selected project
of activities, communication and coordination. The documents and participation in 4 different meetings.
project organized subprojects for the modification of The analyses were performed as an iterative pro-
each rule set. cess inspired by Grounded theory (Strauss & Corbin,
Also, in this project the management encouraged 1998). The analytic tools and results influenced further
the rule developers to ‘‘think new’’. This meant to data collection and further data-collection developed
increase the use of triggering requirements and to the analytic tools. For further presentation of the
base the rules on risk analyses. The triggering method, see Blakstad (2006).
requirements should define conditions in the
infrastructures that should trigger off maintenance
3 FINDINGS: REVIVAL AND CONSERVATION
activities, i.e. define outcomes for maintenance activ-
OF RAILWAY KNOWLEDGE
ities. The rule-imposers were the Norwegian Railway
Administration.
3.1 A strategy of reverse invention
On this background, the Norwegian railway system
represented an opportunity to study implementation of In all cases the participants of the modification pro-
the ongoing changes in rule traditions and its impact cesses tried to ‘‘think new’’ and to use the intended
upon knowledge about operations of the system and deductive top-down approach to rule development, i.e.
associated dangers. to start with the development of higher order outcome
In this paper the term ‘‘railway knowledge’’ refers oriented rules based on knowledge from risk analy-
to the individual and collective understanding of func- ses. The chosen methods for the risk analyses used
tions and interactions of the railway system. This experienced top events as outset for the analyses. The
includes knowledge of the system itself, its activities core attention was directed at the operative level of the
and their interactions, the inherent risks and preventive railway system.
means. However, the cases critically judged outcome ori-
ented rule solutions and risk analytic results through
inquiries into experience based railway knowledge.
2 RESEARCH QUESTION AND APPROACH This reflected a precautionary concern for safety
(Blakstad, 2006). For example, one of the persons
This study looks at how the intended changes of involved in the Traffic-rule case explained how one
regulatory mode and knowledge base were handled of the railway professionals of the work group always
in practical life. The research question is: How did expressed his worries for safety. He did this even when
ambitions to change the safety rule tradition of the he was not able to express why he was worried. These
Norwegian railway system (from predominantly expe- expressed worries led to inquiries and discussions that
rience based prescriptive rules towards risk based revealed the foundations for these worries. In this way
outcome oriented rules) influence railway knowledge? experience based railway knowledge, even when it was
The question is based on the hypothesis that the tacit, served as reference for safe solutions.
described change in the safety rule tradition will The cases soon abandoned the deductive and risk
change railway knowledge. The study will also explore based top-down strategy to the rule development.
possible explanations for the identified changes in Instead, all cases used a bottom-up approach where
railway knowledge and discuss practical implica- existing low level prescriptive rules and associated
tions. This calls for an explorative and qualitative knowledge were used as outset for the development
approach and a case study design was chosen (Miles & of outcome-oriented rules. This strategy is given the
Huberman, 1994; Yin, 1994). name ‘‘reverse invention’’ in this study.
Four cases of safety rule modifications in the Nor- When the cases changed into processes of reverse
wegian railway system were chosen for the study. The invention the work built upon the railway knowledge of
project for modification of the traffic safety rules, the existing rules, i.e. knowledge directly expressed in

841
the rules themselves and knowledge about their history All cases compared the risk analyses with railway
and intended function. It was necessary to inquire knowledge. The expressed purpose of the compari-
into the intentions and rationale behind the existing son was to control that known risk was included in the
prescriptive rules, a knowledge that was sometimes analyses. Accordingly, the cases revealed a higher trust
difficult to retrieve. In this way existing knowledge in existing experience-based railway knowledge than
associated with the pre-existing prescriptive rules was in the results of the risk analyses and railway knowl-
brought forth. edge served as reference for good quality of the risk
Also, none of the cases found that outcome oriented analyses.
rules gave sufficient control of known risks. The Traf- Usually the risk analyses and the railway knowl-
fic rule case stayed with the prescriptive rule solution edge provided the same conclusions. When this hap-
and intended to use and develop their outcome oriented pened, the confidence in both railway knowledge
formulations for educational purposes. However, the and risk analyses increased among the participants
plan for such a textbook was abandoned of econom- of the work and they experienced it as a validation
ical reasons. The Maintenance-rule cases developed of both knowledge sources. The interviewees also
outcome oriented triggering requirements. These were gave some examples where the conclusions from risk
supplemented with explanatory texts and subordinated analyses and railway knowledge came in conflict. In
prescriptive rules. With safety as an important argu- such instances the reason for the different results was
ment, the cases included more prescriptive rules than questioned and inquiries initiated. The interviewees
intended. expressed less trust in risk analytic results than the
The main explanations the cases gave for the devel- experience based railway knowledge. Therefore, the
opment towards reverse invention and the chosen rule most common strategy for the inquiries was to review
solutions were that existing lower order rules and asso- the risk analyses. The major concern was whether
ciated knowledge was highly trusted and necessary for the analyses had excluded important railway knowl-
safe performance. Therefore, it was experienced as a edge. When weaknesses in the analyses were revealed,
waste to start from scratch without taking advantage the analyses were adjusted to conform to the railway
of the knowledge associated with these rules. Refer- knowledge. Through this strategy the risk analyses
ence was also made to the important function the rules were brought in accordance with the railway knowl-
had in the educational system for railway personnel; edge and agreement was established. When consensus
a change would require a new way of teaching. Out- was reached, the participants experienced this as if the
come oriented formulations were also welcomed as a risk analyses and the railway knowledge validated each
means to illustrate the purpose of prescriptive rules in other. Accordingly, also conflicting initial conclusions
the educational settings. resulted in increased confidence in both.

3.3 Validation of railway knowledge through


3.2 Risk analyses validated railway knowledge feedback
In spite of the strategy of reverse invention with the use The work of the cases included different feedback
of existing prescriptive rules and railway knowledge loops (Blakstad, 2006). First, the cases made inquiries
as the core fundament for the rule development, all into the efficiency of the existing rules. Second, the
cases incorporated risk analyses in their modification cases took steps to achieve feedback from practitioners
work. However, the cases demonstrated four different upon the evolving rule solutions, methods to include
solutions for the application of risk analyses that gave risk analyses in the modification work and risk analysis
them different functions in the work (Blakstad, 2006). results.
In spite of this there were some common features in The feedback was given in different ways. In the
the processes. Traffic-rule case the members used each other and
One common feature was that the risk analyses ini- their network of practitioners holding different func-
tiated inquiries. As one of the interviewees of the tions for feedback. This was mostly done through
Maintenance-rule cases related: informal inquiries and discussions when the work
raised questions. The work group also involved their
‘‘But I do believe that we have had much to learn resource groups that held different positions in the
from the RCM. [Reliability Centered Maintenance; organizational hierarchies of the system. In addition,
autors’ comment.] There were systematically asked new statistical material and reports of unwanted events
questions about consequences of failures etc. And it and accidents that occurred during the process were
is clear that we have this in the back of our mind for used.
anything we do but we have not always followed the The Maintenance-rule cases also took steps to get
thinking to the end. And sometimes one experiences feedback on their work. Again, the network of prac-
some surprises.’’ titioners that the core participants posited became

842
involved to achieve feedback. However, differences differences between the cases regarding available time
in the organization of the Maintenance-rule cases and for reflection.
their tasks created different conditions for using those The Traffic-rule case had the best conditions of
people directly involved in maintenance performance the cases for interaction and access to knowledge
for feedback. The cases participated in a joint hearing resources. The work group was located together and
process of the modified rules and the risk analyses. had dedicated time for the work. It worked as an
Also these cases looked into new statistics and acci- interactive team where also the different tasks inter-
dent reports. At the end, the Maintenance-rule project acted in iterative processes. Furthermore, this case
organized a joint two-step hearing process. had a formalized organization that included many
actors, required communication and written reports
and agenda papers. Thus knowledge from different
3.4 Systematizing and storing knowledge
sources became articulated and combined and to a
The descriptions above reveal that the existing pre- great extent transferred into a written form.
scriptive rules served as an important knowledge base Among the Maintenance-rule cases, only the work
for the rule development. group of the Superstructure case was located together
The descriptions also illustrates that the processes and had continuity in their work. This case also had
of the cases can be seen as a revival of railway a formalized Board of the branch that it included.
knowledge. Railway knowledge was spread around The Signal case that had the role as a pilot, had
the organization. It was sometimes difficult to retrieve more available economic resources and dedicated time
because some of it had a more or less individual and to organize meetings about the risk analyses. The
tacit form. Therefore the inquiries and the following Maintenance-rule cases were also less formalized than
work with the achieved knowledge implied an artic- the Traffic-rule case. Therefore they did not com-
ulation of this knowledge and that more people had municate with others and did not produce written
access to it. It remained uncertain whether the inquiries documentation to the same extent.
contributed with actually new knowledge. However, when it came to the rule solutions,
The knowledge retrieved from the inquiries were the Traffic-rule case only expressed knowledge in
combined and sorted out, discussed, systematized and prescriptive rules while the Maintenance-rule cases
to some extent documented. This implied a direction expressed it in three ways: In triggering requirements,
of attention where some knowledge became more in their explanatory texts and in prescriptive rules.
focus than others and therefore included in the work.
The processes were governed by a combination of the
incitements to the rule solutions, the frameworks that 4 DISCUSSION
the risk analytic methods provided and the risk percep-
tion of the participants. For instance, the final report The results above reveal that the cases used rail-
of the Traffic-rule case comments that the risk analy- way knowledge as the core knowledge base for the
ses did not contribute with any particular unknown rule modification process. The rule modification pro-
conditions. However, it had a systematizing func- cess revived railway knowledge by making formerly
tion, contributed with an overview and drew attention tacit knowledge explicit. The processes increased the
to conditions known from before. An interviewee of confidence in this knowledge.
the Maintenance-rule cases made some of the same
reflections:
4.1 The strong position of railway knowledge
‘‘At least the systematizing of it [the risk analy- The cases did not adopt the rationalistic, deductive top-
ses, authors’ comment] forces one to evaluate and down strategy that was intended for the modification
document what one does.’’ And he continues: ‘‘ . . . work. The main explanation was that existing experi-
before it was very much based on individuals—the ence based railway knowledge, and in particular the
experience one had within the areas.’’ knowledge associated with existing prescriptive rules,
was seen as too valuable for safety to be abandoned.
The interviewees were asked who the core contrib- Hence, they are on line with Lindblom’s critic of the
utors to the work were. Their answers revealed that rationalistic strategy (Lindblom, 1959).
those performing the rule development and the risk Furthermore, Reason (1997) argues that the stage
analyses were the main contributors. Their networks reached in an organization’s life history will influence
contributed as important supplements. However, there the opportunities for feed forward and feedback con-
were differences between the cases. The organizing of trol of activities. The Norwegian railway system was
the cases and the tasks differed and influenced how old enough to have the necessary experience to develop
much the knowledge of these actors became articu- prescriptive rules in accordance with feed forward
lated, made collective and combined. There were also control principles.

843
An additional argument was that the prescriptive Accordingly, the inductive bottom-up strategy
rules were perceived as important and valuable ele- created a transition period when higher order rules
ments in the Norwegian railway system’s organiza- were developed from lower level rules. The process
tional memory of safe performance, such as discussed led to inquiries that activated intuitively-contextual
by Stein (1995). railway knowledge and made it more explicit. Further-
Instead, the cases decided to apply another type more, attempts to formulate intended outcomes based
of rationality in their developmental work than the on the experiences behind the existing rules made the
rationalistic rationality of the deductive top-down intentions of their prescriptions more explicit. Bau-
approach. This was based on an inductive bottom-up mard (1999) argues that to make knowledge more
strategy where the existing prescriptive rules were used explicit might be an advantage when a system has to
as the starting point for rule development. This made handle organizational transformations. The ongoing
it possible to build upon the accumulated knowledge deregulation process of the Norwegian railway system
associated with the prescriptive rules and in particu- can be seen as a transformation period. Accordingly,
lar knowledge about their former and existing context, processes that stimulate articulation of knowledge
their intended function and the experiences of their might be an advantage for the safety of the system.
efficiency to fulfil their intention. In March’s termi- All together, this reveals that the predominantly
nology, the rationality that the rule developers applied intuitively-contextual railway knowledge of differ-
in the modification processes became dominated by ent sources became revived and integrated into the
a rule- and identity based type of decision-making form of the higher order outcome oriented rules and
(March, 1994). The existing rules and associated the forms of the chosen risk analytic method. Fur-
knowledge describing safe actions and states served thermore, the revived knowledge became selected in
as a fundament to judge what was considered to be accordance with the rule developers’ perception of risk
appropriate outcomes of the regulated activities. In and transformed into the more abstract and context-
this way, knowledge associated with existing rules free forms of outcomes and the structure and schema
was brought forth into the new descriptions of wanted of the chosen risk analytic methods. Such knowledge
outcomes and the work did not have to start from can be labelled rationalistic knowledge (Schön, 1991;
scratch. This can be seen as a conservative and precau- Ellström, 1996). In this way the system developed the
tionary strategy to fulfil the requirement of outcome- ability to take advantage of railway knowledge in both
oriented rules and a rule hierarchy in a prescriptive intuitively-contextual and rationalistic forms in its
rule tradition. safety work. However, the inherent conservatism of the
Furthermore, the inquiries that the cases made into processes might make it difficult to foresee new and
railway knowledge made them cautious to replace unexpected dangers, such as Turner & Pidgeon (1997)
existing prescriptive rules with outcome-oriented rules discuss. The ongoing transition period of the system
or change them. creates changes from which it might be difficult to
Accordingly, the processes got a conservative touch foresee all consequences for safety.
and prescriptive rules appeared to be more persistent
than expected. The inductive bottom-up approach of
4.2 Is the revived railway knowledge endangered?
reverse invention made the cases able to build upon
existing knowledge of the prescriptive rules’ con- The fact that some of the revived railway knowledge,
text and function and rule specific knowledge, i.e. and mainly the rationalistic elements, is transformed
knowledge that resembles Ellström’s descriptions of into the written form and stored in rules and documents
intuitively-contextual knowledge (Ellström, 1996). does not mean that the knowledge is stored in orga-
One can say that the decision strategy of the cases nizational memory. As knowledge is relational and
resembled that of ‘‘Mixed scanning’’ presented by context specific, data and information that is trans-
Etzioni (1967). Building upon his metaphor, pre- ferred into written form is not the same as knowledge
dominantly intuitively-contextual railway knowledge for the reader of the written documentation (Bau-
represented a broad angled camera. This was used to mard, 1999; Nonaka & Takeuchi, 1995; Stein, 1995).
scan the situation or in other words to get an overview Written documentation cannot communicate the rich
of railway activities, related risks and experiences with mental maps that might be necessary to decode writ-
existing means to prevent accidents, including safety ten texts and to understand the complex dynamics of
rules. Then rule- and identity based rationality, apply- reality. Also, written texts stored in databases require
ing the same knowledge, was used to zero in on those that its storing can be located and that access is
areas that required more in-depth examination. The given to it when it is necessary or useful to retrieve
risk analyses contributed to this work. Like Kørte et al. it (Stein, 1995). In addition, mental maps are also
(2002) discuss, the operational environments provided often important to foresee consequences of different
updated process knowledge and experience data that actions and choices that are necessary for accident
served as input to the analytic process. prevention (Perrow, 1984/1999; Rasmussen, 1997;

844
Rasmussen & Svedung, 2000). The results of the study Reason, 1997). Therefore it might be useful to search
also revealed that such knowledge was experienced as for the inclusion of alternative approaches to the tradi-
being important for the understanding of the rules and tional railway knowledge. In the framework for safety
their intended function, for judging their relevance and rule development that is applied for the SAMRAIL
for rule-followers’ motivation for compliance. research (European Commission. 2004a), this might
Accordingly, to keep the revived railway knowl- imply to increase investments in the first step of this
edge alive for the future, the written text has to be framework. This step requires that the rule developers
retrievable. Furthermore, it has to be supplemented define the processes to be regulated, related accident
with elements of social interaction and context. There- scenarios and means to control the activities.
fore, it is essential that intuitively-contextual railway Seen together, if the rich intuitively-contextual rail-
knowledge is stored in organizational memory. way knowledge is not stored in organizational memory
The differences between the cases in their orga- by other means than those revealed in the study, the
nizing regarding participation, communication and benefit of revived knowledge might be lost in the
interaction imply that they differed regarding degree future. Also, the ongoing changes of the Norwegian
of social interaction and relation to the rules’ con- railway system require judgements of relevance of the
text. This created differences regarding articulation of experience based railway knowledge for the current
knowledge and how this knowledge was distributed context and organizational learning.
among the involved actors. However, the cases did not There are already ongoing discussions in European
provide examples of systematic storing of intuitively- railways about establishing learning agencies to fur-
contextual railway knowledge. ther the development of organizational knowledge
Furthermore, the increased emphasis upon risk and its storing in organizational memory (European
analyses as the fundament for rule development and Commission, 2004b). With reference to the differ-
intentions of increased use of outcome oriented rules ences in the organizing of the studied cases the
might strengthen this development. In addition, ratio- organization of rule modifications can either be given
nalistic knowledge generally holds a higher status than the status as a learning agency or be linked to such
intuitively-contextual knowledge (Perby, 1995; Schön, agencies. Also, the inquiries into railway knowledge
1991). Therefore, the status of the most intuitively con- revealed that there are existing communities of prac-
textual elements of railway knowledge might become tice within the Norwegian railway system that can be
reduced in the future. The status of knowledge might stimulated such as Wenger (1998) has discussed. Fur-
also influence the motivation to retrieve information thermore, there might be a potential for establishing
(Stein, 1995). useful communities of practice within the system, such
The ongoing deregulation of the Norwegian railway as revealed in the Dutch railways (European Commis-
system might weaken the conditions for developing sion, 2004b). However, the results and discussions
railway knowledge that holds an extensive overview of reveal that it is important to further elaborate solutions
the complex interactions of the system. The deregula- for storing and evaluation of railway knowledge.
tion process may also weaken the system’s traditional
conditions for socialization and existing communities
of practice that were considered particularly impor- 5 CONCLUSIONS AND PRACTICAL
tant for transfer of tacit knowledge (Baumard, 1999; IMPLICATIONS
Lave & Wenger, 1991; Nonaka & Takeuchi, 1995;
Wenger, 1998). The deregulation processes has also The study reveals that the cases met the challenge
caused a work force reduction. Baumard (1999) warns of the deductive and risk based top-down approach
against the danger that the need to renew knowledge to safety rule development with a strategy given the
might lead the firm to remove the representatives of name ‘‘reverse invention’’. This strategy can be seen as
the old knowledge. By doing this, they remove the tacit an inductive bottom-up approach to rule development
knowledge of the firm. where existing prescriptive rules and railway knowl-
The deregulation process also implies increased edge served as the core fundament for the development
complexity, uncertainty and ambiguity. Etzioni (1967) of the outcome oriented rules.
argues that under such conditions it might be required The introduction of the deductive and risk based
to increase investments in thorough studies of the situ- top-down approach and the revealed process of reverse
ation. Accordingly, rich intuitively-contextual railway invention raised questions that initiated inquiries
knowledge will be necessary to provide a good picture. into railway knowledge. These inquiries made tacit
However, with reference to the discussions of knowledge more explicit and knowledge became
Turner and Reason of disaster incubation or latent gathered and systematized, i.e. railway knowledge
conditions, there is a danger that even such railway became revived. However, the revived knowledge
knowledge is not sufficient to check for dangers that became reduced into lean rationalistic forms. It
can not be easily discovered (Turner & Pidgeon, 1997; remained uncertain whether the potential of inquires

845
for organizational learning resulted in actual new Gullowsen & Ryggvik, 2004. Jernbanen i Norge 1854–2004.
knowledge. Nye tider og gamle spor. Bergen: Vigmostad og Bjørke
The results and discussions of the study have AS. (In Norwegian)
practical implications: Hale, A.R. 1990. Safety rules O.K.? Journal of Occupational
Accidents 12, 3–20.
◦ Safety rules can serve an important function as Hale, A.R., Heming, B.H.J., Carthey, J. & Kirwan, B. 1997.
knowledge carriers about operations of a system and Modelling of safety management systems. Safety Science
associated dangers. This function should be taken 26(1/2), 121–140.
into consideration when modifying such rules. Hale, A.R., Heijer, F. & Koornneef, F. 2003. Management of
◦ Traditional prescriptive safety rules and associated safety rules: The case of railways. Safety Science Monitor
7, Article III-2, 1–11.
knowledge can serve as knowledge base for a trans- Hovden, J. 1998. Models of Organizations versus Safety
formation of rules into outcome oriented rules, Management Approaches: A Discussion Based on Stud-
i.e. from rules expressing knowledge about what ies of the ‘‘Internal control of SHE’’ Reform in Norway.
to do under certain conditions towards knowledge In: Hale, A.R. & Baram, M. (Eds.). Safety management:
expressing intended outcomes. However, this strat- the challenge of change. Oxford: Pergamon.
egy should take ongoing changes with potential for Kørte, J., Aven, T. & Rosness, R. 2002. On the use of risk
new and unexpected dangers into consideration. analyses in different decision settings. Paper presented at
◦ Introduction of a deductive and risk based approach ESREL 2002. Lyon. March 19–21, 2002.
in an experience based, prescriptive rule tradition Lave, J. & Wenger, E. 1991. Situated Learning. Legiti-
mate peripheral participation. Cambridge: Cambridge
can stimulate inquiries into existing knowledge. The University Press.
inquiries can contribute to a revival and validation of Lindblom, C. 1959. The Science of ‘‘Muddling Through’’.
this knowledge. However, the approach might also Public Administration Review 19, 79–88.
exclude knowledge that does not fit into the frame- March, J.G. 1994. A Primer on Decision Making. New York:
works of chosen rule solutions and risk analytic The Free Press.
methods. Miles, M.B. & Huberman, A.M. 1994. Qualitative Data
◦ Accordingly, organizations should judge the need Analysis. London: Sage Publications Ltd.
for measures to protect safety relevant, endangered Nonaka, I., Takeuchi, H. 1995. The Knowledge-Creating
knowledge. Company. New York: Oxford University Press.
Perby, M.L. 1995. Konsten att bemästra en process. Om att
These practical implications are based on only a förvalta yrkeskunnande. Hedemora: Gidlunds Förlag. (In
few cases in one particular context. Accordingly, they Swedish)
should be critically judged before applied to other Perrow, C. 1999. Normal Accidents. Living with High-
contexts. To extend the generalizability, studies of Risk Technologies. Princeton: Princeton University Press.
(First issued in 1984)
modification processes in other contexts are required. Rasmussen, J. 1997. Risk management in a dynamic society:
The authors want to thank The Research Council of A modelling problem. Safety Science 27(2/3), 183–213.
Norway that financed the work. Rasmussen, J. & Svedung, I. 2000. Proactive Risk Manage-
ment in a Dynamic Society. Karlstad: Räddningsverket.
Reason, J. 1997. Managing the risks of organizational
REFERENCES accidents. Aldershot: Ashgate Publishing Limited.
Reason, J., Parker, D. & Lawton, R. 1998. Organiza-
Baumard, P. 1999. Tacit Knowledge in Organizations. tional controls and safety: The varieties of rule-related
London: Sage Publications Ltd. behaviour. Journal of Occupational and Organizational
Blakstad, H.C. 2006. Revising Rules and Reviving Knowl- Psychology 71, 189–304.
edge. Adapting hierarchical and risk-based approaches Ryggvik, H. 2004. Jernbanen, oljen, sikkerheten og historien.
to safety rule modifications in the Norwegian railway In: Lydersen, S. (Ed.). Fra flis i fingeren til ragnarok.
system. Trondheim: Doctoral thesis for the degree of Trondheim: Tapir Akademisk Forlag. (In Norwegian)
doctor ingeniør. Norwegian University of Science and Schön, D. 1991. The Reflective Practitioner. Aldershot:
Technology (NTNU). Arena, Ashgate Publishing Limited. (First issued in 1983)
Ellström, P.E. 1996. Report: Operatörkompetans—vad den Stein, E.W. 1995. Organizational Memory: Review of
er och hur den kan utvecklas. DUP-resultat. Stockholm: Concepts and Recommendations for Management. Inter-
NUTEK. (In Swedish) national Journal of Information Management. Vol. 15,
European Commission. 2003a. Safety culture in nuclear and No. 2, 17–32.
process control. Fifth Framework Program SAMRAIL. Strauss, A. & Corbin, J. 1998. Basics of Qualitative Research.
Appendix 10: WP 2.1.9. August 5, 2003. California: Sage Publications, Inc.
European Commission. 2003b. SAMNET Glossary. Fifth Turner, B.A., Pidgeon, N.F. 1997. Man-made disasters.
Framework Program SAMNET thematic Network. April Oxford: Butterworth-Heinemann.
09, 2003. Wenger, E. 1998. Communities of Practice: Learning as a
Etzioni, A. 1967. Mixed-Scanning: A ‘‘Third’’ Approach Social System. Systems Thinker. Vol. 9, No. 5, 1–5.
To Decision-Making. Public Administration Review Yin, R.K. 1994. Case Study Research. California: Sage
385–392. December 1967. Publications, Inc.

846
Safety, Reliability and Risk Analysis: Theory, Methods and Applications – Martorell et al. (eds)
© 2009 Taylor & Francis Group, London, ISBN 978-0-415-48513-5

Risk Management in systems: Learning to recognize and respond


to weak signals

E. Guillaume
Safety Science Group, Technological University of Delft, The Netherlands

ABSTRACT: The prevention of Major Accidents is the core of high hazard industries activities. Despite
the increasing safety level, industrial sites are asking for innovative tools. The notion of weak signals will
enable the industries to anticipate danger and improve their safety management. Our preliminary results show
the huge interest and relevance of the weak signals but also the difficulty to, concretely, treat them within
the Safety Management. We found out that organizational features are ‘‘weak signals blockers’’: Bureaucratic
management of Safety, linear and bottom-up communication, and a reactive Safety Management. In order to
favor weak signals treatment, we should act on these organizational factors. This is the main objective of this
PhD research.

1 INTRODUCTION and taking the weak signals into account would favor
proactive approaches and better accidents prevention.
Accident prevention is a central issue in high-risk This research is a partnership agreed with a petro-
industries such as nuclear power plants, aircrafts and chemical company and a steel company, both situated
petrochemical plants. These industries have set up in France. To collect data, case studies will be car-
safer equipments, technical safety barriers but also ried out in both sites. First, failure scenarios will be
organizational and human barriers in order to man- carefully studied. By looking from the end point of the
age risk and improve their capacity to prevent the accident, we will try to identify what weak signals were
accidents. Some researches admit that industries’ picked up by operational crew/shifts, the actions taken
capacities to learn from their own experiences will to take them into account (technical and organizational
enable them to improve accident prevention. Report- responses). Then, normal functioning will provide
ing systems such as Learning from Experience (Retour data on the human and technical tools, methodology
d’Expérience in French) aim at learning from fail- and procedures used to manage, everyday, risk in both
ures and negative experiences. They are implanted sites.
to collect, analyze and share data on the accidents This document is composed of two sections. The
which occurred on sites. Despite the high relevance of first will expose the main definitions of the weak sig-
Rex System, many researchers (Bourrier, 2002, Dien, nals. The second one will try to describe the first data
2006 and Amalberti and Barriquault, 1999) showed collected in Arcelor site.
two main weaknesses: the limits—Rex would provide
mainly technical and direct causes of the accidents,
and biases—Rex would be more used as an enormous 2 DEFINITION
data base than an opportunity to share the lessons
learnt from the accidents. The goal of this research 2.1 Strategic management
is trying to overcome these limits by exploring new The notion of weak signals has already been studied
research areas. The issue of weak signals is emerg- in several disciplines: history, geology and medicine
ing in industrial companies like EDF (Electricité de (the later uses more frequently the notion of ‘‘fore-
France1 ) and academic researches might provide rele- runners’’). Among these works, strategic management
vant ideas. Defined as accidents precursors, the weak provides interesting views that are described here.
signals would enable to identify unsafe situations and Several studies have tried to define the nature
degradation of the system. In that respect, identifying and the role of the weak signals. In literature deal-
ing with ‘‘strategic surveillance’’, weak signals are
more ‘‘qualitative than quantitative, uncertain, frag-
1 mented and ambiguous’’ (Mevel, 2004, p. 20–21).

847
The interest of weak signals lies in the role they play in Then, in the context of Safety Management, two
strategic management. Derived from Ansoff and Dow- theoretical ‘‘schools’’ disagree. The first one—Turner,
ell (1975), Lesca and Castagnos (2004), and Lesca Vaughan- explain the occurrence of accidents with
and Blanco’s (2002), researches argue that, threatened the existence of precursors. The second one, Perrow,
by an uncertain and changing environment, compa- considers the accidents as a normal consequence of
nies have to remain competitive and to capture early complex and ‘‘coupled’’ systems. This PhD research
warning signals like weak signals. The more a com- fits in the first stream, which brings us to underline
pany develops surveillance, the more it will be able to the following statement: weak signals emerge before
detect these signals and anticipate changes, ruptures the accident meaning that they could be captured and
or unexpected events. In that respect, weak signals are could avoid the accident. Many authors attempted to
defined as ‘‘anticipation information’’ (Caron-Fasan, describe the ‘appearance’ or the emergence of the weak
2001). signals. B. Turner and N. Pidgeon (1997) proposed the
We would like to discuss the adjective ‘‘weak’’ notion of ‘‘incubation period’’ (taken from the medical
which is the conventional term always used. ‘‘Weak’’ field) defined as ‘‘a chain of discrepant events develop
implies that ‘‘strong’’ signals exist and emerge inside and accumulate unnoticed’’ (B. Turner, 1997, p. 381).
and outside organizations. Ansoff and Mc Donell2 During this period, many alarming signs are emerging
(1975) propose the following definition: weak sig- but not detected by the members of the organization.
nals are ‘‘imprecise early indications about impending Some other authors use notions similar to the idea of
impactful events ( . . . ). Such signals mature over ‘‘incubation’’. Roux-Dufort (2000) writes that a crisis
time and become strong signals.’’ (p. 20–21). The is the product of a long gestation, during which orga-
most appropriate adjective would actually be ‘‘early’’ nizational dysfunctions accumulate. Dien and Perlot
signals. In fact, as we will see later, a weak signal (2006) state that an accident is not a fortuitous event,
is a signal which is emerging long before an acci- but would be the last stage of a process of damage to
dent occurs. Despite its uncertainty, its fuzziness, the safety.
challenge is to detect them early in order to imple- Despite the interest of these signals—the possibil-
ment a strategic response. Over time, weak signals ity to prevent an accident thanks to those accident
become strong signals and the strategy will be too precursors- many studies have pointed out the dif-
late to response correctly to the threat. Therefore, ficulty to treat them. Turner and Pigdeon, (1997)
we assume that the nature of the weak signals is less Vaughan (2003) and Llory (1996) underline the real
important than ‘‘time evolution’’ aspect. We will try to difficulty to treat weak signals in time, and there-
develop this idea later on. fore to prevent a disaster. In the chapter entitled ‘‘the
The following section is an uncompleted study of Bhopal precursors’’ Llory (1996) stresses the role of
the main studies dealing with the weak signals in whistleblowers, Bhopal industry’s workers, who com-
Safety Management. plained about very bad safety condition (e.g. damaged
defenses barriers). These workers tried to warn the
management and Madhya Pradesh, Bhopal state cap-
2.2 The main definitions in safety management field ital. But these whistleblowers were not listened; they
messages were not taken into account by the manage-
This document presents the main contributions and ment. D. Vaughan (2003) underlined the role of the
tries to set a—temporary- definition of the weak sig- Thiokol engineer who tried to warn NASA. She quotes
nals. The following section is composed of three parts: the engineer ‘‘we shouldn’t ship anymore rocket until
the nature of weak signals, the way they emerge and we got it fixed (O-ring)’’ (2003, p. 254). Although
the way they are detected and treated. engineers sent ‘‘memos’’ to warn Challenger launch
First of all, D. Vaughan (2003) defines a weak sig- deciders, these messages were ignored. At that time,
nal as a subjective, intuitive argument, and mainly NASA believed more in formal and quantitative pro-
ambiguous information. It is supported by informal cedures than messages based on intuition, qualitative
information, which means that the threat to safety is and informal (which was the nature of memos). She
not really clear for the members of the organization. writes ‘‘the context of these memos made them weak
According to M. Llory (1996) a weak signal is a fore- signals to insiders at that time’’ (2003, p. 255). These
runner, and a repetitive precursor warning of ‘‘serious’’ initiatives are often described as failures. The detec-
danger. tors, or whistleblowers, did not send the message
to the right person or suffered from ‘‘communica-
tion pathology’’ (M. Llory used a concept described
by C. Desjours) which means that the communi-
2 They devoted an entire chapter to the weak signals enti- cation between workers and deciders, and in that
tled ‘Using weak signals’ in which they describe how to respect the amplification of the warning messages, is
respond to the issue of weak signals. blocked.

848
We assume that the weak signals emerge before The following section deals with the factors which
the accident, during the incubation period. They are may block or amplify the treatment of weak signals,
detected (by people we call whistleblowers) but not the main assumption of this PhD research.
treated and taken into account. The PhD research
assumes that, in the pathway ‘‘detection-treatment’’,
some factors do not allow the treatment of these sig-
3.2 Cognitive and organizational frames: Weak
nals. In fact, the organization would not accept or
signals blockers and amplifiers
recognize the relevance of these signals and the legit-
imacy of the members who decided to support and Based on several accident analyses, Turner (Weick,
amplify them. My PhD research will focus on the iden- 1998) pointed out several features to explain why
tification of these factors to understand the difficulty the signals emerging during the incubation period
in treating them. were—mostly- ignored. The rigidities in perception
The weak signals, in the field of Safety Manage- and beliefs in organizational settings are the first fea-
ment, are obviously accident precursors (if we admit tures. It means that the possibility to detect disasters
that such signals exist before an accident). Detected can be inhibited by cultural and organizational factors.
on time, they could prevent and even stop an accident. Culture can lead in this case to a collective blind-
The studies quoted previously exposed the interest- ness to important issues. Then, in the same book, B.
ing role of the whistleblowers. Despite their actions, Turner mentions ‘‘organizational exclusivity’’ which
their messages were, in the cases quoted, ignored. means that organization can disregard non members
Thus, the authors acknowledge how difficult the treat- who try to amplify warning information. Finally, he
ment of such signals within the organizations is. The points out the capacity to minimize the emergent dan-
next section will focus on the treatment of the weak gers. The organizational features seem important to
signals. understand why such events are unnoticed. A num-
ber of French studies point out the communication
system underlying its cognitive aspect. Bourrier and
Laroche (2000) describe a similar phenomenon. They
3 TREATING THE WEAK SIGNALS
state that organizations have some difficulties to treat
correctly information because of cognitive categories
3.1 Communication channels
set ‘a priori’. These filters can lead the organization
The issue of weak signals reveals new issues related to blindness. Finally, Vaughan observed that culture
to the interpretation of the information emerging could be a factor of weak signals ignorance. Based on
from outside and inside the organization. Turner and the analysis of Challenger accident (2003) Vaughan
Pidgeon (1997) propose to go back to the general prop- writes that NASA tended to accept anomalies as the
erties of information by paying attention to the manner normal functioning leading to a process of ‘‘deviance
in which the information is dealt with in ‘‘theory normalization’’.
communication’’. To the authors, if the information After describing the phenomenon and trying to
transmitted falls into the available sets of categories, identify the main reasons for weak signals ‘blocking’
then the information will be received. If the message in an organizational context, some authors attempt to
falls outside these recognized categories, this infor- explore solutions in order to amplify them.
mation will be regarded as ‘error’. We assume that The main improvements proposed by the researchers
the weak signals, regarding their nature (ambiguous, deal with on organization’s capacity to be surprised.
uncertain) and their discontinuous way of emerging, Ansoff and Mc Donnell (1975) emphasize environ-
are ignored, partly because they are incompatible with mental surveillance, awareness, and internal flexibil-
a closed communication system. No category is avail- ity. Roux-Dufort (2000) asserts that weak signals do
able to interpret these signals. In other words, three not fit in any preconceived coding process. To capture
options can be taken into account. First, the weak such precursor signals, he writes that the organiza-
signals are ignored because people have got no tools tions should accept information or events which do
to interpret them. The signals are merely seen as not fit in the knowledge and technology which is
irrelevant information. Then, the signals can be inten- already implanted. As Bourrier and Laroche (2000)
tionally ignored because they increase uncertainty and agree on the fact that these categories ‘a priori’ have
unsafe work conditions. Turner and Pidgeon (1997) to be reviewed to treat correctly the information and
write ‘‘we may thus regard this kind of event, which to be able to capture the ‘unexpected events’ (Turner
was not assigned a place in the relevant system of and Pidgeon, 1997). Finally, Östberg (2006) argues
classes’’ (p. ). This is highly relevant for understanding that human intelligence would be a way to take the
the issue of weak signals. Finally, these signals were weak signals into account. He writes that intelligence
ignored because of the difficulty to pick the relevant ‘‘refers to have a bearing on the dealing with sur-
ones up. prise’’ (p. 19) He writes ‘‘obviously, on the one hand

849
human qualifications for intelligent performance are These case studies provided interesting findings on
decisive for the outcome of efforts to cope with sur- these four items. The following section will try to
prises and, on the other hand, the preparedness for describe these first results.
surprising events by intelligence activities enhances
the proper reaction in the prevailing circumstances.’’
(p. 19). 4.2 From a generic definition to a more
The previous section is an attempt to define weak specific definition
signals in academic literature. Although based on Weak signals could be theoretically defined as acci-
empirical works, the quoted studies reveal generic dent precursors alerting of an imminent event. The
definitions: relevance and use to studying weak sig- case studies appeared as an important step in the PhD
nals. But what are, concretely, the weak signals? research. As a matter of fact, we came up with a more
What do they refer to? What do they say about the specific and practical- definition of weak signals. They
system studied (technical and social), risks and acci- are defined in a specific context (an organization, a
dents? These points will be discussed in the following site), related to specific activity (steel making) and
section. specific risks.
The main findings concern three items: the dif-
ficulty to pick the weak signals on time, signals
4 PRELIMINARY RESULTS: WEAK interpretation and the channels of communication used
SIGNALS AND TREATMENT, EXAMPLES by people trying to transmit them. Lessons learnt
IN A STEEL PLANT AND A REFINERY

4.1 Methodology 4.3 The difficulty to pick up the weak signals

The research project planned to collect data on weak The scenarios investigated refer to standard operations
signals by carrying out case studies on failure scenar- but the events described appeared as real surprises.
ios, success stories and normal functioning. Consequently, the signal emerging before the accident
This research aims at studying weak signals of (during the incubation period as Turner showed) were
major accidents. INES scale defines major accidents detected but not interpreted as ‘‘accidents precursors’’.
as ‘‘events with major effects outside the site, implying The difficulty of weak signals lies in the capacity of
consequences on environment and health’’. Fortu- people to combine several pieces of information dis-
nately, these events are very rare. The scenarios patched in the organization and give sense to them.
investigated would more concern ‘‘near-misses’’ it is As Turner wrote, the information were already in the
to say events that could have been more serious in other organization but not interpreted. Then, the scenarios
circumstances. investigated showed that the accident was the result
We carried out five case studies on ‘‘failure’’ sce- of a long process of safety degradation. During this
narios. These accidents occurred in three operational period, many signals had been detected, recorded in
departments: cooking plant, steel making plant and reporting systems but rarely connected.
energy dispatching department. The objective of these Example 1: Explosion of a calcium carbide bunker
case studies was to look from the end point of the acci- Steelmaking plant uses calcium carbide. It is meant to
dent, and trying to identify whether weak signals were take sulphur away from pig iron. Acetylene, produced
picked up by operational crew/shifts on operational by a contact between calcium carbide and water, is a
shop floor? What actions were taken to take them well known risk and the site is preventing the bunker
into account (technical and organizational responses)? against water filtering. However, 16th of Decem-
Why they did not take any actions? What lessons were ber 2005, the bunker exploded. A presence of water
learnt from these weak signals afterwards? explained clearly the accident. People involved in the
Then, some observations on normal functioning investigation discovered (thanks to an external expert)
have been carried. They provided data on the tools that the roof of the bunker was corroded. In fact,
(learning processes like Learning from Experience, the roof had been painted for maintenance reasons,
risk analysis, audits), safety barriers (technical and except in an unreachable part of it. Birds were living
organizational), but also management, implanted in there, attacking and scaring maintenance operators.
both sites to manage risk. Finally, we came up against Dirt favored the corrosion and degraded, little by little
the difficulty to investigate success stories. In fact, the roof and enabled water to soak in the bunker.
success stories are not, by that very fact, based on Signals appeared obvious and relevant to people
critical events. They are not recorded in reporting sys- after the event. Indeed, analysis revealed a long process
tems and, regarding to accidents prevention, they do of safety degradation which led to the bunker explo-
not deserve any further analysis. This idea must be sion: maintenance problems (corrosion), mal-adjusted
deepen in petrochemical plant. roof (liable to rust) and the possibility that water

850
seeps through bunker’s roof. Though these pieces are experts and give advises to management), their
of information were detected, known and recorded, knowledge and their experience. But obviously, their
the combination between them were not taken into message was ignored. At that stage of the research, we
account. The objective is now to understand what can only suppose that the signals they tried to trans-
factors blocked the possibility to give sense to these mit were more based on their intuition and qualitative
signals. The PhD research is still in the data collection data than on formal and quantitative data. Although
stage, and these factors have not been identified yet. these signals were already considered as strong for
Signals interpretation the whistleblowers, they remain weak for the receivers
The scenarios investigated refer to standard operations because on an organizational culture based tangible
but the events described appeared as real surprises. proves. Consequently, their message was not taken into
Consequently, the signal emerging before the accident account.
(during the incubation period as Turner showed) were
detected but not interpreted as ‘‘accidents precursors’’. 5 LESSONS LEARNT FROM WEAK SIGNALS
The difficulty of weak signals lies in the capacity of
people to combine several pieces of information dis- As we mentioned previously, weak signals are obvi-
patched in the organization and give sense to them. ous after the accident. Once identified, people learnt
As Turner wrote, the information were already in the lessons from weak signals, particularly in the new
organization but not interpreted. Then, the scenarios design of damaged installations. For instance, Design
investigated showed that the accident was the result Department, in charge of the bunker building, took
of a long process of safety degradation. During this Expert recommendations into account for implant-
period, many signals had been detected, recorded in ing new safety barriers and better detection systems
reporting systems but rarely connected. (acetylene and temperature). However, we must admit
that weak signals are still ignored before they lead to
4.4 Channels of signals transmission an accident. The explanation seems to be found on the
organizational playing a role of blockers. This hypoth-
The case studies showed clearly that weak signals esis will be tested in the last case studies period we will
were detected but not taken into account. We believe carry out (from April 2008).
that the problem lies in the translation of the signals
detected and transmission to the right person/target.
Some academic studies have tried to identify ways to CONCLUSION
bring the relevant information to the organization (and
particularly people in position to make decision on its As a conclusion, weak signals seem to have a practical
base). relevance in such a site and considered as difficult to
We identified three ways to transmit the signals pick up. This difficulty does not lie on detection but
detected in the organization on the capacity to give sense to several pieces of infor-
mation and possibility to transmit them. We found out
• Whistleblowers that weak signals were indeed detected, and identified
• Safety and health committees channels which enable to bring the information to rele-
• Operational visits vant people. However, factors would block this process
and impede the opportunities to learn. The next stage
This paper will stress the first ‘‘whistleblowers’’
of the research will be precisely to analyze the data and
because pieces of data on the others are still missing.
reveal what factors (organizational, cultural, individ-
Whistleblowers are considered as a channel to transmit
ual) block the opportunities to prevent the accidents
information related to safety [16]. In failure scenarios,
investigated.
whistleblowers obviously failed.
Example 2: Explosion of an oxygen pipe
Fluids expert of the Energy dispatching department ACKNOWLEDGEMENTS
explained in the interview that in 2005 he advised
department manager to train operators about the dan- I grandly thank the French Foundation for a Safety
gers of oxygen. His alert was ignored and, in February Culture (FonCSI) which provides financial sponsors
2005, few months later, an oxygen pipe exploded. and fieldwork.
The issue of whistleblowers is interesting. On the
basis of the interviews carried out, the whistleblowers REFERENCES
would be experts. They are educated (PhD, trainings,
master graduation) and have knowledge on techniques Amalberti R. et Barriquault C., 1999, ‘‘Fondements et lim-
and safety. The resources they used to detect dan- ites du Retour d’Expérience, in Annales des Ponts et
gerous situations are based on their role (they work Chaussées: Retours d’Expérience, n. 91, pp. 67–75.

851
Ansoff I. and Mc Donnell E., 1990, Implanting strategic Llory M., 1996, Accidents industriels, le coût du silence,
management, Second edition, Prentice Hall International, L’Harmattan, Paris.
United Kingdom. Mevel O., 2004, ‘‘Du rôle des signaux faibles sur la
Bourrier M. et Laroche H., 2000, ‘‘Risques de défaillance: reconfiguration des processus de la chaîne de valeur
les approches organisationnelles’’, in Risques, erreurs et de l’organisation : l’exemple d’une centrale d‘achats de
défaillances, Actes de la première séance du Séminaire la grande distribution française’’, Thèse de doctorat en
‘Le risque de défaillance et son contrôle par les indi- sciences de gestion, Ecole doctorale Lettres, Langues,
vidus et les organisations dans les activités à hauts risques, Société et Gestion et Ecole Nationale Supérieure des
publications MSH Alpes, pp. 15–51. Télécommunications de Bretagne.
Bourrier M., 2002, ‘‘Bridging research and practice: the chal- Ostberg G., 2006, ‘‘An unassorted collection of remarks on
lenge of ‘normal operations’’’, in Journal of Contingencies aspects, perspectives and dimensions of weak signals’’,
and Crisis Management, vol.10, n.4, pp 173–180. University of Lund, Sweden.
Caron-Fasan ML., 2001, ‘‘Une méthode de gestion Roux Dufort C., 2000, ‘‘Aspects socio et culturels des signaux
de l’attention des signaux faibles’’, in Systèmes faibles dans les organisations’’, Association ECRIN, 18
d’Information et Management, vol. 6, n. 4. mai, Paris.
Chateauraynaud F., 1999, Les sombres précurseurs. Une Turner A.B. and Pidgeon N.F., 1997, Man-Made disasters,
sociologie pragmatique de l’alerte et du risque, Editions Second edition, Wycheham Publications, London.
EHESS, Paris. Vaughan D., 1996, The Challenger launch decision risky
Dien Y. et Perlot S., 2006, ‘‘Cassandre au pays des risques technology, culture and deviance at NASA, University of
modernes’’, 29ième Congrès National de Médecine et Chicago Press, United States.
Santé au Travail, Lyon. Weick K.E, 1998, ‘‘Foresights and failure: an appreciation
Lesca H. and Blanco S., 2002, ‘‘Contribution à la capacité of Barry Turner’’, in Journal of Contingencies and Crisis
des entreprises par la sensibilisation aux signaux faibles’’, Management, vol. 6, n. 2, pp. 72–75.
6eme Congrès International Francophone sur la PME,
HEC Montréal.
Lesca H. and Castagnos J-C., 2004, ‘‘Capter les signaux
faibles: comment amorcer le processus?’’, Economica e
Gestao, Brésil, vol. 4, n. 7, pp. 15–34.

852
Safety, Reliability and Risk Analysis: Theory, Methods and Applications – Martorell et al. (eds)
© 2009 Taylor & Francis Group, London, ISBN 978-0-415-48513-5

Author index

Aamo, O.M. 3311 Asada, Y. 33 Bertsche, B. 875,


Aase, K. 1385 Astapenko, D. 2021 2233
Abramowicz-Gerigk, T. Astarita, G. 27 Beyer, S. 1539
3343 Aubry, J.F. 2549 Bhattacharya, D. 1915
Achermann, D. 2061 Auder, B. 2107 Bianco, C. 2727
Adel-Aissanou, K. 611 Augutis, J. 1867, 2569, Bigham, J. 3109
Adjabi, S. 611 2575, 3101 Birkeland, G. 365
Affeltranger, B. 3093 Ault, G.W. 2601 Bischof, N. 2789
Agnello, P. 137 Aven, T. 323, 365, 1207, Bladh, K. 227
Ait-Kadi, D. 641 1335, 2081 Blakstad, H.C. 839
Aït-Kadi, D. 1001 Avram, D. 477 Bloemhoff, A. 777
Akiba, T. 1839 Azi, M. 611 Blokus-Roszkowska, A.
Albert, I. 2609 2269
Albrechtsen, E. 407, 2649 Badía, F.G. 575 Bočkarjova, M. 1585,
Alcock, R.E. 415, 993 Baggen, J.H. 1519 2781, 2817
Ale, B.J.M. 2223, 2715 Baicu, F. 1027 Božek, F. 2613
Algora, C. 1949 Baksteen, H. 767, 777 Bolado Lavín, R. 2899
Alkali, B.M. 515 Balakrishnan, N. 1915 Bonanni, G. 2501
All, R. 391 Bank, M. 351, 2675 Bonvicini, S. 1199
Allaix, D.L. 1621 Baraldi, P. 2101 Bordes, L. 593
Almeida Jr., J.R. 3177 Barbarini, P. 1049 Borell, J. 83, 3061
Althaus, D. 2239 Barker, C. 619 Borgia, O. 211
Altiok, T. 3257 Barnert, T. 1463 Borsky, S. 973
Alzbutas, R. 1819 Barone, S. 2251 Bosworth, D.A. 2353
Amaral Netto, J.D. 2587 Barontini, F. 2345 Bouissou, C. 3191
Amari, S.V. 1763 Barros, A. 2003, 3125 Bouissou, M. 1779
Amendola, L. 2409, 2415 Bartlett, L.M. 2021 Boussouf, L. 2135
Amyotte, P.R. 1147 Basco, A. 3085 Bóveda, D. 1533
Ancione, G. 3143 Basnyat, S. 45 Bründl, M. 2773, 2789
Andersen, T.K. 259 Bayart, M. 3245 Braarud, P.Ø. 267
Andrade Herrera, I. 733 Beard, A.N. 2765 Braasch, A. 2239, 2245
Andrews, J.D. 1739, 1873 Becerril, M. 1415 Bragatto, P.A. 137
Aneziris, O.N. 767, 777, Bedford, T. 515, 987 Brandowski, A. 3331
787 Belhadaoui, H. 1829, Brandt, U.S. 3031
Angelo Bragatto, P. 2701 2549 Briš, R. 489
Ansaldi, S. 137 Bellamy, L.J. 767, 777, Brik, Z. 1937
Antão, P. 3265 2223 Brissaud, F. 2003
Antoni, M. 3231 Benard, V. 3245 Brunet, S. 3007
Antonioni, G. 2397, 2749 Benjelloun, F. 2369, Buchheit, G. 2549
Antonsen, S. 1377 3067 Buderath, M. 2175
Arbaretier, E. 1937 Béerenguer, C. 3125 Bünzli, E. 2641
Arezes, P.M. 761 Bérenguer, C. 469, 531, Burciu, Z. 3337
Arizmendi, M. 205 593, 2003 Burgazzi, L. 1787, 2899
Arnaiz, A. 2175, 3223 Berg, H.P. 1439 Burgherr, P. 129
Arnaldos, J. 1073, 2421 Bernhaupt, R. 45 Busby, J.S. 415, 993, 1251,
Artacho, M.A. 2409, 2415 Berntsen, P.I.B. 3311 1325
Artiba, A. 641 Berrade, M.D. 575 Bye, R.J. 1377

853
Cabarbaye, A. 2185, 2217 Costescu, M. 99 Dutfoy, A. 2093
Cabrera, E. 2447 Coulibaly, A. 1001 Dutta, B.B. 3323
Cadini, F. 477 Courage, W.M.G. 2807 Dutuit, Y. 1173
Calixto, E. 957, 1273 Cozzani, V. 1147, 1199, Dvořák, J. 2613
Calle, E.O.F. 2807 2345, 2397, 2749, 3153 Dwight, R.W. 423
Camargo Jr, J.B. 2207 Craveirinha, J. 2627
Campedel, M. 2749 Crespo Márquez, A. 669, Ebrahimipour, V. 1125,
Campos, J. 1217 929 2379
Cañamón, I. 163 Crespo, A. 687, 829 Egidi, D. 2397
Carbone, V.I. 1621 Cugnasca, P.S. 1503 Eide, K.A. 1747, 2029
Carfagna, E. 3217 Eisinger, S. 365, 2937
Carlé, B. 89 D’Auria, F. 2899 El-Koujok, M. 191
Carlos, S. 2827, 2837 Damaso, V.C. 497 Engen, O.A. 1423
Carr, M.J. 523 Damen, M. 767, 777 Erdos, G. 291
Carrión, A. 2447 Dandache, A. 2549 Eriksson, K. 83, 3061
Carvalho, M. 587 da Silva, S.A. 243 Esch, S. 1705
Casal, J. 1073, 1119 David, J.-F. 981 Escriche, I. 2275, 2289
Castanier, B. 469, 3171 David, P. 2259 Escrig, A. 2743
Castillo, C. 2473 de Almeida, A.T. 627, 1165 Esperón, J. 3, 121
Castillo, E. 2473, 2689 De Ambroggi, M. 1431 Espié, E. 2609
Castro, I.T. 463 De Carlo, F. 211 Espluga, J. 1301, 1371,
Cauffriez, L. 3245 de M. Brito, A.J. 1165 2867
Cavalcante, C.A.V. 423, De Minicis, M. 1495 Eusgeld, I. 2541
627, 1165 De Souza, D.I. 919 Eustáquio Beraldo, J. 1273
Chang, D. 703 De Valk, H. 2609 Expósito, A. 3, 121
Chang, K.P. 703 de Wit, M.S. 1585 Eymard, R. 155
Charpentier, D. 2003 Debón, A. 2447
Chatelet, E. 1731, 3093 Debray, B. 3191 Faber, M.H. 1567
Chen, J.R. 2757 Dehghanbaghi, M. 2379 Faertes, D. 2587
Chen, K.Y. 39 Dehombreux, P. 2117 Fallon, C. 1609, 3007
Chen, X. 863, 1663 Deleuze, G. 1309, 3093 Fan, K.S. 2757
Chiu, C.-H. 1651 Deloux, E. 469 Faragona, B. 3217
Cho, S. 2851 Delvenne, P. 3007 Farré, J. 1301
Choi, Y. 2913 Delvosalle, C. 2369, 3067 Faško, P. 1671
Chojnacki, E. 697, 905 Denis, J.-B. 2609 Faure, J. 2185, 2217
Chou, Y.-P. 2405 Depool, T. 2409, 2415 Fechner, B. 147
Christley, R.M. 2317 Dersin, P. 2117, 3163 Fernández, A. 1533
Christou, M.D. 2389 Despujols, A. 531 Fernández, I. 3, 121
Chung, P.W.H. 1739 Destercke, S. 697, 905 Fernández, J. 205, 1395
Ciancamerla, E. 2501 Deust, C. 3191 Fernández-Villodre, G.
Clarhaut, J. 3199 Di Baldassarre, G. 2749 1755
Clavareau, J. 455 Di Gravio, G. 1495 Fernandez Bacarizo, H. 559
Clemente, G. 505 Di Maio, F. 2873 Ferreira, R.J.P. 1165
Clemente, R. 2501 Dien, Y. 63 Ferreiro, S. 2175
Clímaco, J. 2627 Dijoux, Y. 1901 Feuillard, V. 2135
Clough, H.E. 2317 Diou, C. 2549 Fiévez, C. 2369, 3067
Cocquempot, V. 3199 Dohnal, G. 1847 Figueiredo, F.A. 627
Cojazzi, G.G.M. 3135 Doménech, E. 2275, Finkelstein, M.S. 1909
Colli, A. 341, 2715 2289 Flage, R. 1335, 2081
Collins, A. 1251 Dondi, C. 2397 Flammini, F. 105
Conejo, A.J. 2689 Dong, X.L. 2845 Fleurquin, G. 2117
Conrard, B. 3199 Doudakmani, O. 787 Fodor, F. 1309
Contini, S. 1009, 3135 Downes, C.G. 1739, 1873 Forseth, U. 3039, 3047
Cooke, R.M. 2223 Driessen, P.P.J. 369 Fouladirad, M. 567, 593,
Cooper, J. 2223 Duckett, D.G. 1325 2003
Cordella, M. 2345 Duffey, R.B. 941, 1351 Frackowiak, W. 3331
Cornil, N. 2369, 3067 Dunjó, J. 2421 Franzoni, G. 1049

854
Fraser, S.J. 2353 Grall, A. 531, 567, 3125 Hryniewicz, O. 581
Frenkel, I. 483, 551 Grande, Ø. 2937 Hsieh, C.C. 1267
Frimannslund, L. 2963 Grande, O. 1431, 3265 Huang, W.-T. 1651
Froihofer, L. 1539 Grazia Gnoni, M. 2701 Hurrell, A.C. 749
Frutuoso e Melo, P.F. 497 Grenier, E. 2609 Huseby, A.B. 1747, 2029,
Fuchs, P. 635 Groth, K.M. 113 2199
Fugas, C. 243 Gruber, M. 2675 Hwang, M. 2861
Furuta, K. 33 Guaglio, G. 2541
Fuster, V. 1395, 1401 Gucma, L. 3285 Iacomini, A. 2501
Guedes Soares, C. 881, Ibáñez, M.J. 2743
Gagliardi, R.V. 27 3265 Ibáñez-Llano, C. 2051
Gaglione, A. 105 Gugliermetti, F. 1341 Idasiak, V. 2259
Galarza, N. 727 Guida, M. 2251 Idée, E. 1901
Galassi, G. 2899 Guidi, G. 1341 Innal, F. 1173
Galdámez, P. 1539 Guillaume, E. 847 Iooss, B. 2107, 2135,
Gallay, A. 2609 Guo, B. 649 2899
Gámiz, M.L. 2013 Gurley, K. 2453 Ipiña, J.L. 727
Gómez Fernández, J. 929 Gutteling, J.M. 1317, 1585 Isaksen, S.L. 1747, 1891,
Gómez Fernández, J.F. 669 Guttormsen, G. 813 2029, 2937
Gómez, J.F. 687, 829 Izquierdo, J.M. 121, 163
Gómez-Mares, M. 1119 Håbrekke, S. 805
Gäng, J. 2233 Hagen, J.M. 407, 2649 Jallouli, M. 2549
Gåsemyr, J. 1747, 2029 Hamid, S. 2453 Jamieson, R. 1447
Gamiz, M.L. 2447 Hänle, A. 1547, 1555 Jammes, L. 305
Gamo, L. 3, 121 Hansson, L. 733 Janilionis, V. 1819
Ganapini, S. 1199 Hardeman, F. 89 Jarl Ringstad, A. 813
García Ortiz, J.C. 1539 Hardman, G. 987 Jeong, J. 2619
García, B. 757 Häring, I. 1547, 1555 Jiang, Y. 863, 1663
García-Díaz, J.C. 201 Harrami, O. 391, 399 Jo, K.T. 913
Garcia, P.A.A. 497 Harvey, J. 291, 299, 1447 Jóźwiak, I.J. 1929
García-Bertrand, R. 2689 Hauge, S. 2921 Jodejko, A. 1065
Gayen, J.-T. 1283 Haugen, K.-E. 1489 Joffe, H. 1293
Gerbec, M. 1473, 2157 Haugland, D. 2963 Johansson, J. 2491
Gerigk, M. 3303 Hauschild, J. 2245 Johnsen, S.O. 805
Geurts, P.A.T.M. 2781 Hausken, K. 1157 Jongejan, R.B. 1259
Gil, A. 205 Haver, K. 2929 Jönsson, H. 2491
Gil, J. 3, 121 Hayat, S. 3199 Jordá, L. 727
Gillon, P. 3007 Helland, A. 361 Jore, S.H. 3077
Giménez, M. 2899 Hepsø, V. 813, 1407 Joris, G. 1609
Giner-Bosch, V. 2735 Hernández-Simón, L.M. 11 Jóźwiak, I.J. 1455
Ginestar, D. 175 Herrera, I.A. 19 Jóźwiak, K. 1455
Giorgio, M. 2251 Herrero, R. 121 Jun, L. 1943
Girard, Ph. 331 Heslop, S. 299 Jung, K. 1629, 1635
Giraud, J.-B. 2987 Hildebrandt, M. 267 Jung, W. 221
Glor, M. 1217 Holicky, M. 1629 Jung, W.S. 2913
Goeschka, K.M. 1539 Holmberg, J.-E. 227 Juocevičius, Virg. 1641
Gomes, T. 2627 Hong, Y. 1943 Juocevičius, Virm. 1641
González Díaz, V. 929 Hoon Han, S. 2619 Juocevičius, V. 1677
González, J.R. 1949 Hoppe, G. 2037
González, P. 3, 121 Horlick-Jones, T. 1301, Kalusche, W. 2431
González, V. 669, 687, 829 1371, 1601, 2867 Kamenický, J. 891
Gonzalo, J. 1301 Hortal, J. 3, 121, 379 Kangur, K. 797
Gordon, P. 423 Hossein Mohammadian M., Kanno, T. 33
Goti, A. 2707 S. 1001 Kar, A.R. 3323
Gouriveau, R. 191 Hou, H.-Y. 2405 Karlsen, J.E. 1595
Goyal, S. 949 Hovden, J. 839 Kastenholz, H. 361
Grachorloo, N. 2151 Høyland, S. 1385 Kayrbekova, D. 2955

855
Kazeminia, A. 2245 Lèbre La Rovere, E. 957, Massaiu, S. 267
Kellner, J. 2613 1273 Mateusz, Z. 3237
Kermisch, C. 1357 Lebrun, R. 2093 Matuzas, V. 2569
Khan, F.I. 1147 Lecoze, J.C. 3191 Matuzienė, V. 2575
Khatab, A. 641 Lei, H.T. 649 Matuziene, V. 3101
Khvatskin, L. 483 Leira, B.J. 3311 Mavko, B. 1771
Kim, K.Y. 2913 Leitão, A.F. 675 Mazri, C. 3191
Kim, M.C. 2909 Lejette, F. 3231 Mazzocca, N. 105
Kim, S. 2851 Lemes, M.J.R. 2207 McClure, P. 2295
Kiranoudis, C. 281 Leopold, T. 875 McGillivray, B.H. 993
Kleyner, A.V. 1961 Lerena, P. 1217 McMillan, D. 2601
Kloos, M. 2125 Lettera, G. 2701 Mearns, K. 1415
Kobelsky, S. 1141 Levitin, G. 1157, 1723 Medina, H. 1073
Kohda, T. 1035 Li, Pan 79 Medonos, S. 1239
Koivisto, R. 2511 Li, Z.Z. 2845 Medromi, H. 2549
Kollmann, E. 2641 Limbourg, P. 1705 Mehers, J.P. 2317
Kolowrocki, K. 1969, 1985 Limnios, N. 2167 Mehicic Eberhardt, S. 2431
Konak, A. 2657 Lin, P.H. 2223 Meier-Hirmer, C. 3183,
Kongsvik, T. 733 Lindøe, P.H. 1595 3231
Konstandinidou, M. 281, Lindhe, A. 1041 Meléndez, E. 121, 2051
767, 777 Lins, I.D. 541 Meliá, J.L. 243, 1415
Kontic, B. 2157 Lirussi, M. 2727 Membrë, J.-M. 2295
Korczak, E. 1795 Lisi, R. 1019, 3143 Mendes, J.M. 1577
Kortner, H. 1489 Lisnianski, A. 483, 551 Mendizábal, R. 379, 2827,
Kosmowski, K.T. 249, 1463 Lizakowski, P. 3319 2837, 2891
Kosugi, M. 2305, 2311 LLovera, P. 1401 Meneghetti, A. 2727
Koucky, M. 1807, 1813 Loizzo, M. 2987 Menoni, S. 3023
Koutras, V.P. 1525 Lonchampt, J. 531 Mercier, S. 155, 603
Kovacs, S.G. 99 Lopes, I.S. 675 Merz, H.M. 2773
Kowalczyk, G. 449 López Droguett, E. 541 Meyer, P. 275
Kratz, F. 2259 Lorenzo, G. 2899 Meyna, A. 2239
Krikštolaitis, R. 2575, 3101 Lukoševiciene, O. 1685 Mikulová, K. 1671
Kröger, W. 2541 Lundteigen, M.A. 2921 Milazzo, M.F. 1019, 3143
Krummenacher, B. 2773 Miles, R. 1251
Kubota, H. 2305, 2311 MacGillivray, B.H. 415 Mínguez, R. 2473, 2689
Kudzys, A. 1677, 1685 Maftei, E. 2333 Minichino, M. 2501
Kuiper, J. 767, 777 Magott, J. 1055 Missler-Behr, M. 2431
Kujawski, K. 1929 Mai Van, C. 2797 Mlynczak, M. 57
Kulot, E. 1049 Makin, A.-M. 739 Mock, R. 2641
Kulturel-Konak, S. 2657 Malassé, O. 1829, 2549 Moeller, S. 2431
Kurowicka, D. 2223 Malich, G. 1081 Molag, M. 3153
Kuttschreuter, M. 1317 Mancini, G. 1621 Moltu, B. 813
Kvernberg Andersen, T. Manuel, H.J. 1113 Monfort, E. 2743
3039, 3047 Marais, K.B. 659 Monteiro, F. 2549
Marcos, J. 1533 Montoro-Cazorla, D. 1955
Labeau, P.E. 455 Maris, U. 2325 Montoya, M.I. 1089
Labeau, P.-E. 559, 1357 Markatos, N. 281 Moonis, M. 2353
Laclemence, P. 3093 Markeset, T. 2945, 2955 Morales, O. 2223
Laheij, G.M.H. 1191 Marková, J. 1635 Moreno, J. 3
Lamvik, G.M. 2981 Marquès, M. 2899 Moreu de León, P. 669,
Landucci, G. 3153 Marrel, A. 2135 687, 829, 929
Langbecker, U. 3275 Martín, J. 869 Morra, P. 2345
Langeron, Y. 3125 Martinez-Alzamora, N. 441 Mosleh, A. 113
Larisch, M. 1547 Martorell, S. 175, 441, 505, Motoyoshi, T. 2685
Laulheret, R. 2185, 2217 1881, 2275, 2289, 2827, Muñoz, M. 1119
Le Bot, P. 275 2837, 2873, 2971 Muñoz-Escoí, F.D. 1539
Le Guen, Y. 2987 Maschio, G. 1019, 3143 Mud, M. 767, 777

856
Mulley, C. 299 Park, J.J. 1481 Quigley, J. 987
Mullor, R. 441 Park, S.D. 913 Quijano, A. 1395, 1401
Muslewski, L. 2037 Park, S.J. 913
Mutel, B. 1001 Parra, C. 687, 829 Rabbe, M. 2199
Parra Márquez, C. 669, 929 Rachel, F.M. 1503
Næsje, P. 259, 821, 1407 Pashazadeh, S. 2151 Raffetti, A. 3217
Næsje, P.C. 2981 Pečiulytė, S. 2575, 3101 Rajabalinejad, M. 717
Nøkland, T.E. 1207, 2929 Pearce, K. 1447 Rakowsky, U.K. 2045,
Naked Haddad, A. 919 Pecho, J. 1671 3055
Napolitano, N. 1495 Pedersen, L.M. 2581 Raman, R. 1239
Natvig, B. 1747, 2029 Pedroni, N. 709 Raschky, P.A. 965, 973
Navajas, J. 1301, 1371, Peiretti, A. 1119 Rasulo, A. 2519
2867 Pelayo, F. 379, 2827, 2837 Rauzy, A. 1173, 1937, 2051
Navarro, E. 757 Pelaz, A. 727 Real, A. 175
Navarro, J. 1915 Penalva, M.L. 205 Reer, B. 233
Navarro-Esbrí, J. 175 Pereira, G.A.B. 675 Reinders, J. 3153
Navrátil, J. 2613 Pérez, C.J. 869 Remenyte-Prescott, R. 1739
Nebot, Y. 2827, 2873 Pérez-Ocón, R. 1755, Renaux, D. 3245
Nedelec, B. 3191 1955 Renda, G. 3135
Neto, H.V. 761 Peschke, J. 2125 Revilla, O. 3223
Newby, M. 619 Pesme, H. 275 Rey-Stolle, I. 1949
Nguyen, H. 3331 Pey, A. 1217 Rezaie, K. 1125, 2379
Nicholls, J. 291 Pierlot, S. 63 Rhee, T.J. 703
Nicol, A.-M. 749 Pierro, F. 2899 Rheinberger, C.M. 1365
Nieto, F. 2051 Pinelli, J.-P. 2453 Riedstra, D. 1191
Niezgoda, T. 449 Pita, G.L. 2453 Rietveld, P. 2817
Nivolianitou, Z. 281 Pittiglio, P. 137 Rimas, J. 1819
Njå, O. 3077 Piwowar, J. 3093 Rivera, S.S. 1129, 1135
Nogueira Díaz, E. 899 Planas, E. 1089 Robin, V. 331
Norberg, T. 1041 Platis, A.N. 1525 Rocco S., C.M. 1803
Nordgård, D.E. 2561 Pock, M. 1829 Rocha Fonseca, D. 919
Nowakowski, T. 1055, Podofillini, L. 233 Rodríguez, G. 3, 121
1065, 1455, 1929 Podsiadlo, A. 3289, 3331 Rodríguez, V. 2707
Nunes, E. 587 Polič, M. 3015 Rodríguez Cano, D. 899
Núñez Mc Leod, J.E. 1129, Ponchet, A. 567 Røed, W. 2929
1135 Pop, P. 2333 Roelen, A.L.C. 2223
Núñez, N. 1949 Popenţiu Vlǎdicescu, F. Rohrmann, R. 1567
Nuti, C. 2519 2333 Román, Y. 2013
Popoviciu, N. 1027 Romang, H. 2789
Oh, J. 767, 777 Post, J.G. 767, 777 Rosén, L. 1041
Oliveira, A. 3177 Postgård, U. 391, 399 Rosness, R. 839
Oliveira, L.F.S. 1919, 2587 Pouligny, Ph. 3183 Roussignol, M. 155
Olmos-Peña, S. 11 Poupard, O. 2987 Rowbotham, A.L. 2353
Oltedal, H.A 1423 Poupart, E. 45 Rubio, B. 727
Oltra, C. 1301, 1371, 2867 Prades, A. 1301, 1371, Rubio, G. 757
Or, I. 3257 2867 Rubio, J. 727
Osrael, J. 1539 Pragliola, C. 105 Rücker, W. 1567
Özbaş, B. 3257 Praks, P. 559 Rudolf Müller, J. 2665
Prescott, D.R. 1873 Ruiz-Castro, J.E. 1755
Palacios, A. 1119 Prosen, R. 2883 Runhaar, H.A.C. 369
Palanque, P. 45 Proske, D. 2441
Pandey, M.D. 431 Pulcini, G. 2251 Sætre, F. 2635
Pantanali, C. 2727 Puuronen, S. 1995 Sabatini, M. 1199
Papazoglou, I.A. 767, 777, Pyy, P. 227 Sadovský, Z. 1671
787 Sagasti, D. 727
Paridaens, J. 89 Quayzin, X. 1937 Saleh, J.H. 659
Park, J. 221, 2909 Queral, C. 3, 121 Salzano, E. 3085

857
Samaniego, F.J. 1915 Soriano, M.L. 1395 Tveiten, C.K. 2997
Samrout, M. 1731 Soszynska, J. 1985 Tymoteusz, B. 3237
San Matías, S. 2735 Soto, E. 1533
Sánchez, M. 121 Sousa, S.D. 761 Ušpuras, E. 1867
Sánchez, A. 441, 505, 2707 Spadoni, G. 2397 Ulmeanu, A.P. 2167
Sand, K. 2561 Sperandio, S. 331 Ulusçu, O.S. 3257
Sansavini, G. 1861 Spitsa, R. 1141 Unagami, T. 2685
Santamaría, C. 757 Spouge, J. 2223 Uusitalo, T. 2511
Sant’Ana, M.C. 497 Stamenković, B.B. 3163,
Santos-Reyes, J.R. 11, 2765 3209 Vázquez López, M. 899
Sarshar, S. 183 Steen, R. 323 Vázquez, M. 1949
Saull, J.W. 1351 Steinka, I. 2269 Vaccaro, F. 3217
Savić, R. 1513 Stelmach, A.W. 2191 Vaidogas, E.R. 1641
Saw, J.L. 2353 Sterkenburg, R.P. 2363 Valis, D. 1807, 1813
Scarf, P.A. 423 Stevens, I. 3117 van den Berg, A. 1113
Scarlatti, A. 2501 Stian Østrem, J. 1335 van der Boom, R.P. 2223
Schäbe, H. 1283 Stoop, J.A. 1519 van der Most, H. 1585
Schiefloe, P.M. 2997 Strömgren, M. 391, 399 van der Sluijs, J.P. 369
Schmitz, W. 2511 Su, J.L. 2757 van der Veen, A. 2781
Schnieder, E. 2665 Subramanian, C.S. 2453 van der Weide, J.A.M. 431
Schröder, R.W. 1097 Sunde, L. 1489 van Erp, N. 717
Schweckendiek, T. 2807 Susperregui, L. 3223 van Gelder, P.H.A.J.M. 717,
Schwindt, M. 965 Suter, G. 1217 2797
Segovia, M.C. 881, 1955 Sykora, M. 1629 van Mierlo, M.C.L.M.
Serbanescu, D. 341, 2593, Szpytko, J. 1231 2807
2715 van Noortwijk, J.M. 431
Serradell, V. 2827, 2837 Takai, J. 2685 van Vliet, A.A.C. 1191
Servranckx, L. 2369, 3067 Tambour, F. 2369, 3067 Vanem, E. 3275
Shehata, S. 2285 Tao, J. 1663 van’t Sant, J.P. 1113
Shingyochi, K. 1715, 1839 Tarelko, W. 3289 Vanzi, I. 2519
Shokravi, S. 1125 Tavares, A.T. 1577 Vaquero, C. 727
Shu, C.M. 71, 1267 Tchórzewska-Cieślak, B. Vaurio, J.K. 1103
Shu, C.-M. 39, 2405 2463 Čepin, M. 1771, 2883
Siebold, U. 1547 Telhada, J. 587 Veiga, F. 205
Siegrist, M. 361 Terruggia, R. 2501 Verga, S. 315
Signoret, J.-P. 1173 Thöns, S. 1567 Verhoef, E.T. 2817
Silva, S.A. 1415 Thevik, H. 1335 Verleye, G. 3117
Simões, C. 2627 Thompson, H.A. 1881, Vetere Arellano, A.L. 341,
Simos, G. 281 2971 2593
Singh, M. 2945 Thorpe, N. 299 Vikland, K.M. 1377
Sipa, J. 57 Thorstad, H.H. 2581 Vílchez, J.A. 2421
Skarholt, K. 259, 821, Tian, Z. 1723 Viles, E. 205
1407, 2981 Todinov, M.T. 1655, 2143 Villamizar, M. 505
Skjerve, A.B. 941 Torres-Echeverria, A.C. Villanueva, J.F. 2827,
Skjong, R. 3275 1881, 2971 2837
Skorupski, J. 2191 Torvatn, H. 259, 2981, Vinnem, J.E. 1181
Skrobanek, P. 1055 3039, 3047 Vintr, Z. 1813
Sliwinski, M. 1463 Trainor, M.T. 2353 Vivalda, C. 305
Smalko, Z. 1231, 3337 Trijssenaar-Buhre, I.J.M. Vleugel, J.M. 1519
Smith, N.W. 1293 2363 Voirin, M. 63
Smolarek, L. 3295 Tronci, M. 1495 Vojtek, M. 1671
Sniedovich, M. 2071 Trucco, P. 1431, 3265 Volkanovski, A. 1771
Solano, H. 2447 Tseng, J.M. 71 Volovoi, V. 1961
Solberg, G. 733 Tsujimura, Y. 1839 Vrancken, J.L.M. 1519
Soliwoda, J. 3295 Tucci, M. 211 Vrijling, J.K. 1259, 2797
Son, K.S. 1481 Tugnoli, A. 1147, 2345 Vrouwenvelder, A.C.W.M.
Soria, A. 1401 Turcanu, C. 89 2807

858
Wagner, S. 2541 Woltjer, R. 19 Zanelli, S. 1199
Walls, L. 987 Woropay, M. 2037 Zanocco, P. 2899
Walter, M. 1705, 1829 Wu, C.C. 2757 Zendri, E. 2501
Wang, C. 113 Wu, S.H. 71, 1267 Zerhouni, N. 191
Wang, J. 3109 Wu, S.-H. 39, 2405 Železnik, N. 3015
Wang, W. 523 Zhang, C. 863, 1663
Wang, Y. 863 Xu, S. 2845 Zhang, T. 649
Wemmenhove, E. 2295 Xuewei Ji, A. 79 Zhao, X. 593
Weng, W.P. 1267 Zhu, D. 113
Wenguo Weng, B. 79 Yamaguchi, T. 1839 Zieja, M. 449
Werbinska, S. 1055, Yamamoto, H. 1715, 1839
Zilber, N. 3231
1851 Yang, J.-E. 2861
Zille, V. 531
Wiencke, H.S. 2929 Yannart, B. 2369, 3067
Wiersma, T. 1223 Yeung, T.G. 3171 Zio, E. 477, 703, 709,
Wiesner, R. 351 Yoon, C. 2861 1861, 2081, 2101, 2873
Wijnant-Timmerman, S.I. Yu, L.Q. 2845 Zubeldia, U. 3223
1223, 2363 Yufang, Z. 1943 Zuo, M.J. 1723
Wilday, A.J. 2353 Yukhymets, P. 1141 Żurek, J. 449
Wilson, S.P. 949 Žutautaite-Šeputiene, I.
Winder, C. 739, 1081 Zaitseva, E. 1995 1867
Winther, R. 183, 2635 Zajicek, J. 635 Zwetkoff, C. 1609

859
SAFETY, RELIABILITY AND RISK ANALYSIS: THEORY, METHODS
AND APPLICATIONS
PROCEEDINGS OF THE EUROPEAN SAFETY AND RELIABILITY CONFERENCE, ESREL 2008,
AND 17TH SRA-EUROPE, VALENCIA, SPAIN, SEPTEMBER, 22–25, 2008

Safety, Reliability and Risk Analysis:


Theory, Methods and Applications

Editors
Sebastián Martorell
Department of Chemical and Nuclear Engineering,
Universidad Politécnica de Valencia, Spain

C. Guedes Soares
Instituto Superior Técnico, Technical University of Lisbon, Lisbon, Portugal

Julie Barnett
Department of Psychology, University of Surrey, UK

VOLUME 2
Cover picture designed by Centro de Formación Permanente - Universidad Politécnica de Valencia

CRC Press/Balkema is an imprint of the Taylor & Francis Group, an informa business

© 2009 Taylor & Francis Group, London, UK

Typeset by Vikatan Publishing Solutions (P) Ltd., Chennai, India


Printed and bound in Great Britain by Antony Rowe (A CPI-group Company), Chippenham, Wiltshire.

All rights reserved. No part of this publication or the information contained herein may be reproduced, stored
in a retrieval system, or transmitted in any form or by any means, electronic, mechanical, by photocopying,
recording or otherwise, without written prior permission from the publisher.

Although all care is taken to ensure integrity and the quality of this publication and the information herein, no
responsibility is assumed by the publishers nor the author for any damage to the property or persons as a result
of operation or use of this publication and/or the information contained herein.

Published by: CRC Press/Balkema


P.O. Box 447, 2300 AK Leiden, The Netherlands
e-mail: Pub.NL@taylorandfrancis.com
www.crcpress.com – www.taylorandfrancis.co.uk – www.balkema.nl

ISBN: 978-0-415-48513-5 (set of 4 volumes + CD-ROM)


ISBN: 978-0-415-48514-2 (vol 1)
ISBN: 978-0-415-48515-9 (vol 2)
ISBN: 978-0-415-48516-6 (vol 3)
ISBN: 978-0-415-48792-4 (vol 4)
ISBN: 978-0-203-88297-9 (e-book)
Safety, Reliability and Risk Analysis: Theory, Methods and Applications – Martorell et al. (eds)
© 2009 Taylor & Francis Group, London, ISBN 978-0-415-48513-5

Table of contents

Preface XXIV
Organization XXXI
Acknowledgment XXXV
Introduction XXXVII

VOLUME 1

Thematic areas
Accident and incident investigation
A code for the simulation of human failure events in nuclear power plants: SIMPROC 3
J. Gil, J. Esperón, L. Gamo, I. Fernández, P. González, J. Moreno, A. Expósito,
C. Queral, G. Rodríguez & J. Hortal
A preliminary analysis of the ‘Tlahuac’ incident by applying the MORT technique 11
J.R. Santos-Reyes, S. Olmos-Peña & L.M. Hernández-Simón
Comparing a multi-linear (STEP) and systemic (FRAM) method for accident analysis 19
I.A. Herrera & R. Woltjer
Development of a database for reporting and analysis of near misses in the Italian
chemical industry 27
R.V. Gagliardi & G. Astarita
Development of incident report analysis system based on m-SHEL ontology 33
Y. Asada, T. Kanno & K. Furuta
Forklifts overturn incidents and prevention in Taiwan 39
K.Y. Chen, S.-H. Wu & C.-M. Shu
Formal modelling of incidents and accidents as a means for enriching training material
for satellite control operations 45
S. Basnyat, P. Palanque, R. Bernhaupt & E. Poupart
Hazard factors analysis in regional traffic records 57
M. Mlynczak & J. Sipa
Organizational analysis of availability: What are the lessons for a high risk industrial company? 63
M. Voirin, S. Pierlot & Y. Dien
Thermal explosion analysis of methyl ethyl ketone peroxide by non-isothermal
and isothermal calorimetry application 71
S.H. Wu, J.M. Tseng & C.M. Shu

V
Crisis and emergency management
A mathematical model for risk analysis of disaster chains 79
A. Xuewei Ji, B. Wenguo Weng & Pan Li
Effective learning from emergency responses 83
K. Eriksson & J. Borell
On the constructive role of multi-criteria analysis in complex decision-making:
An application in radiological emergency management 89
C. Turcanu, B. Carlé, J. Paridaens & F. Hardeman

Decision support systems and software tools for safety and reliability
Complex, expert based multi-role assessment system for small and medium enterprises 99
S.G. Kovacs & M. Costescu
DETECT: A novel framework for the detection of attacks to critical infrastructures 105
F. Flammini, A. Gaglione, N. Mazzocca & C. Pragliola
Methodology and software platform for multi-layer causal modeling 113
K.M. Groth, C. Wang, D. Zhu & A. Mosleh
SCAIS (Simulation Code System for Integrated Safety Assessment): Current
status and applications 121
J.M. Izquierdo, J. Hortal, M. Sánchez, E. Meléndez, R. Herrero, J. Gil, L. Gamo,
I. Fernández, J. Esperón, P. González, C. Queral, A. Expósito & G. Rodríguez
Using GIS and multivariate analyses to visualize risk levels and spatial patterns
of severe accidents in the energy sector 129
P. Burgherr
Weak signals of potential accidents at ‘‘Seveso’’ establishments 137
P.A. Bragatto, P. Agnello, S. Ansaldi & P. Pittiglio

Dynamic reliability
A dynamic fault classification scheme 147
B. Fechner
Importance factors in dynamic reliability 155
R. Eymard, S. Mercier & M. Roussignol
TSD, a SCAIS suitable variant of the SDTPD 163
J.M. Izquierdo & I. Cañamón

Fault identification and diagnostics


Application of a vapour compression chiller lumped model for fault detection 175
J. Navarro-Esbrí, A. Real, D. Ginestar & S. Martorell
Automatic source code analysis of failure modes causing error propagation 183
S. Sarshar & R. Winther
Development of a prognostic tool to perform reliability analysis 191
M. El-Koujok, R. Gouriveau & N. Zerhouni
Fault detection and diagnosis in monitoring a hot dip galvanizing line using
multivariate statistical process control 201
J.C. García-Díaz
Fault identification, diagnosis and compensation of flatness errors in hard turning
of tool steels 205
F. Veiga, J. Fernández, E. Viles, M. Arizmendi, A. Gil & M.L. Penalva

VI
From diagnosis to prognosis: A maintenance experience for an electric locomotive 211
O. Borgia, F. De Carlo & M. Tucci

Human factors
A study on the validity of R-TACOM measure by comparing operator response
time data 221
J. Park & W. Jung
An evaluation of the Enhanced Bayesian THERP method using simulator data 227
K. Bladh, J.-E. Holmberg & P. Pyy
Comparing CESA-Q human reliability analysis with evidence from simulator:
A first attempt 233
L. Podofillini & B. Reer
Exploratory and confirmatory analysis of the relationship between social norms
and safety behavior 243
C. Fugas, S.A. da Silva & J.L. Melià
Functional safety and layer of protection analysis with regard to human factors 249
K.T. Kosmowski
How employees’ use of information technology systems shape reliable operations
of large scale technological systems 259
T.K. Andersen, P. Næsje, H. Torvatn & K. Skarholt
Incorporating simulator evidence into HRA: Insights from the data analysis of the
international HRA empirical study 267
S. Massaiu, P.Ø. Braarud & M. Hildebrandt
Insights from the ‘‘HRA international empirical study’’: How to link data
and HRA with MERMOS 275
H. Pesme, P. Le Bot & P. Meyer
Operators’ response time estimation for a critical task using the fuzzy logic theory 281
M. Konstandinidou, Z. Nivolianitou, G. Simos, C. Kiranoudis & N. Markatos
The concept of organizational supportiveness 291
J. Nicholls, J. Harvey & G. Erdos
The influence of personal variables on changes in driver behaviour 299
S. Heslop, J. Harvey, N. Thorpe & C. Mulley
The key role of expert judgment in CO2 underground storage projects 305
C. Vivalda & L. Jammes

Integrated risk management and risk—informed decision-making


All-hazards risk framework—An architecture model 315
S. Verga
Comparisons and discussion of different integrated risk approaches 323
R. Steen & T. Aven
Management of risk caused by domino effect resulting from design
system dysfunctions 331
S. Sperandio, V. Robin & Ph. Girard
On some aspects related to the use of integrated risk analyses for the decision
making process, including its use in the non-nuclear applications 341
D. Serbanescu, A.L. Vetere Arellano & A. Colli
On the usage of weather derivatives in Austria—An empirical study 351
M. Bank & R. Wiesner

VII
Precaution in practice? The case of nanomaterial industry 361
H. Kastenholz, A. Helland & M. Siegrist
Risk based maintenance prioritisation 365
G. Birkeland, S. Eisinger & T. Aven
Shifts in environmental health risk governance: An analytical framework 369
H.A.C. Runhaar, J.P. van der Sluijs & P.P.J. Driessen
What does ‘‘safety margin’’ really mean? 379
J. Hortal, R. Mendizábal & F. Pelayo

Legislative dimensions of risk management


Accidents, risk analysis and safety management—Different perspective at a
Swedish safety authority 391
O. Harrami, M. Strömgren, U. Postgård & R. All
Evaluation of risk and safety issues at the Swedish Rescue Services Agency 399
O. Harrami, U. Postgård & M. Strömgren
Regulation of information security and the impact on top management commitment—
A comparative study of the electric power supply sector and the finance sector 407
J.M. Hagen & E. Albrechtsen
The unintended consequences of risk regulation 415
B.H. MacGillivray, R.E. Alcock & J.S. Busby

Maintenance modelling and optimisation


A hybrid age-based maintenance policy for heterogeneous items 423
P.A. Scarf, C.A.V. Cavalcante, R.W. Dwight & P. Gordon
A stochastic process model for computing the cost of a condition-based maintenance plan 431
J.A.M. van der Weide, M.D. Pandey & J.M. van Noortwijk
A study about influence of uncertain distribution inputs in maintenance optimization 441
R. Mullor, S. Martorell, A. Sánchez & N. Martinez-Alzamora
Aging processes as a primary aspect of predicting reliability and life of aeronautical hardware 449
J. Żurek, M. Zieja, G. Kowalczyk & T. Niezgoda
An alternative imperfect preventive maintenance model 455
J. Clavareau & P.E. Labeau
An imperfect preventive maintenance model with dependent failure modes 463
I.T. Castro
Condition-based maintenance approaches for deteriorating system influenced
by environmental conditions 469
E. Deloux, B. Castanier & C. Bérenguer
Condition-based maintenance by particle filtering 477
F. Cadini, E. Zio & D. Avram
Corrective maintenance for aging air conditioning systems 483
I. Frenkel, L. Khvatskin & A. Lisnianski
Exact reliability quantification of highly reliable systems with maintenance 489
R. Briš
Genetic algorithm optimization of preventive maintenance scheduling for repairable
systems modeled by generalized renewal process 497
P.A.A. Garcia, M.C. Sant’Ana, V.C. Damaso & P.F. Frutuoso e Melo

VIII
Maintenance modelling integrating human and material resources 505
S. Martorell, M. Villamizar, A. Sánchez & G. Clemente
Modelling competing risks and opportunistic maintenance with expert judgement 515
T. Bedford & B.M. Alkali
Modelling different types of failure and residual life estimation for condition-based maintenance 523
M.J. Carr & W. Wang
Multi-component systems modeling for quantifying complex maintenance strategies 531
V. Zille, C. Bérenguer, A. Grall, A. Despujols & J. Lonchampt
Multiobjective optimization of redundancy allocation in systems with imperfect repairs via
ant colony and discrete event simulation 541
I.D. Lins & E. López Droguett
Non-homogeneous Markov reward model for aging multi-state system under corrective
maintenance 551
A. Lisnianski & I. Frenkel
On the modeling of ageing using Weibull models: Case studies 559
P. Praks, H. Fernandez Bacarizo & P.-E. Labeau
On-line condition-based maintenance for systems with several modes of degradation 567
A. Ponchet, M. Fouladirad & A. Grall
Opportunity-based age replacement for a system under two types of failures 575
F.G. Badía & M.D. Berrade
Optimal inspection intervals for maintainable equipment 581
O. Hryniewicz
Optimal periodic inspection of series systems with revealed and unrevealed failures 587
M. Carvalho, E. Nunes & J. Telhada
Optimal periodic inspection/replacement policy for deteriorating systems with explanatory
variables 593
X. Zhao, M. Fouladirad, C. Bérenguer & L. Bordes
Optimal replacement policy for components with general failure rates submitted to obsolescence 603
S. Mercier
Optimization of the maintenance function at a company 611
S. Adjabi, K. Adel-Aissanou & M. Azi
Planning and scheduling maintenance resources in a complex system 619
M. Newby & C. Barker
Preventive maintenance planning using prior expert knowledge and multicriteria method
PROMETHEE III 627
F.A. Figueiredo, C.A.V. Cavalcante & A.T. de Almeida
Profitability assessment of outsourcing maintenance from the producer (big rotary machine study) 635
P. Fuchs & J. Zajicek
Simulated annealing method for the selective maintenance optimization of multi-mission
series-parallel systems 641
A. Khatab, D. Ait-Kadi & A. Artiba
Study on the availability of a k-out-of-N System given limited spares under (m, NG )
maintenance policy 649
T. Zhang, H.T. Lei & B. Guo
System value trajectories, maintenance, and its present value 659
K.B. Marais & J.H. Saleh

IX
The maintenance management framework: A practical view to maintenance management 669
A. Crespo Márquez, P. Moreu de León, J.F. Gómez Fernández, C. Parra Márquez & V. González
Workplace occupation and equipment availability and utilization, in the context of maintenance
float systems 675
I.S. Lopes, A.F. Leitão & G.A.B. Pereira

Monte Carlo methods in system safety and reliability


Availability and reliability assessment of industrial complex systems: A practical view
applied on a bioethanol plant simulation 687
V. González, C. Parra, J.F. Gómez, A. Crespo & P. Moreu de León
Handling dependencies between variables with imprecise probabilistic models 697
S. Destercke & E. Chojnacki
Monte Carlo simulation for investigating the influence of maintenance strategies on the production
availability of offshore installations 703
K.P. Chang, D. Chang, T.J. Rhee & E. Zio
Reliability analysis of discrete multi-state systems by means of subset simulation 709
E. Zio & N. Pedroni
The application of Bayesian interpolation in Monte Carlo simulations 717
M. Rajabalinejad, P.H.A.J.M. van Gelder & N. van Erp

Occupational safety
Application of virtual reality technologies to improve occupational & industrial safety
in industrial processes 727
J. Rubio, B. Rubio, C. Vaquero, N. Galarza, A. Pelaz, J.L. Ipiña, D. Sagasti & L. Jordá
Applying the resilience concept in practice: A case study from the oil and gas industry 733
L. Hansson, I. Andrade Herrera, T. Kongsvik & G. Solberg
Development of an assessment tool to facilitate OHS management based upon the safe
place, safe person, safe systems framework 739
A.-M. Makin & C. Winder
Exploring knowledge translation in occupational health using the mental models approach:
A case study of machine shops 749
A.-M. Nicol & A.C. Hurrell
Mathematical modelling of risk factors concerning work-related traffic accidents 757
C. Santamaría, G. Rubio, B. García & E. Navarro
New performance indicators for the health and safety domain: A benchmarking use perspective 761
H.V. Neto, P.M. Arezes & S.D. Sousa
Occupational risk management for fall from height 767
O.N. Aneziris, M. Konstandinidou, I.A. Papazoglou, M. Mud, M. Damen, J. Kuiper, H. Baksteen,
L.J. Bellamy, J.G. Post & J. Oh
Occupational risk management for vapour/gas explosions 777
I.A. Papazoglou, O.N. Aneziris, M. Konstandinidou, M. Mud, M. Damen, J. Kuiper, A. Bloemhoff,
H. Baksteen, L.J. Bellamy, J.G. Post & J. Oh
Occupational risk of an aluminium industry 787
O.N. Aneziris, I.A. Papazoglou & O. Doudakmani
Risk regulation bureaucracies in EU accession states: Drinking water safety in Estonia 797
K. Kangur

X
Organization learning
Can organisational learning improve safety and resilience during changes? 805
S.O. Johnsen & S. Håbrekke
Consequence analysis as organizational development 813
B. Moltu, A. Jarl Ringstad & G. Guttormsen
Integrated operations and leadership—How virtual cooperation influences leadership practice 821
K. Skarholt, P. Næsje, V. Hepsø & A.S. Bye
Outsourcing maintenance in services providers 829
J.F. Gómez, C. Parra, V. González, A. Crespo & P. Moreu de León
Revising rules and reviving knowledge in the Norwegian railway system 839
H.C. Blakstad, R. Rosness & J. Hovden

Risk Management in systems: Learning to recognize and respond to weak signals 847
E. Guillaume
Author index 853

VOLUME 2

Reliability and safety data collection and analysis


A new step-stress Accelerated Life Testing approach: Step-Down-Stress 863
C. Zhang, Y. Wang, X. Chen & Y. Jiang
Application of a generalized lognormal distribution to engineering data fitting 869
J. Martín & C.J. Pérez
Collection and analysis of reliability data over the whole product lifetime of vehicles 875
T. Leopold & B. Bertsche
Comparison of phase-type distributions with mixed and additive Weibull models 881
M.C. Segovia & C. Guedes Soares
Evaluation methodology of industry equipment functional reliability 891
J. Kamenický
Evaluation of device reliability based on accelerated tests 899
E. Nogueira Díaz, M. Vázquez López & D. Rodríguez Cano
Evaluation, analysis and synthesis of multiple source information: An application to nuclear
computer codes 905
S. Destercke & E. Chojnacki
Improving reliability using new processes and methods 913
S.J. Park, S.D. Park & K.T. Jo
Life test applied to Brazilian friction-resistant low alloy-high strength steel rails 919
D.I. De Souza, A. Naked Haddad & D. Rocha Fonseca
Non-homogeneous Poisson Process (NHPP), stochastic model applied to evaluate the economic
impact of the failure in the Life Cycle Cost Analysis (LCCA) 929
C. Parra Márquez, A. Crespo Márquez, P. Moreu de León, J. Gómez Fernández & V. González Díaz
Risk trends, indicators and learning rates: A new case study of North sea oil and gas 941
R.B. Duffey & A.B. Skjerve
Robust estimation for an imperfect test and repair model using Gaussian mixtures 949
S.P. Wilson & S. Goyal

XI
Risk and evidence based policy making
Environmental reliability as a requirement for defining environmental impact limits
in critical areas 957
E. Calixto & E. Lèbre La Rovere
Hazardous aid? The crowding-out effect of international charity 965
P.A. Raschky & M. Schwindt
Individual risk-taking and external effects—An empirical examination 973
S. Borsky & P.A. Raschky
Licensing a Biofuel plan transforming animal fats 981
J.-F. David
Modelling incident escalation in explosives storage 987
G. Hardman, T. Bedford, J. Quigley & L. Walls
The measurement and management of Deca-BDE—Why the continued certainty of uncertainty? 993
R.E. Alcock, B.H. McGillivray & J.S. Busby

Risk and hazard analysis


A contribution to accelerated testing implementation 1001
S. Hossein Mohammadian M., D. Aït-Kadi, A. Coulibaly & B. Mutel
A decomposition method to analyze complex fault trees 1009
S. Contini
A quantitative methodology for risk assessment of explosive atmospheres according to the
ATEX directive 1019
R. Lisi, M.F. Milazzo & G. Maschio
A risk theory based on hyperbolic parallel curves and risk assessment in time 1027
N. Popoviciu & F. Baicu
Accident occurrence evaluation of phased-mission systems composed of components
with multiple failure modes 1035
T. Kohda
Added value in fault tree analyses 1041
T. Norberg, L. Rosén & A. Lindhe
Alarm prioritization at plant design stage—A simplified approach 1049
P. Barbarini, G. Franzoni & E. Kulot
Analysis of possibilities of timing dependencies modeling—Example of logistic support system 1055
J. Magott, T. Nowakowski, P. Skrobanek & S. Werbinska
Applications of supply process reliability model 1065
A. Jodejko & T. Nowakowski
Applying optimization criteria to risk analysis 1073
H. Medina, J. Arnaldos & J. Casal
Chemical risk assessment for inspection teams during CTBT on-site inspections of sites
potentially contaminated with industrial chemicals 1081
G. Malich & C. Winder
Comparison of different methodologies to estimate the evacuation radius in the case
of a toxic release 1089
M.I. Montoya & E. Planas
Conceptualizing and managing risk networks. New insights for risk management 1097
R.W. Schröder

XII
Developments in fault tree techniques and importance measures 1103
J.K. Vaurio
Dutch registration of risk situations 1113
J.P. van’t Sant, H.J. Manuel & A. van den Berg
Experimental study of jet fires 1119
M. Gómez-Mares, A. Palacios, A. Peiretti, M. Muñoz & J. Casal
Failure mode and effect analysis algorithm for tunneling projects 1125
K. Rezaie, V. Ebrahimipour & S. Shokravi
Fuzzy FMEA: A study case on a discontinuous distillation plant 1129
S.S. Rivera & J.E. Núñez Mc Leod
Risk analysis in extreme environmental conditions for Aconcagua Mountain station 1135
J.E. Núñez Mc Leod & S.S. Rivera
Geographic information system for evaluation of technical condition and residual life of pipelines 1141
P. Yukhymets, R. Spitsa & S. Kobelsky
Inherent safety indices for the design of layout plans 1147
A. Tugnoli, V. Cozzani, F.I. Khan & P.R. Amyotte
Minmax defense strategy for multi-state systems 1157
G. Levitin & K. Hausken
Multicriteria risk assessment for risk ranking of natural gas pipelines 1165
A.J. de M. Brito, C.A.V. Cavalcante, R.J.P. Ferreira & A.T. de Almeida
New insight into PFDavg and PFH 1173
F. Innal, Y. Dutuit, A. Rauzy & J.-P. Signoret
On causes and dependencies of errors in human and organizational barriers against major
accidents 1181
J.E. Vinnem
Quantitative risk analysis method for warehouses with packaged hazardous materials 1191
D. Riedstra, G.M.H. Laheij & A.A.C. van Vliet
Ranking the attractiveness of industrial plants to external acts of interference 1199
M. Sabatini, S. Zanelli, S. Ganapini, S. Bonvicini & V. Cozzani
Review and discussion of uncertainty taxonomies used in risk analysis 1207
T.E. Nøkland & T. Aven
Risk analysis in the frame of the ATEX Directive and the preparation of an Explosion Protection
Document 1217
A. Pey, G. Suter, M. Glor, P. Lerena & J. Campos
Risk reduction by use of a buffer zone 1223
S.I. Wijnant-Timmerman & T. Wiersma
Safety in engineering practice 1231
Z. Smalko & J. Szpytko
Why ISO 13702 and NFPA 15 standards may lead to unsafe design 1239
S. Medonos & R. Raman

Risk control in complex environments


Is there an optimal type for high reliability organization? A study of the UK offshore industry 1251
J.S. Busby, A. Collins & R. Miles
The optimization of system safety: Rationality, insurance, and optimal protection 1259
R.B. Jongejan & J.K. Vrijling

XIII
Thermal characteristic analysis of Y type zeolite by differential scanning calorimetry 1267
S.H. Wu, W.P. Weng, C.C. Hsieh & C.M. Shu
Using network methodology to define emergency response team location: The Brazilian
refinery case study 1273
E. Calixto, E. Lèbre La Rovere & J. Eustáquio Beraldo

Risk perception and communication


(Mis-)conceptions of safety principles 1283
J.-T. Gayen & H. Schäbe
Climate change in the British press: The role of the visual 1293
N.W. Smith & H. Joffe
Do the people exposed to a technological risk always want more information about it?
Some observations on cases of rejection 1301
J. Espluga, J. Farré, J. Gonzalo, T. Horlick-Jones, A. Prades, C. Oltra & J. Navajas
Media coverage, imaginary of risks and technological organizations 1309
F. Fodor & G. Deleuze
Media disaster coverage over time: Methodological issues and results 1317
M. Kuttschreuter & J.M. Gutteling
Risk amplification and zoonosis 1325
D.G. Duckett & J.S. Busby
Risk communication and addressing uncertainties in risk assessments—Presentation of a framework 1335
J. Stian Østrem, H. Thevik, R. Flage & T. Aven
Risk communication for industrial plants and radioactive waste repositories 1341
F. Gugliermetti & G. Guidi
Risk management measurement methodology: Practical procedures and approaches for risk
assessment and prediction 1351
R.B. Duffey & J.W. Saull
Risk perception and cultural theory: Criticism and methodological orientation 1357
C. Kermisch & P.-E. Labeau
Standing in the shoes of hazard managers: An experiment on avalanche risk perception 1365
C.M. Rheinberger
The social perception of nuclear fusion: Investigating lay understanding and reasoning about
the technology 1371
A. Prades, C. Oltra, J. Navajas, T. Horlick-Jones & J. Espluga

Safety culture
‘‘Us’’ and ‘‘Them’’: The impact of group identity on safety critical behaviour 1377
R.J. Bye, S. Antonsen & K.M. Vikland
Does change challenge safety? Complexity in the civil aviation transport system 1385
S. Høyland & K. Aase
Electromagnetic fields in the industrial enviroment 1395
J. Fernández, A. Quijano, M.L. Soriano & V. Fuster
Electrostatic charges in industrial environments 1401
P. LLovera, A. Quijano, A. Soria & V. Fuster
Empowering operations and maintenance: Safe operations with the ‘‘one directed team’’
organizational model at the Kristin asset 1407
P. Næsje, K. Skarholt, V. Hepsø & A.S. Bye

XIV
Leadership and safety climate in the construction industry 1415
J.L. Meliá, M. Becerril, S.A. Silva & K. Mearns
Local management and its impact on safety culture and safety within Norwegian shipping 1423
H.A Oltedal & O.A. Engen
Quantitative analysis of the anatomy and effectiveness of occupational safety culture 1431
P. Trucco, M. De Ambroggi & O. Grande
Safety management and safety culture assessment in Germany 1439
H.P. Berg
The potential for error in communications between engineering designers 1447
J. Harvey, R. Jamieson & K. Pearce

Safety management systems


Designing the safety policy of IT system on the example of a chosen company 1455
I.J. Jóźwiak, T. Nowakowski & K. Jóźwiak
Determining and verifying the safety integrity level of the control and protection systems
under uncertainty 1463
T. Barnert, K.T. Kosmowski & M. Sliwinski
Drawing up and running a Security Plan in an SME type company—An easy task? 1473
M. Gerbec
Efficient safety management for subcontractor at construction sites 1481
K.S. Son & J.J. Park
Production assurance and reliability management—A new international standard 1489
H. Kortner, K.-E. Haugen & L. Sunde
Risk management model for industrial plants maintenance 1495
N. Napolitano, M. De Minicis, G. Di Gravio & M. Tronci
Some safety aspects on multi-agent and CBTC implementation for subway control systems 1503
F.M. Rachel & P.S. Cugnasca

Software reliability
Assessment of software reliability and the efficiency of corrective actions during the software
development process 1513
R. Savić
ERTMS, deals on wheels? An inquiry into a major railway project 1519
J.A. Stoop, J.H. Baggen, J.M. Vleugel & J.L.M. Vrancken
Guaranteed resource availability in a website 1525
V.P. Koutras & A.N. Platis
Reliability oriented electronic design automation tool 1533
J. Marcos, D. Bóveda, A. Fernández & E. Soto
Reliable software for partitionable networked environments—An experience report 1539
S. Beyer, J.C. García Ortiz, F.D. Muñoz-Escoí, P. Galdámez, L. Froihofer,
K.M. Goeschka & J. Osrael
SysML aided functional safety assessment 1547
M. Larisch, A. Hänle, U. Siebold & I. Häring
UML safety requirement specification and verification 1555
A. Hänle & I. Häring

XV
Stakeholder and public involvement in risk governance
Assessment and monitoring of reliability and robustness of offshore wind energy converters 1567
S. Thöns, M.H. Faber, W. Rücker & R. Rohrmann
Building resilience to natural hazards. Practices and policies on governance and mitigation
in the central region of Portugal 1577
J.M. Mendes & A.T. Tavares
Governance of flood risks in The Netherlands: Interdisciplinary research into the role and
meaning of risk perception 1585
M.S. de Wit, H. van der Most, J.M. Gutteling & M. Bočkarjova
Public intervention for better governance—Does it matter? A study of the ‘‘Leros Strength’’ case 1595
P.H. Lindøe & J.E. Karlsen
Reasoning about safety management policy in everyday terms 1601
T. Horlick-Jones
Using stakeholders’ expertise in EMF and soil contamination to improve the management
of public policies dealing with modern risk: When uncertainty is on the agenda 1609
C. Fallon, G. Joris & C. Zwetkoff

Structural reliability and design codes


Adaptive discretization of 1D homogeneous random fields 1621
D.L. Allaix, V.I. Carbone & G. Mancini
Comparison of methods for estimation of concrete strength 1629
M. Holicky, K. Jung & M. Sykora
Design of structures for accidental design situations 1635
J. Marková & K. Jung
Developing fragility function for a timber structure subjected to fire 1641
E.R. Vaidogas, Virm. Juocevičius & Virg. Juocevičius
Estimations in the random fatigue-limit model 1651
C.-H. Chiu & W.-T. Huang
Limitations of the Weibull distribution related to predicting the probability of failure
initiated by flaws 1655
M.T. Todinov
Simulation techniques of non-gaussian random loadings in structural reliability analysis 1663
Y. Jiang, C. Zhang, X. Chen & J. Tao
Special features of the collection and analysis of snow loads 1671
Z. Sadovský, P. Faško, K. Mikulová, J. Pecho & M. Vojtek
Structural safety under extreme construction loads 1677
V. Juocevičius & A. Kudzys
The modeling of time-dependent reliability of deteriorating structures 1685
A. Kudzys & O. Lukoševiciene
Author index 1695

VOLUME 3

System reliability analysis


A copula-based approach for dependability analyses of fault-tolerant systems with
interdependent basic events 1705
M. Walter, S. Esch & P. Limbourg

XVI
A depth first search algorithm for optimal arrangements in a circular
consecutive-k-out-of-n:F system 1715
K. Shingyochi & H. Yamamoto
A joint reliability-redundancy optimization approach for multi-state series-parallel systems 1723
Z. Tian, G. Levitin & M.J. Zuo
A new approach to assess the reliability of a multi-state system with dependent components 1731
M. Samrout & E. Chatelet
A reliability analysis and decision making process for autonomous systems 1739
R. Remenyte-Prescott, J.D. Andrews, P.W.H. Chung & C.G. Downes
Advanced discrete event simulation methods with application to importance measure
estimation 1747
A.B. Huseby, K.A. Eide, S.L. Isaksen, B. Natvig & J. Gåsemyr
Algorithmic and computational analysis of a multi-component complex system 1755
J.E. Ruiz-Castro, R. Pérez-Ocón & G. Fernández-Villodre
An efficient reliability computation of generalized multi-state k-out-of-n systems 1763
S.V. Amari
Application of the fault tree analysis for assessment of the power system reliability 1771
A. Volkanovski, M. Čepin & B. Mavko
BDMP (Boolean logic driven Markov processes) as an alternative to event trees 1779
M. Bouissou
Bivariate distribution based passive system performance assessment 1787
L. Burgazzi
Calculating steady state reliability indices of multi-state systems using dual number algebra 1795
E. Korczak
Concordance analysis of importance measure 1803
C.M. Rocco S.
Contribution to availability assessment of systems with one shot items 1807
D. Valis & M. Koucky
Contribution to modeling of complex weapon systems reliability 1813
D. Valis, Z. Vintr & M. Koucky
Delayed system reliability and uncertainty analysis 1819
R. Alzbutas, V. Janilionis & J. Rimas
Efficient generation and representation of failure lists out of an information flux model
for modeling safety critical systems 1829
M. Pock, H. Belhadaoui, O. Malassé & M. Walter
Evaluating algorithms for the system state distribution of multi-state k-out-of-n:F system 1839
T. Akiba, H. Yamamoto, T. Yamaguchi, K. Shingyochi & Y. Tsujimura
First-passage time analysis for Markovian deteriorating model 1847
G. Dohnal
Model of logistic support system with time dependency 1851
S. Werbinska
Modeling failure cascades in network systems due to distributed random disturbances 1861
E. Zio & G. Sansavini
Modeling of the changes of graphite bore in RBMK-1500 type nuclear reactor 1867
I. Žutautaite-Šeputiene, J. Augutis & E. Ušpuras

XVII
Modelling multi-platform phased mission system reliability 1873
D.R. Prescott, J.D. Andrews & C.G. Downes
Modelling test strategies effects on the probability of failure on demand for safety
instrumented systems 1881
A.C. Torres-Echeverria, S. Martorell & H.A. Thompson
New insight into measures of component importance in production systems 1891
S.L. Isaksen
New virtual age models for bathtub shaped failure intensities 1901
Y. Dijoux & E. Idée
On some approaches to defining virtual age of non-repairable objects 1909
M.S. Finkelstein
On the application and extension of system signatures in engineering reliability 1915
J. Navarro, F.J. Samaniego, N. Balakrishnan & D. Bhattacharya
PFD of higher-order configurations of SIS with partial stroke testing capability 1919
L.F.S. Oliveira
Power quality as accompanying factor in reliability research of electric engines 1929
I.J. Jóźwiak, K. Kujawski & T. Nowakowski
RAMS and performance analysis 1937
X. Quayzin, E. Arbaretier, Z. Brik & A. Rauzy
Reliability evaluation of complex system based on equivalent fault tree 1943
Z. Yufang, Y. Hong & L. Jun
Reliability evaluation of III-V Concentrator solar cells 1949
N. Núñez, J.R. González, M. Vázquez, C. Algora & I. Rey-Stolle
Reliability of a degrading system under inspections 1955
D. Montoro-Cazorla, R. Pérez-Ocón & M.C. Segovia
Reliability prediction using petri nets for on-demand safety systems with fault detection 1961
A.V. Kleyner & V. Volovoi
Reliability, availability and cost analysis of large multi-state systems with ageing components 1969
K. Kolowrocki
Reliability, availability and risk evaluation of technical systems in variable operation conditions 1985
K. Kolowrocki & J. Soszynska
Representation and estimation of multi-state system reliability by decision diagrams 1995
E. Zaitseva & S. Puuronen
Safety instrumented system reliability evaluation with influencing factors 2003
F. Brissaud, D. Charpentier, M. Fouladirad, A. Barros & C. Bérenguer
Smooth estimation of the availability function of a repairable system 2013
M.L. Gámiz & Y. Román
System design optimisation involving phased missions 2021
D. Astapenko & L.M. Bartlett
The Natvig measures of component importance in repairable systems applied to an offshore
oil and gas production system 2029
B. Natvig, K.A. Eide, J. Gåsemyr, A.B. Huseby & S.L. Isaksen
The operation quality assessment as an initial part of reliability improvement and low cost
automation of the system 2037
L. Muslewski, M. Woropay & G. Hoppe

XVIII
Three-state modelling of dependent component failures with domino effects 2045
U.K. Rakowsky
Variable ordering techniques for the application of Binary Decision Diagrams on PSA
linked Fault Tree models 2051
C. Ibáñez-Llano, A. Rauzy, E. Meléndez & F. Nieto
Weaknesses of classic availability calculations for interlinked production systems
and their overcoming 2061
D. Achermann

Uncertainty and sensitivity analysis


A critique of Info-Gap’s robustness model 2071
M. Sniedovich
Alternative representations of uncertainty in system reliability and risk analysis—Review
and discussion 2081
R. Flage, T. Aven & E. Zio
Dependence modelling with copula in probabilistic studies, a practical approach
based on numerical experiments 2093
A. Dutfoy & R. Lebrun
Event tree uncertainty analysis by Monte Carlo and possibility theory 2101
P. Baraldi & E. Zio
Global sensitivity analysis based on entropy 2107
B. Auder & B. Iooss
Impact of uncertainty affecting reliability models on warranty contracts 2117
G. Fleurquin, P. Dehombreux & P. Dersin
Influence of epistemic uncertainties on the probabilistic assessment of an emergency operating
procedure in a nuclear power plant 2125
M. Kloos & J. Peschke
Numerical study of algorithms for metamodel construction and validation 2135
B. Iooss, L. Boussouf, A. Marrel & V. Feuillard
On the variance upper bound theorem and its applications 2143
M.T. Todinov
Reliability assessment under Uncertainty Using Dempster-Shafer and Vague Set Theories 2151
S. Pashazadeh & N. Grachorloo
Types and sources of uncertainties in environmental accidental risk assessment: A case study
for a chemical factory in the Alpine region of Slovenia 2157
M. Gerbec & B. Kontic
Uncertainty estimation for monotone and binary systems 2167
A.P. Ulmeanu & N. Limnios

Industrial and service sectors


Aeronautics and aerospace
Condition based operational risk assessment for improved aircraft operability 2175
A. Arnaiz, M. Buderath & S. Ferreiro
Is optimized design of satellites possible? 2185
J. Faure, R. Laulheret & A. Cabarbaye

XIX
Model of air traffic in terminal area for ATFM safety analysis 2191
J. Skorupski & A.W. Stelmach
Predicting airport runway conditions based on weather data 2199
A.B. Huseby & M. Rabbe
Safety considerations in complex airborne systems 2207
M.J.R. Lemes & J.B. Camargo Jr
The Preliminary Risk Analysis approach: Merging space and aeronautics methods 2217
J. Faure, R. Laulheret & A. Cabarbaye
Using a Causal model for Air Transport Safety (CATS) for the evaluation of alternatives 2223
B.J.M. Ale, L.J. Bellamy, R.P. van der Boom, J. Cooper, R.M. Cooke, D. Kurowicka, P.H. Lin,
O. Morales, A.L.C. Roelen & J. Spouge

Automotive engineering
An approach to describe interactions in and between mechatronic systems 2233
J. Gäng & B. Bertsche
Influence of the mileage distribution on reliability prognosis models 2239
A. Braasch, D. Althaus & A. Meyna
Reliability prediction for automotive components using Real-Parameter Genetic Algorithm 2245
J. Hauschild, A. Kazeminia & A. Braasch
Stochastic modeling and prediction of catalytic converters degradation 2251
S. Barone, M. Giorgio, M. Guida & G. Pulcini
Towards a better interaction between design and dependability analysis: FMEA derived from
UML/SysML models 2259
P. David, V. Idasiak & F. Kratz

Biotechnology and food industry


Application of tertiary mathematical models for evaluating the presence of staphylococcal
enterotoxin in lactic acid cheese 2269
I. Steinka & A. Blokus-Roszkowska
Assessment of the risk to company revenue due to deviations in honey quality 2275
E. Doménech, I. Escriche & S. Martorell
Attitudes of Japanese and Hawaiian toward labeling genetically modified fruits 2285
S. Shehata
Ensuring honey quality by means of effective pasteurization 2289
E. Doménech, I. Escriche & S. Martorell
Exposure assessment model to combine thermal inactivation (log reduction) and thermal injury
(heat-treated spore lag time) effects on non-proteolytic Clostridium botulinum 2295
J.-M. Membrë, E. Wemmenhove & P. McClure
Public information requirements on health risk of mercury in fish (1): Perception
and knowledge of the public about food safety and the risk of mercury 2305
M. Kosugi & H. Kubota
Public information requirements on health risks of mercury in fish (2): A comparison of mental
models of experts and public in Japan 2311
H. Kubota & M. Kosugi
Review of diffusion models for the social amplification of risk of food-borne zoonoses 2317
J.P. Mehers, H.E. Clough & R.M. Christley

XX
Risk perception and communication of food safety and food technologies in Flanders,
The Netherlands, and the United Kingdom 2325
U. Maris
Synthesis of reliable digital microfluidic biochips using Monte Carlo simulation 2333
E. Maftei, P. Pop & F. Popenţiu Vlădicescu

Chemical process industry


Accidental scenarios in the loss of control of chemical processes: Screening the impact
profile of secondary substances 2345
M. Cordella, A. Tugnoli, P. Morra, V. Cozzani & F. Barontini
Adapting the EU Seveso II Directive for GHS: Initial UK study on acute toxicity to people 2353
M.T. Trainor, A.J. Wilday, M. Moonis, A.L. Rowbotham, S.J. Fraser, J.L. Saw & D.A. Bosworth
An advanced model for spreading and evaporation of accidentally released hazardous
liquids on land 2363
I.J.M. Trijssenaar-Buhre, R.P. Sterkenburg & S.I. Wijnant-Timmerman
Influence of safety systems on land use planning around seveso sites; example of measures
chosen for a fertiliser company located close to a village 2369
C. Fiévez, C. Delvosalle, N. Cornil, L. Servranckx, F. Tambour, B. Yannart & F. Benjelloun
Performance evaluation of manufacturing systems based on dependability management
indicators-case study: Chemical industry 2379
K. Rezaie, M. Dehghanbaghi & V. Ebrahimipour
Protection of chemical industrial installations from intentional adversary acts: Comparison
of the new security challenges with the existing safety practices in Europe 2389
M.D. Christou
Quantitative assessment of domino effect in an extended industrial area 2397
G. Antonioni, G. Spadoni, V. Cozzani, C. Dondi & D. Egidi
Reaction hazard of cumene hydroperoxide with sodium hydroxide by isothermal calorimetry 2405
Y.-P. Chou, S.-H. Wu, C.-M. Shu & H.-Y. Hou
Reliability study of shutdown process through the analysis of decision making in chemical plants.
Case of study: South America, Spain and Portugal 2409
L. Amendola, M.A. Artacho & T. Depool
Study of the application of risk management practices in shutdown chemical process 2415
L. Amendola, M.A. Artacho & T. Depool
Thirty years after the first HAZOP guideline publication. Considerations 2421
J. Dunjó, J.A. Vílchez & J. Arnaldos

Civil engineering
Decision tools for risk management support in construction industry 2431
S. Mehicic Eberhardt, S. Moeller, M. Missler-Behr & W. Kalusche
Definition of safety and the existence of ‘‘optimal safety’’ 2441
D. Proske
Failure risk analysis in Water Supply Networks 2447
A. Carrión, A. Debón, E. Cabrera, M.L. Gamiz & H. Solano
Hurricane vulnerability of multi-story residential buildings in Florida 2453
G.L. Pita, J.-P. Pinelli, C.S. Subramanian, K. Gurley & S. Hamid
Risk management system in water-pipe network functioning 2463
B. Tchórzewska-Cieślak

XXI
Use of extreme value theory in engineering design 2473
E. Castillo, C. Castillo & R. Mínguez

Critical infrastructures
A model for vulnerability analysis of interdependent infrastructure networks 2491
J. Johansson & H. Jönsson
Exploiting stochastic indicators of interdependent infrastructures: The service availability of
interconnected networks 2501
G. Bonanni, E. Ciancamerla, M. Minichino, R. Clemente, A. Iacomini, A. Scarlatti,
E. Zendri & R. Terruggia
Proactive risk assessment of critical infrastructures 2511
T. Uusitalo, R. Koivisto & W. Schmitz
Seismic assessment of utility systems: Application to water, electric power and transportation
networks 2519
C. Nuti, A. Rasulo & I. Vanzi
Author index 2531

VOLUME 4

Electrical and electronic engineering


Balancing safety and availability for an electronic protection system 2541
S. Wagner, I. Eusgeld, W. Kröger & G. Guaglio
Evaluation of important reliability parameters using VHDL-RTL modelling and information
flow approach 2549
M. Jallouli, C. Diou, F. Monteiro, A. Dandache, H. Belhadaoui, O. Malassé, G. Buchheit,
J.F. Aubry & H. Medromi

Energy production and distribution


Application of Bayesian networks for risk assessment in electricity distribution system
maintenance management 2561
D.E. Nordgård & K. Sand
Incorporation of ageing effects into reliability model for power transmission network 2569
V. Matuzas & J. Augutis
Mathematical simulation of energy supply disturbances 2575
J. Augutis, R. Krikštolaitis, V. Matuzienė & S. Pečiulytė
Risk analysis of the electric power transmission grid 2581
L.M. Pedersen & H.H. Thorstad
Security of gas supply to a gas plant from cave storage using discrete-event simulation 2587
J.D. Amaral Netto, L.F.S. Oliveira & D. Faertes
SES RISK a new framework to support decisions on energy supply 2593
D. Serbanescu & A.L. Vetere Arellano
Specification of reliability benchmarks for offshore wind farms 2601
D. McMillan & G.W. Ault

Health and medicine


Bayesian statistical meta-analysis of epidemiological data for QRA 2609
I. Albert, E. Espié, A. Gallay, H. De Valk, E. Grenier & J.-B. Denis

XXII
Cyanotoxins and health risk assessment 2613
J. Kellner, F. Božek, J. Navrátil & J. Dvořák
The estimation of health effect risks based on different sampling intervals of meteorological data 2619
J. Jeong & S. Hoon Han

Information technology and telecommunications


A bi-objective model for routing and wavelength assignment in resilient WDM networks 2627
T. Gomes, J. Craveirinha, C. Simões & J. Clímaco
Formal reasoning regarding error propagation in multi-process software architectures 2635
F. Sætre & R. Winther
Implementation of risk and reliability analysis techniques in ICT 2641
R. Mock, E. Kollmann & E. Bünzli
Information security measures influencing user performance 2649
E. Albrechtsen & J.M. Hagen
Reliable network server assignment using an ant colony approach 2657
S. Kulturel-Konak & A. Konak
Risk and safety as system-theoretic concepts—A formal view on system-theory
by means of petri-nets 2665
J. Rudolf Müller & E. Schnieder

Insurance and finance


Behaviouristic approaches to insurance decisions in the context of natural hazards 2675
M. Bank & M. Gruber
Gaming tool as a method of natural disaster risk education: Educating the relationship
between risk and insurance 2685
T. Unagami, T. Motoyoshi & J. Takai
Reliability-based risk-metric computation for energy trading 2689
R. Mínguez, A.J. Conejo, R. García-Bertrand & E. Castillo

Manufacturing
A decision model for preventing knock-on risk inside industrial plant 2701
M. Grazia Gnoni, G. Lettera & P. Angelo Bragatto
Condition based maintenance optimization under cost and profit criteria for manufacturing
equipment 2707
A. Sánchez, A. Goti & V. Rodríguez
PRA-type study adapted to the multi-crystalline silicon photovoltaic cells manufacture
process 2715
A. Colli, D. Serbanescu & B.J.M. Ale

Mechanical engineering
Developing a new methodology for OHS assessment in small and medium enterprises 2727
C. Pantanali, A. Meneghetti, C. Bianco & M. Lirussi
Optimal Pre-control as a tool to monitor the reliability of a manufacturing system 2735
S. San Matías & V. Giner-Bosch
The respirable crystalline silica in the ceramic industries—Sampling, exposure
and toxicology 2743
E. Monfort, M.J. Ibáñez & A. Escrig

XXIII
Natural hazards
A framework for the assessment of the industrial risk caused by floods 2749
M. Campedel, G. Antonioni, V. Cozzani & G. Di Baldassarre
A simple method of risk potential analysis for post-earthquake fires 2757
J.L. Su, C.C. Wu, K.S. Fan & J.R. Chen
Applying the SDMS model to manage natural disasters in Mexico 2765
J.R. Santos-Reyes & A.N. Beard
Decision making tools for natural hazard risk management—Examples from Switzerland 2773
M. Bründl, B. Krummenacher & H.M. Merz
How to motivate people to assume responsibility and act upon their own protection from flood
risk in The Netherlands if they think they are perfectly safe? 2781
M. Bočkarjova, A. van der Veen & P.A.T.M. Geurts
Integral risk management of natural hazards—A system analysis of operational application
to rapid mass movements 2789
N. Bischof, H. Romang & M. Bründl
Risk based approach for a long-term solution of coastal flood defences—A Vietnam case 2797
C. Mai Van, P.H.A.J.M. van Gelder & J.K. Vrijling
River system behaviour effects on flood risk 2807
T. Schweckendiek, A.C.W.M. Vrouwenvelder, M.C.L.M. van Mierlo, E.O.F. Calle & W.M.G. Courage
Valuation of flood risk in The Netherlands: Some preliminary results 2817
M. Bočkarjova, P. Rietveld & E.T. Verhoef

Nuclear engineering
An approach to integrate thermal-hydraulic and probabilistic analyses in addressing
safety margins estimation accounting for uncertainties 2827
S. Martorell, Y. Nebot, J.F. Villanueva, S. Carlos, V. Serradell, F. Pelayo & R. Mendizábal
Availability of alternative sources for heat removal in case of failure of the RHRS during
midloop conditions addressed in LPSA 2837
J.F. Villanueva, S. Carlos, S. Martorell, V. Serradell, F. Pelayo & R. Mendizábal
Complexity measures of emergency operating procedures: A comparison study with data
from a simulated computerized procedure experiment 2845
L.Q. Yu, Z.Z. Li, X.L. Dong & S. Xu
Distinction impossible!: Comparing risks between Radioactive Wastes Facilities and Nuclear
Power Stations 2851
S. Kim & S. Cho
Heat-up calculation to screen out the room cooling failure function from a PSA model 2861
M. Hwang, C. Yoon & J.-E. Yang
Investigating the material limits on social construction: Practical reasoning about nuclear
fusion and other technologies 2867
T. Horlick-Jones, A. Prades, C. Oltra, J. Navajas & J. Espluga
Neural networks and order statistics for quantifying nuclear power plants safety margins 2873
E. Zio, F. Di Maio, S. Martorell & Y. Nebot
Probabilistic safety assessment for other modes than power operation 2883
M. Čepin & R. Prosen
Probabilistic safety margins: Definition and calculation 2891
R. Mendizábal

XXIV
Reliability assessment of the thermal hydraulic phenomena related to a CAREM-like
passive RHR System 2899
G. Lorenzo, P. Zanocco, M. Giménez, M. Marquès, B. Iooss, R. Bolado Lavín, F. Pierro,
G. Galassi, F. D’Auria & L. Burgazzi
Some insights from the observation of nuclear power plant operators’ management of simulated
abnormal situations 2909
M.C. Kim & J. Park
Vital area identification using fire PRA and RI-ISI results in UCN 4 nuclear power plant 2913
K.Y. Kim, Y. Choi & W.S. Jung

Offshore oil and gas


A new approach for follow-up of safety instrumented systems in the oil and gas industry 2921
S. Hauge & M.A. Lundteigen
Consequence based methodology to determine acceptable leakage rate through closed safety
critical valves 2929
W. Røed, K. Haver, H.S. Wiencke & T.E. Nøkland
FAMUS: Applying a new tool for integrating flow assurance and RAM analysis 2937
Ø. Grande, S. Eisinger & S.L. Isaksen
Fuzzy reliability analysis of corroded oil and gas pipes 2945
M. Singh & T. Markeset
Life cycle cost analysis in design of oil and gas production facilities to be used in harsh,
remote and sensitive environments 2955
D. Kayrbekova & T. Markeset
Line pack management for improved regularity in pipeline gas transportation networks 2963
L. Frimannslund & D. Haugland
Optimization of proof test policies for safety instrumented systems using multi-objective
genetic algorithms 2971
A.C. Torres-Echeverria, S. Martorell & H.A. Thompson
Paperwork, management, and safety: Towards a bureaucratization of working life
and a lack of hands-on supervision 2981
G.M. Lamvik, P.C. Næsje, K. Skarholt & H. Torvatn
Preliminary probabilistic study for risk management associated to casing long-term integrity
in the context of CO2 geological sequestration—Recommendations for cement plug geometry 2987
Y. Le Guen, O. Poupard, J.-B. Giraud & M. Loizzo
Risk images in integrated operations 2997
C.K. Tveiten & P.M. Schiefloe

Policy decisions
Dealing with nanotechnology: Do the boundaries matter? 3007
S. Brunet, P. Delvenne, C. Fallon & P. Gillon
Factors influencing the public acceptability of the LILW repository 3015
N. Železnik, M. Polič & D. Kos
Risk futures in Europe: Perspectives for future research and governance. Insights from a EU
funded project 3023
S. Menoni
Risk management strategies under climatic uncertainties 3031
U.S. Brandt

XXV
Safety representative and managers: Partners in health and safety? 3039
T. Kvernberg Andersen, H. Torvatn & U. Forseth
Stop in the name of safety—The right of the safety representative to halt dangerous work 3047
U. Forseth, H. Torvatn & T. Kvernberg Andersen
The VDI guideline on requirements for the qualification of reliability engineers—Curriculum
and certification process 3055
U.K. Rakowsky

Public planning
Analysing analyses—An approach to combining several risk and vulnerability analyses 3061
J. Borell & K. Eriksson
Land use planning methodology used in Walloon region (Belgium) for tank farms of gasoline
and diesel oil 3067
F. Tambour, N. Cornil, C. Delvosalle, C. Fiévez, L. Servranckx, B. Yannart & F. Benjelloun

Security and protection


‘‘Protection from half-criminal windows breakers to mass murderers with nuclear weapons’’:
Changes in the Norwegian authorities’ discourses on the terrorism threat 3077
S.H. Jore & O. Njå
A preliminary analysis of volcanic Na-Tech risks in the Vesuvius area 3085
E. Salzano & A. Basco
Are safety and security in industrial systems antagonistic or complementary issues? 3093
G. Deleuze, E. Chatelet, P. Laclemence, J. Piwowar & B. Affeltranger
Assesment of energy supply security indicators for Lithuania 3101
J. Augutis, R. Krikštolaitis, V. Matuziene & S. Pečiulytė
Enforcing application security—Fixing vulnerabilities with aspect oriented programming 3109
J. Wang & J. Bigham
Governmental risk communication: Communication guidelines in the context of terrorism
as a new risk 3117
I. Stevens & G. Verleye
On combination of Safety Integrity Levels (SILs) according to IEC61508 merging rules 3125
Y. Langeron, A. Barros, A. Grall & C. Bérenguer
On the methods to model and analyze attack scenarios with Fault Trees 3135
G. Renda, S. Contini & G.G.M. Cojazzi
Risk management for terrorist actions using geoevents 3143
G. Maschio, M.F. Milazzo, G. Ancione & R. Lisi

Surface transportation (road and train)


A modelling approach to assess the effectiveness of BLEVE prevention measures on LPG tanks 3153
G. Landucci, M. Molag, J. Reinders & V. Cozzani
Availability assessment of ALSTOM’s safety-relevant trainborne odometry sub-system 3163
B.B. Stamenković & P. Dersin
Dynamic maintenance policies for civil infrastructure to minimize cost and manage safety risk 3171
T.G. Yeung & B. Castanier
FAI: Model of business intelligence for projects in metrorailway system 3177
A. Oliveira & J.R. Almeida Jr.

XXVI
Impact of preventive grinding on maintenance costs and determination of an optimal grinding cycle 3183
C. Meier-Hirmer & Ph. Pouligny
Logistics of dangerous goods: A GLOBAL risk assessment approach 3191
C. Mazri, C. Deust, B. Nedelec, C. Bouissou, J.C. Lecoze & B. Debray
Optimal design of control systems using a dependability criteria and temporal sequences
evaluation—Application to a railroad transportation system 3199
J. Clarhaut, S. Hayat, B. Conrard & V. Cocquempot
RAM assurance programme carried out by the Swiss Federal Railways SA-NBS project 3209
B.B. Stamenković
RAMS specification for an urban transit Maglev system 3217
A. Raffetti, B. Faragona, E. Carfagna & F. Vaccaro
Safety analysis methodology application into two industrial cases: A new mechatronical system
and during the life cycle of a CAF’s high speed train 3223
O. Revilla, A. Arnaiz, L. Susperregui & U. Zubeldia
The ageing of signalling equipment and the impact on maintenance strategies 3231
M. Antoni, N. Zilber, F. Lejette & C. Meier-Hirmer
The development of semi-Markov transportation model 3237
Z. Mateusz & B. Tymoteusz
Valuation of operational architecture dependability using Safe-SADT formalism: Application
to a railway braking system 3245
D. Renaux, L. Cauffriez, M. Bayart & V. Benard

Waterborne transportation
A simulation based risk analysis study of maritime traffic in the Strait of Istanbul 3257
B. Özbaş, I. Or, T. Altiok & O.S. Ulusçu
Analysis of maritime accident data with BBN models 3265
P. Antão, C. Guedes Soares, O. Grande & P. Trucco
Collision risk analyses of waterborne transportation 3275
E. Vanem, R. Skjong & U. Langbecker
Complex model of navigational accident probability assessment based on real time
simulation and manoeuvring cycle concept 3285
L. Gucma
Design of the ship power plant with regard to the operator safety 3289
A. Podsiadlo & W. Tarelko
Human fatigue model at maritime transport 3295
L. Smolarek & J. Soliwoda
Modeling of hazards, consequences and risk for safety assessment of ships in damaged
conditions in operation 3303
M. Gerigk
Numerical and experimental study of a reliability measure for dynamic control of floating vessels 3311
B.J. Leira, P.I.B. Berntsen & O.M. Aamo
Reliability of overtaking maneuvers between ships in restricted area 3319
P. Lizakowski
Risk analysis of ports and harbors—Application of reliability engineering techniques 3323
B.B. Dutta & A.R. Kar

XXVII
Subjective propulsion risk of a seagoing ship estimation 3331
A. Brandowski, W. Frackowiak, H. Nguyen & A. Podsiadlo
The analysis of SAR action effectiveness parameters with respect to drifting search area model 3337
Z. Smalko & Z. Burciu
The risk analysis of harbour operations 3343
T. Abramowicz-Gerigk
Author index 3351

XXVIII
Safety, Reliability and Risk Analysis: Theory, Methods and Applications – Martorell et al. (eds)
© 2009 Taylor & Francis Group, London, ISBN 978-0-415-48513-5

Preface

This Conference stems from a European initiative merging the ESRA (European Safety and Reliability
Association) and SRA-Europe (Society for Risk Analysis—Europe) annual conferences into the major safety,
reliability and risk analysis conference in Europe during 2008. This is the second joint ESREL (European Safety
and Reliability) and SRA-Europe Conference after the 2000 event held in Edinburg, Scotland.
ESREL is an annual conference series promoted by the European Safety and Reliability Association. The
conference dates back to 1989, but was not referred to as an ESREL conference before 1992. The Conference
has become well established in the international community, attracting a good mix of academics and industry
participants that present and discuss subjects of interest and application across various industries in the fields of
Safety and Reliability.
The Society for Risk Analysis—Europe (SRA-E) was founded in 1987, as a section of SRA international
founded in 1981, to develop a special focus on risk related issues in Europe. SRA-E aims to bring together
individuals and organisations with an academic interest in risk assessment, risk management and risk commu-
nication in Europe and emphasises the European dimension in the promotion of interdisciplinary approaches of
risk analysis in science. The annual conferences take place in various countries in Europe in order to enhance the
access to SRA-E for both members and other interested parties. Recent conferences have been held in Stockholm,
Paris, Rotterdam, Lisbon, Berlin, Como, Ljubljana and the Hague.
These conferences come for the first time to Spain and the venue is Valencia, situated in the East coast close
to the Mediterranean Sea, which represents a meeting point of many cultures. The host of the conference is the
Universidad Politécnica de Valencia.
This year the theme of the Conference is "Safety, Reliability and Risk Analysis. Theory, Methods and
Applications". The Conference covers a number of topics within safety, reliability and risk, and provides a
forum for presentation and discussion of scientific papers covering theory, methods and applications to a wide
range of sectors and problem areas. Special focus has been placed on strengthening the bonds between the safety,
reliability and risk analysis communities with an aim at learning from the past building the future.
The Conferences have been growing with time and this year the program of the Joint Conference includes 416
papers from prestigious authors coming from all over the world. Originally, about 890 abstracts were submitted.
After the review by the Technical Programme Committee of the full papers, 416 have been selected and included
in these Proceedings. The effort of authors and the peers guarantee the quality of the work. The initiative and
planning carried out by Technical Area Coordinators have resulted in a number of interesting sessions covering
a broad spectre of topics.
Sebastián Martorell
C. Guedes Soares
Julie Barnett
Editors

XXIX
Safety, Reliability and Risk Analysis: Theory, Methods and Applications – Martorell et al. (eds)
© 2009 Taylor & Francis Group, London, ISBN 978-0-415-48513-5

Organization

Conference Chairman
Dr. Sebastián Martorell Alsina Universidad Politécnica de Valencia, Spain

Conference Co-Chairman
Dr. Blás Galván González University of Las Palmas de Gran Canaria, Spain

Conference Technical Chairs


Prof. Carlos Guedes Soares Technical University of Lisbon—IST, Portugal
Dr. Julie Barnett University of Surrey, United Kingdom

Board of Institution Representatives


Prof. Gumersindo Verdú Vice-Rector for International Actions—
Universidad Politécnica de Valencia, Spain
Dr. Ioanis Papazoglou ESRA Chairman
Dr. Roberto Bubbico SRA-Europe Chairman

Technical Area Coordinators


Aven, Terje—Norway Leira, Bert—Norway
Bedford, Tim—United Kingdom Levitin, Gregory—Israel
Berenguer, Christophe—France Merad, Myriam—France
Bubbico, Roberto—Italy Palanque, Philippe—France
Cepin, Marco—Slovenia Papazoglou, Ioannis—Greece
Christou, Michalis—Italy Preyssl, Christian—The Netherlands
Colombo, Simone—Italy Rackwitz, Ruediger—Germany
Dien, Yves—France Rosqvist, Tony—Finland
Doménech, Eva—Spain Salvi, Olivier—Germany
Eisinger, Siegfried—Norway Skjong, Rolf—Norway
Enander, Ann—Sweden Spadoni, Gigliola—Italy
Felici, Massimo—United Kingdom Tarantola, Stefano—Italy
Finkelstein, Maxim—South Africa Thalmann, Andrea—Germany
Goossens, Louis—The Netherlands Thunem, Atoosa P-J—Norway
Hessami, Ali—United Kingdom Van Gelder, Pieter—The Netherlands
Johnson, Chris—United Kingdom Vrouwenvelder, Ton—The Netherlands
Kirchsteiger, Christian—Luxembourg Wolfgang, Kröger—Switzerland

Technical Programme Committee


Ale B, The Netherlands Badia G, Spain
Alemano A, Luxembourg Barros A, France
Amari S, United States Bartlett L, United Kingdom
Andersen H, Denmark Basnyat S, France
Aneziris O, Greece Birkeland G, Norway
Antao P, Portugal Bladh K, Sweden
Arnaiz A, Spain Boehm G, Norway

XXXI
Bris R, Czech Republic Le Bot P, France
Bründl M, Switzerland Limbourg P, Germany
Burgherr P, Switzerland Lisnianski A, Israel
Bye R, Norway Lucas D, United Kingdom
Carlos S, Spain Luxhoj J, United States
Castanier B, France Ma T, United Kingdom
Castillo E, Spain Makin A, Australia
Cojazzi G, Italy Massaiu S, Norway
Contini S, Italy Mercier S, France
Cozzani V, Italy Navarre D, France
Cha J, Korea Navarro J, Spain
Chozos N, United Kingdom Nelson W, United States
De Wit S, The Netherlands Newby M, United Kingdom
Droguett E, Brazil Nikulin M, France
Drottz-Sjoberg B, Norway Nivolianitou Z, Greece
Dutuit Y, France Pérez-Ocón R, Spain
Escriche I, Spain Pesme H, France
Faber M, Switzerland Piero B, Italy
Fouladirad M, France Pierson J, France
Garbatov Y, Portugal Podofillini L, Italy
Ginestar D, Spain Proske D, Austria
Grall A, France Re A, Italy
Gucma L, Poland Revie M, United Kingdom
Hardman G, United Kingdom Rocco C, Venezuela
Harvey J, United Kingdom Rouhiainen V, Finland
Hokstad P, Norway Roussignol M, France
Holicky M, Czech Republic Sadovsky Z, Slovakia
Holloway M, United States Salzano E, Italy
Iooss B, France Sanchez A, Spain
Iung B, France Sanchez-Arcilla A, Spain
Jonkman B, The Netherlands Scarf P, United Kingdom
Kafka P, Germany Siegrist M, Switzerland
Kahle W, Germany Sørensen J, Denmark
Kleyner A, United States Storer T, United Kingdom
Kolowrocki K, Poland Sudret B, France
Konak A, United States Teixeira A, Portugal
Korczak E, Poland Tian Z, Canada
Kortner H, Norway Tint P, Estonia
Kosmowski K, Poland Trbojevic V, United Kingdom
Kozine I, Denmark Valis D, Czech Republic
Kulturel-Konak S, United States Vaurio J, Finland
Kurowicka D, The Netherlands Yeh W, Taiwan
Labeau P, Belgium Zaitseva E, Slovakia
Zio E, Italy

Webpage Administration
Alexandre Janeiro Instituto Superior Técnico, Portugal

Local Organizing Committee


Sofía Carlos Alberola Universidad Politécnica de Valencia
Eva Ma Doménech Antich Universidad Politécnica de Valencia
Antonio José Fernandez Iberinco, Chairman Reliability Committee AEC
Blás Galván González Universidad de Las Palmas de Gran Canaria
Aitor Goti Elordi Universidad de Mondragón
Sebastián Martorell Alsina Universidad Politécnica de Valencia
Rubén Mullor Ibañez Universidad de Alicante

XXXII
Rafael Pérez Ocón Universidad de Granada
Ana Isabel Sánchez Galdón Universidad Politécnica de Valencia
Vicente Serradell García Universidad Politécnica de Valencia
Gabriel Winter Althaus Universidad de Las Palmas de Gran Canaria

Conference Secretariat and Technical Support at Universidad Politécnica de Valencia


Gemma Cabrelles López
Teresa Casquero García
Luisa Cerezuela Bravo
Fanny Collado López
María Lucía Ferreres Alba
Angeles Garzón Salas
María De Rus Fuentes Manzanero
Beatriz Gómez Martínez
José Luis Pitarch Catalá
Ester Srougi Ramón
Isabel Martón Lluch
Alfredo Moreno Manteca
Maryory Villamizar Leon
José Felipe Villanueva López

Sponsored by
Ajuntament de Valencia
Asociación Española para la Calidad (Comité de Fiabilidad)
CEANI
Generalitat Valenciana
Iberdrola
Ministerio de Educación y Ciencia
PMM Institute for Learning
Tekniker
Universidad de Las Palmas de Gran Canaria
Universidad Politécnica de Valencia

XXXIII
Safety, Reliability and Risk Analysis: Theory, Methods and Applications – Martorell et al. (eds)
© 2009 Taylor & Francis Group, London, ISBN 978-0-415-48513-5

Acknowledgements

The conference is organized jointly by Universidad Politécnica de Valencia, ESRA (European Safety and
Reliability Association) and SRA-Europe (Society for Risk Analysis—Europe), under the high patronage of
the Ministerio de Educación y Ciencia, Generalitat Valenciana and Ajuntament de Valencia.
Thanks also to the support of our sponsors Iberdrola, PMM Institute for Learning, Tekniker, Asociación
Española para la Calidad (Comité de Fiabilidad), CEANI and Universidad de Las Palmas de Gran Canaria. The
support of all is greatly appreciated.
The work and effort of the peers involved in the Technical Program Committee in helping the authors to
improve their papers are greatly appreciated. Special thanks go to the Technical Area Coordinators and organisers
of the Special Sessions of the Conference, for their initiative and planning which have resulted in a number of
interesting sessions. Thanks to authors as well as reviewers for their contributions in the review process. The
review process has been conducted electronically through the Conference web page. The support to the web
page was provided by the Instituto Superior Técnico.
We would like to acknowledge specially the local organising committee and the conference secretariat and tech-
nical support at the Universidad Politécnica de Valencia for their careful planning of the practical arrangements.
Their many hours of work are greatly appreciated.
These conference proceedings have been partially financed by the Ministerio de Educación y Ciencia
de España (DPI2007-29009-E), the Generalitat Valenciana (AORG/2007/091 and AORG/2008/135) and the
Universidad Politécnica de Valencia (PAID-03-07-2499).

XXXV
Safety, Reliability and Risk Analysis: Theory, Methods and Applications – Martorell et al. (eds)
© 2009 Taylor & Francis Group, London, ISBN 978-0-415-48513-5

Introduction

The Conference covers a number of topics within safety, reliability and risk, and provides a forum for presentation
and discussion of scientific papers covering theory, methods and applications to a wide range of sectors and
problem areas.

Thematic Areas
• Accident and Incident Investigation
• Crisis and Emergency Management
• Decision Support Systems and Software Tools for Safety and Reliability
• Dynamic Reliability
• Fault Identification and Diagnostics
• Human Factors
• Integrated Risk Management and Risk-Informed Decision-making
• Legislative dimensions of risk management
• Maintenance Modelling and Optimisation
• Monte Carlo Methods in System Safety and Reliability
• Occupational Safety
• Organizational Learning
• Reliability and Safety Data Collection and Analysis
• Risk and Evidence Based Policy Making
• Risk and Hazard Analysis
• Risk Control in Complex Environments
• Risk Perception and Communication
• Safety Culture
• Safety Management Systems
• Software Reliability
• Stakeholder and public involvement in risk governance
• Structural Reliability and Design Codes
• System Reliability Analysis
• Uncertainty and Sensitivity Analysis

Industrial and Service Sectors


• Aeronautics and Aerospace
• Automotive Engineering
• Biotechnology and Food Industry
• Chemical Process Industry
• Civil Engineering
• Critical Infrastructures
• Electrical and Electronic Engineering
• Energy Production and Distribution
• Health and Medicine
• Information Technology and Telecommunications
• Insurance and Finance
• Manufacturing
• Mechanical Engineering
• Natural Hazards

XXXVII
• Nuclear Engineering
• Offshore Oil and Gas
• Policy Decisions
• Public Planning
• Security and Protection
• Surface Transportation (road and train)
• Waterborne Transportation

XXXVIII
Reliability and safety data collection and analysis
Safety, Reliability and Risk Analysis: Theory, Methods and Applications – Martorell et al. (eds)
© 2009 Taylor & Francis Group, London, ISBN 978-0-415-48513-5

A new step-stress Accelerated Life Testing approach: Step-Down-Stress

Chunhua Zhang, Yashun Wang, Xun Chen & Yu Jiang


College of Mechatronics and Automation, National University of Defense Technology, Changsha, China

ABSTRACT: Step-stress ALT is a widely used method in life validation of products with high reliability. This
paper presents a new step-stress ALT approach with an opposite exerting sequence of stress levels in contrast to
traditional step-stress, so-called Step-Down-Stress (SDS). The testing efficiency of SDS ALT is compared with
step-stress ALT by Monte-Carlo simulation and contrastive experiment. This paper also presents a statistical
analysis procedure for SDS ALT under Weibull distribution. A practical ALT on bulb is given in the end to
illustrate the approach. SDS ALT may advance testing efficiency of traditional step-stress ALT remarkably
when applied in life validation of products with high reliability. It consumes less time for same failure number,
and gets more failures in same testing time than traditional step-stress ALT with identical testing plan. It also
helps to introduce the statistical analysis procedure accordingly which establishes a uniform analysis procedure
and can be applied to different acceleration equation easily.

ACRONYMS constant-stress ALT, step-stress ALT and progressive


s- implies statistical(ly) stress ALT as shown in Figure 1. Sometimes ALT
ALT accelerated life testing also takes a lengthy period of time until the censored
AST accelerated stress testing failure number or gains no failure until the censored
SDS step-down-stress time, especially for life validation of products with
CEM cumulative exposure model high reliability.
CDF cumulative distribution function Step-stress ALT is applied more widely in engi-
IPL inverse power law neering, because it is more efficient than constant
stress ALT, and easier to perform than progressive
NOTATION stress ALT. In step-stress ALT, specimens are first
k number of stress levels
Si stress levels, i = 0, . . ., k
X X = f (S)
Fi (t) CDF of failures under Si
m shape parameter of Weibull distribution
η scale parameter of Weibull distribution
ti time to failure under Si
ri censored number under Si
n number of specimens
Kij acceleration factor between Si and Sj a. Constant-stress ALT b. Step-stress ALT
xi equivalent time to failure under Si
converted from other stress levels
ln(x) natural logarithm of x

1 INTRODUCTION

Accelerated life testing (ALT) is a commonly used


AST which is used to get information quickly on
life distribution of a product. Specimens are tested
under severe conditions in ALT and fail sooner than c. Progressive stress ALT d. Step-down-stress ALT
under use conditions. According to the exerting way
of stress, ALT commonly falls into three categories: Figure 1. Methods of ALT.

863
subjected to a specified constant stress for a speci-
fied length of time; after that, they are subjected to a
higher stress level for another specified time; the stress
on specimens is thus increased step by step[1∼8] .
This paper presents a new step-stress ALT approach
with an opposite exerting sequence of stress levels in
contrast to traditional step-stress, so-called step-down-
stress (SDS), from the assumption that the change in
exerting sequence of stress levels will improve test-
ing efficiency. The validity of SDS ALT is discussed
through comparison with tradition step-stress ALT by
Monte-Carlo simulation and contrastive experiment,
and it concludes that SDS ALT takes less time for same
failure number and gets more failures in same time.
A new s-analysis procedure is constructed for SDS
ALT, which is applicable to different acceleration
equation and may be programmed easily.
The rest of this paper is organized as follows:
section 2 describes SDS ALT including basic assump-
tions, the definition, s-analysis model, and Monte-
Carlo Simulation; section 3 presents an s-analysis
procedure for SDS ALT; section 4 gives a practical
example; section 5 concludes the paper.

Figure 2. CEM for traditional step-stress ALT.


2 STEP-DOWN-STRESS ALT MODEL

2.1 Basic assumptions


a. Time to failure at each stress level follows Weibull
distribution

F(t) = 1 − exp[−(t/η)m ], t>0 (1)

The wear-out failures occur at an increasing failure


rate with m > 1.
b. ηi and Si meet the acceleration equation


n
ln(η) = aj Xj (2)
j=0

which is the Arrhenius model when X = 1/S and


the IPL model when X = ln(S).
c. Failure mechanism is identical at different stress
levels.

2.2 Definition
SDS ALT is shown in Figure 1d. In SDS ALT, speci-
mens are initially subjected to the highest stress level
Sk for rk failures, then stress is stepped down to Sk−1 Figure 3. CEM for SDS ALT.
for rk−1 failures, and then stress is stepped down
again to Sk−2 for rk−2 failures, and so on. The test is
terminated at the lowest stress S1 until r1 failures occur. direction of stress levels. The following text will prove
From the description above, SDS ALT is symmetri- that the change in exerting sequence of stress levels
cal to traditional step-stress ALT with different varying will possibly improve the test efficiency remarkably.

864
Table 1. Monte-Carlo simulations for step-stress and SDS ALT.

Test plan (n, r)

No. (m, η) Step-down stress Step-stress e

1 Model 1: 2.4, [188 115 81 63] 50, [10 10 10 10] 50, [10 10 10 10] 1.318
2 50, [25 5 5 5] 50, [5 5 5 25] 1.338
3 60, [10 10 10 10] 60, [10 10 10 10] 1.399
4 Model 2: 4.5, [58678 16748 5223 1759] 50, [10 10 10 10] 50, [10 10 10 10] 3.916
5 40, [10 10 10 10] 40, [10 10 10 10] 2.276
6 40, [20 5 5 5] 40, [5 5 5 20] 4.699
7 Model 3: 3.8, [311 108 47 26] 40, [10 10 10 10] 40, [10 10 10 10] 1.696
8 40, [20 5 5 5] 40, [5 5 5 20] 2.902
∗η = [η1 , η2 , η3 , η4 ]; r = [r4 , r3 , r2 , r1 ] for SDS ALT, r = [r1 , r2 , r3 , r4 ] for step-stress ALT.

2.3 s-Analysis model means that the total testing time for SDS ALT is
[2] only 21.28% of traditional step-stress ALT.
According to CEM , the remaining life of units under
b. The longer the life of specimen is, the greater
a step-stress pattern depends only on the current cumu-
advantage of SDS ALT has, for example, e of
lative fraction failed and current stress regardless how
Model 2 is commonly higher than other two mod-
the fraction accumulated.
els, which means SDS ALT can be adopted better
As shown in Figure 2, if stress levels in step-stress
in long life validation.
ALT are S1 , S2 , . . . , Sk , where the duration at Si is
ti − ti−1 and the corresponding CDF is Fi (t), the pop-
ulation CDF F0 (t) of specimens in step-stress ALT
equals to F1 (t) at first and then steps to F2 (t) at t1 . 3 STATISTICAL ANALYSIS
F0 (t) steps up successively by that way until Sk .
As shown in Figure 3, if apply CEM to SDS ALT, 3.1 Description of the problem
the population CDF F0 (t) of specimens in SDS ALT
If stress levels in SDS ALT are Sk , Sk−1 , . . .S1 (Sk >
equals to Fk (t) at first and then steps to Fk−1 (t) at tk .
Sk−1 > · · · > S1 ), the specimen size is n and the
F0 (t) steps successively by that way until S1 .
censored failure number is ri for Si . The stress steps
down to Si−1 until ri failures occur at Si . The time
2.4 Monte-carlo simulation to failure in such an SDS ALT can be described as
follows:
In order to discuss the efficiency of SDS ALT fur-
ther, contrastive analysis is performed through Monte-
Carlo simulation shown in Table 1. Sk : tk,1 , tk,2 , . . . , tk,rk
Let the use stress be S0 , the k accelerated stresses Sk−1 : tk−1,1 , tk−1,2 , . . . , tk−1,rk−1
(3)
be S1 , S2 , . . . , Sk , and the scale parameter correspond- ......
ing to Si be η0 , η1 , . . . , ηk according to acceleration S1 : t1,1 , t1,2 , . . . , t1,r1
equation (2). In Monte-Carlo Simulation, the sam-
pling size of units is n, and k censored failure numbers where ti,j means the No. j time to failure under Si timed
corresponding to Si are r1 , r2 , . . . , rk . from the beginning of Si .
Monte-Carlo Simulations, see Table 1, are per-
formed respectively based on three commonly-used
acceleration equations, which come from practical 3.2 Estimation of distribution parameters
products. The result of Monte-Carlo simulation is
expressed as efficiency index e, which is the average Failures in SDS ALT are cumulative effect of several
ratio of the total testing time in traditional step-stress accelerated stresses except for Sk , so it is a key prob-
ALT to that in SDS ALT. If e > 1, the efficiency of lem in s-analysis for SDS ALT that how to convert
SDS ALT is higher than traditional step-stress ALT. testing time between stress levels to obtain population
From Table 1, the following rules can be drawn: life information. Data in s-analysis for traditional step-
stress ALT are converted mainly through acceleration
a. The efficiency of SDS ALT in these occasions are equations [1∼4] , and the solution to this problem will
commonly higher than traditional step-stress ALT. becomes very complex accordingly and sometimes
The highest e = 4.699 in No. 6 simulation, which diverges.

865
If CDF of specimen is Fi (ti ) at ti under Si , according then the inverse moment estimates of Weibull param-
to CEM, tj can be found at Sj to satisfy eters under Si is from

i −1
n
Fi (ti ) = Fj (tj ) (4) ln[uni (m̂i , η̂i )/uj (m̂i , η̂i )] = ni − 1 (12)
j=1
which means that cumulative degradation of life at ⎡⎛ ⎞ ⎤1/m̂i
Si equals to that at Sj in given time respectively. So ni

acceleration factor can be defined as Kij = tj /ti . For η̂i = ⎣⎝ xj i (η̂i ) + (n − ni )xnm̂ii (η̂i )⎠/ni⎦

Weibull distribution, j=1


(13)
mi = mj
So mi & ηi can be obtained by solving equation (12)
Kij = ηj /ηi (5) & (13). See APPENDIX B for a numerical algorithm.

See APPENDIX A for the proof of (5). Because 3.3 Estimation of acceleration equation
Ki,i+1 = ti+1 /ti = ηi+1 /ηi , the time to failure ti+1 at
By the procedure above, the s-analysis for SDS ALT
Si+1 can be converted into equivalent data under Si for
is transferred to s-analysis for an equivalent constant-
Weibull distribution
stress ALT, which can be described as follows:

xi = ηi (ti+1 /ηi+1 ) (6) Sk : tk,1 , tk,2 , . . . , tk,rk


Sk−1 : xk−1,1 , xk−1,2 , . . . , xk−1,rk +rk−1
(14)
......
If all the equivalent data at Si+1 are
S1 : x1,1 , x1,2 , . . . , x1,rk +rk−1 +···+r1

xi+1,1 , xi+1,2 , . . . , xi+1,ni +1 (7) To estimate parameters of acceleration equation, an


estimation method for constant-stress ALT under
Weibull distribution can be applied[9] .
where ni+1 = rk + rk−1 + · · · + ri+1 , the former ni+1
equivalent time-to-failures at Si are
4 EXPERIMENT RESULT
xi,j = ηi (xi+1,j /ηi+1 ) (j = 1, 2, . . . , ni+1 ) (8)
To illustrate the efficiency of the proposed SDS ALT
approach, an SDS ALT and a traditional step-stress
The succeeding ri equivalent time-to-failures at Si
ALT on bulb were performed with identical testing
can be obtained by adding cumulative testing time to
plan but inverse sequence of stress levels shown in
ti,j at Si
Table 2. The accelerated stress is the voltage, given
by S1 = 250 V, S2 = 270 V, S3 = 287 V, and S4 =
xi,ni+1 +j = xi,ni+1 + ti,j (j = 1, 2, . . . , ri ) (9) 300 V. n = 40 and (r1 , r2 , r3 , r4 ) = (5, 5, 5, 20). It
shows that total testing time of SDS ALT is only 30.7%
of traditional step-stress ALT.
So ni (ni = rk + rk−1 + · · · + ri ) equivalent time-to- Data from this SDS ALT are s-analyzed with pro-
failures at Si are obtained, which form quasi samples posed procedure and the distribution parameters are
xi (ηi ) under Weibull distribution

Table 2. Contrastive experiment of SDS ALT and tradi-


x1 (ηi ) < x2 (ηi ) < · · · xni (ηi ) (10) tional step-stress ALT.

Testing time Total


which contains unknown variable ηi . Let at each stress testing
Approach Si (V) (n, r) level (h) time (h)
j
Step up 250 270 40, 192.118 10.183 213.519
uj (mi , ηi ) = xkmi (ηi ) + (n − j)xjmi (ηi ) 287 300 [5 5 20] 3.690 7.528
k=1 Step 300 287 40, [20 13.342 7.092 65.502
( j = 1, 2, . . . , ni ) (11) down 270 250 5 5 5] 9.325 35.743

866
Table 3. Distribution parameters under Si . which is identical equation for arbitrary ti . So

Si (V) 300 287 270 250 mi = mj

m 3.827 3.836 3.827 3.821 Kij = ηj /ηi


η(h) 14.690 83.294 97.248 285.160
B Numerical solution algorithm of (12) & (13).
To simplify the procedure, (12) & (13) is trans-
formed as
shown in Table 3. The relation between η and S
i −1
n
satisfies IPL model
1 + (ni − 1) ln uni (mi , ηi ) − ln ui (mi , ηi ) = ni
i=1
ln η = 81.43918 − 13.71378 · ln S (15) (19)

Parameters of acceleration equation (15) approach ηi = [uni (mi , ηi )/ni ] 1/mi


(20)
nearly to those from constant-stress ALT, which
demonstrates the validity of SDS ALT and its Equation (20) can be written as
s-analysis procedure.
ηi (l + 1) = [uni (mi (l), ηi (l))/ni ]1/mi (l) (21)

by which the convergent solution of ηi corresponding


5 CONCLUSION to mi (l) can be computed iteratively.
The solution of (19) can be obtained with a tentative
An SDS ALT and its s-analysis procedure are pre- method. Let
sented in this paper for life validation by AST. The
validity of this new step-stress approach is dis- ni (mi , ηi ) = 1 + (ni − 1) ln uni (mi , ηi )
cussed through Monte-Carlo simulation and con-
trastive experiment. It shows that SDS ALT may i −1
n
advance testing efficiency remarkably and accordingly − ln ui (mi , ηi ) (22)
decrease the cost of test. The s-analysis procedure for i=1
SDS ALT under Weibull distribution is constructed,
and the validity is also demonstrated through ALT by which ni (mi (l), ηi (l + 1)) can be computed for a
on bulb. given mi (l) and ηi (l + 1).
Future efforts in the research of SDS ALT may
include further discussion on efficiency, improvement
of s-analysis procedure, and optimal design of test plan
for SDS ALT.

APPENDIX

A Proof of (5)
The CDF under Weibull distribution is

F(t) = 1 − exp[−(t/η)m ], t>0 (16)

where m, η(m, η > 0) are shape parameter and scale


parameter respectively. From equation (4),

1 − exp[−(ti /ηi )mi ] = 1 − exp[−(tj /ηj )mj ] (17)

Because exp(·) is strictly monotone and tj = Kij ti ,

(ti /ηi )mi = (Kij ti /ηj )mj (18) Figure A1. Numerical solution algorithm of (12) & (13).

867
If ni (mi (l), ηi (l + 1)) = ni , [2] Nelson, W. 1980. Accelerated Life Testing—Step-
stress Models and Data Analysis. IEEE Trans. on
mi (l + 1) = mi (l) (23) Reliability R-29: 103–108.
[3] Tyoskin, O. & Krivolapov, S. 1996. Nonparametric
If ni (mi (l), ηi (l + 1)) > ni , mi (l + 1) < mi (l) and Model for Step-Stress Accelerated Life Testing. IEEE
Trans. on Reliability 45(2): 346–350.
try again by [4] Tang, L. & Sun, Y. 1996. Analysis of Step-Stress
Accelerated-Life-Test Data: A New Approach. IEEE
mi (l + 1) = mi (l) − m (24) Trans. on Reliability 45(1): 69–74.
[5] Miller, R. & Nelson, W. 1983. Optimum Simple Step-
If ni (mi (l), ηi (l + 1)) < ni , mi (l + 1) > mi (l) and stress Plans for Accelerated Life Testing. IEEE Trans.
try again by on Reliability. 32(1): 59–65.
[6] Bai, D. & Kim, M. & Lee, S. 1989 Optimum Simple
mi (l + 1) = mi (l) + m (25) Step-stress Accelerated Life Tests with censoring. IEEE
Trans. on Reliability. 38(5): 528–532.
[7] Khamis, I. & Higgins, J. 1996. Optimum 3-Step
Figure A1 shows the algorithm above. Step-Stress Tests. IEEE Trans. on Reliability. 45(2):
341–345.
[8] Yeo, K. & Tang, L. 1999. Planning Step-Stress Life-
REFERENCES Test with a Target Acceleration-Factor. IEEE Trans. on
Reliability. 48(1): 61–67.
[1] Nelson, W. 1990. Accelerated Testing: Statistical Mod- [9] Zhang, C. & Chen, X. 2002. Analysis for Constant-
els, Test Plans, and Data Analyses. New York: John stress Accelerated Life Testing Data under Weibull Life
Willey & Sons: 18–22. Distribution. Journal of National University of Defense
Technology (Chinese). 24(2): 81–84.

868
Safety, Reliability and Risk Analysis: Theory, Methods and Applications – Martorell et al. (eds)
© 2009 Taylor & Francis Group, London, ISBN 978-0-415-48513-5

Application of a generalized lognormal distribution to engineering


data fitting

J. Martín & C.J. Pérez


Departamento de Matemáticas, Universidad de Extremadura, Cáceres, Spain

ABSTRACT: The lognormal distribution is commonly used to model certain types of data that arise in several
fields of engineering as, for example, different types of lifetime data or coefficients of wear and friction.
However, a generalized form of the lognormal distribution can be used to provide better fits for many types of
experimental or observational data. In this paper, a Bayesian analysis of a generalized form of the lognormal
distribution is developed. Bayesian inference offers the possibility of taking expert opinions into account. This
makes this approach appealing in practical problems concerning many fields of knowledge, including reliability
of technical systems. The full Bayesian analysis includes a Gibbs sampling algorithm to obtain the samples from
the posterior distribution of the parameters of interest. Empirical proofs over a wide range of engineering data
sets have shown that the generalized lognormal distribution can outperform the lognormal one in this Bayesian
context.

Keywords: Bayesian analysis, Generalized normal distribution, Engineering data, Lognormal distribution,
Markov chain Monte Carlo methods.

1 INTRODUCTION formulation of the EP distribution could be attributed


to Subbotin (1923). Since then, several different
The lognormal distribution is commonly used to model parameterizations can be found in the literature (see,
certain types of data that arise in several fields of engi- for example, Box and Tiao (1973), Gómez et al.,
neering as, for example, different types of lifetime data (1998), and Mineo and Ruggieri (2005)). This fam-
(see, e.g., Meeker and Escobar (1998)) or coefficients ily provides distributions with both heavier and lighter
of wear and friction (see, e.g., Steele (2008)). Particu- tails compared to the normal ones. The GN distri-
lar properties of the lognormal random variable (as the butions allow the modeling of kurtosis providing, in
non-negativeness and the skewness) and of the lognor- general, a more flexible fitting to experimental data
mal hazard function (which increases initially and then than the normal distributions. The main reason why
decreases) make lognormal distribution a suitable fit the GN or EP distributions have not been used as
for some engineering data sets. However, a generalized often as desirable has been purely computational, i.e.,
lognormal distribution can be used to provide better because most standard statistical software did not con-
fits for many types of experimental or observational tain procedures using GN distributions. Currently, the
data. GN distribution is considered as a factible alternative
If a random variable X has a lognormal distribu- to the normal one as a general distribution for random
tion, the random variable Y = log X is normally errors.
distributed. This allows the more well-known analysis In this paper, a generalized lognormal distribution
techniques for the normal distribution to be applied to (logGN for short) is analyzed from a Bayesian view-
the lognormal data through transformation. The previ- point. If a random variable X has a logGN distribution,
ous relationship suggests a possible generalization of the random variable Y = log X is distributed as a
the lognormal distribution by using a similar transfor- GN. The logGN distribution has the lognormal one as
mation for the generalized normal distribution. The a particular case. Bayesian inference offers the pos-
generalized normal distribution (GN for short) is a sibility of taking expert opinions into account. This
generalization of the normal distribution that also has makes this approach appealing in practical problems
the t and Laplace distributions as particular cases (see concerning many fields of knowledge, including reli-
Nadarajah (2005)). In fact, it is a re-parametrization ability of technical systems. The Bayesian approach is
of the exponential power (EP) distribution. The first also interesting when no prior information is obtained,

869
in this case a noninformative prior distribution is used. The capacity of a distribution to provide an accurate
The full Bayesian analysis includes a Gibbs sampling fit to data depends on its shape. The shape can be
algorithm to obtain the samples from the posterior defined by the third and fourth moments and they
distribution of the parameters of interest. Then, the represent the asymmetry and flatness coefficients of
predictive distribution can be easily obtained. Empir- a given distribution. The logGN distributions allow
ical proofs over a wide range of engineering data sets the modeling of kurtosis providing, in general, a more
have shown that the generalized lognormal distribu- flexible fit to experimental data than the lognormal
tion can outperform the lognormal one in this Bayesian distribution.
context. Random variates from the logGN distribution can
The outline of this work is as follows. In Section 2, be generated from random variates of the GN distribu-
the logGN distribution is described. Section 3 presents tion via exponentiation. Since the GN distribution is
the Bayesian analysis with both noninformative and a reparameterization of the EP distribution, the tech-
informative prior distributions. An example with fric- niques for random generation of these distributions
tion data illustrates the application of the proposed can be used for the GN distribution (see, for example,
approach in Section 4. Finally, Section 5 presents the Devroye (1986), Jhonson (1987) and Barabesi (1993)).
main conclusions. Walker and Gutiérrez-Peña (1999) suggested a mix-
ture representation for the EP distribution that is
adapted here to be valid for the logGN distribution.
2 THE GENERALIZED LOGNORMAL MODEL The following result will be used in the next section
to determine the full conditional distributions neces-
If a random variable X has a logGN distribution, the sary to apply the Gibbs sampling method. The proof
random variable Y = log X is distributed as a GN. is immediate.
Therefore, the probability density function of a logGN
distribution with parameters μ, σ , and s is given by: Proposition 1 Let X and U be two random variables
   such that f (x|u) = 2xσ1u1/s I [exp (μ − σ u1/s ) < x <
s  log x − μ s
f (x) = exp −   ,
2 x σ ( 1s )  σ  exp (μ + σ u1/s )] and f (u) = Gamma (shape = 1 +
1/s, scale = 1), then X ∼ log GN(μ, σ , s).
with x > 0, −∞ < μ < +∞, σ > 0 and s ≥ 1. Note
that  denotes the gamma function. This result can also be used to generate random
This distribution has the lognormal distribution as variates from a logGN(μ, σ , s). Generating from U
a particular
√ case by taking s = 2 and changing σ is standard, and generating from X |U is obvious
to 2 σ . The log-Laplace distribution is recovered through the inverse transformation method. Then, the
when s = 1. Figure ?? shows the probability density algorithm to generate random values is given by the
functions for some values of s with μ = 0 and σ = 1. following steps:
1. Generate W ∼ Gamma (1 + 1/s, 1)
2. Generate V ∼ Uniform (−1, 1)
s=1.0
3. Set X = exp {σ W 1/s V + μ}
0.8

s=1.5
s=2.0
s=3.0 The next section presents a Bayesian analysis for
the logGN distribution.
0.6

3 BAYESIAN ANALYSIS
Density

0.4

Bayesian analyses with both noninformative and infor-


mative prior distributions are addressed in this section.
0.2

3.1 Noninformative case


Following the suggestions in Box and Tiao (1973)
0.0

and Portela and Gómez-Villegas (2004) for the EP


0 1 2 3 4 distribution, independence between parameters is con-
x sidered.
Jeffreys’ noninformative prior distributions are
Figure 1. Probability density functions for logGN distribu- considered here. Jeffreys’
√ choice for the noninforma-
tions with μ = 0 and σ = 1, and several values of s. tive density is π(θ) ∝ I (θ), where I (θ) is the Fisher

870
information for θ (see, e.g., Box and Tiao (1973) or Proposition 1 is used here to obtain the likelihood.
Gelman et al. (2004)). This prior distribution is nonin- Then, the likelihood of a sample x = (x1 , x2 , . . . ,
formative in the sense that it maximizes the entropy. In xn ), given the vector of mixing parameters u =
order to obtain the expressions for the noninformative (u1 , u2 , . . . , un ), is:
prior distributions of the parameters, the calculation
of the Fisher information matrix is required. The
Fisher information matrix for the logGN distribution L(μ, σ , s, u|x)
is given by:
1 1
n
1/s 1/s
= I [eμ−σ ui < xi < eμ+σ ui ].
I (μ, σ , s) (2σ )n i=1 xi ui1/s
⎛ (s−1)s(1−1/s) ⎞
σ 2 (1/s)
0 0
⎜ s1+2/s ⎟ Therefore, the posterior distribution is given by:
=⎝ 0 σ2
− σ s1−1/s
A
⎠,
(1+1/s)ψ  (1+1/s)+A2 −1
0 − σ s1−1/s
A
s3
sn−1 n
e−ui
where ψ is the digamma function and A = log(s) + f (μ, σ , s, u|x) ∝ I
σ n+1  n ( 1s ) i=1 xi
ψ(1 + 1/s).
Noninformative prior distributions are derived for 1/s 1/s

the parameters, i.e.,: × [eμ−σ ui < xi < eμ+σ ui ].

π(μ) ∝ 1,
1 The full conditional distributions are derived:
π(σ ) ∝ ,
σ

(1 + 1/s)ψ  (1 + 1/s) + A2 − 1 f (μ|σ , s, u, x) ∝ 1,
π(s) ∝ ,
s3 1/s
max{log(xi ) − σ ui } < μ < min{log(xi ) + σ ui }
1/s
i i
with −∞ < μ < +∞, σ > 0 and s ≥ 1. (1)
Since the expression for π(s) is very involved, a 
simple and similar distribution is used, i.e., π(s) ∝ 1 |μ − log(xi )|
f (σ |μ, s, u, x) ∝ , σ > max
1/s, see Figure ??. σ n+1 i 1/s
ui
A Markov Chain Monte Carlo (MCMC) method
is applied to generate samples from the posterior (2)
distribution. Specifically, a Gibbs sampling algo- sn−1
rithm is derived. The mixture representation given in f (s|μ, σ , u, x) ∝ ,
 n (1/s)
max{1, ai } < s < min ai (3)
i∈S − i∈S +
I s
0 .6

2 3s
f (ui |μ, σ , s, x) ∝ e−ui ,
 
| log(xi ) − μ| s
0 .5

ui > , i = 1, 2, . . . , n, (4)
σ
0 .4
s

where S − = {i : log(|μ − log(xi )|/σ ) < 0}, S + = {i :


0 .3

log(|μ − log(xi )|/σ ) > 0} and

log(ui )
0 .2

ai = , i = 1, 2, . . . , n.
log(|μ − log(xi )|/σ )
0 .1

Random variates from these densities can be gener-


1 2 3 4 5 6 7 8 9 10 ated by using standard methods. Note that the densities
s given in (??), (2) and (??) are uniform, Pareto and

exponential, respectively, and they are generated by
Figure 2. Comparison of functions I (s) and 2/(3s). using the inverse transformation method. The density

871
given in (??) is non-standard, but it can also be eas- The full conditional distributions are:
ily generated by using the rejection method (see, e.g.,
Devroye (1986)). f (μ|σ , s, u, x) ∝ p(μ),
Iterative generations from the above conditional
distributions produce a posterior sample of (μ, σ , s). 1/s
max{log(xi ) − σ ui } < μ < min{log(xi ) + σ ui }
1/s
i i
(8)
−(a0 +n+1) −b0 /σ
3.2 Informative case f (σ |μ, s, u, x) ∝ σ e ,
In many situations, the data analyst is interested in 
|μ − log(xi )|
including relevant initial information in the infer- σ > max 1/s
(9)
ence process. The choice of the prior distribution i ui
must be carefully determined to allow the inclusion
of this information. Since the prior distribution choice sn−c0 −1 e−d0 /s
f (s|μ, σ , u, x) ∝ ,
depends on the problem in hand, there are multiple ref-  n (1/s)
erences related to this topic in the literature (see, e.g., max{1, ai } < s < min ai (10)
DeGroot (1970), Berger (1985), Ibrahim et al. (2001)), i∈S − i∈S +
and Akman and Huwang (2001). Kadane and Wolfson
f (ui |μ, σ , s, x) ∝ e−ui ,
(1998) present an interesting review on elicitation of
 
expert opinion. O’Hagan (1998) considers the elicita- | log(xi ) − μ| s
tion of engineers’ prior beliefs. Gutiérrez-Pulido et al. ui > , i = 1, 2, . . . , n. (11)
σ
(2005) present a comprehensive methodology to spec-
ify prior distributions for commonly used models in
reliability. where S − , S + and ai , i = 1, 2, . . . , n, are defined as
The following prior distributions have been pro- in the previous subsection.
posed because they can accommodate many possible In this case, generating from these truncated den-
shapes for the kind of parameters involved in the sities is also easy to perform. Generating from (??)
logGN distribution. Besides, they allow to make effi- depends on the chosen density. Note that the den-
cient posterior calculations and recover the noninfor- sity given in (??) is a left truncated inverse-gamma.
mative distribution for each parameter. The proposed Random variates from this distribution are obtained
prior distributions are given by: by taking the reciprocal of variates from a truncated
gamma distribution. The density for the conditional
distribution given in (??) is again non-standard. Sim-
π(μ) ∝ p(μ), −∞ < μ < +∞, (5) ilarly to the noninformative case, a rejection method
is implemented. Finally, note that (??) is the same
π(σ ) ∝ σ −(a0 +1) e−b0 /σ , σ >0 (6) as (??).
−(c0 +1) −d0 /s
π(s) ∝ s e , s ≥ 1, (7)

4 APPLICATION TO FRICTION DATA


where (??) is any distribution in its support, (??) is
an inverse-gamma distribution and (??) is a truncated In basic tribology, dimensional issues and the central
inverse gamma distribution. Note that the noninfor- limit theorem are used to argue that the distribution
mative prior distributions are recovered when p(μ) is of the coefficients of friction and wear are typically
constant and a0 = b0 = c0 = d0 = 0. This fact is lognormal. Moreover, empirical evidences from many
specially interesting because it allows to use a nonin- data sources support this argument (see, e.g., Wall-
formative prior distribution for one or two parameters bridge and Dowson (1987) and Steele (2008)). Steele
and informative prior distributions for the remaining (2008) recommends that engineers, without evidence
parameters. to suggest differently, allocate a lognormal distribution
Analogously to the noninformative case, the poste- to the coefficients of friction and wear.
rior distribution is derived: The use of the proposed approach is illustrated with
a data set (presented in Nica (1969)) containing the
p(μ) sn−c0 −1 e−d0 /s coefficients of friction of clean steel in a vacuum.
f (μ, σ , s, u|x) ∝ The size of this data set is 23. These data are used
σ a0 +n+1 eb0 /σ  n (1/s)
as historical data to extract knowledge on the predic-

n
e−ui 1/s 1/s tive distribution of the friction coefficient. The most
× I [eμ−σ ui < xi < eμ+σ ui ]. usual case in engineering studies is to have some prior
xi
i=1 information about the process on which they are trying

872
to make inferences. This corresponds to the informa-

8
tive case. Then, the historical information and the prior
information provided by the engineer are embedded in
the posterior distribution.

6
The following step is to choose the hyperparame-

Density
ter values for the prior distributions of the parameters
of interest μ, σ , and s. There are many ways to elicit

4
these values. One possibility is to specify the values
according to previous direct knowledge on the parame-

2
ters (see, e.g., Berger (1985) and Akman and Huwang
(2001)). Another one consists in using partial infor-
mation elicited by the expert. In this case, there are

0
many criteria to obtain the hyperparameters values as,
for example, maximum entropy and maximum pos- 0.7 0.8 0.9 1.0 1.1 1.2
terior risk (see Savchuk and Martz (1994)). A third Data
possibility considered here is to use expert informa-
tion on the expected data and not on the parameters. Figure 4. Posterior predictive distributions for friction data.
This is easier for engineers who are not familiarized Solid line: logGN, and dashed line: lognormal.
with parameters but have an approximate knowledge
of the process. Finally, it is remarkable that noninfor-
mative prior distributions can be used for any of the hyperparameters obtained are μ0 = −0.0460, σ0 =
parameters and informative prior distributions for the 0.0017, a0 = 80.8737, b0 = 4.0744, c0 = 60.2888,
remaining ones. and d0 = 220.4130.
In this application, the hyperparameters are obta- After it has been considered that the chain conver-
ined by using a method similar to the one pre- gence has been achieved, a sample of size 10,000 for
sented by Gutiérrez-Pulido et al. (2005). In this the parameters of the posterior distribution is gener-
case the expert is asked to provide occurrence ated. The 95% Highest Density Regions (HDR) for μ,
intervals for some usual quantities as the mode, σ , and s are (−0.0490, −0.0422), (0.0595, 0.0855),
median and third quartile. The expert considered and (2.2468, 3.3496), respectively (see Figure ??).
that these quantities should be in the following inter- Note that the HDR for s does not contain the value
vals: [LMo , UMo ] = [0.935, 0.955], [LMe , UMe ] = s = 2, that recovers lognormality.
[0.95, 0.96], and [LQ3 , UQ3 ] = [0.97, 0.985]. By using In order to make a performance comparison, a
the informative prior distributions presented in sub- similar procedure is implemented to obtain the hyper-
section 3.2, with μ ∼ N (μ0 , σ0 ), and following the parameters in the lognormal case. Then, a poste-
development in Gutiérrez-Pulido et al. (2005), the rior sample is generated by using the lognormal
distribution instead of the logGN distribution. Here
the 95% HDR for μ and σ are (−0.0494, −0.0427)
and (0.0923, 0.1416). The posterior predictive distri-
Posterior density

150

butions are presented in Figure ??.


0 50

The comparison between both performances is pre-


sented based on the generated posterior predictive
μ
distributions. The criterion used to compare them is
based on the use of the logarithmic score as a util-
ity function in a statistical decision framework. This
60
Posterior density

was proposed by Bernardo (1979) and used, for exam-


40

ple, by Walker and Gutiérrez-Peña (1999) in a similar


20
0

0.05 0.06 0.07 0.08 0.09 0.10


context. The expected utilities for the lognormal and
logGN models can be estimated as:
Posterior density

0.0 0.4 0.8 1.2

1
n
Ū0 = log(p0 (xi ))
n i=1
2.0 2.5 3.0 3.5 4.0

1
s
n
Ū1 = log(p1 (xi )),
Figure 3. 95% HDR for μ, σ , and s. n i=1

873
where p0 and p1 are the posterior predictive DeGroot, M.H. (1970). Optimal Statistical Decisions.
distributions of the lognormal and logGN models, McGraw-Hill, New York.
respectively. The estimated values are Ū0 = 1.4118 Devroye, L. (1986). Non-Uniform Random Variate Genera-
and Ū1 = 1.4373, so the logGN model performs tion. Springer-Verlag.
better than the lognormal one. The same happens Gelman, A., J.B. Carlin, H.S. Stern, and D.B. Rubin (2004).
Bayesian Data Analysis. Chapman & Hall-CRC.
for the noninformative model (Ū0 = 1.5533 and Gómez, E., M.A. Gómez-Villegas, and J.M. Marín (1998).
Ū1 = 1.5881). A multivariate generalization of the power exponential
family of distributions. Communications in Statistics-
Theory and Methods 27(3), 589–600.
5 CONCLUSION Gutiérrez-Pulido, H., V. Aguirre-Torres, and J.A. Christen
(2005). A practical method for obtaining prior distribu-
The generalized form of the lognormal distribution, tions in reliability. IEEE Transaction on Reliability 54(2),
presented and analyzed from a Bayesian viewpoint, 262–269.
offers the possibility of taking expert opinions into Ibrahim, J.G., M.H. Chen, and D. Sinha (2001). Bayesian
Survival Analysis. Springer-Verlag.
account. The proposed approach represents a viable Johnson, M.E. (1987). Multivariate Statistical Simulation.
alternative to analyze data that are supposed to follow John Wiley and Sons.
a lognormal distribution and provides flexible fits to Kadane, J.B. and L.J. Wolfson (1998). Experiences in
many types of experimental or observational data. The elicitation. The Statistician 47, 3–19.
technical development is based on a mixture repre- Meeker, W.Q. and L.A. Escobar (1998). Statistical Methods
sentation that allows to perform inferences via Gibbs for Reliability Data. John Wiley and Sons.
sampling. It is remarkable that the logGN family pro- Mineo, A.M. and M. Ruggieri (2005). A software tool for
vides very flexible distributions that can empirically the exponential power distribution: The normal package.
fit many types of experimental or observational data Journal of Statistical Software 12(4), 1–24.
Nadarajah, S. (2005). A generalized normal distribution.
obtained from engineering studies. Journal of Applied Statistics 32(7), 685–694.
Nica, A. (1969). Theory and Practice of lubrication systems.
Scientific Publications.
ACKNOWLEDGEMENTS O’Hagan, A. (1998). Eliciting expert beliefs in substantial
practical applications. The Statistician 47, 21–35.
This research has been partially supported by Ministe- Portela, J. and M.A. Gómez-Villegas (2004). Implementa-
rio de Educación y Ciencia, Spain (Project TSI2007- tion of a robust bayesian method. Journal of Statistical
66706-C04-02). Computation and Simulation 74(4), 235–248.
Savchuk, V.P. and H.F. Martz (1994). Bayes reliability esti-
mation using multiple sources fo prior information: Bino-
mial sampling. IEEE Transactions on Reliability 43(I),
REFERENCES 138–144.
Steele, C. (2008). The use of the lognormal distribution
Akman, O. and L. Huwang (2001). Bayes computation for the coefficients of friction and wear. Reliability
for reliability estimation. IEEE Transaction on Reliabil- Engineering and System Safety To appear.
ity 46(1), 52–55. Subbotin, M. (1923). On the law of frecuency errors.
Barabesi, L. (1993). Optimized ratio-of-uniform method for Mathematicheskii Sbornik 31, 296–301.
generating exponential power variates. Statistica Appli- Walker, S.G. and E. Gutiérrez-Peña (1999). Robustify-
cata 5(2), 149–155. ing bayesian procedures. In J.M. Bernardo, J.O. Berger,
Berger, J.O. (1985). Statistical Decision Theory and Bayesian A.P. Dawid, and A.F.M. Smith (Eds.), Bayesian Statistics
Analysis. Springer. 6, pp. 685–710. Oxford University Press.
Bernardo, J.M. (1979). Reference posterior distributions Wallbridge, N.C. and D. Dowson (1987). Distribution of wear
for Bayesian inference. Journal of the Royal Statistical rate data and a statistical approach to sliding wear theory.
Society B 41, 113–147. Wear 119, 295–312.
Box, G. and G. Tiao (1973). Bayesian Inference in Statistical
Analysis. Addison-Wesley, Reading.

874
Safety, Reliability and Risk Analysis: Theory, Methods and Applications – Martorell et al. (eds)
© 2009 Taylor & Francis Group, London, ISBN 978-0-415-48513-5

Collection and analysis of reliability data over the whole product


lifetime of vehicles

T. Leopold
TTI GmbH an der Universität Stuttgart, Germany

B. Bertsche
Institute of Machine Components, Universität Stuttgart, Germany

ABSTRACT: One of the biggest challenges in quantitative reliability analyses is a complete and significant
data basis, which describes the complete product lifetime. Today, this data basis, in regards to the demanded
and needed information, is not given. Especially the circumstances, that lead to a failure while the field usage,
e.g. operational and ambient information before and while failure occurrence, are of highest importance. In the
development phase of products much more detailed data is collected and documented, compared to the amount
in general customer field usage. Today, one of the most important data basis to describe the real field behavior
are warrantee and goodwill data. For an optimal correlation between failures that occur while testing and while
field usage, these data are not sufficient. In order to improve this situation, an approach was developed, with
which the collection of reliability relevant data during costumer usage is enabled over the whole product lifetime
of vehicles. The basis of this reliability orientated data collection is the consideration of already available CAN
signals in modern vehicles.

1 INTRODUCTION and field behavior. Thus, an observation of the field


behavior can not be substituted by extensive testing.
In general, field data of systems (e.g., vehicles) and In fact, testing and observation of the field behavior
its components, which are collected or generated over of products complement one another.
the complete life time of the product, is a very impor-
tant data basis for companies. This data is an essential
precondition for analyzing reliability and availability
of the systems, with which reflections or forecasts of 2 DATA COLLECTION IN THE VEHICLE
the real behavior in field use can be done.
An outstanding characteristic of quantitative relia- The rising number of ECUs (Electronic Control Unit)
bility data is the mapping of failures and its ambient in modern vehicles requires an intensive communica-
and operational circumstances that lead to a failure. tion between the ECUs among themselves and with
The possibility to compare the operational and ambi- sensors and actuators. The result is an enormous num-
ent conditions of intact and failed vehicles is another ber of messages, which are communicated by the
important characteristic. communication networks in the vehicles to ensure all
The conclusion of these two characteristics is that the functions. Some of these messages contain infor-
such a data base has to be built up systematically mation about the ambient and operational conditions
in forward-looking companies and thus represents an of the vehicle and its systems as well as informa-
extension of the present common way of collecting tion about occurred failures. Unfortunately, these
failure data. Today’s warrantee and goodwill data do important and very useful information are not used
not offer that detailed failure information over the consequently for collecting field data over the whole
whole product lifetime of vehicles, not to mention the product lifetime and for following reliability analyses
missing information about intact products while field until now.
usage. To generate dependable failure data while test- In the following, the idea of using messages of the
ing, a permanent feedback of field data is necessary for CAN-bus for collecting field data of vehicles is illus-
a continuous improvement of testing conditions and trated. Due to the fact, that the CAN-bus corresponds
its assumptions, e.g., correlation between test results to the actual state-of-the-art to realize communication

875
networks in vehicles, a short description of the Enhancement of an

resolution of the
of implementation
fundamentals of the CAN-technology is given first.

number of the
data elements

data elements
existing control unit

low effort
Connection via
2.1 Configuration of a CAN-bus diagnostics interface

The Controller Area Network (CAN) belongs to the


Additional control unit
Local Area Networks (LAN). The CAN-bus is a serial
bus system, whose nodes of the network can receive
Figure 2. Characteristics of the possibilities of implement-
all available messages that are present on the CAN- ing the data collecting device.
bus. Such a system is called multi master system. The
principal configuration of a CAN-bus is illustrated in
Figure 1. technology is replaced sometime in the future, the
Each node has the right to access the CAN-bus if basic principal of the data collection and the afterwards
required to broadcast a message for other nodes. The reliability consideration will still be applicable then.
broadcasting of the messages is not time triggered but
event controlled. To operate the order of the messages, 2.2 Development of data collection device
different priorities of importance are allocated to the
messages. These broadcasted messages are available First of all, the data elements to describe the opera-
for all other nodes of the CAN-bus. The nodes decide tional and ambient conditions, that have impact on the
whether they need the message or not. reliability of the product, have to be selected out of the
The nodes that are illustrated in Figure 1 can be numerous data elements of the CAN-bus. For a proce-
control units, gateways to other bus systems or a dure to perform the underlying assessment of the data
interface to diagnostics. The interface to diagnostics as well as the possibilities to integrate the hardware
has the purpose of connecting an external diagnos- of the collection device into the vehicle, see Leopold
tics device, which in turn accords a control unit. The (2007). The three described possibilities of integration
line termination of the CAN-bus is a resistor to avoid and its main characteristics are illustrated in Figure 2.
reflections. The low effort of implementation of the enhanced
Besides the technology of the CAN-bus there control unit and the sufficient number and resolution
are numerous other possibilities to realize the com- of the data elements are the reasons for choosing this
munication network in vehicles, e.g. flexray, LIN solution for further considerations.
(Local Interconnect Network) or MOST (Media Ori-
ented Systems Transport). These technologies already 2.3 Data processing in the vehicle
found or will find its way to communication networks
of vehicles, especially in sub zones of the vehicle Besides the reduction of the huge amount of data by
network, compare Felbinger (2005). selecting the most important messages of the CAN-
Consequently, the collection of vehicle bus signals bus, a further minimization of the data volume is
on basis of CAN technology is just one possibility to necessary in order to minimize the required storage
collect such kind of data. Nevertheless, if the CAN space in the vehicle and to reduce the amount of data
along the whole flow of information up to the company.
Therefore, possibilities of data classing of the
CAN messages that take place already in the vehicle, see
node 2 Bertsche (2008), DIN 45667, in addition to the defi-
nition of the frequency of data recording, have to be
CAN CAN defined.
node 1 node 3 A further reduction of the needed storage capacity
in the vehicle appears to be possible by using com-
pression algorithms of the information technology. An
adaptation to the requirements of the implementation
CAN-bus line
line
into an ECU of a vehicle has to be performed. How-
termination ever, the advantage of further reduction of the required
storage capacity has to be faced with the disadvantage
of rising computing power. Therefore, the potential of
CAN CAN using compression algorithms is not investigated more
node 4 node 5 detailed.
In addition to the selection of the required mes-
sages of the CAN-bus and the classing of the data, the
Figure 1. Principal configuration of a CAN-bus. generated classed data has to be stored and transferred

876
The resulting files contain classed data with different
Monitor CAN-Bus time scales.
Another task of the data management is to name
the files according to a determined syntax. The file
names include the serial number of the ECU, a short
Translate identifier to identify necessary messages cut for the time scale, the kind of data and the date.
This consistent naming enables a determination of a
prioritized data transfer to the company, which is done
by the file management.
Decide on recording (right event, timeframe)
2.3.2 File management
Taking into account that the data transfer to the com-
pany is realized by radio networks, see Leopold
Classing (counter, time, mileage,…) (2007), and therefore a data transfer is not always
possible for broadcasting all existing files, a useful
sequence of the data transfer has to be managed. The
priority is rising with increasing age of the files and
Data management increasing time scale of the contained classed data.
The reason for rising priority with increasing age of the
files is that the history of the ambient and operational
conditions is available in the company without inter-
File management for data transfer ruption. To get an overview of the field behavior, the
files containing long time scales are of higher priority
than the files containing low time scales. The informa-
Figure 3. Software structure for data processing in the tion with low time scales are used for detailed analyses
vehicle. in a second step after analyzing the data with long time
scales.
The history of the broadcasted files and its status of
to the company. The structure of the software with its successful transfer are stored in a log file, see Figure 4.
main tasks, that executes the data processing in the The exemplary files of Figure 4 describe the classed
vehicle, is shown in Figure 3. data of ABS Activity (Anti-lock braking system) of the
Taking the potentials of data reduction into account, vehicle with the VIN (Vehicle Identification Number)
a data volume of only a few hundred kilobytes per ADB9340. . . . In Detail, the entry 1 2 1 means 1 ABS
month is possible. That means an enormous reduction Activity in class 1, 2 ABS activities in class 2 and 1
of the data volume, compared to many hundreds of ABS Activity in class 3. This kind of saving the classed
megabytes per day when recording the whole CAN data in data fields causes only very little storage space.
communication. As mentioned, a further reduction can The log file includes examples of files containing
be derived by using data compression algorithms. trip (t1), day (d) and month (m) data. The status of 110
of the file containing the classed data of one month
describes the conditions of completeness of the file,
2.3.1 Data management
the execution of the transfer and the status whether the
The data collection has to regard different time scales.
file has been deleted on the data classing device or not.
That offers the possibility to differ between average
By using the serial number of the ECU within the
values over the whole product lifetime as well as sin-
file name and transferring the VIN as a part of the non-
gle months, days or trips. The higher the time scale,
recurring data, an easy allocation of files and vehicle
the more general conclusions can be derived of them.
is possible.
Lower time scales are especially interesting to analyze
the circumstances very short before or while failure
occurrence. For example, failures as a result of misuse
2.4 Transfer of the data to the company
can be distinguished from failures because of fatigue.
In addition to the more or less general differentiation There are different technologies available, which app-
of failure causes, the classed data of few data elements ear to be suitable for broadcasting the data from the
can be used for more detailed analyses. vehicle to the company.
Therefore, the classing has to be executed for differ- One possible realization is the usage of wireless
ent time scales. To enable different files with different connections within the data collecting devices, e.g.,
durations of classing, a data management system is Bluetooth or WLAN (Wireless Local Area Network).
needed that works simultaneously to the data classing. In the first step, the transfer of the data of the

877
Time scale

Data of month
ABS Active

Costs per year


#121
#201 Log file

719270_t1_20071017 110
719270_d_20071025 111
719270_m_20071031 11
110
… Indirect Data Transfer with outstations
Direct Data Transfer GSM/GPRS
Data of day
Priority of file transfer

ABS Active
Number of vehicles
#121
#201
… Non-recurring
N data Figure 5. Cumulative costs of data collection.

Ki of vehicle: Truck
Kind
Vehicle Identification No.
V
W
WDB9340… software as well as additional unit costs and recur-
Data of trip … ring costs for the data transfer from the vehicle to the
company.
ABS Active
#121 The fixed costs for the development of the detailed
#201 investigated concepts, the direct data transfer via radio
… networks and the indirect data transfer via outsta-
tions, are nearly the same. In both cases, an existing
ECU has to be enhanced with sufficient storage capac-
Priority of file transfer ity, interfaces and the software, which manages the
Age data processing in the vehicle. The unit costs for
the broadcasting modules differ only slightly. The
Figure 4. Priority of file transfer within file management.
initial investment of the concept with the indirect
data transfer is higher because of the costs for the
outstations.
The main economic differences between the two
concepts are the recurring costs for broadcasting the
vehicles and outstations is realized with these short
data. With rising number of vehicles, which directly
range technologies. In the second step, the data of
transfer the data from the vehicle to the company, the
the outstations have to be transferred to the com-
broadcasting costs are also increasing. In contrast to
pany, e.g., via radio networks like GSM (Global
the broadcasting costs of the direct data transfer, the
System for Mobile Communications), GPRS (General
costs for the indirect data transfer are not rising with
Packet Radio Service) or UMTS (Universal Mobile
higher number of equipped vehicles. The reason is that
Telecommunications System).
costs causing data transfers via radio network arise
Another possibility is the direct transfer of the data
only between outstations and the company. The num-
from the vehicle to the company via radio networks.
ber of outstations is nearly constant, because the single
This kind of data transfer is already used success-
outstations just have to broadcast more data volume.
fully for assistance of fleet management of truckage
The principal trend of the total costs for both
companies.
concepts can be seen in Figure 5.
Depending on the number of vehicles, which are
It is to mention that the graphical visualization of the
equipped with a data collection device, the amount of
costs is a first estimation. The very important conclu-
data and the available storage capacity in the vehicle,
sion is that a field data collection with many vehicles
one of the two possible realizations for broadcasting
causes fewer costs when the data transfer is realized
the data has to be selected.
with the indirect concept.

4 DATA PROTECTION
3 ECONOMIC CONSIDERATION
A comprehensive examination of the data collection
The total costs of the data collection system includes based on the messages of the CAN-bus demands an
costs for the development of the required hard- and assessment of legal aspects. Therefore, the question

878
has to be answered, whether issues of the law, which 35

Fraction of total operating time [%]


contains the data protection, is affected or not. 30

The federal data protection act of Germany (BDSG, 25

Bundesdatenschutzgesetz) rules the matter of personal 20


data. According to § 3 BDSG personal data are particu- 15
lars of individual or factual circumstances of a defined
10
or definable person.
The classed data of the field data collection does 5

not contain the name of the driver. Therefore, the 0


<0 0 0-9 9-18 18-27 27-36 36-45 45-54 54-63 63-72 72-81 81-90 > 90
driver is not defined obviously. In addition to that,
Vehicle speed [km/ h]
the selection and the classing of the data in the vehicle
reduce the original data content. After the data trans- Figure 6. Idealized curve of vehicle speed of long-haul
fer from the vehicle to the company, no additional vehicles.
data about the driver are available than the classed
data of the observed vehicle. The classed data can not 25
be referred to one single driver of a certain truckage

action of total operating time [%]


company. Thus, the data of the vehicle CAN-bus are 20
modified by the data selection and data classing, that
the driver is neither defined nor definable. 15

The result is, that the federal data protection act of 10


Germany does not restrict the data collection of the
CAN-bus because of the anonymized data and conse- 5
Fraction

quently because of missing relation of classed data to


individuals. 0
<0 0 0-9 9-18 18-27 27-36 36-45 45-54 54-63 63-72 72-81 81-90 > 90

Vehicle speed [km/h]

5 EXAMPLES OF CLASSED DATA Figure 7. Idealized curve of vehicle speed of distribution


vehicles.
The collection of operational and ambient conditions
30
in connection with occurred failures of failed vehi-
Fraction of total operating time [%]

cles offers many possibilities to analyze the real field 25

behavior. To compare the classed load collectives with


20
occurred failures in combination with damage accu-
mulation offers the great possibility to verify and 15

optimize testing conditions and assumptions. Other


10
analyses are the comparison of the operational and
ambient conditions of failed and non-failed prod- 5

ucts, e.g. to derive the most significant factors and


combination of factors, which cause a product to fail. 0
<0 0 0-9 9-18 18-27 27-36 36-45 45-54 54-63 63-72 72-81 81-90 > 90
It is very important to do all these analyses for prod- Vehicle speed [km/ h]
ucts, which are comparable. An example is the very
different usage of commercial vehicles, e.g., for long- Figure 8. Idealized curve of vehicle speed of construction-
haul transport, distribution work or construction-site site vehicles.
work. It is very difficult to differ the usage of these
vehicles on basis of common field data. The new
approach on basis of the CAN-bus enables an easy belongs to a distribution vehicle, except the high frac-
differentiation of the different kinds of usage. The fist tion of the standstill period and interurban operation.
step for the differentiation is to define idealized curves The typical character of a distribution vehicle is the
for the vehicle speed for the usage of the vehicles, see high fraction of middle-sized vehicle speeds.
Figures 6, 7 and 8. Another example of the analysis of the CAN-
To derive the kind of usage of a vehicle, the ideal- bus messages is a comparison between the dif-
ized curves of the vehicle speed have to be compared ferences of high and low time scales of the
with the real curve of the classed vehicle speed. classed data. A very impressive comparison is the
Figure 9 shows the real curve of a vehicle. The investigation of braking activities of a commer-
comparison of the vehicle speed of Figure 9 with cial vehicle. The collection of the classed data
Figures 6, 7 and 8 shows that the curve of Figure 9 was done for one month (high time scale) and for

879
30
The shown approach of using the existing informa-
Fraction of total operating time [%]

25
tion, that are already available in modern vehicles, is
an appropriate solution to a collection of field data for
20
reliability analyses. Data about failed and non-failed
15
parts are available as well as information about opera-
tional and ambient conditions over the whole lifetime
10
of vehicles.
5
The most important steps of the data processing
in the vehicle lead to a significant reduction of the
0
<0 0 0-9 9-18 18-27 27-36 36-45 45-54 54-63 63-72 72-81 81-90 > 90
huge amount of data and enable a reasonable data
transfer between vehicle and company. The consid-
Vehicle speed [km/h]
eration of different time scales has to be considered,
Figure 9. Real curve of vehicle speed of a distribution
as it is realized by the data management and shown
vehicle. by an example. The order of transferring the data files
to the company is executed by the file management
according to a defined logic of prioritization.
1800 A short economic consideration shows the principal
1600 trend of the total costs for two concepts. The concept of
indirect data transfer via outstations is more economic
Relative difference [%]

1400

1200 than the direct data transfer via radio networks.


1000 Finally, legal aspects of the federal data protec-
800 tion act of Germany do not appear to be an estoppel,
600
although only a short summary of the investigation is
400
shown.
200

2 0-4 0
Therefore, a collection and analysis of reliability
0
1 5-2 0 data over the whole product lifetime of vehicles is
0

c]
20
0 -1

1 0-1 5
30

se
40
1 0-

5 -10
possible and has to be realized in forward-looking
50
2 0-

n[
60
3 0-

4 0-

70

0 -5
80
5 0-

io
90
6 0-

1 00
7 0-

at

companies.
8 0-

Brake peda
ur
9 0-

l position
D

[%]

Figure 10. Difference of high and low time scales for brake REFERENCES
pedal position.
Bertsche, B.: Reliability in Automotive and Mechanical
Engineering. Berlin, Springer, 2008.
one day (low time scale). The brake pedal posi- DIN 45667: Data Classing (in German). 1969.
tion describes the braking activity. The differences Felbinger, L.; Schaal, H.: Trucks under Control (in German).
of the results of the classed data are shown in Elektronik Automotive 05/2005.
Figure 10. ISO/DIS 14220-1: Road Vehicles – Unified Diagnostics Ser-
The differences are recognizable clearly, partly up vices (UDS) – Part 1: Specification and Requirements.
draft standard, 2005.
to a factor of 16, which can be seen as an evidence Leopold, T.; Pickard, K.; Bertsche, B.: Development of
for regarding unequal time scales. The differences can a Data Collection of Vehicle Bus Signals for Reliabil-
get quite higher if the differences of the loading of the ity Analyses. Proc. ESREL 2007, 25th–27th June 2007,
vehicles increase within different days. Stavanger, Norway.
SAE J1939/71: Vehicle Application Layer. 2006.
VDA (ed.): Reliability Assurance in Automobile Industry and
6 SUMMARY Suppliers (in German). Vol. 3 Part 2, VDA, Frankfurt,
2000.
Field data of products over the whole product lifetime
is an essential precondition for quantitative reliability
analyses. That enables the possibility to derive analy-
ses and forecasts of the real behavior of the products
in field usage.

880
Safety, Reliability and Risk Analysis: Theory, Methods and Applications – Martorell et al. (eds)
© 2009 Taylor & Francis Group, London, ISBN 978-0-415-48513-5

Comparison of phase-type distributions with mixed and additive Weibull


models

M.C. Segovia
Facultad de Ciencias, University of Granada, Spain

C. Guedes Soares
CENTEC, Instituto Superior Técnico, Technical University of Lisbon, Portugal

ABSTRACT: The parametric estimation is performed for two models based on the Weibull distribution: the
mixture of Weibull distributions and the additive Weibull model. The estimation is carried out by a method
which use the Kolmogorov-Smirnov distance as the objective. Phase-type distributions are introduced and a
comparison is made between the fitted Weibull models and the phase-type fit to various sets of life data.

1 INTRODUCTION distributions (Weibull, truncated normal, and others)


see Navarro & Hernández (2004).
Obtaining probabilistic models plays an important role In this paper two models are considered to rep-
in the study of lifetime data of any component or sys- resent complex life data resulting from more than
tem. Given a failure time data set, a usual problem one failure mechanism: the mixture of Weibull dis-
is to select the lifetime distribution and estimate its tributions, extending the paper of Ling and Pan
parameters. (1998); and the additive model whose failure rate
Because of its interesting properties, the Weibull function corresponds to the sum of several failure
distribution is widely used for modelling lifetimes. rate functions of Weibull form. Xie and Lai (1995)
Frequently, several failure mechanisms are present studied the additive model with two components
and as a result the lifetime distribution is composed for to apply to data that has a bathtub-shaped failure
more than one model. In these cases the mixture and rate.
composite Weibull distributions are alternatives to the A method whose purpose is to minimize the
simple model. Kolmogorov-Smirnov distance is used to estimate the
The mixture of two Weibull distributions provides parameters of the previous models. This method can
a rather flexible model to be fitted to data and is also be seen in Ling and Pan (1998).
able to depict non-monotonous hazard rates. These models lead to expressions that are of dif-
Supposing the system structure is unknown and ficult treatment. An alternative to these complex
there is no further information about the given sys- expressions is to consider phase-type distributions.
tem to select the best reliability model, the application This kind of distributions presents the following
of Weibull mixture distribution as a reliability model advantages: they are weakly dense; hence any distri-
is always possible. Its application is feasible also in bution on [0, ∞] can be approximated by a phase-
case of complex structure systems where reliability type distribution, they can be fitted to any dataset
function is difficult to derive exactly. or parametric distribution, the obtained expressions
On the other hand, to find distributions with bathtub have algorithm treatment, their Laplace transform is
shaped failure is a frequent problem in practical situ- rational.
ations. The usual distributions do not have this type The phase-type distributions were introduced by
of failure rate. Then, when a dataset has an empiri- Neuts (1981). The versatility of the phase-type dis-
cal bathtub shaped failure, it is usual to construct new tribution is shown in Pérez-Ocón and Segovia (2007)
distributions from the usual ones in order to fit the where a particular class formed by mixtures of phase-
new distribution to the set of data. To construct a suit- type distributions is presented in order to calculate
able model for the dataset, one of the operations for special distributions with properties of interest in
this fitting, that are commonly used, is the mixtures of reliability.

881
Phase-type distributions can be an alternative way where βi and λi are, respectively, the shape and scale
of describing failure rates. The purpose of this paper parameters of the component distributions; pi repre-
is to compare the fit of complex data by phase-type sents the weight of every component in the mixture
distributions with the more traditional approximations and ni=1 pi = 1.
of mixture of Weibulls or composite Weibulls. The particular cases where n = 2 and n = 3 are
To have control of the properties of the data used in considered, because with a greater number of compo-
the analysis, simulated data was adopted. Various data nents the estimation of the parameters is much more
sets were generated with different levels of mixture difficult with the minimax algorithm.
of different basic Weibull distributions. The simu-
lated data are fitted with different methods allowing
2.2 The additive model
conclusions about their relative usefulness.
The fit of a phase-type distribution to the set of The additive model can be interpreted as the lifetime of
failure time data is carried out, with the EMpht soft- a system wich consist in several independent Weibull
ware, based on the EM algorithm. This is an iterative components that are arranged in a series. If T is the
method to find the maximum likelihood estimation, lifetime of the system, then,
see Asmussen (1996).
In order to compare the proposed methods the T = Min(T1 , T2 , . . . , Tn ) (4)
Kolmogorov-Smirnov test is used.
The paper is organized as follows; in section 2 where Ti denote the lifetime of the component i, which
the mixture of Weibull distributions and the additive is a Weibull with parameters λi , βi .
model are introduced. In section 3, phase-type distri- Xie and Lai (1995) presented the additive model
butions are defined and in section 4 the minimax algo- that combines two Weibull distributions (1), one of
rithm for the parametric estimation is described. For them with decreasing failure rate and the other with
the proposed models, in section 5, several numerical increasing failure rate. The combined effect represents
applications are shown. a bathub-shaped failure rate.
The distribution function of this model is given by,
    β2
2 THE MIXED AND THE ADDITIVE WEIBULL t β1 t
F (t) = 1 − exp − − , (5)
MODEL λ1 λ2

Because the Weibull distribution is used to study the where βi and λi are the shape and scale parameters,
ageing and the opertional and burn-in time of a device, respectively, of the Weibull distributions.
two models, based on this distribution, are studied. The additive model that combines three Weibull dis-
These models can represent in a better way the failure tributions can be represented by adding a third term to
time data. the exponential function in eqn (5).
The two parameter Weibull distribution function is
given by,
3 PHASE-TYPE DISTRIBUTIONS
 
F (t) = 1 − exp − (t/λ)β , t ≥ 0, (1)
One of the main purposes of this paper is to compare
the fit obtained for the mixed and additive Weibull
where λ is the scale parameter and β is the shape models by the minimax algorithm, with the phase
parameter. The failure rate function has the next form distribution fit.
The phase-type distributions considered in this
r (t) = (β/λ) (t/λ)β−1 (2) paper are defined in the continuous case.
The continuous distribution F(·) on [0, ∞] is a
phase-type distribution (PH-distribution) with repre-
2.1 The mixture of Weibull distributions sentation (α, T), if it is the distribution of the time
until absorption in a Markov process on the states
In general, the mixture of Weibull distributions (1), {1, . . . , m, m + 1} with generator
can be defined as a weighted sum of these distribu-
 
tions: T T0
Q= , (6)
   
0 0

n
t βi
F (t) = pi 1 − exp − , (3) and initial probability vector (α, αm+1 ), where α is a
λi
i=1 row m-vector. The states {1, . . . , m} are all transient.

882
The matrix T of order m is non-singular with negative of failure.
diagonal entries and non-negative off-diagonal entries
and satisfies −Te = T0 ≥ 0. The distribution F(·) is Min Max |Fe (Ti ) − F0 (Ti )| , (9)
given by i=1,2,... ,n

F(x) = 1 − α exp(Tx)e, x ≥ 0 (7) where Fe (Ti ) is the expected probability of


failure, which can be calculated from the selected
distribution.
It will be denoted that F(·) follows a PH(α,T) To obtain the optimized parameters for the distri-
distribution. bution, solving the previous equation, the function
fminimax in Matlab is used.
6. Compare the maximum absolute difference bet-
4 ESTIMATION METHOD ween the observed probability of failure and the
expected probability of failure with the allowable
The selected method to estimate the parameters of critical value.
these models is the minimax algorithm, based on the
Kolmogorov-Smirnov distance.
The Kolmogorov-Smirnov test allows to check if the 5 NUMERICAL APPLICATIONS
distribution selected for the data set is the most appro-
priate, no matter what kind of parameters estimation In this section the parameters of the different models
is used. The Kolmogorov-Smirnov test compares the are estimated with the minimax algorithm, in order to
empirical distribution of the sample with the proposed perform this estimation 100 values are obtained by the
distribution, determining the maximum absolute dif- Monte Carlo method. These values correspond to the
ference between the observed probability of failure failure time data.
and the expected probability of failure under the dis- Furthermore, the versatility of PH-distributions is
tribution, if the difference is less than a determined displayed, showing their fit to the empirical distribu-
critical value, the proposed distribution is accepted as tion of the sample.
a reasonable distribution for the sample.
The minimax algorithm obtains the parameters esti-
mation of the selected distribution minimizing the 5.1 Mixture of two Weibull distributions
maximum absolute difference between the observed For a mixture of two Weibull distributions, with param-
probability of failure and the expected probability of eters, β1 = 0.5, λ1 = 100, p1 = 0.25, β2 = 2,
failure under the selected distribution. λ2 = 1000 y p2 = 0.75 (2), 100 values are generated
The steps to apply this procedure are: by the Monte Carlo method.
1. Arrange the data in ascending order. If the adequate distribution for this sample is a mix-
2. Calculate the observed probability of failure. The ture of two Weibull distributions, using the minimax
median rank is used to obtain this probability: algorithm, the estimation of the parameters is indicated
in Table 1.
Figure 1 shows the fit of this distribution to the
i − 0.3 empirical distribution.
F0 (Ti ) = , i = 1, . . . , n, (8)
n + 0.4 Figure 2 represents the failure rate function that cor-
responds to the original distribution and the failure rate
where Ti is the failure time of i and n is the total function of the estimated distribution
number of samples. Table 2 shows the estimation of the parameters,
3. Select a suitable probability distribution. Distribu- with the minimax algorithm, for different examples
tions that belong to the mixed Weibull model of mixtures of two Weibull distributions, and the K-S
and the additive Weibull model are the selected test.
distributions in this paper. PH fit
4. Determine the inital values of the parameters for the Now the phase-type distribution, indicated in eqn (7)
selected distribution, but, if the initial probability that provides the best fit to the data set is obtained.
values are not appropriate, the method might cause
problems in the algorithm convergence.
5. Obtain a set of parameters optimized for the Table 1. Estimated parameters.
selected distribution by minimizing the maximum
β1 = 0.3072, λ1 = 113.1296, p1 = 0.211
absolute difference between the observed prob- β2 = 2.0816, λ2 = 903.0699, p2 = 0.789
ability of failure and the expected probability

883
As is known, this kind of distributions can be fitted to
any dataset.
To find the PH-representation of this distribution
the EMPht software is used. This representation is
given by,

γ = (0 0 0 1),
⎛ ⎞
−0.003184 0 0 0
⎜ 0 −0.005175 0.003462 0.000947 ⎟
L = ⎝ 0.003184 0 −0.003184 0 ⎠
0 0.134512 0 −0.144241

The fit provided by the PH-distribution can be seen


in Figures 3 and 4.

Figure 1. Fit of a Weibull mixture the sample.

Figure 2. Failure rate of the Weibull mixture. Figure 3. PH fit to the sample.

Table 2. Parametric estimation of the Weibull mixture.

Weibull mixture Estimation K-S test

β1 = 0.5 β2 = 2 β1 = 0.475 β2 = 2.433 D0.95 = 0.136∗


λ1 = 100 λ2 = 1000 λ1 = 79.25 λ2 = 937.05 0, 136 > 0, 0419+
p1 = 0.5 p1 = 0.4388
β1 = 0.5 β2 = 2 β1 = 0.542 β2 = 1.416 D0.95 = 0.136∗
λ1 = 100 λ2 = 1000 λ1 = 110.6 λ2 = 894.9 0, 136 > 0, 0441+
p1 = 0.75 p1 = 0.7036
β1 = 0.5 β2 = 4 β1 = 0.822 β2 = 4.197 D0.95 = 0.136∗
λ1 = 100 λ2 = 1000 λ1 = 97.65 λ2 = 1021 0, 136 > 0, 0393+
p1 = 0.5 p1 = 0.4488
β1 = 0.5 β2 = 1 β1 = 0.62 β2 = 1.039 D0.95 = 0.136∗
λ1 = 100 λ2 = 1000 λ1 = 225.8 λ2 = 894.9 0, 136 > 0, 0497+
p1 = 0.5 p1 = 0.5452
β1 = 0.5 β2 = 2 β1 = 0.231 β2 = 1.264 D0.95 = 0.136∗
λ1 = 500 λ2 = 1000 λ1 = 346.5 λ2 = 854.67 0, 136 > 0, 0535+
p1 = 0.5 p1 = 0.1998
β1 = 0.5 β2 = 2 β1 = 0.492 β2 = 2.642 D0.95 = 0.136∗
λ1 = 1000 λ2 = 1000 λ1 = 968.04 λ2 = 946.64 0, 136 > 0, 0465+
p1 = 0.5 p1 = 0.655
∗ Critical value, + K-S experimental.

884
Table 3 shows the PH-fit for the examples given in
Table 2 and the K-S test.

5.2 Mixture of three Weibull distribution


One hundred of values of a mixture of three Weibull
distributions with parameters β1 = 0.5, λ1 = 100,
p1 = 0.3, β2 = 1, λ2 = 500, p2 = 0.3, β3 = 2,
λ3 = 1000, p3 = 0.4, are simulated.
The estimation of these parameters is carried out
with the minimax algorithm. Table 4 shows the
estimated values of the parameters.
Figure 5 displays the fit to the empirical distribu-
tion, which is good.
Figure 6 represents the failure rate function corre-
Figure 4. Failure rate (sample & PH). sponding to the previous distributions.

Table 3. Phase-type fit to the sample.

PH-representation K-S test


 
α = 0 0.924658 0.075342 0 D0.95 = 0.136∗
⎛ ⎞
−0.003365 0 0 0.003365
⎜ 0.036542 −0.057758 0.005832 0 ⎟
T =⎝ ⎠ 0, 136 > 0, 0745+
0 0 −0.004333 0
0 0 0.003365 −0.003365
 
α = 0.036284 0.963716 D0.95 = 0.136∗
 
−0.005016 0.004900
T = 0, 136 > 0, 0461+
0.029753 −0.048887

 
α= 0 1 0 0 D0.95 = 0.136∗
⎛ ⎞
−0.003733 0 0 0.003733
⎜ 0.008083 −0.013203 0 0 ⎟
T =⎝ ⎠ 0, 136 > 0, 1039+
0 0.000378 −0.003733 0
0 0 0.003733 −0.003733
 
α = 0.121536 0.160842 0.717622 D0.95 = 0.136∗
⎛ ⎞
−0.085995 0.002930 0.082800

T = 0.001061 −0.002107 0.000659 ⎠ 0, 136 > 0, 0525+
0.515296 0.019454 −0.571630

 
α = 0.025961 0.974039 D0.95 = 0.136∗
 
−0.025961 0.000187
T = 0, 136 > 0, 0830+
0.000067 −0.001484

 
α = 0 0.208189 0 0.791811 D0.95 = 0.136∗
⎛ ⎞
−0.000210 0 0 −0.000210
⎜ 0 −0.027253 0.000325 0 ⎟
T =⎝ ⎠ 0, 136 > 0, 0630+
0.000331 0.000405 −0.002275 0
0 0.000002 0.002268 −0.002270
∗ Critical value, + K-S experimental.

885
Table 4. Estimated parameters.

β1 = 0.2682, λ1 = 10.5641, p1 = 0.2006


β2 = 1.8272, λ2 = 150.2917, p2 = 0.2444
β3 = 2.533, λ3 = 919.6259, p3 = 0.555

Figure 7. Distribution (sample & PH).

Figure 5. Fit a Weibull mixture to the sample.

Figure 8. Failure rate (sample & PH).

The phase-type fit to the original distribution can


Figure 6. Failure rate of the Weibull mixture.
be seen in Figures 7 and 8.
The K-S test provides us 0.1233 as experimental
value, because D0.95 = 0.136 > 0.1233 the fit is
The experimental value of the K-S test is 0.0418 < acceptable.
D0.95 = 0.136, then the obtained estimation is
acceptable.
PH fit
The representation of the PH-distribution that provides 5.3 Additive model Weibull with two components
a better fit to the data set is Once 100 values of an Additive Weibull model with
  parameters β1 = 0.5, λ1 = 100, β2 = 2, λ2 = 1000
α= 0 0.249664 0.001439 0.406906 0.341991 , (3), have been generated, the minimax algorithm is
⎛ ⎞ applied to estimate these parameters. The results are
−0.024926 0.005432 0.003831 0.009449 0.006022
⎜0.996349 −9.29289 0.728256 1.778256 3.402062 ⎟ shown in Table 5.
⎜ ⎟
T = ⎜0.007600 0.003714 −0.014657 0.001057 0.001943 ⎟
⎝0.002870 0.003963 0.007491 −0.109279 0.072219 ⎠ Figure 9 represents the fit of the estimated distribu-
0.011298 0.000688 0.001906 0.000456 −0.014351 tion to the empirical distribution.

886
Table 5. Estimated parameters.

β1 = 0.6486, λ1 = 103.8854
β2 = 5.6458, λ2 = 849.1117

Figure 11. PH fit to the sample.

Figure 9. Fit of an Additive Weibull to the sample.

Figure 12. Failure rate (Weibull & PH).

Table 6. Estimated parameters.

β1 = 0.5555, λ1 = 100.1662
β2 = 2.042, λ2 = 628.2805,
Figure 10. Failure rate of the Weibull Additive model. β3 = 2.0373, λ3 = 628.2824,

The fit seems acceptable and the experimental value


of the K-S test is lower than the critical value, D0.95 = The fit is quite good as can be seen in Figure 11.
0.136 > 0.0554. Furthermore D0.95 = 0.136 > 0.052.
PH fit
The EMPht software provides the following PH-
representation for the PH-distribution, 5.4 Additive model Weibull with three components
  For the additive model of three components, where
γ = 0.154338 0.027336 0.818326 β1 = 0.5, λ1 = 100, β2 = 1, λ2 = 500, β3 = 2 and
⎛ ⎞ λ3 = 1000, a hundred of data are simulated.
−0.040346 0.003227 0.028734 The parameters estimated are shown in Table 6.
L = ⎝ 0.002439 −0.004663 0.002056 ⎠ Figures 13 and 14 show the fit provided by the
0.488996 0.037781 −0.621015 previous estimation to the data set.

887
Figure 13. Fit an Additive Weibull to the sample. Figure 15. PH fit to the sample.

Figure 14. Failure rate of the Weibull Additive model.

Figure 16. Failure rate (sample & PH).


As in the previous examples, the experimental
value for the K-S test is lower than the critical value,
D0.95 = 0.136 > 0.0507, then it can be concluded
that the estimated distribution is the distribution of the
sample. 6 DISCUSSION OF RESULTS
PH fit As can be seen in the previous figures, the fits provided
Finally, the PH-distribution that better fits to the by minimax algorithm and the PH-distributions to the
sample has this PH-representation empirical distributions are quite similar.
  In figures 6 and 10 significant differences can be
α = 0.071507 0.928493 , observed between the estimated failure rate and
  the original one. Minor variations in the estimation of
−0.019868 0.018978 the shape parameter can cause a very different shape
T =
0.067853 −0.110555 of the failure rate function. However, the failure rate
of the PH-distributions is more or less similar to the
Figure 15 represents the empirical distribution and failure rate of the original distribution.
the phase-type distribution and Figure 16 shows the It seems that there is no relation between the number
failure rate. of phases of the PH-distributions that provided a better
This is an acceptable fit, as the K-S test shows fit and the number of components of the distribution
D0.95 = 0.136 > 0.0459. that is estimated.

888
7 CONCLUSIONS [6] Jiang, R. & Murthy, D.N.P. 2001. n-fold Weibull mul-
tiplicative model. Reliab. Engng. Syst. Safety 74:
The PH-distributions are introduced as an useful tool 211–219.
to fit any set of failure data, as it is shown in the [7] Ling, J. & Pan, J. 1998. A new method for selection
examples given above. The advantage of this kind of population distribution and parameter estimation.
Reliab. Engng. Syst. Safety 60: 247–255.
of distributions is that it is not necessary to select a [8] Navarro, J. & Hernández, P.J. 2004. How to obtain
parametric distribution to fit to the data set. bathub-shaped failure models from normal mixtures.
On the other hand, to estimate the parameters with Probability in the Engineering and Informational
the minimax algorithm, the appropriate initial value of Sciences 18: 511–531.
these parameters have to be determined, and it can take [9] Montoro-Cazorla, D., Pérez-Ocón, R. & Segovia
a long time, increasing when the number of parameters M.C. 2007. Shock and wear models under policy N
of the distribution is increased. using phase-type distributions. Applied Mathematical
Finally, in all of the examples presented in the paper, Modelling. Article in press.
it is verified, with the K-S test, that the fits obtained, [10] Murthy, D.N.P. & Jiang, R. 1997. Parametric study of
sectional models involving two Weibulls distributions.
with both methods, are all adequate, and the goodness Reliab. Engng. Syst. Safety 56: 151–159.
of fit in the previous examples is more or less similar. [11] Neuts, M.F. 1981. Matrix geometric solutions in
stochastic models. An algorithmic approach. Univ.
Press, Baltimore.
REFERENCES [12] Pérez-Ocón, R. & Segovia M.C. 2007. Modeling life-
times using phase-type distributions. Risk, Reliability
[1] Asmussen S, Nerman O, & Olsson M. 1996. Fitting and Societal Safety. Terje Aven & Jan Erik Vinnem,
phase-type distributions via the em algorithm. Scand. Taylor & Francis. Stavanguer, Norway 1: 463–469.
J. Statist. 23: 419–441. [13] Sun, Y.S., Xie, M., Goh, T.N & Ong, H.L. 1993.
[2] Bucar, T. Nagode, M. & Fajdiga, M. 2004. Reliability Development and applications of a three-parameter
approximation using finite Weibull mixture distribu- Weibull distribution with load-dependent location and
tions. Reliab. Engng. Syst. Safety 84: 241–251. scale parameters Reliab. Engng. Syst. Safety 40:
[3] Jiang, R. & Murthy, D.N.P. 1997. Parametric study of 133–137.
multiplicative model involving two Weibull distribu- [14] Xie, M. & Lai, C.D. 1995. Reliability analysis using
tions. Reliab. Engng. Syst. Safety 55: 217–226. an additive Weibull model with bathub-shaped failure
[4] Jiang, R. & Murthy, D.N.P. 1998. Mixture of Weibull rate function. Reliab Engng. Syst. Safety 52: 87–93.
distributions-parametric characterization of faiulre [15] Xie, M., Tang, Y. & Goh, T.N. 2002. A modified
rate function. Appl. Stochastic models data anal 14: Weibull extension with bathub-shaped failure rate
47–65. function. Reliab Engng. Syst. Safety 76: 279–285.
[5] Jiang, R. & Murthy, D.N.P. 2001. Models involving
two inverse Weibull distributions. Reliab. Engng. Syst.
Safety 73: 73–81.

889
Safety, Reliability and Risk Analysis: Theory, Methods and Applications – Martorell et al. (eds)
© 2009 Taylor & Francis Group, London, ISBN 978-0-415-48513-5

Evaluation methodology of industry equipment functional reliability

J. Kamenický
Technical University of Liberec, Liberec, Czech Republic

ABSTRACT: Electric power is an essential condition of every modern economy. But it is not enough only to
maintain existing power plants, it is also necessary to develop the new ones. Companies need to make machines
with the highest possible reliability (the lowest failure rate, lowest repair time—lowest unavailability) by this
development. It is very complicated to estimate availability of developed engine which have not worked yet.
We can use an estimation of reliability parameters for older machines with similar design. It is also possible to
allocate weak parts of these older machines and make some improvements.
To find such parts of machine we have to analyze it. It is usually done by statistics methods. So the problem
is where to get the input data for the analysis? The presented methodology tells us how to collect relevant data,
how to differentiate and remove unwanted data and of course how to solve the meaningful rest of data. There is
shown making of failure frequency histogram, appointment of exponential distribution for mean time between
failures of equipment inclusive of chi-square test of this hypothesis. In addition to these parts there is shown
how to perform ABC analysis of failure consequences. Very important part of this paper is Appendix 1, where
it is shown an example of methodology application.

1 SUMMARY OF USED ACRONYMS λU upper confidence interval limit of real value


of failure rate
A expected number of failures χ2 chi-square distribution
CR Czech Republic
MTBF mean time between failures
MTBFL lower confidence interval limit of 2 MOTIVATION
real value of mean time between
failures Electric power is essential condition of every modern
MTBFU upper confidence interval limit of real economy growth. But it is not enough only to maintain
value of mean time between failures existing power plants, it is also necessary to develop
MTTR mean time to repair and build the new ones. By this work it is useful to learn
MTTRL lower confidence interval limit of real form previous mistakes and weaknesses of processed
value of mean time to repair machines and equipment. Data databases are available
MTTRU upper confidence interval limit of real in most cases, but the data are not processed because
value of mean time to repair of no immediate wealth effect. This data processing
U asymptotical unavailability is economically profitable from long-term point of
λ failure rate view. Evaluation methodology of industry equipment
λL lower confidence interval limit of real functional reliability grows up as a supportive docu-
value of failure rate ment for reliability parameters calculation and weak
λU upper confidence interval limit of real places finding. All examples, used in appendix 1, are
value of failure rate taken from analysis of huge industry pumps, which are
T total cumulated processing time already finished.
t length of time interval of test
TTR time to repair
r total number of failures
3 DATA PART
ri empirical failure frequency in reflected
interval
3.1 Data collection
Tp total cumulated failure time
λL lower confidence interval limit of real It is inevitable to find and collect data about
value of failure rate operational conditions and maintenance politics for

891
establishment of machine reliability. However, the 3.3 Reliability parameters calculation
most important part in data collection is still fail-
This chapter handles about failure rate of equipment
ure number, failure causes and their consequences
type, not about each unit. It is required to erase data
location. Power industry in Czech Republic is split
about scheduled overhauls, preventive maintenance
into several branches. Nuclear area looks as separate
tasks and keep only records about failures and break-
section. This is because of massive medial pressure.
downs. These data should be sorted chronologically
This reason makes nuclear area being one step before
so the histogram of failure numbers in each year could
the rest of power industry. Nuclear power plants have
be done. Then number of failures in one year is the
their own system for registration of failures and repairs
sum of all failures which happened on all machines of
of equipment, but presented methodology takes this
one type.
also into account. Second researched area is named
Industrial machinery is usually operating still in the
‘‘classic power plants’’—it means black and brown
same conditions, equipment etc. Because of that it
coal power plants. There is a failure monitoring soft-
seems to us that their failure rate should be also the
ware for classic power stations. Its (dis)advantages
same at the time of their operating life. This presump-
are mentioned in the example in Appendix 1. How-
tion responds to the exponential distribution of mean
ever, recorded data are not processed on adequate level
time between failures. The only parameter of that is λ,
neither in nuclear nor in classic area.
which is reciprocal to the MTBF.
Better failure rate evaluation needs more data. So
Let’s do chi-square test for confirmation/negation
the best evaluation uses all the data about one type
of exponential distribution usage validity. Exponen-
of machine. Then we sort that data regarding on what
tial distribution has constant failure rate, so we can
surrounding machine is processing, what quality and
presuppose the same number of failures in time inter-
maintenance politics is applied etc. Analyst could not
vals of the same length T . Splitting the test period into
do it himself; he needs help of machine operator. Get-
m similar intervals we expect the similar number of
ting informed is the key question of all analysis, not
failures A.
only reliability ones. The best case occurs when there
exists some kind of central database of maintenance,
which is possible to simply get the relevant informa- d
A=w· [1] (1)
tion from. If there is no such a database, data could be T
collected in paper forms, so the analyst has to rewrite
information into electronic representation. The worst, where w is the length of the interval. This length should
but still solvable, case is that the maintenance history be chosen so that there are at least 5 failures in each
is saved only in operator’s memory. Data must be than interval.
rewritten into tabular mode. It is recommended to work d is then number of failures in tested period. Lets
in groups and make a record of work. Analysis could count test statistics:
be repeated then. When no information is available,
it is necessary to make tests or estimate parameters. 
m
We assume data are available for this methodology (ri − A)2
χ2 = [1] (2)
purpose. i=1
A

3.2 Operational time estimation To confirm hypothesis of exponential distribution


Total cumulated operational time of equipment is one of mean time between failures the counted value of
of basest item for reliability parameters counting. We χ 2 fractile should be lower than its theoretical value
count it as sum of all operational times of one type χ 2 (v) for v = m − 1 degrees of freedom. We can do
equipment. Many times there is not available time for example one-sided test on 10%1 significance level.
of equipment kickoff. Then we consider date of first Point estimation of mean time between failures of
failure as starting time. all machineries could be counted by [2], point 5.2:
Great assistant of ‘‘motor hours’’ counting is MS n
Excel. There is function ‘‘year360’’, which counts T i=1 ti
differences between two dates. This function reflects MTBF = = [h] (3)
r r
every month as 30 days long, it means year has only
360 days in it. This is no problem regarding the fact
we do not think over scheduled outages.
Total cumulated processing time is just a sum of 1 10% significant level of hypothesis testing means that
all operating times. Analyst has to distinguish whether with probability of 10% we do failure of 1st type—we
equipment was only on position or was really work- do not confirm exponential distribution although the data
ing – mostly in case of backup systems. really are exponentially distributed.

892
Failure rate for exponential distribution is then cou- 3.4 Paret analysis of failure modes
nted as:
Paret analysis was developed as quantitative analysis
1 of the most common failure modes and their effects.
λ= [h−1 ] (4) It can separate substantial factors from minor ones
MTBF
and show where maintenance focus effort should by
Confidence intervals are done by standard [2], removing lacks in equipment maintenance process. So
paragraph 5.1.2.1. Lower confidence interval limit is called Paret analysis presupposes that approximately
counted by form (5a), upper one by (5b): 80% of consequences are caused by only 20% of cau-
sations. This analysis accents the fact that it is not
χ0,2 05 (2r) necessary to deal with all causes but that for satisfac-
λL = [h−1 ] (5a) tory effect is enough to solve only some of the most
2T important of them. For purpose of this study is ade-
χ0,2 95 (2r + 2) quate to reason about operator experiences and follow
λU = [h−1 ] (5b) his advices. It is possible to predict which causes will
2T
be the most common from these experiences. Most of
Where χα2 (v) means α–fractile of distribution func- failures are caused by only three failure mode groups
for the methodology purpose. These groups could be,
tion of χ 2 distribution with v degrees of freedom.
divided into several sub-modes, as shown in following
1 Table 1.
MTBFL = [h] (6a) There are not all failure modes which occurred on
λU the equipment in the tablet. Analyst has to diagnose
1 all failure modes and write them down; e.g., also into
MTBFU = [h] (6b) the tablet or by Paret graph, see Fig. 1.
λL

Lower and upper limits of mean time between fail-


ures with 90% confidence interval indicate that with 4 ANALYSIS FLOW-CHART
90% probability the real MTBF will lay inside of
interval MTBFL , MTBFU . Following flow-chart summarizes previous chapters of
The next observed reliability indicator is mean time methodology. It is sufficient base for skilled analyst to
to repair of equipment function (MTTR). This para- make a study.
meter could be estimated as sum of all times to repair
divided by number of failures. Of course data must be
available.
5 RESULTS OF THE STUDY
Tp
MTTR = [h] (7) Analyst should summarize records of the study at
r
the end of the paper. Summarization should con-
Lower and upper confidence interval limit of mean tain processing conditions, operators’ experiences,
time to repair is counted as: modifications of equipment (if were done). As next
part of the summarization there should be empha-
2T sized weak parts of the equipment and as the last
MTTRL = [h] (8a) but not least part of paper there should be written
χ0,2 95 (2r + 2)
proposals for change of machine construction. Such
2T study could serve as basis for new equipment devel-
MTTRU = [h] (8b) opment with similar parameters. Secondary outcome
χ0,2 05 (2r)
of submitted procedure is estimation of reliability
parameters of produced (or prediction of reliability
Lower and upper limit of mean time to repair on
90% confidence limit says what is the times to repair
range, which are in estimated interval with 90% Table 1. List of most common failure modes.
probability.
Based on knowledge of these reliability parameters Failure Sum Sub- Sub- Sub- Sub-
we can calculate asymptotical availability (unavail- mode of subs mode 1 mode 2 mode 3 mode 4
ability) of technological equipment:
Mode 1 74 27 23 22 2
MTTR Mode 2 11 5 3 3
U = [1] (9) Mode 3 23 9 7 5 2
MTTR + MTBF

893
ACKNOWLEDGEMENT
140
Number of failures
Cumulate number of failures This research was supported by the Ministry of Educa-
120
tion, Youth and Sports of the Czech Republic, project
100
No. 1M06059—Advanced Technologies and Systems
for Power Engineering and it was solved by Technical
80 University of Liberec, Faculty of Mechatronics and
Interdisciplinary Engineering Studies.
60

40 REFERENCES

20 ČSN IEC 50(191) (010102) Mezinárodní elektrotechnický


slovník—kapitola 191: Spolehlivost a jakost služeb.
0 ČSN IEC 60605-4 (01 0644-4) Zkoušení bezporuchovosti
Mode1 Mode 3 Mode 2 Mode 4 Mode 6 Mode 8 Mode 5 Mode 7 zařízení—Část 4: Statistické postupy pro exponen-
ciální rozdělení—Bodové odhady, konfidenční intervaly,
Figure 1. Graph of failure frequency. předpovědní intervaly a toleranční intervaly.

APPENDIX: METHODOLOGY APPLICATION

Experiences with data collecting


Submitted methodology of technological equipment
analysis operation was applied on huge industrial
pumps made by Sigma Group, used in power industry
on Czech Republic. This appendix collects experi-
ences from finished studies.
First step of analysis was available data collecting.
I though there was a central database of pump failures
and their repairs. So I contacted several companies,
which deal with power-plant pumps maintenance,
to get this database. I obtained extensive tablets—
reduced and translated example is shown in
Table A1.
This example shows that it is possible to filter data
by pump location (column PWR_P). Column BS_C
tells us type of the pump—whether the pump is feed-
ing, condensation or cooling pump. Data in column
DAY mean date of failure and we are able to find time
of repair in column TIME. Unfortunately we can not
locate concrete piece of pump, so when there are more
than one (almost always) machine types in the power
plant for e.g., cooling, we can not find out what type
the data belongs to. This is the reason why I had to
get the data by a little bit more difficult way. I traveled
from one power plant to the other and collect data in
each of them.
Data in pwr-plants are saved at SHIM department.
Staff of this department also knows where to find rel-
evant records for reliability analysis. The example of
that data is shown in Table A2. Data are translated
Figure 2. Analysis proceeding flow-chart. and some columns are hidden because of the paper
width.
parameters of developed) machine. Knowledge of There is no uniform format of equipment operating
reliability parameters is already standard customers’ data saving in Czech Republic. This is why analyst
request. has to transfer obtained data into standardized form.

894
Table A1. Chosen columns of central pumps failure rate database.

PUMP PWR_P HTC DAY CNSQ RSN BS_C AD_C SKR_C ST MWH GJ TIME

290 ECH B3 1.7.82 RS20 90 4320 4900 312 2 55 0 0


317 ECH B3 9.7.82 RE10 61 2222 4900 314 2 120 0 0,6
322 ECH B4 15.7.82 RH10 61 4320 4600 311 2 162 0 0,9
339 ECH B2 23.7.82 RS20 64 2222 4900 312 2 12 0 0
357 ECH B4 30.7.82 RE11 64 4320 4600 311 2 260 0 1,3
360 ECH B2 31.7.82 RE11 61 2222 4900 314 2 600 0 3
361 ECH B2 31.7.82 RX10 19 2222 4800 2 0 0 0

Table A2. List of maintenance tasks.

Description Pr. system WRK EQ type Equipment Date TTR Req. date

RELAY CHANGE EZH-112V-T SOVP SO CER 1VC01D00 1 17.04.00 200 17.12.99


1VC01L051
PUMP BODY CLEANING-OIL SOVP SO CER 1VC01D00 1 04.10.00 12 12.06.00
PUMP BEARING LEAKAGE SOVP SO CER 1VC01D00 1 16.10.00 24 11.07.00
GDZ/485/00—RUN-UP SOVP SO CER 1VC01D00 1 23.11.00 20 11.08.00
MEASUREMENT
OIL DRAINAGE ARMOUR SOVP SO CER 1VC01D00 1 01.05.01 01.01.00 29.01.01
BLANKING
NEW SENSOR WEC 1VC01F001 SOVP SO CER 1VC01D00 1 22.05.01 24.01.00
NEW SEAL ON RADIAL SOVP SO CER 1VC01D00 1 30.04.01 17.04.00
BEARING

Table A3. Corrected, chronologically sorted data about Table A4. Operating times of each pump.
failures.
Start of Est. operational
ID number Date Repair description TTR Unique ID operating time [h]

1VC01D001 17.04.00 Change of relay EZH- 200 P_P1_B1_P1 3.8.1985 184 248
112VT 1VC01L051 P_P1_B1_P2 18.12.1986 172 368
1VC01D001 04.10.00 Pump body cleaning—oil 12 P_P1_B1_P3 15.2.1990 145 080
1VC01D001 16.10.00 Repair of pump bearing 24 P_P1_B1_P4 25.4.1986 177 960
leakage P_P1_B2_P1 18.9.1987 165 888
1VC01D001 23.11.00 Motor run-up measurement 20 P_P1_B2_P2 6.4.1988 161 136
1VC01D001 19.03.01 Oil relay checking 8 P_P1_B2_P3 25.9.1987 165 720
1VC01D001 30.04.01 New seal on pump radial 108 P_P1_B2_P4 17.8.1987 166 632
bearing Total cumulated operational time 1 339 032
1VC01D001 30.04.01 Pump sensors disconnecting 6
Total number of failures 135

Reliability parameters estimation


For analysis purpose it is sufficient to get data about
unique machine identification (e.g., production num- Cumulated operating time is counted and shown in
ber, position number), failure description and the date following table. All machines were in operational state
of failure. Information about time to repair, which is all the time, so Total cumulated operating time is just
not always feasible, adds necessary data for availability a sum of estimated operating times for each machine.
point estimation. Table A3 shows example of corrected Number of failures in separate years could be men-
data, as written above. tioned in table. However graphical form is more
Proper analysis could start when corrected data transparent, we can see failure rate trends. An example
are available. This analysis is described in following of failure rate histogram in separate years is shown in
steps. following figure.

895
25 Table A5. Example of the three most common pump failure
modes.
20 Total Set-up Change Mass Drain Seal

Seal
top-up plugged clean
74 27 23 22 1 1
15
Number of failures

Resea- Air Oil

Bearing
Total Cooling Revision
ling venting change
10 11 5 2 2 1 1
Total Resea- Top-up Cover With- O ring

system
5 ling leakage drawal montage

Oil
23 9 5 5 2 2
0
1985
1986
1987
1988
1989
1990
1991
1992
1993
1994
1995
1996
1997
1998
1999
2000
2001
2002
2003
2004
2005
2006
Test statistic value by (2):
Year

Figure A3. Number of failures in separate years of


operating.
χ 2 = 9, 71

140
Number of failures
Counted value χ 2 is 9, 71, theoretical value χ0,2 9 (7)
120
Cumulated number of failures is equal to 12, 02. Hypothesis of mean time to pump
failure exponential distribution is confirmed on 10%
100
level of significance for years 1985–2000. In the year
80
2001 there were done bigger amount of maintenance
60 tasks, which decreases failure rate in following years.
40 Point estimation of mean time between failures is
20
counted by (3):
0
Alignment
Shaft
Chain

Gearbox

Clutch

Flange

Design
Bearing
Oil system

failure mode
Sensors
Seal system

Noisiness

Prevention

Cooling spiral

1339032
MTBF = [h] = 9900h
135

Figure A4. Paret graph of failure modes percent occur-


Failure rate, counted by (4), is then:
rence.

λ = 1·10−4 h−1
It is evident from the picture that fault current
was relatively constant for whole test period, only in
the year 2001 there was massive growing of failure Confidence levels of failure rate could be obtained
number. Let’s focus on this year during failure cause by (5a) and (5b).
location. In this case there was a try of seal system
change, what increased number of maintenance tasks.
232, 95
There were fewer failures in years after modification, λL = [h−1 ] = 8, 7 · 10−5 h−1
so we can say that the change was successful and pump 2 · 1339032
failure rate decreased. 311, 5
Due to the fact, that in year 2001 were 24 failures λU = [h−1 ] = 1, 2 · 10−4 h−1
2 · 1339032
(this fact is explained, so we are allowed to do simpli-
fication) will be hypothesis of exponential distribution
of mean time to failure tested for years before 2001. Confidence levels of mean time between failures
Tested time is 16 years (1985–2000 included), during could be obtained by (6a) and (6b):
this time there occurred 107 failures, length of testing
period was determined to 2 years. Expected number
2 · 1339032
of failures, counted by (1), is in every 2-years interval: MTBFL = [h] = 8600h
311, 5
107 2 · 1339032
A=2· = 13 MTBFU = [h] = 11500h
16 232, 95

896
Mean time to repair is counted by (7): and bearing failure. That is why these failure modes
were divided out to more sub-modes, called root
3270 causes. This subdivision is shown in Table A5.
MTTR = [h] = 24h
135 These three dominant failure modes really covered
exactly 80% of all failures (it is example from indus-
Confidence limits of mean time to repair are deter- try, based on real data, not school one)—108 from
mined by (8a) and (8b): 135. Number of failures of one failure mode, sorted
by percent occurrence is charted in following Paret
2 · 3270 graph.
MTTRL = [h] = 21h
311, 5 Results from Paret analysis shows that the most
2 · 3270 troubled place in the pumps operational history was
MTTRU = [h] = 28h seal system. However in the year 2001 seals were mod-
232, 95 ified and their failure rate decreased. Because of that
fact analyst recommends to focus on group of sec-
Now we can finish reliability part of work by
ond most common failures, on the oil system failures.
counting of asymptotical unavailability by the form
It is recommended to observe modified seal system
(9):
to ensure that the modification really erase problems
24, 2 with seals.
U = = 2, 4 · 10−3
24, 2 + 9919

Paret analysis
Operators experiences show us that three most
common failures are seal leakage, oil system failure

897
Safety, Reliability and Risk Analysis: Theory, Methods and Applications – Martorell et al. (eds)
© 2009 Taylor & Francis Group, London, ISBN 978-0-415-48513-5

Evaluation of device reliability based on accelerated tests

E. Nogueira Díaz
Telefónica I +D, UPM, Spain

M. Vázquez López & D. Rodríguez Cano


EUITT-Universidad Politécnica Madrid, Spain

ABSTRACT: Reliability evaluation based on degradation is very useful in systems with scarce failures. In
this paper a new degradation model based on Weibull distribution is proposed. The model is applied to the
degradation of Light Emitting Diodes (LEDs) under different accelerated tests. The results of these tests are in
agreement with the proposed model and reliability function is evaluated.

1 INTRODUCTION 2 CLASSIC MODEL

Reliability evaluation based on degradation models[1] In degradation models it is assumed that a component
is commonly applied in highly reliable products as a fails when one of its functional parameters (power,
cost effective and confident way of evaluating their voltage, light output etc) degrades enough that does
reliability. In this paper a devices degradation model not allow it to carry out its functionality successfully.
is presented and subsequently applied in the quanti- Degradation failure is usually defined as a percent-
tative analysis of LEDs reliability. With this model age of the nominal value from which the component
the different parameters related to module reliability, is considered to be unable to perform its function.
such as the reliability function, failure rate function, For example in LEDs case, the failure is considered
the Mean Time to Failure (MTTF) or the warranty when the light output falls below 70% of nominal
period can be assessed based on LED degradation. In value [2].
order to obtain reliability data in a suitable period of Classical model assumes that:
time degradation is measured in climatic chamber in
accelerated tests. • The functionality parameter is distributed follow-
The classical degradation model determines the ing a normal distribution with an average, μ, and
number of failures at any time based on degradation standard deviation, σ.
data. This model assumes that functionality para- • Average and standard deviation are functions of
meter, light output in the case of LEDs, of a group of time, μ(t) and σ(t).
devices follows a normal distribution in each instant of
time, whose parameters (mean and standard deviation) For the mean a linear variation is usually used by
change as a function of time. several authors [3–4].
In this paper the classical model limitations, from
theoretical and practical point of view, are analysed. μ(t) = μ0 − A t (1)
The calculations were performed in order to see the
temporal limitation of the classic model, using mean Where:
and standard deviation linear variation with time. The μ0 mean initial value
standard deviation trend limits are also analysed in A constant that indicates the speed of degradation.
order to avoid non real results from the degradation t time.
point of view as LEDS that improve with time and
light output values lower than zero. Linear trend presents a problem for t ≥ μ0 /A
Finally, we propose a model using the Weibull dis- because, in this period of time, functionality parameter
tribution to solve the classical model limitations. takes values lower than zero.

899
Other authors [5] propose an exponential trend: or in a simpler way μ(t) − 3σ(t), will pass through the
failure limit, and in that moment degradation failures
μ(t) = μ0 e−t/C (2) will appear.
Based on this model it is possible to evaluate the
being: reliability as the probability in any instant of time
that functionality parameter is within the non-failure
μ0 mean initial value parameter limits.
C Constant that represents the time for which the  
parameter has degraded to a 36.7% of its initial LS −21 p−μ(t) 2
1 σ (t)
value. R(t) = √ e dp (5)
σ (t) 2π
LL
For the time variation of the standard deviation is often
used a linear variation: Where:
p Parameter that is being analysed
σ(t) = σ0 + B t (3) μ mean
σ standard deviation
Where: LL and LS are the lower and upper failures limits.
σ0 initial standard deviation. There are some device manufacturers that provide
B constant that indicates the speed of degradation degradation data but they are scarce. In order to obtain
of standard deviation. data from degradation in a suitable period of time it is
t time. necessary to use accelerate tests as it will be explained
in this paper.
In general this model assumes that the parameter
Reliability from degradation data can be estimated
distribution in any instant of time follows a normal dis-
using the equation (5). One time parameter that
tribution with an average μ(t) and standard deviation
is easily evaluated with this model following the
σ(t).
Figure 1 is the time at which 50% of the devices failed
 
p−μ(t) 2
R(t50 ) = 0.5.
1 −1
f (p, t) = √ e 2 σ (t)
(4) In the linear parameter trend case t50 will be:
σ (t) 2π
μ0 − p F
t50 = (6)
Figure 1 shows the previous model assuming a lin- A
ear variation of both the average and the standard
deviation. pF failure limit parameter.
Due the standard deviation increases with time, the In the exponential parameter trend case t50 will be:
normal distribution will be flatting with time, and
pF
therefore in a certain instant of time the normal curve, t50 = −C ln (7)
μ0

It is also easily evaluated the time at which reliability is


practically zero. This time can be calculated by means
of the following equation:

μ(t) = pF − 3σ(t) (8)

3 CLASSIC MODEL LIMITATIONS

Using classic model, and depending on the degrada-


tion parameter, it is possible to obtain results without
any physical sense. As an example, it is possible that
using the classic model a percentage of the devices
improves their performance, functionality parameter,
s it can be seen in Figure 2.
As can be seen in Figure 2 there is a percentage
of devices (calculated as μ(t) + 3σ(t)) that improves
Figure 1. Mean and standard deviation evolution with time. their performance with time that is not possible in a

900
Figure 2. Normal distribution power values (μ+σ, μ, μ−σ) Figure 4. Normal distribution power values (μ+σ, μ, μ−σ)
with mean and standard deviation linear trend according to with average exponential trend and standard deviation linear
the classical model. trend according to the classical model.

because depending on the parameters is possible to


approximate very different degradation trends.
In Weibull distribution functionality parameter
depends on time in the following way:
 β
t−t0

μ(t) = μ0 e η
(11)

being
t0 –location parameter.
Figure 3. Normal distribution power values (μ+σ, μ, μ−σ) η–scale parameter.
with mean and standard deviation linear trend according to ß–shape parameter (or slope).
the classical mode (A ≈ 3B). Related scale parameter:
If β = 1 functionality parameter varies with respect
time following an exponential. It means that degrada-
degradation model. In order to avoid this situation it tion rate is constant in the whole period of time:
is necessary that degradation trend follows the next
equation:  
− t+t
η(t + t) η0 e η
=   = e−t (12)
μ(t) + 3σ(t) ≤ μ0 + 3σ0 (9) η(t) − t
η0 e η
In the case of mean and standard linear trend it is
necessary that: If β < 1 degradation rate decreases with time.
If β > 1 degradation rate increases with time.
3B≥A (10) η is the scale parameter and is the time at which the
functionality parameter has reduced to e−1 (0.368).
In the next figure it can be seen a case in which t0 – location parameter, defines the degradation start-
A ≈ 3B. ing point.
In the exponential average degradation case the
analysis is very similar to the previous case as it can
For the common case where t0 = 0 and η = 1
be seen in Figure 4.
the figure shows three curves for the three types of β
described in the preceding paragraph.
Main advantages of Weibull function are:
4 PROPOSED MODEL
• It takes values between μ0 in t = 0 and 0 (t =
In the paper we propose a model that is based on infinite) according theoretical degradation models.
the assumptions that functionality parameter decays Although takes zero value for t equal to infinite it
with time following a Weibull distribution function[6]. is possible to model practical zero value at finite
Weibull function is very useful due its versatility times.

901
Pressure cooker

Figure 6. Test circuit.


Figure 5. Weibull distribution for t0 = 0, η = 1 and
different β values (0.2, 1 and 5).
Table 1. Power luminosity values for the days 10, 17 and
26. Pressure Cooker 110◦ C/85% RH.
• Different parameters degradations reported in the
literature as linear degradation, exponential or oth- Day
ers can be modeled with a Weibull distribution.
LED 10 17 26
• Weibull parameters give us information about the
degradation rate (constant, decreases with time or L1 0,552 0,322 0,183
increases with time). L2 0,574 0,488 0,316
L3 0,613 0,549 0,476
L4 0,591 0,538 0,294
5 EXPERIMENTAL TESTS L5 0,616 0,560 0,534
L6 0,606 0,524 0,400
Degradation model has been applied to experimental L7 0,634 0,384 0,265
L8 0,606 0,623 0,551
results obtained in LEDs that has been working in a L9 0,610 0,567 0,481
Pressure Cooker chamber under different temperature L10 0,623 0,385 0,171
and humidity conditions. LEDs were working during L11 0,604 0,489 0,000
the tests in the pressure cooker chamber according to L12 0,628 0,570 0,462
the next figure. L13 0,629 0,600 0,521
Test procedure was: L14 0,575 0,519 0,000
L15 0,565 0,535 0,410
1. 15 LEDs were introduced in the Presure Cooker
applying a bias volage according to the scheme
shown in the figure. Polarization resistance and
power supply were outside of the camera.
2. Periodically, once each day, the test was interrupted 2. Power luminosity degrades following a Weibull
and all the LEDs were characterized with an optical function. Weibull representation has been used for
power meter. this purpose.
3. Luminosity power has been represented and com-
pared with initial power luminosity.
In the 110◦ C/85% case the test lasted 29 days and 6.1 LEDs Power luminosity distribution
in the 19 day began to appear catastrophic failures.
We have Table 1 shows the 15 LEDs power luminosity
Catastrophic failures have not been taken into account
results for the days ten, seventeen and twenty-six in
in calculating means and standard deviations.
the test 110◦ C/85% RH in the Pressure Cooker.
As can be seen in Table 1 on day 26, two LEDs, 11
and 14, have failed catastrophically. These Power lumi-
6 DEGRADATION MODEL VERIFICATION
nosity data have bee represented in a normal paper as
it can be seen in Figure 7. From these normal represen-
In order to analyse the model validity we have evalu-
tations average and standard deviation from different
ated different subjects:
instant of time have been evaluated.
1. Power luminosity of the 15 LEDs follows a nor- From normal plot average and standard deviation
mal distribution in any instant of time. Average and values are easily evaluated. In the following figures
standard deviation have been calculated by means the average and standard deviation power luminosity
of normal representation. with respect time have been represented.

902
Weibull Pm/Po - t
Power luminosity values in normal plot
for 10, 17 y 26 days 1
0 y = 2,1642x - 7,2746
2 R2 = 0,9088
-1

Ln (-Ln P/Po)
1 -2
-3
0
-4
0 0,2 0,4 0,6 0,8
-1 -5
-6
-2
0 0,5 1 1,5 2 2,5 3 3,5
-3 Ln (t)

Figure 7. Normal distribution representation in three dif- Figure 10. Power luminosity vs time in a Weibull plot.
ferent instant of time (day 10-blue and right, day 17-
pink-middle and day 26-yellow-left). Pressure cooker test
(110◦ C/85% RH).
from the beginning till eleven day and the standard
deviation is almost constant. Second period is from
eleven to eighteen day and in this period the standard
Pm(t) – Time (110º C 85 % H) deviation increases following a linear trend, according
to the classical model. In the last period that starts in
0,8
19 day catastrophic failures appear and therefore it is
0,6 not easy to find the standard deviation trend.
Pm

0,4
0,2
0 6.2 Power luminosity Weibull function
0 10 20 30 40
Ti m e
Based on the figure 7 we have evaluated average power
luminosity with respect time. It can be seen that degra-
dation does not start until day 4. In figure 10 we
Figure 8. Average Power luminosity respect time. have represented the relative power luminosity (rel-
ative to power luminosity at fourth day) with respect
time in a Weibull representation concluding that can
Standard deviation
be modelled with a Weibull function.
From Weibull representation we have obtained
0,35
Weibull parameters for this specific test (110◦
0,3 C/85% RH).
0,25 As can be seen following the proposed law with
0,2 β = 2.1642, η = 28.8276days = 691, 86hours, and
0,15
0,1 therefore power luminosity evolutions with time in the
0,05 following way.
0
1 3 5 7 9 11 13 15 17 19 21 23  2.16
− t−96
Time Pm (t) = 0, 62e 691,86
. (13)

Figure 9. Standard deviation Power luminosity respect


time. 7 RELIABILITY EVALUATION

We have evaluated reliability function and MTTF


From Figure 8 it can be seen the behaviour of power assuming that device failure limit is 70% of nominal
luminosity versus time. During the first days power luminosity. Table 2 shows in which days the failures
luminosity increases with respect initial power lumi- appear.
nosity and after that the device starts to degrade. This We have represented the failures in Weibull plot as
evolution has been reported in the literature by other it can be seen in Figure 8.
authors [7,8]. From Weibull plot it is possible to evaluate reliabil-
From standard deviation evolution it is possible to ity function obtaining a β value higher than one for all
distinguish three different periods: the first period is the tests indicating a degradation mechanism.

903
Table 2. Accumulated 9 CONCLUSIONS
failures at different days
(110◦ C/85% RH). Main conclusions of this paper are:
Days Accumulated failures • We have proposed a degradation model based on
Weibull function that fits with several degradation
12 1 models for different devices reported in the litera-
13 3 ture.
16 4 • From this model it is possible to evaluate the relia-
20 8
22 9
bility function of any device by means of analyzing
23 11 degradation data.
24 14 • First results in accelerated tests in AlGaInP LEDs
25 15 shows that:
◦ LEDs degradation follows a Weibull function
respect time in agreement with the proposed
model.
◦ Reliability in all pressure cooker tests follows a
Weibull plot Weibull function with a shape parameter higher
2 than one in agreement with the degradation
y = 4,6621x - 14,219
1 mechanism.
R2 = 0,9309
0 • Accelerated tests at different conditions are on going
F(t)

-1 in order to extrapolate reliability data at normal


-2 working conditions.
-3
-4
0 1 2 3 4
Ln time
REFERENCES

[1] Coit D.W., Evans J.L., Vogt N.T., Thomson J.R. A


Figure 11. Weibull plot for 110◦ C/85% RH pressure method for correlating field life degradation with
cooker test. reliability prediction for electronic modules. Quality
and Reliability Engineering International 2005; 21:
715–726.
[2] ASSIST: Alliance for Solid-State Illumination Sys-
tems and Technologies. LED life for general lighting.
8 RESULTS AND FUTURE WORK Assist Recommends 2005;1(1):1–13.
[3] Osterwald C.R., Benner J.P., Pruett J., Anderberg A.,
Once average and standard deviation power luminos- Rummeland S., Ottoson L. Degradation in weathered
ity has been evaluated it is easy to evaluate reli- crystalline-silicon PV modules apparently caused by
ability based on the proposed model. First results UV radiation. 3rd World Conference on Photovoltaic
show that: Energy Conversion.
[4] Jia-Sheng Huang. ‘‘Reliability-Extrapolation Meth-
• All the experiments done at different condi- odology of Semiconductor Laser Diodes.’’ IEEE
tions (110◦ C/85%RH, 130◦ C/85% RH and Transactions on Device and Materials Reliability, Vol.
140◦ C/85RH) inside the pressure cooker chamber 6, NO 1, (2006).
[5] Xie J., Pecht M. ‘‘Reliability Predicition Modeling of
can be fitted by the proposed model. Semiconductor Light Emitting Device.’’ IEEE Trans-
• First results related with reliability evaluation by actions on Device and Materials Reliability Vol 6,
the proposed model show that all of them have No3. 218–222.
similar degradation behaviour but at different rates [6] Weibull W. ‘‘Fatigue Testing and Analysis of results’’
depending on the accelerated tests. Pergamon Press New York.
• After several hours of operation, this time depends [7] Kish F.A., Vanderwater D.A., DeFevere D.C., Steiger-
on the acceleration factor, LEDs start degradation wald D.A., Hofler G.E., Park K.G., Steranka F.M.
and failure rate increases with time. ‘‘Highly reliable and efficient semiconductor wafer-
• Tests at different temperature/humidity conditions bonded AlGaInP/GaP light-emitting diode.’’ Elec-
tronics Letters Vol. 32, No. 2; 132–134 (1996).
are on going in order to analyse the reliability [8] Grillot P.N., Krames M.R., Zhao H., Teoh S.H. ‘‘Sixty
influence of the different parameters: temperature, Thousand Hour Light Output Reliability of AlGaInP
humidity and pressure. Light Emitting Diodes.’’ IEEE Transactions on Device
• Reliability at normal working conditions will be and Materials Reliability. Vol. 6, No. 4; 564–574
evaluated when all the tests will be finished. (2006).

904
Safety, Reliability and Risk Analysis: Theory, Methods and Applications – Martorell et al. (eds)
© 2009 Taylor & Francis Group, London, ISBN 978-0-415-48513-5

Evaluation, analysis and synthesis of multiple source information:


An application to nuclear computer codes

Sebastien Destercke & Eric Chojnacki


Institut de Radioprotection et de Sûreté Nucléaire, Cadarache, France

ABSTRACT: In this paper, we’re interested in the problem of evaluating, analyzing and synthesizing
information delivered by multiple sources about the same badly known variable. We focus on two approaches that
can be used to solve the problem, a probabilistic and a possibilistic one. They are first described and then applied
to the results of uncertainty studies performed in the framework of the OECD BEMUSE project. Usefulness and
advantages of the proposed methods are discussed and emphasized in the lights of obtained results.

1 INTRODUCTION settings of the two approaches. Section 3 describes


the BEMUSE program and the application of each
In the field of nuclear safety, the value of many vari- methodology to its results. The benefits of using these
ables are tainted with uncertainty. This uncertainty can approaches are then discussed in the lights of the
be due to a lack of knowledge, of experimental value or obtained results.
simply because a variable cannot be directly observed
and must be evaluated by some mathematical models.
Two common problems encountered in such situations 2 METHODS
are the following:
Most of the formal approaches proposing to handle
1. The difficulty to build synthetic representations of
information provided by multiple sources consist in
our knowledge of a variable;
three main steps: modeling the information, evalu-
2. The need to compare, analyse and synthesize
ating the sources by criteria measuring the quality
results coming from different mathematical models
of the provided information, and synthesizing the
modeling a common physical phenomenon.
information.
Both issues can be viewed as problems of informa- In this paper, we will only remind the basic
tion fusion in presence of multiple sources. In the first ideas of each methodology, and will focus on the
case, the information can come from multiple experts, results obtained with the BEMUSE program. More
sensors, or from different experimental results. Tak- details are given in an extended paper (Destercke and
ing into account these multiple sources to model input Chojnacki 2007), while the probabilistic and possi-
uncertainty is therefore desirable. In the second case, bilistic approaches are fully motivated respectively
the output of each single mathematical model or com- in (Cooke 1991) and in (Sandri, Dubois, and Kalfsbeek
puter code can be considered as a single source of 1995), respectively.
information. The synthesis and analysis of the differ-
ent outputs can then be treated as an information fusion
2.1 Modeling information
problem.
Both probability and possibility theories offer for- In this paper, we consider that the information is
mal frameworks to evaluate, analyze and synthesize provided in term of percentiles, which are surely
multiple sources of information. In this paper, we the commonest type of probabilistic information
remind the basics of each methodology derived from encountered in safety studies. Other kinds of infor-
these theories and then apply them to the results of the mation cover characteristics of distribution (mean,
BEMUSE (Best Estimate Methods—Uncertainty and median, . . . ), comparative assessments, (see (Cooke
Sensitivity Evaluation) OECD/CSNI program (OCDE 1991) and (Walley 1991, ch 4.) for extensive reviews).
2007), in which the IRSN participated. Classical approaches consist of singling out a prob-
The rest of the paper is divided in two main sections. ability distribution that correspond to the informa-
Section 2 details the ideas on which are based the tion, usually by maximizing an information measure
methods and then gives some basics about the formal (e.g. the entropy). Nevertheless, many arguments

905
q95 % q100%
1 1
0.9
45%
%

0.5
q50 %
q0% 45%
%
q5%
% 5%
| 0.1
0 5% 0
50 0 K 60 0 K 80 0 K 90 0 K 1000 K 500 K 600 K 700 K 800 K 900 K 1000 K

Figure 1. Examples of probabilistic modeling. Figure 2. Examples of possibilistic modeling.

converge to the fact that single probabilities cannot be interpreted as lower and upper probability mea-
adequately account for incompleteness, imprecision sures (Dubois and Prade 1992), thus defining a set
or unreliability in the information (See (Ferson and Pπ of probability distributions such that
Ginzburg 1996) for a short discussion). Other uncer-
tainty theories, such as possibility theory, allows to Pπ = {P|∀A ⊆ RN (A) ≤ P(A) ≤ (A)}
explicitly account for such features of the information.
Such theories are less precise than probability theory, where P are probability measures over R. This set of
but ensure that no extra assumptions are added to the probabilities is also related to α-cuts in the following
available information. sense
The probability distribution that fits a set of per-
centiles qk% 1 and maximize entropy simply corre- Pπ = {P|∀α ∈ [0, 1], P(πα ) ≥ 1 − α}.
sponds to a linear interpolation between percentiles.
Figure 1 represents a cumulative distribution func- This relation indicates that possibility distributions
tion (CDF) corresponding to the peak clad temperature allows to model information given in terms of nested
(Maximal temperature value reached during an acci- intervals associated to confidence levels (the narrower
dental transient phase) temperature of a fuel rode in the interval, the less the confidence in it).
a nuclear reactor core for which available information It can thus model information given by a finite
is q0% = 500 K, q5% = 600 K, q50% = 800 K, number of percentiles, as well as cases where we
q95% = 900 K, q100% = 1000 K. The corresponding have partial information about characteristics of an
probability density is pictured in dashed lines. unknown distribution (e.g. mean, percentiles, mode,
A possibility distribution (Dubois and Prade 1988) . . . , see (Baudrit and Dubois 2006). Figure 2 repre-
over the reals is formally defined as a mapping π : sents a possibility distribution corresponding to the
R → [0, 1]. For a given value α ∈ [0, 1], the (strict) α- peak clad temperature of a fuel rode in a nuclear reac-
cut of π is defined as the set πα = {x ∈ R|π(x) > α}. tor core where information consists of four intervals
Given a possibility distribution π, possibility  and [750 K, 850 K], [650 K, 900 K], [600 K, 950 K],
necessity N measures of an event A are respectively [500 K, 1000 K] which have respective confidence
defined as: levels of 10%, 50%, 90% and 100%.

(A) = max π(x) and N (A) = 1 − π(Ac ) 2.2 Evaluating sources


x∈A
Once information has been given by a source, it is
with Ac the complements of A. We have, for any event desirable to evaluate the quality of this information
A, N (A) ≤ (A), and possibility and necessity mea- and of the sources. In each approach, this quality is
sure are respectively interpreted as upper and lower given by two numerical values computed on the basis
confidence level given to an event. They can be com- of rationality requirements:
pared to classical probabilities, where the confidence Informativeness (Inf): evaluate the precision of the
in an event is given by a single (precise) measure. information, by comparing it to the model represent-
Actually, the possibility and necessity measures can ing ignorance. The more informative is a source, the
more useful the information is, and the higher is the
associated score.
Calibration (Cal): evaluate the coherence between
1 thepercentile qk% of the probability distribution the provided information and some observed exper-
P of a variable X is the deterministic value x s.t. imental values. The higher this coherence with
P(X ≤ x) = k%. observed values, the higher the calibration score.

906
Variables on which are computed calibration are called 1
seed variables (that is, variables for which sources have
given information and for which experimental data are source 1
or will be available) weight 0.5
0
In the probabilistic approach, informativeness ql q5% q50% q95% qu
and calibration are computed by the means of the 1
Kullbach-Leibler (KL) divergence, which can be inter-
preted as a distance between two probabilities. The source 2
weight 0.5
informativeness is obtained by comparing the proba- 0
bility distribution pX derived from the source informa- ql q5% q50% q95% qu
tion to the uniform probability distribution uX defined 1
on the whole variation domain of the variable. Cali-
bration is obtained by comparing probability pX to an synthesis
empirical distribution rX built from the observations. 0
ql q q65% q96 . 25%
If distributions are discretized in B elements, then the q3 . 75%8 . 75% q42 . 5% q83 . 75% qu
KL divergence used to compute informativeness and
calibration of a source respectively read:
Figure 3. Probabilistic synthesis illustration.


B  
pi
I (p, u) = pi log 1 2
i=1
ui 1 1


and
0 0

B  
ri
I (r, p) = ri log 1
h
1
i=1
pi
mean

And are then transformed to obtain, for all sources, 0 0
non-negative scores summing up to one. In the proba-
bilistic approach, calibration is based on a convergence Figure 4. Possibilistic synthesis illustration.
argument and requires about 10 experiment to ensure
a good stability. It is argued by Sandri et al. (San-
dri, Dubois, and Kalfsbeek 1995) that the probabilistic 2.3 Synthesizing the information
approach tends to confuse variability and imprecision.
Synthesizing the information consists of aggregating
In the possibilistic approach, informativeness is
multiple models built from the information given by
evaluated by comparing the distribution built from the
different sources to get a single models. This model
source information to the interval covering the whole
can be used in subsequent treatments or analyzed to
variation domain of a variable. Calibration is simply
get information about the sources. Three main kinds
the extent to which experimental value are judged plau-
of operators are usually used:
sible by the built distribution. In this case, no argument
Conjunction: equivalent to set intersection. Sup-
of convergence is used. Let Xr denote the variation
poses that all sources are reliable. Conjunction gives
domain of a variable X , IXr the indicator function of
poorly reliable results in case of disagreement between
Xr (i.e. has value one in Xr , zero elsewhere), and
sources, but allows to detect such disagreement.
πX the possibility distribution built from the source
Disjunction: equivalent to set union. Supposes that
information. Informativeness is given by:
at least one source is reliable. Disjunction gives reli-
 able results that are often very imprecise (hence of
X (IXr − πX )dx
I (πX ) = r  limited usefulness).
Xr IXr dx Arithmetic mean: equivalent to a statistical count-
ing of the sources. Supposes that sources are indepen-
and if x∗ denote the observed value for X , calibration dent, and gives a result that is between conjunction
score C(πX ) is simply given by the value πX (x∗ ) (the and disjunction. With this operator, sources can also
upper confidence degree given to x∗ . be weighted by scores obtained during the evaluation
Once calibration and informativeness scores for phase.
every source and for all variables are computed, Disjunctive and conjunctive operators are not appli-
these scores are then normalized so that they are cable to the probabilistic approach, and it is commonly
non-negative and sums up to one. recognized that the weighted arithmetic mean is the

907
best approach to aggregate probability distributions. operators respectively called t-norms and t-conorms
We don’t consider Bayesian methods here, because we (Klement, Mesiar and Pap 2000).
do not assume we have prior information (see (Clemen
and Winkler 1999) for a recent review of such meth-
ods). Let p1 , . . . , pN be the probability distributions 3 APPLICATION TO BEMUSE BENCHMARK
corresponding to the information delivered by N dif-
ferent sources, and λ1 , . . . , λN be the non-negative To show the usefulness and potential applications of
weights summing to one attached to these sources the methodology, we apply them to the results of the
(possibly provided by the evaluation procedure briefly BEMUSE (Best Estimate Methods—Uncertainty and
described in Section 2.2). The probability distribution Sensitivity Evaluation) programme (OCDE 2007) per-
p obtained by arithmetic weighted mean is: formed by the NEA (Nuclear Energy Agency). Our
study will focus on the results of the first step of the

N
programme, in which nine organisations were brought
p = λi pi together in order to compare their respective uncer-
i=1 tainty analysis with experimental data coming from
This is not the case for the possibilistic approach, for the experiment L2-5 performed on the loss-of-fluid
which conjunctive (π∩ ), disjunctive operators (π∪ ) and test (LOFT) facility, for which an accidental transient
the arithmetic mean (πmean ) are well defined, allowing was simulated.
for a greater flexibility in the synthesis and analysis. We will focus on four scalar variables for which
Let π1 , . . . , πN be the probability distributions corre- each participant had to provide a lower bound (Low), a
sponding to the information delivered by N different reference value (Ref) and an upper bound (Upp). These
sources, and λ1 , . . . , λN be the non-negative weights variables are the first (PCT1) and second (PCT2) peak
summing to one attached to these sources (possibly clad temperature (respectively corresponding to the
provided by the evaluation procedure briefly described peak of the blowdown and of the reflood phase), the
in Section 2.2). Then, classical conjunctions, disjunc- time of accumulator injection (Tinj ) and the time of
tions and arithmetic mean are given, for all x ∈ R, by: complete quenching (Tq ). These four variables are
amongst the more critical values that have to be sur-
veyed in case of nuclear accident (this is particularly
π∩ (x) = min πi (x) (1) true for the peak clad temperatures). Values result-
i=1,... ,N
ing from the uncertainty studies achieved by each
π∪ (x) = max πi (x) (2) participant are summarized in Table 1
i=1,... ,N For each participant and each variable, the chosen
probabilistic model was to take the lower bound as q1% ,

N
the reference value as q50% (median) and the upper
π (x) = λi πi (x) (3) bound as q99% . The possibilistic model was taken as
i=1 π(Low) = π(Upp) = 0.02 (98% confidence inter-
val), π(Ref ) = 1 (most plausible value). Figure 5
Note that the above conjunctive and disjunctive illustrates both models built from the information of
operators belongs to a broad family of mathematical NRI2 concerning the second PCT.

Table 1. Results of the BEMUSE program.

1PCT (K ◦) PCT (K ◦) Tinj (s) Tq (s)

Low Ref Up Low Ref Up Low Ref Up Low Ref Up


CEA 919 1107 1255 674 993 1176 14.8 16.2 16.8 30 69.7 98
GRS 969 1058 1107 955 1143 1171 14 15.6 17.6 62.9 80.5 103.3
IRSN 872 1069 1233 805 1014 1152 15.8 16.8 17.3 41.9 50 120
KAERI 759 1040 1217 598 1024 1197 12.7 13.5 16.6 60.9 73.2 100
KINS 626 1063 1097 608 1068 1108 13.1 13.8 13.8 47.7 66.9 100
NRI1 913 1058 1208 845 1012 1167 13.7 14.7 17.7 51.5 66.9 87.5
NRI2 903 1041 1165 628 970 1177 12.8 15.3 17.8 47.4 62.7 82.6
PSI 961 1026 1100 887 972 1014 15.2 15.6 16.2 55.1 78.5 88.4
UNIPI 992 1099 1197 708 944 1118 8.0 16.0 23.5 41.4 62.0 81.5
UPC 1103 1177 1249 989 1157 1222 12 13.5 16.5 56.5 63.5 66.5
Exp. val. 1062 1077 16.8 64.9

908
1 and informal observations confirms that using formal
methods to analyze information is meaningful.
Another noticeable result is that participants using
the same code can have very different scores (both
high and low, e.g. global scores of RELAP5 users
can range from 0.025 to 0.59), which illustrates and
0 T(K) confirms the well-known user influence on the result
592 845 1012 1167 1228 of a given computer code. Also note that, since scores
are built to be directly comparable between them, they
can also be used as code validation tools (the better
1
the global result, the better the information delivered
by the code). We will see in the next section that using
the results of the evaluation can improve the results of
the synthesis.

0 T(K)
592 845 1012 1167 1228 3.2 Synthesis
Figure 6 shows some results of the synthesis for the
Figure 5. Probability and possibility dist. of NRI1 for the
PCT2. Since this variable is of critical importance in
2PCT.
accidental transient and is difficult to estimate, it is of
particular interest in the current problem.
Figure 6.A shows the synthetic probabilities when
we consider subgroup of participant using the same
3.1 Evaluation
code. This figures indicate that, while CATHARE
Table 2 shows the results of the evaluation steps per- and RELAP5 users seem to underestimate the exper-
formed on the results of the BEMUSE programme, imental value, ATHLET users tend to overestimate it.
with the models described above. From a methodologi- Figure 6.B shows the benefits of weighting sources
cal point of view, we can notice that the scores and the or of selecting a subgroup of sources judged better
ranking between sources are globally in agreement, by the evaluation step. Such a selection and weight-
even if there are some differences coming from the ing shift the curves towards the experimental value
differences between formalisms. (resulting in a better global calibration) and tighten
From a practical standpoint, interesting things can their uncertainty bounds (resulting in a better global
be said from the analysis of results. First, our results are informativeness). We also see that the arithmetic
in accordance with informal observations made in pre- mean tends to average the result, and that using
vious reports (OCDE 2007): PSI and UNIPI have high probabilistic modeling do not allow us to see pos-
informative scores, which reflects their narrow uncer- sible disagreements between sources. This can be
tainty bands, and have very low calibration scores, due problematic, since it is often desirable to detect and
to the fact that, for each of them, two experimental investigate the sources of such disagreements, partic-
values are outside interval [Low, Upp]. This consis- ularly when synthesis tools are used to analyze the
tency between conclusions drawn from our methods information.

Table 2. Scores resulting from evaluation (Inf.: informativeness ; Cal.: Calibration).

Prob. approach Poss. approach

Participant Used code Inf. Cal. Global Inf. Cal. Global


CEA CATHARE 0.77 0.16 0.12 0.71 0.55 0.40
GRS ATHLET 1.23 0.98 1.21 0.84 0.52 0.44
IRSN CATHARE 0.98 0.75 0.73 0.73 0.83 0.60
KAERI MARS 0.68 0.16 0.11 0.70 0.48 0.34
KINS RELAP5 1.29 0.16 0.21 0.72 0.67 0.49
NRI1 RELAP5 0.79 0.75 0.59 0.75 0.63 0.47
NRI2 ATHLET 0.79 0.13 0.10 0.78 0.72 0.56
PSI TRACE 1.6 0.004 0.008 0.88 0.25 0.22
UNIPI RELAP5 0.53 0.75 0.4 0.69 0.67 0.46
UPC RELAP5 1.44 0.02 0.025 0.87 0.28 0.24

909
1 1
RELAP5 (KINS,NRI1,UNIPI,UPC)
0.9 CATHARE (IRSN,CEA) 0.9
ATHLET (GRS,NRI2)
0.8 0.8 (GRS,IRSN,NRI1,UNIPI)
IRSN,KINS,NRI1,UNIPI
0.7 0.7

0.6 0.6

F(X)
F(X)

0.5 0.5

0.4 0.4

0.3 0.3

0.2 0.2

0.1 0.1

0 0
500 600 700 800 900 1000 1100 1200 1300 500 600 700 800 900 1000 1100 1200 1300
T(K) T(K)

6.A : Probabilities by used codes 6.B : Probabilities by scores


1 1
RELAP5 (KINS,NRI1,UNIPI,UPC) All sou rces
0.9 ATHLET (GRS,NRI2) 0.9 First Four (IRSN,KINS,NRI1,NRI2)
CATHARE (IRSN,CEA) IRSN,KINS,NRI1,UNIPI
0.8 0.8

0.7 0.7

0.6 0.6

(x)
(x)

0.5 0.5

0.4 0.4

0.3 0.3

0.2 0.2

0.1 0.1

0 0
500 600 700 800 900 1000 1100 1200 1300 600 700 800 900 1000 1100 1200 1300
T(K) T(K)

6.C : Possibilities conjunctions by used codes 6.D:Possibilities conjunctions by scores


1 1
All sou rces : maxi mum
0.9 0.9

0.8 0.8

0.7 0.7

0.6 0.6
(x)
(x)

0.5 0.5

0.4 0.4

0.3 0.3

0.2 0.2

0.1 0.1

0 0
500 600 700 800 900 1000 1100 1200 1300 300 400 500 600 700 800 900 1000 1100 1200 1300
T(K) T(K)

6.E : Possibilities disjunction 6.F:Possibilities arithmetic weighted mean

Figure 6. Results of synthesis for PCT2: probabilistic and possibilistic approaches (- - -: experimental value).

Figure 6.C and 6.D show synthetic possibility distri- that all sources strongly disagreeing when considered
butions resulting from the application of a conjunctive as a whole, but that the best sources globally agree
operator (Equation (1)). In this case, the disagree- together, and that taking only their information into
ment between sources of a particular subgroup is account gives a more reliable synthesis.
directly visible, both graphically and quantitatively Figure 6.E and 6.F respectively illustrate the syn-
(disagreement is measured by the maximal height of thetic possibility distributions resulting from the appli-
a distribution: the lower the distribution, the higher cation of the disjunction (Equation (2)) and of the
the disagreement). We can thus see that informa- arithmetic weighted mean (Equation (3)) over all
tion given by ATHLET users are more conflicting sources. Anew, we can see on Figure 6.F that the arith-
than those given by CATHARE users (this could be metic mean averages the result, thus smoothing the
explained by the higher number of input data param- resulting curves. Figure 6.E well illustrates the poten-
eters in ATHLET code). Similarly, Figure 6.D shows tial high imprecision resulting from the disjunction.

910
Although the resulting uncertainty model is reliable, are needed. This is why the IRSN is working on
its informative content appears of poor interest (e.g. methods that are more complex but remain tractable
the 50% confidence interval for the 2PCT temperature and interpretable (Destercke, Dubolis, and Chojnacki
is [800, 1200], which is very broad). 2007).

REFERENCES
4 CONCLUSIONS
Abellan, J. and M. Gomez (2006). Measures of divergence
We have applied methods to evaluate, synthesize and on credal sets. Fuzzy Sets and System 157 (11).
analyze information coming from multiple sources Baudrit, C. and D. Dubois (2006). Practical representations
to results of uncertainty studies on various computer of incomplete probabilistic knowledge. Computational
codes. By using formal methods based on rational Statistics and Data Analysis 51 (1), 86–108.
requirements, evaluations are made as objective as Clemen, R. and R. Winkler (1999). Combining probability
possible. distributions from experts in risk analysis. Risk Analysis
Proposed methods allow to take uncertainty (either 19 (2), 187–203.
Cooke, R. (1991). Experts in uncertainty. Oxford, UK:
aleatory or coming from imprecision in the data) Oxford University Press.
explicitly into account in the evaluation process. They Destercke, S. and E. Chojnacki (2007). Methods for the
provide interesting tools to evaluate sources. In the par- evaluation and synthesis of multiple sources of infor-
ticular case of computer codes, they give new instru- mation applied to nuclear computer codes. Accepted for
mental tools for code validation procedures (Trucano, publication in Nuclear Eng. and Design.
Swiler, Igusa, Oberkampf, and Pilch 2006), a problem Destercke, S., D. Dubois, and E. Chojnacki (2007). Pos-
particularly important for nuclear safety institute as sibilistic information fusion using maximal coherent
the IRSN. The consistency between conclusions drawn subsets. In Proc. IEEE Int. Conf. On Fuzzy Systems
from our results and informal observations confirms (FUZZ’IEEE).
Dubois, D. and H. Prade (1988). Possibility Theory: An
that using formal methods to analyze information is Approach to Computerized Processing of Uncertainty.
meaningful and can be useful. Compared to such New York: Plenum Press.
informal observations, presented methods allow for Dubois, D. and H. Prade (1992). On the relevance of non-
a more subtle analysis, allowing to quantify disagree- standard theories of uncertainty in modeling and pool-
ment among sources, to detect biases, underestimated ing expert opinions. Reliability Engineering and System
uncertainty, . . . Safety 36, 95–107.
We have also illustrated the potential advantages Ferson, S. and L.R. Ginzburg (1996). Different methods are
offered by the use of possibility theory. In terms of needed to propagate ignorance and variability. Reliability
information evaluation, probabilistic and possibilis- Engineering and System Safety 54, 133–144.
Klement, E., R. Mesiar, and E. Pap (2000). Triangular
tic approaches have comparable results (which is not Norms. Dordrecht: Kluwer Academic Publisher.
surprising, since they are based on similar rational OCDE (2007, May). Bemuse phase iii report: Uncertainty
requirements). However, the possibilistic approach has and sensitivity analysis of the loft l2-5 test. Technical
more flexibility to synthesis and analyze the informa- Report NEA/NCIS/R(2007)4, NEA.
tion, offering a wider range of tools. The fact that both Sandri, S., D. Dubois, and H. Kalfsbeek (1995, August).
probabilities and possibilities can be seen as special Elicitation, assessment and pooling of expert judgments
cases of imprecise probabilities could be used to build using possibility theory. IEEE Trans. on Fuzzy Systems 3
a generalized approach, possibly by using some recent (3), 313–335.
research results about measure of divergence for sets Trucano, T., L. Swiler, T. Igusa, W. Oberkampf, and M. Pilch
(2006). Calibration, validation, and sensitivity analysis:
of probabilities (Abellan and Gomez 2006). Such a What’s what. Reliability Engineering and System Safety
generalization remains the subject of further research. 91, 1331–1357.
Also, since results given by basic synthesizing oper- Walley, P. (1991). Statistical reasoning with imprecise Prob-
ators can sometimes be found too rough, sometimes abilities. New York: Chapman and Hall.
more complex tools that allow for a finer analysis

911
Safety, Reliability and Risk Analysis: Theory, Methods and Applications – Martorell et al. (eds)
© 2009 Taylor & Francis Group, London, ISBN 978-0-415-48513-5

Improving reliability using new processes and methods

S.J. Park & S.D. Park


CS Management Center, Samsung Electronics, Suwon, Korea

K.T. Jo
Digital Printing Division, Samsung Electronics, Suwon, Korea

ABSTRACT: To improve reliability of a newly designed product, we introduced new processes and methods.
The reliability definitions contain ‘‘probability’’, ‘‘intended function’’, ‘‘specified period’’ and ‘‘stated condi-
tions’’. Therefore, it is inevitable to research the operating condition, current state of probability and target of
probability, potential failure mode and mechanism of the product for the specified period.
We conducted a 4-step test program that is, Architecture and Failure Mode Analysis, Potential Failure Mech-
anism Analysis, Dominant Failure Extraction and Compliance Test. Based upon the Architecture and Failure
Mode Analysis, we selected the stress factors in an ALT and reproduced the field failure mode by an ALT.
Based on the results of researches, test plans are designed to satisfy the target of reliability. Stress analysis is
also useful tool to improve the reliability of a printed circuit assembly. HALT and HASS are respectively used
to improve reliability by finding root causes of latent defects at the stage of development and screening weak
products at the stage of pre-mass production in a short time. We conducted all these kinds of processes and
methods to improve the reliability of electronic devices developed for the first time.

1 INSTRUCTIONS curve, while the failure rate of a good product follows


the hockey stick line.
It is well known that the shape of failure rate of prod- The main causes of initial failures are quality def-
uct in the field shows like Figure 1. The bathtub curve ects such as workmanship error or mal-manufacturing
consists of three periods: an infant mortality period etc. These earlier failures are unacceptable from a
with a decreasing failure rate followed by a normal viewpoint of customer satisfaction and result in change
life period (also known as ‘‘useful life’’) with a rel- of their royalty to the company. Therefore, stress
atively constant and low failure rate and concluding screening is usually used to avoid infant mortali-
with a wear-out period that exhibits an increasing ties. Burn-in, ESS (Environmental Stress Screening),
failure rate HASS (Highly Accelerated Stress Screening) are the
The most appropriate shape of failure rate is illus- most representative methods that can eliminate the
trated as target curve in Figure 1. Ryu called it hockey defects. Although, appropriate specifications, ade-
stick line (Ryu, 2003). He mentioned that the fail- quate design tolerance and stress analysis for sufficient
ure rate of a bad product proceeds along the bathtub component derating could decrease the initial fail-
ure, it is impossible to cover all possible interactions
between components in operation. ALT (Acceler-
λ ated Life Test) and HALT (Highly Accelerated Life
Screening/HASS Test) are used to reduce the level of failure rate dur-
Current ing a normal operating period. HALT, quite different
HALT from standard life testing, design verification testing
Target and end-of-production testing, are becoming recog-
nized as a powerful tool to improve product reliability,
ALT/Stress analysis reducing warranty costs and increasing customer sat-
isfaction. HALT has been widely used in industry
to find out the weak point in a product in a short
t time (Hobbs, 2000). Although HALT is a power-
ful tool to find out the latent defects that can be
Figure 1. Bathtub curve and reliability actions. eliminated or reduced prior to the life tests at the

913
M ea n balance between stress and strength. We also con-
f ducted transient analysis to protect abnormal problem
Stress Strength in the field. For example, the stress ratio of main com-
ponents is analyzed in case of mode change, on/off
testing of power, motor, heater, etc. More than 20
parts of the original design were changed to assure
Varian ce the reliability of the product through stress analysis

Stress/Strength
F ailure s 2 RELIABILITY ASSURANCE PROCESS

Product reliability highly depends on development


Stress process. Most companies have their own reliabil-
analysis ity assurance process. The definition of reliability
involves ‘‘probability’’, ‘‘intended function’’, ‘‘spec-
ified period’’ and ‘‘stated conditions’’. Therefore, it
is inevitable to study the operating condition, current
f Strength state of probability and target of probability, poten-
Stress tial failure mode and mechanism of the product for
the specified period. We used to do set level reliabil-
ity test such as environmental test, electrical test, and
shipping test in the past time. But nowadays we are
trying to do unit level life test and parts stress analysis
to estimate the lifetime of the designed product.
Stress/Strength
2.1 Define the operating environments
Figure 2. Part stress vs. strength relationship.
The first step of reliability assurance process is to
define exactly the operating condition. It may include
development stage, however, it is difficult to estimate temperature, humidity, vibration level, voltage, fre-
the lifetime of the product. Therefore, an ALT is con- quency and altitude etc. Most industry may have
ducted for the purpose of estimating the lifetime of the some experiences of failures resulted from insufficient
product as soon as possible in an economical way. An information of the use condition. Operating hours or
ALT employs higher-than-usual levels of a stress to cycles per year are also important to set up the test plan.
quickly obtain reliability-related quantities (e.g., the These environmental profiles are directly related to all
qth quantile of the lifetime distribution). In the case kinds of reliability test.
of printed circuit board assembly, parts stress analy-
sis (derating) is a basic tool of assuring that stresses,
either environmental or operational, are applied below 2.2 Extraction of potential failure modes
rated values to enhance reliability by decreasing failure and mechanisms
rates. It essentially prevents small changes in operating The second step is to find out the potential failure
characteristics from creating large increases in failure modes and mechanisms. FMEA (Failure Mode and
rate (Reliability toolkit, 2001). Effect Analysis) is traditional tool by which each
As shown in Figure 2, there are four basic stress potential failure mode in a system is analyzed to deter-
derating approaches to consider are: mine the results or effects on the system and to classify
1. Decrease average stress (i.e., using fan or heat each potential failure mode regarding its severity. But
sink). FMEA has some limitations when finding out the
2. Increase average strength (i.e., using higher rated patent failure mode and mechanism. Therefore we use
part if size and weight increases are not a problem). 4-step test program explained in detail in Section 3.
3. Decrease stress variation (i.e., using noise filter
clamping higher stresses). 2.3 Allocation reliability goal to unit level
4. Decrease strength variation (i.e., holding tighter
control over the process or by tests to eliminate The third step is to determine the target of reliability
defects). of the system and to allocate it to the unit level. This
target is based on the failure rate of the previous similar
The purpose of these stress analysis is to improve product in the field or the level of competitor. Figure 3
the reliability of the design by achieving the optimum shows an example of the failure rate allocation.

914
Table 1. Determination of the sample size.
Product Set
h = 500 hrs r=0 r=1 r=2
Annual failure rate n%
n 29 41 63

Accelerated tests (ATs) have been widely used in


Unit #1 Unit #1
#2 … Unit #3 industry to reduce test time. If AF (Acceleration Fac-
tor) was known in an AT, equation (2) is changed as
1% 1% … 1% follows.
1
Figure 3. Allocation of annual failure rate to the unit level. ∴ n ≥ (r + 1) · (3)
λt arg et · h · AF

2.4 Determine the sample size and test time Therefore, we can reduce the sample size by AF times.
We will describe how to make test plan for assuring In the case of a decreasing or increasing failure rate,
the target of failure rate. the number of test samples would be calculated with
First, we will explain the case of a constant failure about a 60% confidence level as follows (Ryu & Jang,
rate. If the target of failure rate is determined, the test 2005).
plan to prove it can be set up. Tests can be censored  β
(stopped) at either a pre-planed number of hours or a 1 LB
n ≥ (r + 1) · · (4)
pre-planned number of failures. Censoring at a prede- x h
termined amount of time allows for scheduling when
the test will be completed and is called Type I censor- where LB is Bx life and x is the probability of failure
ing. Censoring at a predetermined number of failures until LB .
allows for planning the maximum number of units that
will be required for testing and is referred to as Type II
censoring (Nelson, 1990). In real situation, Type I 3 4-STEP TEST PROGRAMS
censoring is usually adopted as a censoring scheme
to meet the due date of a product development. For a We use a 4-step test program that is, architecture and
one-sided upper confidence limit in a Type I censoring failure mode analysis, potential failure mechanism
situation with a few failure, calculate:

χ(α,
2
2r + 2) 1
λ≤ · (1) Architecture,
2 T Step 1
failure mode analysis
where λ = failure rate; χα2 = 100 (1 − α) percentile
of the chi-square distribution with degrees of freedom
2r + 2; r = the number of failures; T = the total time
on test. Potential failure
The failure rate should be less than or equal to the Step 2
target. Therefore, the sample size under 60% confi- mechanism analysis
dence level can be determined as follows (Ryu et al.
2003).

λ ≤ (r + 1) ·
1
≤ λt arg et Step 3 Dominant failure
n·h extraction
1
∴ n ≥ (r + 1) · (2)
λt arg et · h

where n = the sample size; h = the test hour.


For example, under the assumption that the target
Step 4 Compliance test
of annual failure rate of fuser unit is 1% and the worst
case operating time is 100 hrs per year, the sample size
to be tested is calculated as follows Figure 4. 4-step test program.

915
Table 2. Key parts and its material. Table 3. Extraction potential failure mechanisms.

Key part Material Key part Potential failure mechanism

1. Gear cap Plastic 1. Gear cap Melt/Crack


2. Ball bearing Stainless steel 2. Ball bearing Wear
3. Outer pipe Aluminum 3. Outer pipe Deformation
4. Mica sheet Mica 4. Mica sheet Tear off
5. E-coil Ni-Cr wire 5. E-coil Breakdown
6. Pi film 500 NH 6. Pi film Creep
7. Inner pipe Aluminum 7. Inner pipe Deformation

To extract potential failure mode in the field, we


should analyze the special structure and its material.
Table 2 shows key part and its material in the heater
roller module.
(a) Heater roller.
3.2 Potential failure mechanism
Based upon the results of structure and material anal-
ysis we extracted potential failure mechanism. Table 3
(b) e-coil inside of heater roller. shows potential failure mechanisms of each part.

Figure 5. Components of an e-coil type heater roller.


3.3 Dominant failure mechanism extraction
In the real field, failure time of the product is deter-
analysis, dominant failure extraction and compliance mined by the dominant failure mechanism stimulated
test to prevent field failure at the early stage of devel- by environmental stress factors such as temperature,
opment. Based upon the architecture and failure mode humidity, voltage and so on. Table 4 shows the dom-
analysis, we selected the stress factors in an ALT and inant failure mechanism. As shown in Table 4, the
reproduce the potential failure modes by ALTs. In this dominant failure mechanism is breakdown of e-coil
section, we will explain the 4-step test program step wire.
by step with the e-coil type heater roller of a printer
developed in our company.
3.4 Compliance test
3.1 Architecture and failure mode Based on the relationship between potential failure
mechanism and environmental stress factor, we found
First step is architecture and failure mode analysis. As the best compliance test for the module as shown in
shown in Table 2, heater roller is composed of 7 parts. Table 5. The main stress factors in an ALT are thermal
Figure 5 shows these items more detail. cycle under power on/off.

Table 4. Dominant failure mechanisms.

Environmental condition
Potential
Key failure Temp Humidity Voltage
parts mechanism (0 ∼ 50◦ C) (0 ∼ 85%RH) (21 ∼ 26V) Point

Gear cap Melt/Crack ◦


  – 6
Ball bearing Wear ◦
 – – 5
Outer pipe Deformation  – – 1
Mica sheet Tear off   – 6
E-coil Breakdown ◦
   9
Pi film Creep   – 6
Inner pipe Deformation  – – 1

◦ : 5,  : 3, : 1.
Point 

916
Table 5. Compliance test. 1 Cycle = 5 hours

Test types 32˚ C


Potential
failure Thermal Temp/ Power
mechanism Point cycle humidity on/off 22˚ C
30’ 90’ 60’ 90’ 30’
Melt/Crack 6 ◦
  
Wear 5 ◦
 – – -5 ˚ C
Deformation 1   –
Tear off 6  –  Figure 6. Test profiles for an accelerated life test.
Breakdown 9 ◦
  ◦
Creep 6   –
Deformation 1   –
Total 126 53 57
Ranking 1 3 2

Table 6. Normal environmental condition.

Environmental
condition Min Mean Max Remark

Temperature (◦ C) 0 30 50 Operating
Humidity (%RH) 10 65 85

Figure 7. Test system.


4 ACCELERATED LIFE TEST
Table 7. Test results of the accelerated tests.
4.1 Test plans
Set level Humidity (%RH)
Based upon the architecture and failure mode analysis,
we selected the stress factors in an ALT and determined Failure times 72, 612, 923 24, 45, 143, 338
test plans as follows. Table 6 shows normal environ-
mental condition and Figure 6 shows test profiles for
ALTs of the module.
We conducted a test set and unit level using the
above thermal cycle profiles. The purpose of unit level
accelerated test is to save time and papers used in
the set level test. Figure 7 shows the unit level test
system. It is composed by fuser unit of the printer,
continuous monitoring module, rotating motor and
temperature/humidity chamber.

4.2 Test results Figure 8. Failure mode – breakdown of e-coil.

We observed if any failures occurred, and obtained


the results as shown in Table 7 and Figure 8. The main where t > 0, β > 0, η > 0 and its reliability function
failure mode was breakdown of the e-coil. is given by
The failure is reproduced as same as set level fail-    
ure. We analyzed the above data using the Weibull t β
R(t) = exp − (6)
distribution. Its pdf is given by, η

   β−1    
β t t β The Maximum likelihood estimates (MLEs) of
f (t) = exp − (5) Weibull distribution parameters for the lifetime data
η η η
are presented in Table 8 and Figure 9.

917
Table 8. Weibull parameters. calculated AF and equation (4). The annual printing
time in the field is 104 hours. The number of sample
Test condition β η size to assure B5 = 5 years, that is 5 % probability of
failure until 5 years, under confidence level 60 % is
Set level 1.15 556.41
calculated as Table 9.
Unit level 1.15 145.53
The newly designed assembly is conducted acceler-
ated test to evaluate the reliability. We tested 7 fusers
until 300 hrs and couldn’t find any failure. There-
fore, the cumulated failure probability of the newly
designed one for 5 years would be less than 5%.

6 CONCLUSIONS

In this paper, we introduced reliability assurance pro-


cess and related tools such as 4-step test programs. In
particular, parts stress analysis and ALTs at the unit
level had excellent effects decreasing initial service
call rate in recent years. We explained it using e-coil
type heater roller in a laser beam printer. The unit
Figure 9. Weibull probability plot. level accelerated test was developed to save time and
money for set level testing. In a unit level test, we
reproduced the same failure of the heater roller in a
Table 9. Test plan for assurance test.
set level test. We also developed ALT plans to assure
Test time (hrs) 200 300 400 the goal of reliability of the module. The heater roller
was redesigned to reduce failure rate and evaluated by
Sample size 12 7 5 ALT plan suggested in this paper.
The most important thing is to analyze and try to
∗ The number of allowable failures is 0. decrease the gap between the estimated Bx life and the
one obtained from the field real data.
4.3 Acceleration factor
Suppose that qth quantile of a lifetime distribution REFERENCES
is ts for the set level test and is tu for the unit level
test. Then the acceleration factor (AF) for those two ALTA reference manual, 1998, Reliasoft.
Kececioglu, D. & Sun, F.B. 1995. Environmental Stress
conditions is
Screening: Prentice Hall.
Nelson, W. 1990. Accelerated testing: John Wiley & Sons.
AF = ts /tu = 556.41/145.53 = 3.82 (7) Park, S.J. 2000. ALT + + reference manual. SAMSUNG
ELECTRONICS CO., LTD.
Therefore, we can save the test time 3.82 times by Park S.J., Park S.D. & Kim K.S. 2006. Evaluation of the Reli-
using unit level test compared with set level. ability of a Newly Designed Duct-Scroll by Accelerated
Life Test, ESREL 2006.
Reliability toolkit. 2001. RAC Publication, CPE. Commer-
5 DESIGN CHANGE AND EVALUATION cial practices edition.
Ryu D.S., Park S.J. & Jang S.W. 2003. The Novel Concepts
for Reliability Technology, 11th Asia-Pacific Conference
5.1 Design improvement on Non-Destructive Testing, Nov. 3∼7.
The heat roller is redesigned to improve the reliability Ryu D.S, & Jang S.W. 2005. The Novel Concepts for
of it. We adopt a new structure and thickness of inner Reliability Technology, Microelectronics Reliability 45:
611–622.
pipe that can protect the breakdown of the e-coil wire
by thermal cycle.

5.2 Reliability evaluation


We made accelerated test plans to assure reliability
of the newly designed heat roller based upon the

918
Safety, Reliability and Risk Analysis: Theory, Methods and Applications – Martorell et al. (eds)
© 2009 Taylor & Francis Group, London, ISBN 978-0-415-48513-5

Life test applied to Brazilian friction-resistant low alloy-high strength


steel rails

Daniel I. De Souza
North Fluminense University & Fluminense Federal University, Campos & Niterói, RJ, Brazil

Assed Naked Haddad


Rio de Janeiro Federal University, Rio de Janeiro, RJ, Brazil

Daniele Rocha Fonseca


North Fluminense State University, Campos, RJ, Brazil

ABSTRACT: In this work we will apply a combined approach of a sequential life testing and an accelerated
life testing to friction-resistant low alloy-high strength steel rails used in Brazil. One possible way to translate test
results obtained under accelerated conditions to normal using conditions could be through the application of the
‘‘Maxwell Distribution Law.’’ To estimate the three parameters of the underlying Inverse Weibull sampling model
we will use a maximum likelihood approach for censored failure data. We will be assuming a linear acceleration
condition. To evaluate the accuracy (significance) of the parameter values obtained under normal conditions for
the underlying Inverse Weibull model we will apply to the expected normal failure times a sequential life testing
using a truncation mechanism developed by De Souza (2005). An example will illustrate the application of this
procedure.

1 INTRODUCTION through the application of the ‘‘Maxwell Distribution


Law.’’
The sequential life testing approach is an attractive The Inverse Weibull model was developed by Erto
alternative to that of predetermined, fixed sample size (1982). It has been used in Bayesian reliability estima-
hypothesis testing because of the fewer observations tion to represent the information available about the
required for its use, especially when the underlying shape parameter of an underlying Weibull sampling
sampling distribution is the three-parameter Inverse distribution, as in (Erto 1982, De Souza & Lamber-
Weibull model. It happens that even with the use of son 1995). It has a location (or minimum life), a
a sequential life testing mechanism, sometimes the scale and a shape parameter. It has also been used in
number of items necessary to reach a decision about reliability estimation of electronic products where it
accepting or rejecting a null hypothesis is quite large seems to have a better answer to the accuracy prob-
(De Souza 2000, De Souza 2001). Then, for a three- lem presented by the Weibull model, as shown by
parameter underlying Inverse Weibull Distribution, a De Souza (2005). It happens that when the shape
truncation mechanism for this life-testing situation parameter of the Weibull model is greater than 7, the
was developed by De Souza (2005) and an applica- Weibull curve becomes highly pointed, resulting in
tion of this mechanism was presented by De Souza some computational difficulty (accuracy) in calculat-
& Haddad (2007). But it happens that sometimes ing the component’s characteristics of interest values.
the amount of time available for testing could be The three-parameter Inverse Weibull distribution has
considerably less than the expected lifetime of the a location, a scale and a shape parameter. Its density
component. To overcome such a problem, there is function is given by:
the accelerated life-testing alternative aimed at forcing
components to fail by testing them at much higher-  β+1   β 
than-intended application conditions. These models β θ θ
are known as acceleration models. One possible f (t) = exp −
θ t−ϕ t−ϕ
way to translate test results obtained under acceler-
ated conditions to normal using conditions could be t ≥ 0; θ, β, ϕ > 0.

919
2 THE ACCELERATING CONDITION Weibull model represents the life distribution at one
stress level, a three-parameter Inverse Weibull model
The ‘‘Maxwell Distribution Law,’’ which expresses the also represents the life distribution at any other stress
distribution of kinetic energies of molecules, is given level. We will be assuming a linear acceleration condi-
by the following equation: tion. In general, the scale parameter and the minimum
life can be estimated by using two different stress levels
MTE = Mtot × e−E/KT (1) (temperature or cycles or miles, etc.), and their ratios
will provide the desired value for the acceleration
MTE represents the number of molecules at a partic- factors AFθ and AFϕ . Then:
ular absolute Kelvin temperature T (Kelvin = 273.16
plus the temperature in Centigrade), that passes a θn
AFθ = (6)
kinetic energy greater than E among the total number θa
of molecules present, Mtot ; E is the energy of activa- ϕn
tion of the reaction and K represents the gas constant AFϕ = (7)
(1.986 calories per mole). Equation 1 expresses the ϕa
probability of a molecule having energy in excess of According to De Souza (2005), for the Inverse
E. The acceleration factor AF2/1 at two different stress Weibull model the cumulative distribution function
temperatures, T2 and T1 , will be given by the ratio of at normal testing condition Fn (tn − ϕn ) for a certain
the number of molecules having energy E at these two testing time t = tn , will be given by:
different temperatures, that is:
⎡  βn⎤
  θn
MTE (2) e−E/KT2 t
AF2/1 = = −E/KT Fn (t) = Fa = exp ⎣−  AF
ϕn
⎦ (8)
MTE (1) e 1 AF AF t − AF
  
E 1 1
AF2/1 = exp − (2) Equation 8 tells us that, under a linear accelera-
K T1 T2
tion assumption, if the life distribution at one stress
Applying natural logarithm to both sides of Equa- level is Inverse Weibull, the life distribution at any
tion 1 and after some algebraic manipulation, we will other stress level is also an Inverse Weibull model.
obtain: The shape parameter remains the same while the
accelerated scale parameter and the accelerated min-
   
 MTE (2) E 1 1 imum life are multiplied by the acceleration factor.
ln AF2/1 = ln = − (3) The equal shape parameter is a necessary mathe-
MTE (1) K T1 T2
matical consequence to the other two assumptions;
From Equation 3 we can estimate the term E/K by assuming a linear acceleration model and an Inverse
testing at two different stress temperatures and com- Weibull sampling distribution. If different stress levels
puting the acceleration factor on the basis of the fitted yield data with very different shape parameters, then
distributions. Then: either the Inverse Weibull sampling distribution is the
wrong model for the data or we do not have a linear

E ln AF2/1 acceleration condition.
=
(4)
K 1
− 1
T1 T2 3 HYPOTHESIS TESTING SITUATIONS

The acceleration factor AF2/1 will be given by the The hypothesis testing situations will be given by:
relationship θ1 /θ2 , with θi representing a scale param- 1. For the scale parameter θ:
eter or a percentile at a stress level corresponding to
Ti . Once the term E/K is determined, the acceleration H0 : θ ≥ θ0 ; H1 : θ < θ0
factor AF2/n to be applied at the normal stress tem-
perature is obtained from Equation 2 by replacing the The probability of accepting H0 will be set at
stress temperature T1 with the temperature at normal (1−α) if θ = θ0 . If θ = θ1 where θ1 < θ0 , the prob-
condition of use Tn . Then: ability of accepting H0 will be set at a low level γ.
   2. For the shape parameter β:
E 1 1
AF2/n = exp − (5)
K Tn T2 H0 : β ≥ β0 ; H1 : β < β0

De Souza (2005) has shown that under a linear The probability of accepting H0 will be set at
acceleration assumption, if a three-parameter Inverse (1 − α) if β = β0 . If β = β1 where β1 < β0 , the

920
probability of accepting H0 will also be set at a low 5 EXPECTED SAMPLE SIZE
level γ. OF A SEQUENYIAL LIFE TESTING
3. For the location parameter ϕ:
According to Mood & Graybill (1963), an approxi-
H0 : ϕ ≥ ϕ0 ; H1 : ϕ < ϕ0 mate expression for the expected sample size E(n) of
a sequential life testing will be given by:
Again, the probability of accepting H0 will be set
at (1 − α) if ϕ = ϕ0 . Now, if ϕ = ϕ1 where ϕ < ϕ0 , E(Wn∗ )
E(n) = (11)
then the probability of accepting H0 will be once more E(w)
set at a low level γ.
Here, w is given by:

4 SEQUENTIAL TESTING
f (t; θ1 , β1 , φ1 )
w = ln (12)
According to (Kapur & Lamberson 1977, De Souza f (t; θ0 , β0 , φ0 )
2004), the development of a sequential test uses the
likelihood ratio given by the following relationship: The variate Wn∗ takes on only values in which Wn∗
exceeds ln (A) or falls short of ln (B). When the true
L1;n /L0;n distribution is f (t; θ, β, ϕ), the probability that Wn∗
takes the value ln (A) is P(θ, β, ϕ), while the probabil-
The sequential probability ratio (SPR) will be given ity that it takes the value ln (B) is 1 − P(θ, β, ϕ). Then,
by SPR = L1,1,1,n /L0,0,0,n , or yet, according to De according to Mood & Graybill (1963), the expres-
Souza (2004), for the Inverse Weibull model, the sion for the expected value of the variate Wn∗ will
sequential probability ratio (SPR) will be: be given by:

 n n   
E Wn∗ ≈ P (θ, β, ϕ) ln (A) + [1 − P (θ, β, ϕ)] ln (B)
θ11  (ti − ϕ0 )β0 +1
β
β1
SPR = β × (13)
θ00 β0
i=1
(ti − ϕ1 )β1 +1
 n  
 β
θ1 1
β
θ00 Hence, with A = γ/(1 − α) and B = (1 − γ)/α,
× exp − − Equation 11 becomes:
i=1
(ti − ϕ1 )β1 (ti − ϕ0 )β0
P (θ, β, ϕ) ln (A) + [1 − P (θ, β, ϕ)] ln (B)
So, the continue region becomes A < SPR < B, E (n) ≈
E (w)
where A = γ/(1 − α) and B = (1 − γ)/α. We will (14)
accept the null hypothesis H0 if SPR ≥ B and we will
reject H0 if SPR ≤ A. Now, if A < SPR < B, we
will take one more observation. Then, by taking the Equation 14 enables one to compare sequential tests
natural logarithm of each term in the above inequality with fixed sample size tests. The proofs of the exis-
and rearranging them, we get: tence of Equations 11 to 14 can be found in Mood &
Graybill (1963), pp. 391–392.
 β
 
 For a three-parameter Inverse Weibull sampling dis-
β1 θ1 (1 − γ) tribution, the expected value of Equation 12 will be
n ln β × 1
− ln <X
θ00 β0 α given by:
 β
  
β1 θ1 1 (1 − α)
< n ln β × + ln (9) E (w) = ln (C) + (β0 + 1) E [ln (ti − ϕ0 )] − (β1 + 1)
θ00 β0 γ  β1
β1 1
× E [ln (ti − ϕ1 )] − θ1 E
  (ti − ϕ1 )

n β
θ1 1
β
θ00  β0
X= − − (β0 + 1) β
+ θ0 0 E
1
i=1
(ti − ϕ1 )β1 (ti − ϕ0 )β0 (ti ϕ0 )

(15)


n 
n
× ln (ti − φ0 ) + (β1 + 1) ln (ti − φ1 ) (10) The solution for the components of Equation 15
i=1 i=1 can be found in De Souza (2004).

921
6 MAXIMUM LIKELIHOOD ESTIMATION dL r  r
= + r ln (θ) − ln (ti − ϕ)
FOR THE INVERSE WEIBULL MODEL dβ β
i=1
FOR CENSORED TYPE II DATA (FAILURE
r 
 β  
CENSORED) θ θ
− × ln
ti − ϕ ti − ϕ
The maximum likelihood estimator for the shape, scale i=1
and minimum life parameters of an Inverse Weibull  β  
θ θ
sampling distribution for censored Type II data (failure − (n − r) ln =0 (21)
censored) will be given by: tr − ϕ tr − ϕ
dL  1
r

 r  = (β + 1) − βθβ
 dϕ (ti − ϕ)
i=1
L(β; θ; ϕ) = k! f (ti ) [1 − F(tr )]n−r , or yet:  r  β+1
i=1
 1
 r  ×
 ti − ϕ
i=1
L (β; θ; ϕ) = k! f (ti ) [R(tr )]n−r ; t > 0 (16)  β+1 
i=1 1
  + (n − r) =0 (22)
 β+1 β  tr − ϕ
β θ θ
f (ti ) = exp − (17)
θ ti − ϕ ti − ϕ
From Equation 20 we obtain:
  β 
θ ⎛ ⎞1/β
R(tr ) = exp − (18)
tr − ϕ ⎜ ⎟
r
 r β+1 θ=⎜
⎝ r
β
β ⎟
⎠ (23)
 
r βr 1 1
+ (n − r) 1
L (β; θ; ϕ) = k! β θ ti −ϕ tr −ϕ
(ti − ϕ) i=i
i=1

r
 n−r Notice that, when β = 1, Equation 23 reduces
− (θ/ti −ϕ)β β
×e i=1 e−(θ/tr −ϕ) (19) to the maximum likelihood estimator for the inverse
two-parameter exponential distribution. Using Equa-
tion 23 for θ in Equations 21 and 22 and applying some
algebra, Equations 21 and 22 reduce to:
The log likelihood function L = ln [L (β; θ; ϕ)] will
be given by:
r 
r
− ln (ti − ϕ)
β
i=1
L = ln (k!) + r ln (β) + rβ ln (θ) − (β + 1)  r

 1 β
 r r  β r× ln (ti − ϕ)
θ ti −ϕ
× ln (ti − ϕ) − i=1

β 
ti − ϕ
i=1 i=1 + (n − r) tr −ϕ 1
ln (tr − ϕ)
 β + =0
θ  r
β
β
− (n − r) 1
+ (n − r) 1
tr − ϕ ti −ϕ tr −ϕ
i=1
(24)
To find the value of θ and β that maximizes the log
likelihood function, we take the θ, β and ϕ derivatives
and make them equal to zero. Then, we will have: 
r
1
(β + 1)
(ti − ϕ)
i=1
r  β  r
β+1
β+1 
dL rβ  1 
= − βθβ−1 βr 1
ti −ϕ + (n − r) 1
tr −ϕ
dθ θ ti − ϕ i=1
i=1 − r
β
β =0
 β 
1
1
ti −ϕ + (n − r) 1
tr −ϕ
− (n − r) βθβ−1 =0 (20) i=1
tr − ϕ (25)

922
Equations 24 and 25 must be solved iteratively. we can determine a ϕ0 value which should make the
The problem was reduced to the simultaneous solu- right side of Equation 26 equal to the first failure time
tion of the two iterative Equations 24 and 25. The t1 . When the decisions about these quantities θ0 , θ1 ,
simultaneous solution of two iterative equations can β0 , β1 , ϕ0 , ϕ1 , α, γ and P(θ, β) are made, and after the
be seen as relatively simple when compared to the E(n) is calculated, the sequential test is totally defined.
arduous task of solving three simultaneous iterative
Equations (20, 21, 22) as outlined by Harter (Har-
ter et al. 1965). Even though this is the present case,
one possible simplification in solving for estimates 7 EXAMPLE
when all three parameters are unknown could be the
following approach proposed by Bain (1978). We are trying to determine the values of the shape,
scale and minimum life parameters of an underlying
For example, let us suppose that β̂ and θ̂ represent three-parameter Inverse Weibull model, representing
the good linear unbiased estimators (GLUEs) of the the life cycle of a new friction-resistant low alloy-high
shape parameter β and of the scale parameter θ for a strength steel rail. Once a life curve for this steel rail
fixed value of the minimum life ϕ. We could choose an is determined, we will be able to verify using sequen-
initial value for ϕ to obtain the estimators β̂ and θ̂, and tial life testing, if new units produced will have the
then apply these two values in Equation 18, that is, the necessary required characteristics. It happens that the
maximum likelihood equation for the minimum life ϕ. amount of time available for testing is considerably
An estimate ϕ̆ can then be obtained from Equation 25, less than the expected lifetime of the component. So,
then the GLUEs of β and of θ can be recalculated for we will have to rely on an accelerated life testing
the new estimate ϕ̆, and a second estimate for the min- procedure to obtain failure times used on the parame-
imum life ϕ obtained from Equation 25. Continuing ters estimation procedure. The steel rail has a normal
this iteration would lead to approximate values of the operating temperature of 296 K (about 23 degrees
maximum likelihood estimators. As we can notice, the Centigrade). Under stress testing at 480 K, 16 steel
advantage of using the GLUEs in this iteration is that rail items were subjected to testing, with the testing
only one equation must be solved implicitly. The exis- being truncated at the moment of occurrence of the
tence of solutions to the above set of Equations 24 twelfth failure. Table 1 shows these failure time data
and 25 has been frequently addressed by researchers (hours).
as there can be more than one solution or none at all; Now, under stress testing at 520 K, 16 steel rail
see Zanakis & Kyparisis (1986). items were again subjected to testing, with the testing
The standard maximum likelihood method for esti- being truncated at the moment of occurrence of the
mating the parameters of the three-parameter Weibull twelfth failure. Table 2 shows these failure time data
model can have problems since the regularity condi- (hours).
tions are not met, see (Murthy et al. 2004, Blischke Using the maximum likelihood estimator approach
1974, Zanakis & Kyparisis 1986). To overcome this for the shape parameter β, for the scale parameter θ
regularity problem, one of the approaches proposed by and for the minimum life ϕ of the Inverse Weibull
Cohen (Cohen et al. 1984) is to replace Equation 25 model for censored Type II data (failure censored), we
with the equation obtain the following values for these three parameters
k+1 
g  
−1/β  
−Ui n
n×θ× × Ui 1−e × (1, 2or4) Table 1. Failure times (hours) of steel rail items tested under
3
i=1 accelerated temperature conditions (480 K).
k+1 
g   n 
+ n × ϕj × × 1 − e−Ui × (1, 2or4) = t1 765.1 843.6 850.4
3 862.2 877.3 891.0
i=1
909.4 930.9 952.4
(26) 973.2 1,014.7 1,123.6

Here, t1 is the first order statistic in a sample of


size n. In solving the maximum likelihood equations,
we will use this approach proposed by Cohen (Cohen Table 2. Failure times (hours) of steel rail items tested under
et al. 1984). Appendix 1 shows the derivation of accelerated temperature conditions (520 K).
Equation 26.
Some preliminarily life testing is then performed in 652.5 673.6 683.1
692.9 705.1 725.4
order to determine an estimated value for the first fail- 738.2 769.2 776.6
ure time x1 of the underlying Inverse Weibull sampling 784.9 816.0 981.9
distribution. Then, using a simple search procedure,

923
under accelerated conditions of testing: normal stress temperature AFϕ2/n , will be:

  
At 480 K. β1 = βn = β = 8.38; θ1 = 642.3 hours; 1 1
AFϕ2/n = exp 1, 015.1 − = 4.38
ϕ1 = 117.9 hours 296 520

At 520 K.β2 = βn = β = 8.41; θ2 = 548.0 hours; Then, as we expected, AFθ = 4.23 ≈ AFϕ =
4.38 ≈ AF = 4.3. Finally, the minimum life parame-
ϕ2 = 100.2 hours
ter of the component at normal operating temperatures
is estimated to be:
The shape parameter did not change with β ≈ 8.4.
The acceleration factor for the scale parameter AFθ2/1
ϕn = AFϕ2/n × ϕ2 = 4.3 × 100.2 = 430.9 hours
will be given by:

Then, the steel rail life when operating at normal use


AFθ2/1 = θ1 /θ2 = 642.3/548.0
conditions could be represented by a three-parameter
Inverse Weibull model having a shape parameter β of
Using Equation 4, we can estimate the term E/K. 8.4; a scale parameter θ of 2,318.0 hours and a mini-
mum life ϕ of 430.9 hours. To evaluate the accuracy
 (significance) of the three-parameter values obtained
E ln AF2/1 ln (642.3/548.0)
=
=  1 = 990.8 under normal conditions for the underlying Inverse
480 − 520
K 1
T1 − T2
1 1 Weibull model we will apply, to the expected normal
failure times, a sequential life testing using a trunca-
tion mechanism developed by De Souza (2004). These
Using now Equation 5, the acceleration factor for expected normal failure times will be acquired by mul-
the scale parameter, to be applied at the normal stress tiplying the twelve failure times obtained under accel-
temperature AFθ2/n , will be: erated testing conditions at 520 K given by Table 2 by
the accelerating factor AF of 4.3. It was decided that
   the value of α was 0.05 and γ was 0.10. In this example,
E 1 1
AF2/n = exp − the following values for the alternative and null para-
KTn T2
   meters were chosen: alternative scale parameter θ1 =
1 1 2, 100 hours, alternative shape parameter β1 = 7.8
AF2/n = exp 990.8 − = 4.23 and alternative location parameter ϕ1 = 380 hours;
296 520
null scale parameter θ0 = 2, 320 hours, null shape
parameter β0 = 8.4 and null minimum life parameter
Therefore, the scale parameter of the component at ϕ0 = 430 hours. Now electing P(θ, β, ϕ) to be 0.01,
normal operating temperatures is estimated to be: we can calculate the expected sample size E(n) of this
sequential life testing under analysis. Using now Equa-
θn = AF2/n × θ2 = 4.23 × 548.0 = 2, 318.0 hours tion 15, the expression for the expected sample size of
the sequential life testing for truncation purpose E(n),
we will have:
The acceleration factor for the minimum life param-
eter AFϕ2/1 will be given by:
E (w) = − 5.501 + 9.4 × 7.802 − 8.8 × 7.749
ϕ1 117.9 − 0.033 + 1.0 = 0.611
AFϕ2/1 = =
ϕ2 100.2
Now, with P(θ, β, ϕ) = 0.01, with
Again applying Equation 4, we can again estimate
the term E/K. Then:    
(1 − γ) (1 − 0.10)
ln (B) = ln = ln = 2.8904,
 α 0.05
E ln AF2/1 ln (117.9/100.2)
=
=  1 = 1, 015.1
480 − 520
K 1
T1 − T2
1 1
and also with
   
γ 0.10
Using once more Equation 5, the acceleration factor ln (A) = ln = ln = −2.2513,
for the minimum life parameter, to be applied at the 1−α 1 − 0.05

924
NUMBER OF ITEMS able to identify a specific rate that is assignable to a
specific temperature. If the mechanism of reaction at
0 higher or lower temperatures should differ, this, too,
0 1 2 3 4 5 6
-5
would alter the slope of the curve. Second, it is nec-
V essary that the energy activation be independent of
A -10 temperature, that is, constant over the range of tem-
L
U -15
ACCEPT Ho peratures of interest. It happens that, according to
E Chornet & Roy (1980), ‘‘the apparent energy of acti-
S
-20 vation is not always constant, particularly when there
-25 is more than one process going on.’’ Further comments
O on the limitations of the use of the Arrhenius equation
F -30 REJECT Ho
can be found in Feller (1994). In this work we life-
X -35 tested a new industrial product using an accelerated
mechanism. We assumed a linear acceleration condi-
-40
tion. To estimate the parameters of the three-parameter
Inverse Weibull model we used a maximum likelihood
Figure 1. Sequential test graph for the three-parameter approach for censored failure data, since the life-
Inverse Weibull model. testing will be terminated at the moment the truncation
point is reached. The shape parameter remained the
same while the accelerated scale parameter and the
we will have: accelerated minimum life parameter were multiplied
by the acceleration factor. The equal shape param-
P (θ, β) ln (A) + [1 − P (θ, β)] ln (B) eter is a necessary mathematical consequence of the
other two assumptions; that is, assuming a linear accel-
= −0.01 × 2.2513 + 0.99 × 2.8904 = 2.8390 eration model and a three-parameter Inverse Weibull
sampling distribution. If different stress levels yield
Then: E (n) = 2.8390 data with very different shape parameters, then either
0.6115 = 4.6427 ≈ 5 items.
the three-parameter Inverse Weibull sampling distri-
So, we could make a decision about accepting or bution is the wrong model for the data or we do not
rejecting the null hypothesis H0 after the analysis of have a linear acceleration condition. In order to trans-
observation number 5. Using Equations 9 and 10 and late test results obtained under accelerated conditions
the twelve failure times obtained under accelerated to normal using conditions we applied some reasoning
conditions at 520 K given by Table 2, multiplied by given by the ‘‘Maxwell Distribution Law.’’ To evalu-
the accelerating factor AF of 4.3, we calculate the ate the accuracy (significance) of the three-parameter
sequential life testing limits. Figure 1 below shows the values estimated under normal conditions for the
sequential life-testing for the three-parameter Inverse underlying Inverse Weibull model we employed, to the
Weibull model. expected normal failure times, a sequential life test-
Then, since we were able to make a decision ing using a truncation mechanism developed by De
about accepting or rejecting the null hypothesis H0 Souza (2004). These expected normal failure times
after the analysis of observation number 4, we do not were acquired by multiplying the twelve failure times
have to analyze a number of observations correspond- obtained under accelerated testing conditions at 520 K
ing to the truncation point (5 observations). As we given by Table II, by the accelerating factor AF of
can see in Figure 1, the null hypothesis H0 should be 4.3. Since we were able to make a decision about
accepted since the final observation (observation num- accepting or rejecting the null hypothesis H0 after the
ber 4) lays on the region related to the acceptance of H0 . analysis of observation number 4, we did not have to
analyze a number of observations corresponding to
the truncation point (5 observations). As we saw in
8 CONCLUSIONS Figure 1, the null hypothesis H0 should be accepted
since the final observation (observation number 4)
There are two key limitations to the use of the Arrhe- lays on the region related to the acceptance of H0 .
nius equation: first, at all the temperatures used, lin- Therefore, we accept the hypothesis that the friction-
ear specific rates of change must be obtained. This resistant low alloy-high strength steel rails life when
requires that the rate of reaction, regardless of whether operating at normal use conditions could be repre-
or not it is measured or represented, must be constant sented by a three-parameter Inverse Weibull model
over the period of time at which the aging process is having a shape parameter β of 8.4; a scale parame-
evaluated. Now, if the expected rate of reaction should ter θ of 2,320 hours and a minimum life ϕ of 430
vary over the time of the test, then one would not be hours.

925
REFERENCES Murthy, D.N.P., Xie, M. & Hang, R. 2004. Weibull Models. In
Wiley Series in Probability and Statistics, John Wiley &
Bain, Lee J. 1978. Statistical Analysis of Reliability and Sons, Inc., New Jersey.
Life-Testing Models, Theory and Method. Marcel Dekker, Zanakis, S.H. & Kyparisis, J. 1986. A Review of Maximum
Inc., New York, NY, USA. Likelihood Estimation Methods for the Three Parameter
Blischke, W.R. 1974. On non-regular estimation II. Esti- Weibull Distribution. Journal of Statistical Computation
mation of the Location Parameter of the Gamma and and Simulation, 25, 53–73.
Weibull Distributions, Communications in Statistics, 3,
1109–1129.
Chornet & Roy. 1980. Compensation of Temperature on Per- APPENDIX 1. DETERMINING AN INITIAL
oxide Initiated Cross linking of Polypropylene, European ESTIMATE TO THE MINIMUM LIFE ϕ
Polymer Journal, 20, 81–84.
Cohen, A.C.; Whitten, B.J. & Ding, Y. 1984. Mod-
ified Moment Estimation for the Three-Parameter
The pdf of t1 will be given by:
Weibull Distribution, Journal of Quality Technology 16,
159–167. f (t1 ) = n [1 − F(t1 )]n−1 f (t1 ). Since
De Souza & Daniel I. 2000. Further thoughts on a sequen-
tial life testing approach using a Weibull model. In F(t1 ) = 1−R(t1 ), we will have
Cottam, Harvey, Pape & Tait (eds.), Foresight and Precau-
tion, ESREL 2000 Congress. 2: 1641–1647, Edinburgh,
f (t1 ) = n [R(t1 )]n−1 f (t1 )
Scotland: Balkema.
De Souza & Daniel I. 2001. Sequential Life Testing with For the three-parameter Inverse Weibull sampling
a Truncation Mechanism for an Underlying Weibull distribution, we will have:
Model. In Zio, Demichela & Piccinini (eds.), Towards a
 β+1   β n
Safer World, ESREL 2001 Conference, 16–20 September. nβ θ θ
3:1539–1546. Politecnico Di Torino. Italy. f (t1 ) = 1 − exp −
De Souza & Daniel I. 2004. Sequential Life-Testing with θ t−ϕ t−ϕ
Truncation Mechanisms for Underlying Three-Parameter
Weibull and Inverse Weibull Models.,In Raj B.K. Rao, The expected value of x1 is given by:
B.E. Jones & R.I. Grosvenor Eds.; COMADEM Confer-
ence, Cambridge, U.K., August 2004260–271, Comadem ∞  β+1
International, Birmingham, U.K. nβ θ
E(t1 ) = t
De Souza & Daniel I. 2005. A Maximum Likelihood θ t−ϕ
Approach Applied to an Accelerated Life Testing with ϕ
an Underlying Three-Parameter Inverse Weibull Model   β n
In: Raj B.K. Rao & David U Mba Eds. COMADEM θ
× 1− exp− dt
2005 – Condition Monitoring and Diagnostic Engineer- t−ϕ
ing Management, University Press, 2005. v.01. p.63 – 72.
Cranfield, Bedfordshire, UK.
β
θ
De Souza, Daniel I. & Addad, Assed N. 2007. Sequential Letting U = t−ϕ , we will have:
Life-Testing with an Underlying Three-Parameter Inverse
Weibull Model – A Maximum Likelihood Approach   β+1
In: IIE Annual Conference and Exposition. Nashville, β θ θ
TN: The Institute of Industrial Engineering, 2007. V.01. du = − dt; t = +ϕ
θ t−ϕ U1/β
pp. 907 – 912. USA.
De Souza, Daniel I. & Lamberson, Leonard R. 1995.
Bayesian Weibull Reliability Estimation, IIE Transactions
When t → ∞, U → 0; When t → ϕ, U → ∞. Then:
27 (3), 311–320.
∞
Erto & Pasquale. 1982. New Practical Bayes Estimators for  !n
the 2-Parameter Weibull Distribution, IEEE Transactions E (t1 ) = n θU−1/β + ϕ 1 − e−U du
on Reliability, R-31, (2), 194–197. 0
Feller & Robert L. 1994. Accelerated Aging, Photochemical
and Thermal Aspects. The Getty Conservation Institute, ∞
!n
Eds.; Printer: Edwards Bross., Ann Harbor, Michigan. E (t1 ) = nθ U−1/β 1 − e−U du
Harter, H. et al. 1965, Maximum Likelihood Estimation of
the Parameters of Gamma and Weibull Populations from 0
Complete and from Censored Samples, Technometrics, ∞
No 7, pp. 639–643; erratum, 15 (1973), pp. 431. !n
Kapur, K. & Lamberson, L.R. 1977. Reliability in Engineer-
+ nϕ 1 − e−U du
ing Design, John Willey & Sons, Inc., New York. 0
Mood, A.M. & Graybill, F.A. 1963. Introduction to the The-
ory of Statistics. Second Edition, McGraw-Hill, New The above integrals have to be solved by using a
York. numerical integration procedure, such as Simpson’s

926
1/3 rule. Remembering that Simpson’s 1/3 rule is Using Equations A and B, we will have:
given by:
∞
!n
b E (t1 ) = nθ U−1/β 1 − e−U du
g
f (x)dx = (f1 + 4f2 + 2f3 + · · · + 4fk + fk+1 ) 0
3
a ∞
!n
− error + nϕ 1 − e−U du.
0
Making the error = 0; and with i = 1, 2, . . ., k + 1,
we will have: Finally:
∞ ⎧ ⎡ ⎛ ⎞⎤⎫
! ⎪k+1 1 ⎪
g ⎨ ⎢
−1/β  ⎬
−1/β −U n g
nθ U 1−e du = n × θ × E (t1 ) = n × θ × ⎣ Ui 1−e −Ui n ⎜ 2 ⎟⎥
× ⎝or ⎠⎦
3 3⎪⎩ i=1 ⎪

0 4
 k+1  k+1 
 
−1/β
 
g   n 
−Ui n
× Ui 1−e × (1, 2 or 4) + n × ϕj × × 1 − e−Ui × (1, 2 or 4) . (27)
3
i=1 i=1

(A)
∞
!n g
nϕ 1 − e−U du = n × ϕj ×
3
0
 k+1 
  
−Ui n
× 1−e × (1, 2 or 4) (B)
i=1

927
Safety, Reliability and Risk Analysis: Theory, Methods and Applications – Martorell et al. (eds)
© 2009 Taylor & Francis Group, London, ISBN 978-0-415-48513-5

Non-homogeneous Poisson Process (NHPP), stochastic model


applied to evaluate the economic impact of the failure in the Life Cycle
Cost Analysis (LCCA)

Carlos Parra Márquez, Adolfo Crespo Márquez, Pedro Moreu de León, Juan Gómez Fernández &
Vicente González Díaz
Department of Industrial Management School of Engineering, University of Seville, Spain

ABSTRACT: This paper aims to explore different aspects related with the failure costs (non reliability costs)
within the Life Cycle Cost Analysis (LCCA) of a production asset. Life cycle costing is a well-established
method used to evaluate alternative asset options. This methodology takes into account all costs arising during
the life cycle of the asset. These costs can be classified as the ‘capital expenditure’ (CAPEX) incurred when
the asset is purchased and the ‘operating expenditure’ (OPEX) incurred throughout the asset’s life. In this paper
we explore different aspects related with the ‘‘failure costs’’ within the life cycle cost analysis, and we describe
the most important aspects of the stochastic model called: Non-homogeneous Poisson Process (NHPP). This
model will be used to estimate the frequency failures and the impact that could cause the diverse failures in the
total costs of a production asset. The paper also contains a case study where we applied the above mentioned
concepts. Finally, the model presented provides maintenance managers with a decision tool that optimizes the
life cycle cost analysis of an asset and will increase the efficiency of the decision-making process related with
the control of failures.
Keywords: Asset; Failures; Life Cycle Cost Analysis (LCCA); Non-homogeneous Poisson Process (NHPP);
Maintenance; Reliability; Repairable Systems

1 INTRODUCTION many decisions and actions exist, technical as much


as not technical, that should be adopted through the
With the purpose of optimizing the costs and to whole use period of an industrial asset. Product sup-
improve the profitability of the productive processes, port and maintenance needs of systems are more
the denominated organizations of World Class cate- or less decided during the design and manufactur-
gory (Mackenzie, 1997), dedicate enormous efforts ing phase (Markeset and Kumar, 2001). Outlines
to visualize, analyze, implant and execute strategies that most of these actions, particularly those that
for the solution of problems, that involve decisions correspond to the design phase of the production sys-
in high impact areas: security, environment, pro- tem, have a high impact in the total life cycle of
duction goals, products quality, operation costs and the asset, being of particular interest, those deci-
maintenance. In the last years, specialists in the sions related with the improvement process of the
areas of value engineering and operations direction ‘‘Reliability’’ factor (quality of the design, used tech-
have improved the quantification process of the costs, nology, technical complexity, frequency of failures,
including the use of techniques that quantify the Reli- costs of preventive/corrective maintenance, maintain-
ability factor and the impact of the failure events ability levels and accessibility), since these aspects,
on the total costs of a production system along their have a great influence on the total cost of the asset’s
life cycle (Woodhouse, 1993). These improvements life cycle, and they influence in great measure on
have allowed diminishing the uncertainty in the pro- the possible expectations to extend the useful life
cess of decisions making in vital importance areas of the production systems to reasonable costs (see,
such as: design, development, maintenance, sub- e.g. Blanchard, 2001; Blanchard and Fabrycky, 1998;
stitution and acquisition of production assets. It is Goffin, 2000; Markeset and Kumar, 2001; Smith and
important to clear up that, in this whole process, Knezevic, 1996 and Woodward, 1997).

929
2 ANTECEDENTS OF THE LCCA • 1980, the American Society for Testing and
TECHNIQUES Materials (ASTM) developed a series of standards
and database oriented to ease the search of necessary
In the last years, the investigation area related with the information for the application of the LCCA.
Life cycle Costs Analysis, has continued its develop- • 1992, two investigators of the University of
ment, as much in the academic level as to the industrial Virginia, Wolter Fabrycky and B.S. Blanchard,
level. It is important to mention the existence of other developed a model of LCCA—see details in
methodologies that have emerged in the area of LCCA, (Fabrycky and Blanchard, 1993), in which they
such as: Life cycle Costs Analysis and Environmen- include a structured process to calculate the costs of
tal Impact, Total Costs Analysis of Production Assets, Non Reliability starting from the estimate of con-
among other (Durairaj and Ong, 2002). These method- stant values of failures per year (constant rate of
ologies have their particular characteristics, although failures).
regarding the estimation process of the costs for failure • 1994, Woodward (1997), from the School of Busi-
events impact, they propose Reliability analysis usu- ness of the University of Staffordshire (England,
ally based on rate of constant failures. The antecedents Great Britain), developed an investigation line in
of the LCCA are shown next (Kirt and Dellisola, which included basic aspects of analysis of the Reli-
1996): ability factor and their impact on the Costs of Life
cycle.
• 1930, one of the first records that are known of • 1998, David Willians and Robert Scott of the con-
the LCCA techniques is found in the book named sulting firm RM-Reliability Group, developed a
Principles of Engineering Economics by Eugene L. model of LCCA based on the Weibull Distribution
Grant. to estimate the frequency of failures and the impact
• 1933, the first reference of Life cycle Analysis by of the Reliability Costs, see details in (Zohrul Kabil,
the Government of the United States shows up car- 1987, Ebeling, 1997 and Willians and Scott, 2000).
ried out by part of the federal department: General • 1999, the Woodhouse Partnership consulting group
Accounting Office (GAO), which is related to the participated in the European Project EUREKA,
purchase of a series of tractors. specifically inside the line of investigation
• 1950, Lawrence D. Miles originated the concept denominated MACRO (Maintenance Cost/Risk
of Value Engineering at General Electric, incor- Optimization Project) and they developed an LCCA
porating aspects related with the techniques of commercial software of denominated APT Lifes-
LCCA. pan, see details in (Roca, 1987, Barlow, Clarotti and
• 1960, Stone (1975) began to work in England, Spizzichino, 1993, Woodhouse, 1991 and Riddell
giving as a result in the decade of the 70’s the pub- and Jennings, 2001).
lication of two of the biggest texts developed in • 2001, the Woodhouse Partnership consulting firm
Europe in relation to costs engineering. and the Venezuelan Oil Technological Institute
• 1960, the Logistics Management Institute of the (INTEVEP), put on test this model, evaluating the
United States developed an investigation in the area Total Costs of Life cycle for 56 gas compression
of Obsolescence Engineering for the Ministry of systems, used for the extraction of heavy oil in the
Defense. The final result of this investigation was San Tomé District (Venezuela), see details in (Parra
the publication of the first Life cycle Cost Manual and Omaña, 2003).
in the year of 1970.
• 1972, the Ministry of Defense of the United States,
3 BASIC ASPECTS OF THE LCCA
promoted the development of a group of Manuals
with the purpose of applying the LCCA Methodol-
To evaluate the costs associated to the life cycle of
ogy, in all the Logistics areas.
a production system, a collection of procedures that
• 1974, the Department of Energy of the United
group together exists in the denominated: Techniques
States, decided to develop its expansion and energy
of Life cycle Costs Analysis. The early implementa-
consumption plans supported by the analysis of Life
tion of the costs analysis techniques allows to evaluate
cycle.
in advance the potential design problems and to quan-
• 1975, the Federal Department of Supplies and Ser-
tify the potential impact in the costs along the life cycle
vices of the United States developed a Logistics and
of the industrial assets (Durairaj and Ong, 2002). Next,
Acquisition technique based on the LCCA.
some basic definitions of Life cycle Cost Analysis are
• 1979, the Department of Energy introduced a pro-
presented:
posal (44 FR 25366, April 30 1979) which intended
that evaluations of LCCA were included in all – Kirt and Dellisolla (1996) defines the LCCA as
the new constructions and mayor modifications in a technique of economic calculation that allows
government facilities. to optimize the making of decisions associated to

930
the design processes, selection, development and consideration of the costs. Inside the dynamic process
substitution of the assets that conform a produc- of change, the acquisition costs associated to the new
tion system. It intends to evaluate in a quantitative systems are not the only ones to increase, but rather
way all the costs associated to the economic period the operation and maintenance costs of the systems
of expected useful life, expressed in yearly equiv- already in use also do it in a quick way. This is due
alent monetary units (Dollars/year, Euros/year, mainly to a combination of such factors as (Fabrycky,
Pesos/year). 1997):
– Woodhouse (1991) defines the LCCA like a sys-
tematic process of technical-economical evaluation, • Inaccuracies in the estimates, predictions and
applied in the selection and replacement process of forecasts of the events of failures (Reliability),
production systems that allows to consider in simul- ignorance of the probability of occurrence of the dif-
taneous way economic and Reliability aspects, with ferent failure events inside the production systems
the purpose of quantifying the real impact of all the in evaluation.
costs along the life cycle of the assets ($/year), and • Ignorance of the deterioration processes behavior.
in this way, be able to select the asset that contributes • Lack of forecast in the maintenance processes and
the largest benefits to the productive system. ignorance of the modern techniques of maintenance
management.
The great quantity of variables that must be man- • Engineering changes during the design and devel-
aged when estimating the real costs of an asset along opment.
its useful life generates a scenario of high uncertainty • Changes in the own construction of the system.
(Durairaj and Ong, 2002). The combination among • Changes in the expected production patterns.
inflation, rise/decrease of the costs, reduction/increase • Changes during the acquisition of system compo-
of the purchasing power, budget limitations, increase nents.
of the competition and other similar characteristics, • Setbacks and unexpected problems.
has generated a restlessness and interest about the total
cost of the assets. Often the total cost of the production
system is not visible, in particular those costs associ-
ated with: operation, maintenance, installation tests, 3.1 Characteristics of the costs
personnel’s training, among others. in a production asset
Additionally, the dynamics of the economic sce- The cost of a life cycle is determined identifying the
nario generate problems related to the real determina- applicable functions in each one of its phases, cal-
tion of the asset’s cost. Some of them are (Fabrycky, culating the cost of these functions and applying the
1997): appropriate costs during the whole extension of the
• The factors of costs are usually applied incorrectly. life cycle. So that it is complete, the cost of the life
The individual costs are inadequately identified cycle should include all the costs of design, fabrica-
and, many times, they are included in the wrong tion and production (Ahmed, 1995). In the following
category: the variable costs are treated as fixed paragraphs the characteristics of the costs in the dif-
(and vice versa); the indirect costs are treated as ferent phases of an asset’s life cycle are summarized
direct, etc. (Levi and Sarnat, 1990):
• The countable procedures do not always allow a
realistic and timely evaluation of the total cost. • Investigation, design and development costs: initial
Besides, it is often difficult (if not impossible) to planning, market analysis, product investigation,
determine the costs, according to a functional base. design and engineering requirements, etc.
• Many times the budgetary practices are inflexible • Production, acquisition and construction costs:
with regard to the change of funds from a category industrial engineering and analysis of operations,
to another, or, from one year to another. production (manufacturing, assembly and tests),
construction of facilities, process development,
To avoid the uncertainty in the costs analysis, the production operations, quality control and initial
studies of economic viability should approach all the requirements of logistics support.
aspects of the life cycle cost. The tendency to the vari- • Operation and support costs: operations inputs of
ability of the main economic factors, together with the the production system, planned maintenance, cor-
additional problems already enunciated, have driven rective maintenance (it depends on the Reliability
to erroneous estimates, causing designs and develop- Factor) and costs of logistical support during the
ments of production systems that are not suitable from system’s life cycle.
the point of view of cost-benefit (Fabrycky, 1997). It • Remove and elimination costs: elimination of non
can be anticipated that these conditions will worsen, repairable elements along the life cycle, retirement
unless the design engineers assume a bigger grade of of the system and recycling material.

931
From the financial point of view, the costs generated Costs of Non
along the life cycle of the asset are classified in two Reliability
types of costs:
• CAPEX: Capital costs (design, development, acqui-
sition, installation, staff training, manuals, doc-
umentation, tools and facilities for maintenance, Costs for
Costs for
replacement parts for assurance, withdrawal). corrective
• OPEX: Operational costs: (manpower, operations, penalization maintenance

planned maintenance, storage, recruiting and


corrective maintenance—penalizations for failure
events/low Reliability).
Figure 1. Economic impact of the Reliability.

3.2 Impact of the reliability in the LCCA


Woodhouse (1991) outlines that to be able to design • Materials and replacement parts: direct costs
an efficient and competitive productive system in the related with the consumable parts and the replace-
modern industrial environment, it is necessary to eval- ments used in the event of an unplanned action.
uate and to quantify in a detailed way the following two The impact in the costs that an asset of low Reliabil-
aspects: ity generates is associated directly with the behavior
• Costs: aspect that is related with all the costs of the following index:
associated to the expected total life cycle of
the production system. Including: design costs, (t) = expected number of failures in a time
production, logistics, development, construction, interval [0, t]
operation, preventive/corrective maintenance, with-
drawal. According to Woodhouse (1991), the increase of
• Reliability: factor that allows to predict the form in the costs is caused in its great majority, for the lack of
which the production processes can lose their opera- forecast in case of unexpected failures appearances,
tional continuity due to events of accidental failures scenario basically provoked by ignorance and lack of
and to evaluate the impact in the costs that the fail- analysis in the design phase of the aspects related with
ures cause in security, environment, operations and the Reliability. This situation brings as a result an
production. increase in the operation costs (costs that were not
The key aspect of the term Reliability is related considered in a beginning) affecting in this way the
to the operational continuity. In other words, we can profitability of the production process.
affirm that a production system is ‘‘Reliable’’ when it is It is important to mention that the results obtained
able to accomplish its function in a secure and efficient from the LCCA, reach their maximum effectiveness
way along its life cycle. Now, when the production during the phases of: initial development, visualiza-
process begins to be affected by a great quantity of tion, and conceptual, basic and details engineering.
accidental failure events—(low Reliability), this sce- Once the design has been completed, it is substantially
nario causes high costs, associated mainly with the difficult to modify the economic results. Also, the eco-
recovery of the function (direct costs) and with grow- nomic considerations related with the life cycle should
ing impact in the production process (penalization be specifically outlined during the phases previously
costs). See Figure 1: mentioned, if it is that one wants to totally exploit the
The totals costs of Non Reliability are described possibilities of an effective economic engineering. It
next (Barlow, Clarotti and Spizzichino, 1993, Ruff and is necessary to keep in mind that almost two thirds of
Paasch, 1993 and Woodhouse, 1993): the life cycle cost of an asset or system are already
determined in the preliminary conceptual and design
– Costs for penalization: phase (70–85% of value creation and costs reduction
• Downtime: opportunity losses/deferred produc- opportunities), according to (Dowlatshahi, 1992).
tion, production losses (unavailability), opera-
tional losses, impact in the quality, impact in
security and environment. 4 STOCHASTIC MODELS CONSIDERED
FOR THE ANALYSIS OF THE RELIABILITY
– Costs for corrective maintenance:
• Manpower: direct costs related with the man- The following sections discuss the different stochastic
power (own or hired) in the event of a non planned models considered for the analysis of the frequency of
action. failures in repairable units and systems.

932
4.3 Generalized renewal process (GRP)
A repairable system may end up in one of the five
possible states after a repair:
a. As good as new
b. As bad as old
c. Better than old, but worse than new
d. Better than new
e. Worse than old
The two models described before, ordinary renewal
process and NHPP, account for the first two states
Figure 2. Basic notation for a stochastic point process.
respectively. However, the last three repair states have
received less attention since they involve more com-
plex mathematical models. Kijima and Sumita (1987)
4.1 Ordinary Renewal Process (ORP) proposed a probabilistic model for all the after-repair
This model assumes that, following a repair, the states called Generalized Renewal Process (GRP).
unit returns to an ‘‘as good as new’’ (AGAN) con- According to this approach, the ordinary renewal pro-
dition. In this process, the interarrival times, xi, cess and the NHPP are considered specific cases of
between successive failures (see Figure 2) are consid- the generalized model. The GRP theory of repairable
ered independently and identically distributed random items introduces the concept of virtual age (An). This
variables. It is a generalization of a Homogeneous value represents the calculated age of the element
Poisson Process (HPP). This model represents an ideal immediately after the nth repair occurs. For An = y the
situation; it is only appropriate for replaceable items system has a time to the (n + 1)th failure, xn+1 , which
and hence has very limited applications in the analysis is distributed according to the following cumulative
of repairable components and systems. Variations of distribution function (cdf ):
the ORP can also be defined. The modified renewal
process, where the first interarrival time differs from F(x + y) − F(y)
F(x |A n = y) = (2)
the others, and the superimposed renewal process 1 − F(y)
(union of many independent ORPs) are examples of
these possible variations (Ascher and Feingold, 1984). where F(x) is the cdf of the time to the first failure
(TTFF) distribution of a new component or system.
The summation:
4.2 Non-Homogeneous Poisson Process (NHPP)

n
This model is also called ‘‘minimal repair’’ and it Sn = xi (3)
assumes that the unit returns to an ‘‘as bad as old’’ i=1
(ABAO) condition after a repair. So that, after the
restoration the item is assumed to be operative but with S = 0, is called the real age of the element. The
as old as it was before the failure. The NHPP differs model assumes that the nth repair only compensates
from the HPP in that the rate of occurrence of failures for the damage accumulated during the time between
varies with time rather the being constant (Ascher and the (n − 1)th and the nth failure. With this assumption,
Feingold, 1984). Unlike the previous model, in this the virtual age of the component or system after the
process the interarrival times are neither independent nth repair is:
nor identically distributed. The NHPP is a stochastic
point process in which the probability of occurrence An = An−1 + qxn = qSn (4)
of n failures in any interval [t1, t2] has a Poisson
distribution with: where q is the repair effectiveness (or rejuvenation)
parameter and A0 = 0. According to this model, the
t2 result of assuming a value of q = 0 leads to an ordinary
mean = λ(t)dt (1) renewal process (as good as new), while the assump-
tion of q = 1 corresponds to a non-homogeneous
t1 Poisson process (as bad as old). The values of q that
fall in the interval 0 < q < 1 represent the after-repair
where λ(t) is the rate of occurrence of failures states in which the condition of the element is better
(ROCOF) defined as the inverse of the expected inter- than old but worse than new, whereas the cases where
arrival times, 1/E[xi] (Ascher and Feinfold, 1984 and q > 1 correspond to a condition worse than old. Sim-
Crow, 1974 ). ilarly, cases with q < 0 would suggest a component

933
or system restored to a state better than new. There-
fore, physically speaking, q can be seen as an index
for representing the effectiveness and quality of repairs
(Yañez et al., 2002). Even though the q value of the
GRP model constitutes a realistic approach to simulate
the quality of maintenance, it is important to point out
that the model assumes an identical q for every repair
in the item life. A constant q may not be the case for
some equipment and maintenance process, but it is a
reasonable approach for most repairable components Figure 3. Conditional probability of occurrence of failure.
and systems.
The three models described above have advantages
and limitations. In general, the more realistic is the
model, the more complex are the mathematical expres- Law Model (Ascher and Feinfold, 1984 and Crow,
sion involved. The NHPP model has been proved to 1974 ):
provide good results even for realistic situations with
better-than-old but worse-than-new repairs (Yañez  β−1
β t
et al., 2002). Based on this, and given its conservative λ(t) = (8)
nature and manageable mathematical expressions, the α α
NHPP was selected for this particular work. The spe-
cific analytical modeling is discussed in the following This form comes from the assumption that the inter-
section. arrival times between successive failures follow a
conditional Weibull probability density function, with
parameters α and β. The Weibull distribution is typ-
4.4 Non-homogeneous Poisson process ically used in maintenance area due to its flexibility
analytical modeling and applicability to various failure processes, however,
solutions to Gamma and Log-normal distributions are
The NHPP is a stochastic point process in which the also possible. This model implies that the arrival of
probability of occurrence of n failures in any interval the ith failure is conditional on the cumulative operat-
[t1, t2] has a Poisson distribution with the mean: ing time up to the (i − 1)th failure. Figure 3 shows a
schematic of this conditionality (Yañez et al., 2002).
 t2 This conditionality also arises from the fact that the
λ= λ(t)dt (5) system retains the condition of as bad as old after
t1
the (i − 1)th repair. Thus, the repair process does not
restore any added life to the component or system.
where λ(t) is the rate of occurrence of failures In order to obtain the maximum likelihood (ML)
(ROCOF). estimators of the parameters of the power law model,
Therefore, according to the Poisson process: consider the following definition of conditional
probability:
Pr[N (t2 ) − N (t1 ) = n]
 n    F(t) − F(t1 )
t2 t P(T ≤ t |T > t1 ) =
t1 λ(t)dt exp − t12 λ(t)dt R(t1 )
= (6) 1 − R(t) − 1 + R(t)
n! =
R(t1 )
where n = 0, 1, 2, . . . are the total expected number of R(t)
failures in the time interval [t1, t2]. The total expected =1− (9)
R(t1 )
number of failures is given by the cumulative intensity
function:
where F(·) and R(·) are the probability of compo-
 t
nent failure and the reliability at the respective times.
(t) = λ(t)dt (7) Assuming a Weibull distribution, Eq. (9) yields:
0
 β  β

ti−1 ti
One of the most common forms of ROCOF used in F(ti ) = 1 − exp − (10)
α α
reliability analysis of repairable systems is the Power

934
Therefore, the conditional Weibull density func- 1  
tion is: (tn , tn+s ) = β
(tn + ts )β − (tn )β (16)
α
     β

β ti β−1 ti−1 β ti where ts is the time after the last failure occurred in
f (ti ) = . exp − (11) the one which needs to be considered the number of
α α α α
failures and

For the case of the NHPP, different expressions for 


n

the likelihood function may be obtained. We will use tn = ti (17)


expression based on estimation at a time t after the i=1
occurrence of the last failure and before the occurrence
of the next failure, see details on these expressions in
(Modarres et al., 1999). 5 NHPP MODEL PROPOSED
FOR THE EVALUATION OF THE COSTS
PER FAILURE
4.4.1 Time terminated NHPP maximum
likelihood estimators
Asiedu and Gu (1998) have published a state of the art
In the case of time terminated repairable compo-
review on LCCA. Most of the methodologies proposed
nents, the maximum likelihood function L can be
in the last years, include basic analysis that allow to
expressed as:
quantify the economic impact that the failures inside a
production system generate. In relation to the quantifi-

n
n
cation of the costs for non Reliability in the LCCA, we
L= f (ti ) = f (t1 ) f (ti )R(tn |t ) (12)
recommend to use NHPP model. This model proposes
i=1 i=2
to evaluate the impact of the main failures on the costs
structure of a production system, starting from a sim-
Therefore:
ple process, which is summarized next: first, the more
   
important types of failures are determined, then, it is
β t1 β−1 t1 β assigned to each failure type a constant value of occur-
L= exp −
α α α rence frequency per year (this value will not change
    along the expected useful life), later on, the impact
β n−1 t1 β−1
n
in costs per year is estimated, generated by the fail-
× ures to the production, operations, environment and
α i=2
α
security, and finally, the total impact in costs of the
 n    β

 ti−1 β ti failures for the years of expected useful life is consid-
× exp − ered in present value to a specific discount rate. Next,
i=2
α α are detailed the steps to estimate the costs for failures
   β
according to NHPP model:
tn β t
× exp − (13) 1. Identify for each alternative to evaluate the main
α α
types of failures. This way for certain equipment
there will be f = 1. . . . . . . F types of failures.
Again, the ML estimators for the parameters are 2. Determine for the n (total of failures), the times
calculated. The results are (Ascher and Feinfold, 1984 to failures tf . This information will be gathered by
and Crow, 1974): the designer based on records of failures, databases
and/or experience of maintenance and operations
tn personnel.
α̂ = (14)

1
3. Calculate the Costs for failures Cf ($/failure).
n These costs include: costs of replacement parts,
β̂ =   (15) manpower, penalization for production loss and

n
ln ttni operational impact.
i=1 4. Define the expected frequency of failures per year
(tn , tn+s ). This frequency is assumed as a con-
where ti is the time at which the ith failure occurs, tn stant value per year for the expected cycle of useful
is the total time where the last failure occurred, and life. The (tn , tn+s ) is calculated starting from the
n is the total number of failures. The total expected expression (16). This process is carried out start-
number of failures in the time interval [tn, tn + s] ing from the times to failures registered tf by failure
by the Weibull cumulative intensity function is type (step 2). The parameters α and β, are set start-
(Modarres et al., 1999): ing from the following expressions (14) and (15).

935
In the expression (16), ts it will be a year (1 year) or – Define the types of failures (f ). Where f = 1. . .F
equivalent units (8760 hours, 365 days, 12 months, for F types of failures:
etc.). This time ts represents the value for estimate
de frequency of failures per year. F = 1 types of failures
5. Calculate the total costs per failures per year TCPf ,
generated by the different events of stops in the – Calculate the Costs per failure Cf (these costs
production, operations, environment and security, include: costs of replacement parts, manpower,
with the following expression: penalization for production loss and operational
impact):

F
TCPf =  (tn , tn+s ) × Cf (18) Cf = 5000
$
f failure

– Define the expected frequency of failures per year


The obtained equivalent annual total cost, rep- (tn , tn+s ), use expression (16):
resents the probable value of money that will be
needed every year to pay the problems of reliabil- 1
ity caused by the event of Failure, during the years (tn , tn+s ) = [(tn + ts )β − (tn )β ]
αβ
of expected useful life.
6. Calculate the total costs per failures in present value Where:
PTCPf . Given a yearly value TCPf , the quantity
of money in the present (today) that needs to be
saved, to be able to pay this annuity for the expected n = 24 failures
n
number of years of useful life (T), for a discount rate tn = ti = 5 + 7 + 3. . . . . . .4 + 7 + 4 = 117
(i). The expression used to estimate the PTCPf is i=1
shown next: months
ts = 12 months
tn+s = 119 months
(1 + i)T − 1
PTCPf = TCPf × (19)
i × (1 + i)T The parameters α and β, are calculated from the
expressions (14) and (15):
Later on, to the costs calculated by non reliability,
the rest of the evaluated costs (investment, planned α = 6.829314945
maintenance, operations, etc.) are added, the total β = 1.11865901
cost is calculated in present value for the selected dis-
count rate and the expected years of useful life and the The expected frequency of failures per year:
obtained result is compared with the total costs of the
other evaluated options.
failures
(tn , tn+s ) = 2.769896307 ,
year

6 CASE STUDY this frequency is assumed as a constant value per


year for the expected cycle of useful life.
The following data of failures will be used for the three
models previously explained. This information was – Calculate the total Costs per failures per year TCPf ,
gathered by records of failures of a Gas Compressor use expression (18):
from Venezuelan National Oil Company. In this equip-
ment have occurred 24 events of failures in 10 years of failures $
TCPf = 2.769896307 × 5000
useful life. Next, the times to failures tf are presented year failures
in months:
$
This model proposes to evaluate the impact of the = 13849.48154
failures in the following way: year

Table 1. Times to failures.

5 7 3 7 2 4 3 5 8 9 2 4 6 3 4 2 4 3 8 9 4 4 7 4

936
– Calculate the total cost per failure in present value In the process of analysis of the costs along the life
PTCPf , use expression (19), for a period T = 10 cycle of an asset, many decisions and actions exist
years and discount rate i = 10%: that should be taken, being of particular interest for
this work, those aspects related with the process of
PTCPf = 73734, 805$, improvement of the Reliability (quality of the design,
used technology, technical complexity, frequency of
value that represents the quantity of money (today) failures, costs of preventive/corrective maintenance,
that the organization needs to be able to cover maintainability levels and accessibility), since these,
the annual expenses projected by failures in the have a great impact on the total cost of the life cycle
next 10 years, with a discount factor of 10%. For of the asset, and they influence in great measure on
this example, the total expected number of fail- the possible expectations to extend the useful life of
ures in the time interval [tn, tn + s] is estimated the assets to reasonable costs. For these reasons, it
by the NHPP stochastic model (Weibull cumulative is of supreme importance inside the process to esti-
intensity function), see Modarres et al., 1999. mate the life cycle of the assets, to evaluate and to
analyze detailedly the aspects related with the failure
rate. According to Ascher (1984), the following points
6.1 Limitations of the model evaluated
should be considered in failure rate trend analyses:
The NHPP model has been proved to provide good
results even for realistic situations with better-than- • Failure of a component may be partial, and repair
old but worse-than-new repairs (Hurtado et al., 2005). work done on a failed component may be imperfect.
Based on this, and given its conservative nature and Therefore, the time periods between successive fail-
manageable mathematical expressions, the NHPP was ures are not necessarily independent. This is a major
selected for this particular work. The model described source of trend in the failure rate.
above has advantages and limitations. In general, the • Imperfect repairs performed following failures do
more realistic is the model, the more complex are the not renew the system, i.e., the component will
mathematical expression involved. The main strengths not be as good as new; only then can the statisti-
and weakness of this model are summarized next: cal inference methods using a Rate Of Occurrence
Strengths: Of Failures (ROCOF) assumption be used.
• Repairs made by adjusting, lubricating, or other-
• It is a useful and quite simple model to represent wise treating component parts that are wearing out
equipment under aging (deterioration). provide only a small additional capability for fur-
• Involves relatively simple mathematical expres- ther operation, and do not renew the component or
sions. system. These types of repair may result in a trend
• It is a conservative approach and in most cases pro- of a increasing ROCOF.
vides results very similar to those of more complex • A component may fail more frequently due to aging
models like GRP (Hurtado et al., 2005). and wearing out.
Weakness: It is important to mention that inside the LCCA
• Is not adequate to simulate repair actions that restore techniques a potential area of optimization related
the unit to conditions better than new or worse with the evaluation of the Reliability impact exists.
than old. In the near future the new proposals of evaluation of
the costs generated by aspects of low Reliability will
use advanced mathematical methods such as:
7 FUTURE DIRECTIONS
• Stochastic methods see (Tejms, 1986, Karyagina
The specific orientation of this work toward the analy- et al., 1998, Yañez et al., 2002, Hurtado et al.,
sis of the Reliability factor and its impact in the costs, is 2005 and Vasiliy, 2007). Table 2 shows the stochas-
due to, that great part of the increment of the total costs tic processes used in reliability investigations of
during the expected cycle of useful life of a production repairable systems, with their possibilities and
system, is caused in its majority, for the lack of pre- limits (Modarres et al., 1999).
vision in the face of unexpected appearance of failure • Advanced maintenance optimization using genetic
events, scenario basically provoked by ignorance and algorithms see (Martorell et al., 2000 and Martorell
by the absence of a technical evaluation in the design et al., 2005).
phase of the aspects related with the Reliability. This • Monte Carlo simulation techniques see (Barringer,
situation brings as a result an increment in the total 1997, Barringer and Webber, 1996, and Kaminskiy
costs of operation (costs that were not considered in and Krivtsov, 1998).
the beginning) affecting in this way the profitability of • Advanced Reliability distribution analysis see
the production process. (Elsayed, 1982, Barlow, Clarotti and Spizzichino,

937
Table 2. Stochastic processes used in reliability analysis of REFERENCES
repairable systems.
Ahmed, N.U. 1995. ‘‘A design and implementation model
Stochastic Background/ for life cycle cost management system’’, Information and
process Can be used Difficulty Management, 28, pp. 261–269.
Asiedu, Y. and Gu, P. 1998. ‘‘Product lifecycle cost analysis:
Renewal process Spare parts provisioning Renewal state of art review’’, International Journal of Production
in the case of theory/ Research, Vol. 36 No. 4, pp. 883–908.
arbitrary failure rates Medium Ascher, H. and Feingold, H. ‘‘Repairable System Reliability:
and negligible Modeling, Inference, Misconceptions and their Causes’’,
replacement or repair New York, Marcel Dekker, 1984.
time (Poisson Barlow, R.E., Clarotti, C.A. and Spizzichino, F. 1993. Relia-
process) bility and Decision Making, Chapman & Hall, London.
Alternating One-item repairable Renewal Barringer, H. Paul and David P. Weber. 1996. ‘‘Life
renewal (renewable) structure theory/ Cycle Cost Tutorial’’, Fifth International Conference
process with arbitrary failure Medium on Process Plant Reliability, Gulf Publishing Company,
and repair rates Houston, TX.
Markov process Systems of arbitrary Differential Barringer, H. Paul and David P. Weber. 1997. ‘‘Life Cycle
(MP) structure whose equations Cost & Reliability for Process Equipment’’, 8th Annual
elements have constant or integral ENERGY WEEK Conference & Exhibition, George R.
failure and repair rates equations/ Brown Convention Center, Houston, Texas, Organized by
during the stay Low American Petroleum Institute.
time (sojourn time) in Barroeta, C. 2005. Risk and economic estimation of inspec-
every state (not tion policy for periodically tested repairable components,
necessarily at a state Thesis for the Master of Science, University of Maryland,
change, e.g. because Faculty of Graduate School, College Park,Cod.Umi-umd-
of load sharing) 2712, pp. 77, August, Maryland.
Semi-Markov Some systems whose Integral Blanchard, B.S. 2001. ‘‘Maintenance and support: a
process (SMP) elements have constant equations/ critical element in the system life cycle’’, Proceedings of
or Erlangian failure Medium the International Conference of Maintenance Societies,
rates (Erlang paper 003, May, Melbourne.
distributed failure- Blanchard, B.S. and Fabrycky, W.J. 1998. Systems Engineer-
free times) and ing and Analysis, 3rd ed., Prentice-Hall, Upper Saddle
arbitrary repair rates River, NJ.
Semi- Systems with only one Integral Bloch-Mercier, S. 2000. ‘‘Stationary availability of a semi-
Regenerative repair crew, arbitrary equations/ Markov system with random maintenance’’, Applied
process structure, and whose High Stochastic Models in Business and Industry, 16,
elements have constant pp. 219–234.
failure rates and Crow, LH. 1974. ‘‘Reliability analysis for complex repairable
arbitrary repair rates systems’’, Reliability and biometry, Proschan F, Serfling
Nonregenerative Systems of arbitrary Partial diff. RJ, eds., SIAM, Philadelphia, pp. 379–410.
process structure whose eq.; case by Dhillon, B.S. 1989. Life Cycle Costing: Techniques, Models
elements have arbitrary base sol./ and Applications, Gordon and Breach Science Publishers,
failure and repair High to New York.
rates very high Dhillon, B.S. 1999. Engineering Maintainability: How to
Design for Reliability and Easy Maintenance, Gulf,
Houston, TX.
1993, Ireson, et al., 1996, Elsayed, 1996, Scarf, Dowlatshahi, S. 1992. ‘‘Product design in a concurrent engi-
1997, Ebeling, 1997 and Dhillon, 1999). neering environment: an optimization approach’’, Journal
• Markov simulation methods see (Roca, 1987, of Production Research, Vol. 30 (8), pp. 1803–1818.
Kijima and Sumita, 1987 Kijima, 1997 and Bloch- Durairaj, S. and Ong, S. 2002. ‘‘Evaluation of Life Cycle
Mercier, 2000). Cost Analysis Methodologies’’, Corporate Environmental
Strategy, Vol. 9, No. 1, pp. 30–39.
These methods will have their particular character- DOD Guide LCC-1, DOD Guide LCC-2, DOD Guide LCC-
istics and their main objective will be to diminish the 3. 1998. ‘‘Life Cycle Costing Procurement Guide, Life
uncertainty inside the estimation process of the total Cycle Costing Guide for System Acquisitions, Life Cycle
Costing Guide for System Acquisitions’’, Department of
costs of an asset along the expected useful life cycle. Defense, Washington, D.C.
Finally, it is not feasible to develop a unique LCCA Ebeling, C. 1997. Reliability and Maintainability Engineer-
model, which suits all the requirements. However, it is ing, McGraw Hill Companies, USA.
possible to develop more elaborate models to address Elsayed, E.A. 1982. ‘‘Reliability Analysis of a container
specific needs such as a Reliability cost-effective asset spreader’’, Microlelectronics and Reliability, Vol. 22,
development. No. 4, pp. 723–734.

938
Elsayed, E.A. 1996. Reliability Engineering, Addison Wesley Martorell, S., Villanueva, J.F., Nebot, Y., Carlos, S.,
Longman INC, New York. Sánchez, A., Pitarch, J.L. and Serradell, V. 2005.
Fabrycky, W.J. 1997. Análisis del Coste de Ciclo de Vida de ‘‘RAMS+C informed decision-making with application
los Sistemas, ISDEFE, Ingeniería de Sistemas, Madrid, to multi-objective optimization of technical specifications
España. and maintenance using genetic algorithms’’. Reliability
Fabrycky, W.J. and Blanchard, B.S. 1993. Life Cycle Costing Engineering & System Safety 87, 65–75.
and Economic Analysis, Prentice Hall, Inc, Englewwod Modarres, M., Kaminskiy, M. and Krivtsov V. 1999. Relia-
Cliff, New Jersey. bility engineering and risk analysis. Marcel Dekker Inc.,
Goffin, K. 2000. ‘‘Design for supportability: essential New York.
component of new product development’’, Research- Nachlas, J. 1995. Fiabilidad, ISDEFE, Ingeniería de Sis-
Technology Management, Vol. 43, No. 2, pp. 40–7. temas, Madrid, España.
Hurtado, J.L., Joglar, F. and Modarres, M. 2005. ‘‘General- Navas, J. 1997. Ingeniería de Mantenimiento, Universidad
ized Renewal Process: Models, Parameter Estimation and de los Andes, Mérida, Venezuela.
Applications to Maintenance Problems’’, International Parra, C. 2002. ‘‘Evaluación de la Influencia de la Confia-
Journal on Performability Engineering, Vol. 1, No. 1, bilidad en el Ciclo de Vida de 16 Motocompresores del
paper 3, pp. 37–50. Distrito Maturín’’, Informe Técnico INT-9680, PDVSA
Ireson, W. Grant, Clyde F. Coombs Jr., Richard Y. Moss. INTEVEP, Venezuela.
1996. Handbook of Reliability Engineering and Manage- Parra, C. y Omaña C. 2003. ‘‘Análisis determinístico del
ment, 2nd edition, McGraw-Hill, New York. Ciclo de Vida y evaluación del factor Confiabilidad en
Kaminskiy M, Krivtsov V. 1998. ‘‘A Monte Carlo Motocompresores de gas para extracción de petróleo’’,
approach to repairable system reliability analysis’’, Prob- Congreso Internacional de Mantenimiento, Colombia,
abilistic safety assessment and management, Springer, Bogotá, Colombia.
pp. 1063–1068. Riddell, H., Jennings, A. 2001. ‘‘Asset Investment & Life
Karyagina, M., Wong, W., Vlacic, L. 1998. ‘‘Life cycle Cycle Costing’’, The Woodhouse Partnership, Technical
cost modelling using marked point processes’’, Reliability paper, London.
Engineering & System Safety, Vol. 59, pp. 291–298. Roca, J.L. 1987. ‘‘An approach in the life cycle costing
Kececioglu, D. 1991. ‘‘Reliability and Life Testing Hand- of an integrated logistic support’’, Microelectronics and
book’’, Prentice Hall, Inc, Englewood Cliff, New Jersey. Reliability, Vol. 27, No. 1, pp. 25–27.
Kijima, M., Sumita, N. 1987. ‘‘A useful generalization Ruff, D.N., and Paasch, R.K. 1993. ‘‘Consideration of failure
of renewal theory: counting process governed by non- diagnosis in conceptual design of mechanical systems’’,
negative Markovian increments’’, Journal Appl. Prob., Design Theory and Methodology, ASME, New York,
Vol. 23, pp. 71–88. pp. 175–187.
Kijima, M. 1977. ‘‘Markov processes for stochastic model- Scarf, P.A. 1997. ‘‘On the application of mathematical mod-
ing’’, Chapman & Hall, London. els in maintenance’’, European Journal of Operational
Kirk, S. and Dellisola, A. 1996. Life Cycle Costing for Design Research, Vol. 99, No. 3, pp. 493–506.
Professionals, McGraw Hill, New York, pp. 6–57. Smith, C. and Knezevic, J. 1996. ‘‘Achieving quality through
Levy, H. and Sarnat, M. 1990. Capital Investment and Finan- supportability: part 1: concepts and principles’’, Journal
cial Decisions, 4th Edition, Prentice Hall, New York. of Quality in Maintenance Engineering, Vol. 2, No. 2,
‘‘Life Cycle Costing Workbook: A guide for implementa- pp. 21–9.
tion of Life on Life Cycle Costing in the Federal Supply Tejms, H.C. 1986. Stochastic Modelling and Analysis, Wiley
Services’’. 1989. U.S. General Services Administration, and Sons, New York, NY.
Washington. Vasiliy, V. 2007. ‘‘Recent advances in theory and applications
‘‘Life Cycle Analysis as an Aid Decision Making’’. 1985. of stochastic point process model in reliability engineer-
Building Information Circular, Department of Energy, ing’’, Reliability Engineering & System Safety, Vol. 92,
Office of Facilities Engineering and Property Manage- No. 5, pp. 549–551.
ment, Washington. Willians, D., Scott R. 2000. ‘‘Reliability and Life Cycle
Mackenzie, J. 1997. ‘‘Turn your company’s strategy into Costs’’, RM-Reliability Group, Technical Paper, Texas,
reality’’, Manufacturing Management, January, pp. 6–8. TX, November.
Markeset, T. and Kumar, U. 2001. ‘‘R&M and risk anal- Woodhouse, J. 1991. ‘‘Turning engineers into businessmen’’,
ysis tools in product design to reduce life-cycle cost 14th National Maintenance Conference, London.
and improve product attractiveness’’, Proceedings of Woodhouse, J. 1993. Managing Industrial Risk, Chapman
the Annual Reliability and Maintainability Symposium, Hill Inc, London.
22–25 January, Philadelphia, pp. 116–122. Woodward, D.G. 1997. ‘‘Life Cycle Costing—Theory, Infor-
Markeset, T. and Kumar, U. 2003. ‘‘Design and develop- mation Acquisition and Application’’, International Jour-
ment of product support and maintenance concepts for nal of Project Management, Vol. 15, No. 6, pp. 335–344.
industrial systems’’, Journal of Quality in Maintenance Yañez, M., Joglar, F., Mohammad, M. 2002. ‘‘Generalized
Engineering, Vol. 9, No. 4, pp. 376–392. renewal process for analysis of repairable systems with
Martorell, S., Carlos, S., Sanchez, A. and Serradell, V. limited failure experience’’, Reliability Engineering &
2000. ‘‘Constrained optimization of test intervals using System Safety, Vol. 77, pp. 167–180.
a steady-state genetic algorithm’’. Reliability Engineer-
ing & System Safety 67, 215–232.

939
Safety, Reliability and Risk Analysis: Theory, Methods and Applications – Martorell et al. (eds)
© 2009 Taylor & Francis Group, London, ISBN 978-0-415-48513-5

Risk trends, indicators and learning rates: A new case study


of North sea oil and gas

R.B. Duffey
Atomic Energy of Canada Limited, Chalk River, Ontario, Canada

A.B. Skjerve
Institute for Energy Technology, Norway

ABSTRACT: Industrial accidents, explosions and fires have a depressingly familiar habit of re-occurring, with
similar if not identical causes. There is a continual stream of major losses that commonly are ascribed to poor
operating and management practices. The safety risks associated with modern technological enterprises make it
pertinent to consciously monitor the risk level. A comprehensive approach in this respect is being taken by the
Petroleum Safety Authority Norway (PSA)’s program ‘‘Trends in Risk Levels Norwegian Continental Shelf.’’
We analyse the publicly available data provided by this program using the Duffey–Saull Method. The purpose of
the analysis is to discern the learning trends, and to determine the learning rates for construction, maintenance,
operation and administrative activities in the North Sea oil and gas industry. This outcome of this analysis allows
risk predictions, and workers, management and safety authorities to focus on the most meaningful trends and
high-risk activities.

1 INTRODUCTION the BP Texas City refinery of March 2005). This has


extreme significance in risk management, and great
The offshore oil and gas industry is a huge and importance in the development of safety management
modern technological enterprise in which vast quanti- systems, not to mention the large impact on insur-
ties of oil are pumped, shipped, and stored. These large ance costs, corporate liability and losses, and the threat
socio-technological facilities pose major hazards, with to worker safety. This is important not only for such
potential for spills, fires, sinkings and explosions in large disasters but also for everyday accidents, where
a hazardous and sensitive sea environment with risk the issue is: How can a large facility loss be predicted
of accident, injury and death to the operators. In his using the everyday events and operation of that facility.
review of major oil industry failures and engineer- The safety risks associated with modern techno-
ing practices (Moan 2004) assesses that: ‘‘the main logical enterprises make it pertinent to consciously
cause of actual structural failures are due to human monitor the risk level, to assess the extent to which
errors and omissions . . . and cause 80–90% of the safety improving initiatives are required. A compre-
failures of buildings, bridges and other engineering hensive approach in this respect is being taken by
structures.’’ This is also true with what is found in all the Norwegian oil and gas regulator, the Petroleum
other industries and technological systems world wide, Safety Authority Norway (PSA), who are responsi-
and the same types of mistakes and common errors ble for overseeing the safety of many of the massive
appear (Duffey & Saull 2002). Industrial accidents, deep sea platforms operating in the storm swept North
explosions and fires have a depressingly familiar habit Sea. The PSA’s newly developed program called
of re-occurring, with similar if not identical causes. on ‘‘Trends in Risk Levels Norwegian Continental
There is a continual stream of major losses that com- Shelf’’, where key measures have been defined for
monly are ascribed to poor operating and management the purpose of tracking and evaluating both relative
practices, as in recent examples of sometimes billion safety improvements in specific areas plus defining
dollar damage and losses at offshore oil rigs (e.g. the an overall risk level, with the objective ‘‘to create
P-36 platform that sank following three explosions on a reliable decision making platform for industry and
March 2001), oil storage facilities (e.g. the Buncefield authorities’’ (PSA 2003). In a wide-range approach to
Oil Depot explosions and fire of December 2005), and ‘‘measure risk for an entire industrial sector’’, twenty-
refineries (e.g. the catastrophic process accident on one (21) risk indicators define situations of hazard and

941
accident (called DFUs), covering many known major risk outcomes as a function of experience. It assumes
and minor outcomes. These indicators include data that with continuous exposure to a given operational
for the differing activity segments of the oil and setting humans will learn to master task performance,
gas offshore and onshore work (shipping, transport, and that the manifest effect of learning will be lower
maintenance . . .) and include events and abnormal accident/incident rates – because humans as a starting
activities (leaks, accidents and incidents . . .), plus point is assumed to be the key contributing fac-
the effectiveness of ‘‘barriers’’ (systems, practices tor to accident/incidents. The present Case Study
and procedures . . .). The yearly trends of the quan- applies these techniques and approaches to analyze
titative data are analyzed as to whether these show the new and publicly available North Sea outcome
change (increase, decrease, or not) of both the num- data. Using the experience-based DSM, we try to dis-
bers and rates of indicator outcomes; and whether there cern the learning trends, and determine the learning
is any relation to more qualitative measures based on rates for construction, maintenance, operation and
attitudinal surveys. drilling activities in the North Sea oil and gas indus-
Determining the safety level based on this type try. In our Case Study, we provide a basis to determine,
of calculation can, however, be a difficult task. The prioritize, and compare the learning rates and injury
recent PSA report states: ‘‘On the basis of the data trends between different key work phases. This anal-
and indicators used in this project, no clear positive ysis allows risk predictions, and provides guidance
or negative trends can be observed in risk level. Most for workers, management and safety authorities to
major accident indicators show an improvement in focus on the most meaningful trends and high-risk
2003 in relation to 2002. Serious injuries to person- activities.
nel also show a decrease in 2003. The position is now
on a level with the average for the previous 10 years.
Cooperation and trust between the parties are seen as 2 RISK INDICATOR DATA ANALYSIS
good.’’
Since the common factor and major cause in indus- The procedure we use is to first determine the risk
trial accidents everywhere is the human involvement, outcomes, rates and numbers, and their distribution
it is postulated here that by understanding the prior with experience. The basic prior data for Norway for
outcomes, human learning and error correction, we 1996–2005 are reported by the PSA in both graphical
can predict the probability of observing any outcome. and tabular form (PSA 2007). Some equivalent data
The key questions to answer when looking at trends for the UK for 1992–2002 are tabulated in Yang &
are: Are we learning from our past mistakes? What is Trbojevic 2007 (Table 6.16 p 195). All the data are typ-
the rate of learning now? What is it predicted to be in ically given and are analyzed by calendar year, such as
the future? the number of injuries to workers, broken down by dif-
Precisely to quantify such issues, Duffey & Saull ferent sub-categories of severity (e.g., major or total),
(2002) have derived measures and methods for the and work location and/or activity type (e.g., fixed or
analysis of learning rates as direct indicators of safety mobile facility, drilling or maintenance).
improvement using existing worldwide outcome data To convert to a learning basis for analysis, we use
for some 200 years and covering over 60 exam- the relevant measure of experience as the accumu-
ples. The approach, called the Duffey-Saull Method lated worker-hours, summing the year-by-year num-
(DSM), uses the Learning Hypothesis to analyze and bers reported. A typical xls. spreadsheet tabulation
predict errors, accidents, injuries and all other such and analysis of the rates is shown in Table 1 for a

Table 1. Typical data subset—Norway well drilling injuries 1996–2005.

Well drilling (hours) Injuries, n AccMh N∗ Entropy, H Injury rate/Mh Year

4670117 145 4.670117 0.088633 0.27047 31.04847 1996


4913477 141 9.583595 0.181884 0.266685 28.69658 1997
4967799 133 14.551394 0.276167 0.258794 26.77242 1998
4418068 117 18.969462 0.360016 0.241637 26.48216 1999
4696224 121 23.665686 0.449144 0.246107 25.76538 2000
5168486 110 28.834172 0.547236 0.233505 21.28283 2001
5506589 103 34.340761 0.651744 0.224957 18.70486 2002
5827360 90 40.168122 0.762339 0.207881 15.44438 2003
6248973 54 46.417095 0.880937 0.150437 8.64142 2004
6273504 59 52.690599 1 0.159497 9.404633 2005

942
subset in our observational interval. In this case, the risk) activities are clearly maintenance and drilling,
data are for injuries in drilling at fixed facilities for and must be the areas of most safety importance and
Norway, and similar tables were made for all the management attention.
various sets where numbers were available. Secondly, the lowest rates attained so far (in admin-
This Table is in general for the jth observation inter- istration and production) are ∼5/Mh, or about 1 in
val, with the sub-intervals within it. Such a tabulation 200,000 experience hours, in complete accord with
is not by itself very informative, apart from illustrating the lowest risk found in any other industry (Duffey &
the manipulations and steps in the necessary arithmetic Saull 2002). However, the highest rates are ten times
for each experience increment: more, or ∼1 in 20,000 experience hours, which is also
comparable to other industries.
1. adding up prior worker-hours to obtain the run- Thirdly, the UK has less experience, but a simple
ning total of the accumulated millions of hours of extrapolation forward of the rough fit to the major
experience, ε (AccMh) for each ith sub-interval; injury (MI) data, and backward extrapolation of the
2. turning the injury numbers, ni , into risk Rates per Norway maintenance data shows similar event rate
Mh by straightforward division; magnitudes and learning rates. The implication is that
3. calculating the non-dimensional experience, N∗ , learning of similar effectiveness in Norway and the
by dividing each AccMh interval, εi , by the UK suggests that further research is needed into the
total accumulated experience, εT (εT = AccMh = influencing factors and causes of this similarity. Thus,
53Mh); and we may predict and expect the UK rates to fall further
4. calculating the entropy (Hi = pi ln pj ) in each and track down towards the Norway risk rates if such
ith sub-interval from the probability, where pi = international learning continues. Similar convergence
ni /Nj , where, Nj , is the total number of injuries trends are observed with differing experience in, say,
(Nj = ni = 1073). commercial aircraft near-misses and marine shipping
accidents.
To clarify the trends, typical results of such anal-
ysis of the raw data are then plotted in Figure 1. The
figure also shows some UK data alongside the Norway 3 ARE WE DOING ENOUGH TO ENSURE
data. By grouping the data together in this way, sev- SAFETY?
eral key comparisons and observations are possible.
In addition, to simplify the presentation, simple expo- A key question for managers of hazardous industries
nential fits are shown to the data, since we expect such is: Are we doing enough to ensure safety? In this
a curve to crudely represent the improvement effects section we will take a closer look at learning in a work-
of learning (Duffey & Saull 2002). place setting, and suggest that this question may also
Firstly, learning is evident in most of the data, but be answered based on an assessment of the effective-
the absolute risk indicator rates are higher for some ness of the joint initiatives taken by an organization
activities, and some task areas are clearly learning (or an entire industry) to ensure safety.
slower than others (the slope is half). We may pre- Individuals learn as they gain experience (Ebbing-
dict that they will all reach some asymptotic but slow hause 1885). Employees in petroleum companies will
learning state, by about twice the present experience learn from participation in the formal education and
if learning continues. The most hazardous (highest training programs offered by their organization. The
aim of these programs is to ensure that all employ-
Offshore Risk Rates
ees possess the competence, i.e. the skills, knowledge
(Injuries Data: Norway 1996-2005 and UK 1992-2001) and attitudes required to efficiently perform their jobs
90
Norway administration to the specified standard (IAEA 2002; Skjerve &
Norway production
80
Norway drilling Torgersen 2007). As part of their engagement in the
70 Norway construction
Norway maintenance
every-day work activities, the employees will more-
60 Rate (UK) = 63e
-0.017Mh
UK Major Injuries
UK Major injuries
over face a range of learning opportunities resulting
Norway Maintenance injuries from the myriad of different situations that arise
Rate/ Mh

50
Norway drilling

40
Norway administration from interactions between humans, technology and
Rate (Norway)= 52e
-0.014Mh
administrative systems. The employees will need both
30
the competence acquired from the formal education/
20 -0.0271Mh
Rate= 41.e
2
R = 0.8883
training sessions and the competence acquired based
10
Rate = 21e
2
-0.0289Mh on the more informal experiences gained on-the-job,
R = 0.8865
0 to be able to perform their tasks efficiently (Johnston &
0 10 20 30 40 50 60 70 80
Work Experience, AccMh Hawke 2002). With increased experience, employ-
ees will obtain still more refined insights into the
Figure 1. Typical data plot and simplified curve fits. task performance process and their task performance

943
environment,1 and gradually they will be able to (Svenson 2006). Concretely, organizational learning
perform the routine part of their tasks in a highly may result in the introduction of new work practices,
automated manner (Rasmussen 1986). revisions of operational procedures, refinements of
Observation, imitation, reflection, discussion, and training programs, improvement in the safety manage-
repetition may all constitute important elements in ment approach, etc. That is, in initiatives that jointly
employees’ learning processes. Handling of situa- aim at ensuring safe and efficient production.
tions where unexpected occurrences happen in rela- To facilitate learning processes at all levels in the
tion to task performance provides an important basis organization it is important to ensure that a learn-
for learning. Such unexpected occurrences may be ing culture is engineered (Reason 1997). A learning
caused by human errors (e.g. errors of the particular culture can be defined as ‘‘. . . an environment in
employee, errors of colleagues – in some situations which opportunities for learning are openly valued and
the errors may even be consciously introduced for the supported and are built, where possible, into all activi-
employees to learn something). Unexpected occur- ties’’ (DEST 2005). It has been suggested that effective
rences may also be caused by breakdowns in technol- high-reliability organizations is characterised by their
ogy or administrative systems, or by any combination ability to learn as much as possible for the failures that
of the above factors. When unexpected occurrences occur (Weick & Sutcliffe 2001).
arise things will not progress according to plan, and Finally, the importance of ensuring a sound safety
this will spur the employees to develop a more com- culture is generally reckoned as a prerequisite for
prehensive understanding of the task performance safe production in high-risk industries. The agenda
process and the work environment. This, in turn, will among most actors in the Norwegian petroleum sec-
improve their ability to perform safely in future sit- tor is to improve the safety culture both within and
uations. Accidents constitute important, but highly across the industry (Hoholm 2003). Safety culture can
unwarranted, learning opportunities. When accidents be defined as ‘‘ . . . that assembly of characteristics
happen, they will tend to challenge the organization’s and attitudes in organizations and individuals which
model of the risks it faces and the effectiveness of its establishes that, as an overriding priority, safety issues
countermeasure (Woods 2006). For this reason, radi- receive the attention warranted by their significance.’’
cal changes may be implemented in the organization (Adapted from IAEA 1991, Yang & Trbojevic 2007).
following an accident investigation. This suggests that A sound safety culture means that the structures and
not only individuals but also the organization as such process of the organization should work together to
may learn from experience. ensure safety. Thus, deviations caused by the activities
Organizational learning may be defined as ‘‘. . . the in one part of the organization should be compen-
capacity or processes within an organization to main- sated by the activity in other parts of the organization
tain or improve the performance based on experience’’ so that safety is always ensured (Weick & Sutcliffe
(DiBella 2001, Duffey & Saull 2002). A key ele- 2001). A sound safety culture, moreover, implies that
ment in organizational learning is the transformation the attitudes and behaviours of employees should pro-
of experiences gained by employees to the organiza- mote safety. In the context of the Norwegian petroleum
tional level. In this process, however, the organization industry, the impact of colleagues’ and managers’ atti-
needs to be aware that not all the experiences gained tudes to safety on the individual employee at was
by employees will contribute to increase the likeli- demonstrated in two recent studies (Aase et al. 2005;
hood for safe performance: Employees are engaged in Skjerve, in press).
a continuous learning process. Misunderstandings of One way to answer the question: ‘‘Are we doing
factors in the work environment, misunderstandings of enough to ensure safety?’’ could be to calculate the
the inter-relationship between these factors, inaccurate effectiveness of the joint initiatives taken by an orga-
risk perception, etc. can all be expected to be (inter- nization (or an entire industry) to ensure safety. In the
mediate) elements or states in a learning process. In next section, we introduce ‘‘H’’ as one such possible
addition, to the experiences of the employees, expe- measure.
riences obtained by other organizations or by other
industries may also prove valuable to organizational
learning. Organizational learning should be manifest 4 SAFETY CULTURE, RISK MANAGEMENT
in the structures and processes of the organization AND PREDICTION

The emergence of order from chaos is a precise


analogy to that postulated for microscopic physi-
1 The extent to which this process involves deduction cal and chemical systems (Prigogine 1984), and is
based on inference rules or the development of mental the expressed intent of safety management and risk
models is still a matter of debate (cf., e.g. Johnson-Laird & indicators for macroscopic socio-technological sys-
Byrne 2000). tems. In particular, we determine the Information

944
Entropy risk measure, H, which Duffey & Saull (2007) behavior on the degree of order and on the risk trends
suggest is the objective and quantitative measure of with experience that are attained.
safety culture, management systems, organizational
learning and risk perception. Thus, since we may
regard, H, as a measure of the ‘‘disorder’’ this, of 5 COMPARISON OF THEORY AND DATA
course, is the converse of ‘‘order’’, and hence is an indi-
cation of the effectiveness of these safety management For the present Case Study, we can now compare this
processes. theory to the overall trends of a subset of the present
The statistical theory that determines the outcome risk indicator data, noting that we have evaluated the
risk distribution yields an explicit expression for the entropy already as part of the initial data analysis
Information Entropy, H, using the probability of the (see Table 1). To simplify, the data are normalized to
outcomes (Pierce 1980, Jaynes 2003, Duffey & Saull the initial probability at the initial or lowest experience,
2004, Duffey & Saull 2008). The degree of order is where we take, p0 = 1, by definition. Figure 2 shows
a function of the depth of experience based on the the Norway (injury) and UK (major injury and >3 day
frequency of error state occupation, ni = i Nj . injury) data compared to the theory (SEST) prediction,
The classic result for the Information Entropy, but adopting a value of, a = 1, for the shape or slope
H, is a measure of the uncertainty, or the ‘‘missing parameter in the entropy distribution.
information’’ or the ‘‘degree of order’’ given by: Rather satisfyingly, the theory and data easily
appear side-by-side on the same graph, lending some
Hj = −pi ln pi (1) credence to this analysis. The other data shown
for comparison purposes are the commercial aircraft
Substituting in the expression for the Information near-misses (NMACs), because of the significant and
Entropy, H, in the companion paper (Duffey & Saull traditional airline emphasis on safety (Duffey & Saull
2008), we obtain: 2002). The NMAC line up rather better with the
 ∗ 2
Hj = 1/2 p0 e−aN {aN∗ + 1/2} (2) Information Entropy
Offshore Injuries UK and Norway
0.4

The relative value of the information entropy, H,


0.35
at any experience depth is also an objective measure
of the cultural aspect of modern technologies called 0.3

‘‘organizational learning’’ since it reflects the degree 0.25


Entropy, plnp

of learning and the extent of management ‘‘order’’


0.2
or effectiveness. The so-called organizational learn-
ing and safety culture attributes of a HTS, and its 0.15
Norway entropy
management’s ability to respond effectively to the 0.1
UK entropy MI
UK entropy 3d
demands for continuous safety improvement. The US NMACs entropy
0.05 Theoretical entropy, a=1
resulting structure and probability of observed out-
comes are a direct reflection of the internal organiza- 0
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
tional and skill acquisition caused by the innumerable Non-dimensional Experience, N*

human learning and unlearning interactions occurring


Entropy
within. Norway Offshore Injury Data 1996-2005

These statistical fluctuations, due to human deci- 0.35

sion making and actions, also cause the uncertainty


0.3
of precisely when an event will actually occur, which
uncertainty is determined and measured by the infor- 0.25

mation entropy. The unobserved random fluctua-


Entropy, p lnp

0.2
tions produced by unpredictable human behavior (the
chaos) are reflected in the emergent order at the system 0.15

level: namely, the observed predictable trend of learn-


ing and error reduction with increasing experience. 0.1
Norway injury data
Thus the influence of safety management can be quan- 0.05 Theory, a=1 E = 0.28 - 0.11N*
tified, and predictions made about the effectiveness of Linear data fit
2
R = 0.478

learning and management systems. By providing the 0


0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
information entropy, H-measure (which we can also Non-dimensional Experience, N*

refer to as the ‘‘Learning Entropy’’), we can not only


describe but also can predict the impact of the learning Figure 2. Comparisons of theory and data.

945
theoretical prediction, but there are clearly some only occur when the organization is not sufficiently
key differences between the oil and gas data set robust to prevent that human performance will have
trends. Despite the scatter, we note for this data subset adverse implications. The fact that the learning rate
that the: seems to decrease only when incidents and accidents
occur is caused by the fact that accidents/incidents
1. entropy distribution with experience lies above the
(rather than e.g. successful outcomes) serves as input
theory line;
data for the model. In general accidents/incidents can
2. slope trend is less than the theory, indicating
be expected to occur in lower frequencies as an organi-
insufficient attainment of order;
zation or an entire industrial section gains experience.
3. data lie above the best aircraft practices (aircraft
This point of view is emphasised by the representation
near-misses); and
contained in Figure 2. It shows the learning rate based
4. best (but still far from perfect) fit to all the injury
on the level of control an organisation or and industrial
data is a straight line, not an exponential as we
section have over the production processes.
should expect.
Based on our Case Study of the observed and pub-
The approximate straight line ‘‘fit’’ shown is, H = lished safety indicator data for some ten years of
0.28 − 1.11N∗ , which actually corresponds to the first operation of Norway and UK North Sea oil and gas
two terms of the series expansion of the rather slowly- facilities, and the trends shown in a subset of the data,
decaying exponential. Therefore, the implied first- we observe that:
order approximate value is a ≈1.11 for the distribution
exponent. – Learning is occurring in the major risk indicators
All these trends and comparisons suggest symp- as experience is gained, and this trend is similar
toms of potentially insufficient learning, giving inad- between the Norway and UK indicating some com-
equate reduction in risk compared both to the expected monality in approach and safety standards (after
ideal, and other industries. This adverse trend was correction for differing experience);
confirmed by plotting the rates against the Universal – Certain activities, notably maintenance and drilling,
Learning Curve, and finding a similar value of, k ∼ 1, apparently have much higher risk than others, both
for the learning rate constant. in numbers and rates, and suggest themselves as
priority areas for management emphasis; and
– Evaluation of the Learning Entropy as a measure of
6 CONCLUSIONS AND OBSERVATIONS the degree of order attained by safety management
(suggested to represent organizational learning and
We are interested in predicting safety performance safety culture) also indicate symptoms of potentially
and accident occurrences utilizing quantitative anal- insufficient learning.
ysis of prior data. These predictions should serve to Extension of this Case Study to the complete set
inform the industry to facilitate decision making with of risk indicators would be desirable; as also would
respect to when more emphasis on safety initiatives revising the indicator choices to reflect priority of
is required. As the predictions express learning rates, risk-related importance; and changing the conven-
they will allow companies to readily compare their tional purely time-series manner of data reporting and
learning rate with other companies in the same domain analysis.
to establish whether they are on the right track. Like-
wise, the predictions allow for comparisons between
entire industrial sectors, and they may in this way con-
REFERENCES
tribute to decisions of the national safety authorities
when defining requirements to the various industrial Aase, K., Skjerve, A.B.M. & Rosness, R. 2005. Why Good
sectors. Luck has a Reason: Mindful Practices in Offshore Oil and
The two types of representations presented in this Gas Drilling. In: S. Gherardi & D. Nicolini (eds.), The
paper in Figure 1 and Figure 2 invites different inter- Passion for Learning and Knowing. Proceedings of the
pretations of what spurs the learning rate. The repre- 6th International Conference on Organizational Learning
sentation used in Figure 1 may invite the interpretation and Knowledge, vol. 1.: 193–210. Trento: University of
that further accidents/incidents are necessary for learn- Trento e-books.
ing to take place whereas Figure 2 suggests that we can DEST, 2005. The website of the Department of Education,
intercompare learning and progress using the existing Science and Training of Australia. http://www.dest.gov.
au / sectors / training _ skills / policy_issues_reviews/key _
knowledge. issues/nts/glo/ftol.htm#Glossary_-_L (Accessed January
Still, even if handling of unexpected events is a key 2008)
element in the learning process (as discussed above), DiBella, A.K. 2001. Learning practices: Assessment and
this does not imply that accidents/incidents will have to Action for Organizational Improvement. Upper Saddle
take place for people to learn. Accidents/incidents will River, N.J: Prentice-Hall.

946
Duffey, R.B. & Saull, J.W. 2002. Know the Risk, First Edition, Petroleum Safety Authority Norway (PSA) 2003. Trends
Boston, USA, Butterworth and Heinemann. in risk levels-Norwegian Continental Shelf, Summary
Duffey, R.B. & Saull J.W. 2004. Reliability and Failures of Report, Phase 4–2003, Ptil-04-04, p.11, Norway.
Engineering Systems Due to Human Errors, Proc. The Petroleum Safety Authority Norway (PSA) 2007. Supervi-
First Cappadocia Int. Mechanical Engineering Sympo- sion and Facts, Annual Report 2006, Stavanger, Norway,
sium (CMES’-04), Cappadocia, Turkey. 26 April, available at www.ptil.no.
Duffey, R.B. & Saull, J.W. 2007. Risk Perception in Soci- Pierce, J.R. 1980. An Introduction to Information Theory,
ety: Quantification and Management for Modern Tech- Dover, New York.
nologies, Proc. Safety and Reliability Conference, Risk Prigogine, I. and Stengers, I. 1984. Order Out of
Reliability & Societal Safety (ESREL 2007), Stavanger, Chaos: Man’s New Dialogue with Nature, Toronto,
Norway, 24–27 June. Bantam Books.
Duffey, R.B. & Saull, J.W. 2008. Risk Management Measure- Rasmussen, J. 1986, Information Processing and Human-
ment Methodology: Practical Procedures and Approaches Machine Interaction. An Approach to Cognitive Engineer-
for Risk Assessment and Prediction, Proc. ESREL 2008 ing, System Science and Engineering, vol. 12, New York:
and 17th SRA Europe Annual Conference, Valencia, North-Holland.
Spain, 22–25 September. Reason, J. 1997. Managing the Risks of Organizational
Ebbinghaus, H. 1885. Memory: A Contribution to Experi- Accidents. Aldershot, UK: Ashgate.
mental Psychology. (Translated from: "Über das Gedächt- Skjerve, A.B. (in press). The Use of Mindful Safety Practices
nis"). http://psy.ed.asu.edu/∼classics/Ebbinghaus/index. at Norwegian Petroleum Installations. To be published in
htm (Accessed January 2008). Safety Science.
Hoholm, T. 2003. Safety Culture in the Norwegian Skjerve, A.B. & Torgersen, G.E. 2007. An Organizational-
Petroleum Industry: Towards an Understanding of interor- Pedagogical Framework to Support Competence Assur-
ganisational culture development as network learning. ance Activities. In: T. Aven, J.E.Vinnem (Eds.), Risk,
Arbeidsnotat nr. 23/2003. Oslo: Center for Technology, Reliability and Societal Safety: 1925–1932. London, UK:
Innovation and Culture, University of Oslo. Taylor & Francis Group.
IAEA, 1991. Safety Culture, Safety Series no. 75-INSAG-4, Svenson, O. 2006. A Frame of Reference for Studies of Safety
Vienna: International Atomic Energy Agency. Management. In (Ed.) O. Svenson, I. Salo, P. Oedewald, T.
IAEA, 2002. Recruitment, Qualification and Training of Reiman, A.B. Skjerve, Nordic Perspectives on Safety
Personnel for Nuclear Power Plants, Safety Guide no. Management in High Reliability Organizations. Theory
NS-G-2.8, Vienna: International Atomic Energy Agency. and Applications: 1–7. Valdemarsvik, Sweden: Stock-
Jaynes, E.T. 2003. Probability Theory: The Logic of Sci- holm University.
ence, First Edition, Edited by G.L. Bretthorst, Cambridge Weick, K.E. & Sutcliffe, K.M. 2001. Managing the Unex-
University Press, Cambridge, UK. pected. Assuring High Performance in an Age of Com-
Johnston, R. & Hawke, G. 2002. Case studies of organi- plexity. San Francisco, CA: Jossey Bass.
sations with established learning cultures, The National Woods, D.D. 2006. Essential Characteristics of Resilience.
Centre for Vocational Education Research (NCVER), In E. Hollnagel, D.D. Woods & N. Leveson, Resilience
Adelaide, Australia. http://www.ncver.edu.au/research/ Engineering. Concepts and Precepts:21–34. Aldershot,
proj/nr9014.pdf (Accessed January 2008) UK: Asgate.
Johnson-Laird, P. & Byrne, R., 2000. Mental Models Web- Yang, J. & Trbojevic, V. 2007. Design for Safety of Marine
site. http://www.tcd.ie/Psychology/Ruth_Byrne/mental_ and Offshore Systems, IMarEST Publications, ISBN:
models/ (Accessed January 2008). 1-902536-58-4.
Moan, T. 2004. Safety of Offshore Structures, Second Keppel
Offshore and Marine Lecture, CORE Report No. 2005–04,
National University of Singapore.

947
Safety, Reliability and Risk Analysis: Theory, Methods and Applications – Martorell et al. (eds)
© 2009 Taylor & Francis Group, London, ISBN 978-0-415-48513-5

Robust estimation for an imperfect test and repair model


using Gaussian mixtures

Simon P. Wilson
Centre for Telecommunications Value-Chain Research, Trinity College Dublin, Dublin, Ireland

Suresh Goyal
Bell Labs Ireland, Dublin, Ireland

ABSTRACT: We describe a technique for estimating production test performance parameters from typical
data that are available from past testing. Gaussian mixture models are used for the data because it is often
multi-modal, and the inference is implemented via a Bayesian approach. An approximation to the posterior
distribution of the Gaussian mixture parameters is used to facilitate a quick computation time. The method is
illustrated with examples.

1 INTRODUCTION parameters. Second, the measurement data display


a wide variety of behaviour that are challenging to
Many manufacturing processes for electronic equip- model, such as multimodality and extreme outliers.
ment involve a complex sequence of tests on com- We address this by using a Gaussian mixture model
ponents and the system. Statistical imperfect test and for the measurement data. Third, the data are some-
repair models can be used to derive the properties of times censored; for example, the database may only
the sequence of tests, such as incoming quality, rate record that a measurement was within acceptable lim-
of false positives and negatives, and the success rate its, rather than the actual value. In this case inference
of repair, but require the value of these properties with a Gaussian mixture model is very difficult. For
to be specified. It is recognized that optimal testing these cases we show that it is possible to fit a single
strategies can be highly sensitive to their value (Dick, component Gaussian model. Finally, practical appli-
Trischler, Dislis, and Ambler 1994). cation of the method requires that the computations be
Fortunately, manufacturers often maintain exten- done on-line, so computing time must not be longer
sive databases from production testing that should than a few seconds. We address this by using a fast but
allow these properties to be estimated. In this paper rather crude approximation to implement the Bayesian
we propose a technique to compute the properties of inference.
a test from the test measurement data. It is a robust The paper is organised as follows. Section 2
technique that is designed to be applied automatically, describe the model for what is observed in the test, the
with no intervention from the test engineer in most model for the test properties that interest us and the
cases. relationship between them. It also includes a descrip-
This learning process is not as straightforward as tion of the data. Section 3 describes the statistical
it might appear at first for several reasons. First, the inference procedure that obtains estimates of the test
properties of the test that interest us are not what are model parameters from the data. Section 4 illustrates
recorded in the database. Rather, the test measure- the method with some examples and Section 5 is some
ments themselves are what are stored. We address this concluding remarks.
by defining a model for the measurements, introduced
in (Fisher et al. 2007a; Fisher et al. 2007b), that implic-
2 MODEL
itly defines the test properties; we fit measurement
data to the model following the Bayesian approach
2.1 The measurement model
which then gives us estimates of the test properties.
The Bayesian approach has the advantage that it cor- A unit is tested by measuring the value of one prop-
rectly propagates the uncertainty in the measurement erty of the unit. The true value of the property being
model parameter estimates, as inferred directly from measured is x. The value is measured with error
the data, through to the estimates of the test model and we denote the observed value as y. A Gaussian

949
mixture model is used for x in most circumstances,
since it is able to model a wide range of behaviour
that we have seen in practical examples, such as
extreme outliers, skewness and multi-modality. We let
θ = {pk , μk , σk2 | k = 1, . . . , K} denote the Gaussian
mixture component weights, means and variances.
Therefore:


K
1
px (x | θ ) = pk  Figure 1. Flow chart of the test and repair model.
k=1 2π σk2
 
1
× exp − 2 (x − μk ) ,
2
(1) measurement model for x and y, while in this work we
2σk
leave βBG to be defined directly.
−∞ < x < ∞. We assume in this case that the
measurement error is Gaussian: 2.3 Data
There are three sets of measurements available:
1
e−(y−x) /2s ,
2 2
py|x (y | x, s2 ) = √ 1. Data from a set of ‘‘one-off’’ tests where a single
2π s 2
unit is tested m times. Such tests are occasionally
− ∞ < y < ∞. (2) carried out by the engineers to learn about the
repeatability of the test results and are clearly an
Marginally, y is a Gaussian mixture with the same pk important source of data to learn about py (y | x, s2 ).
and μk as x but with variances σk2 + s2 . The model for We define z1 , . . . , zm to be the measured values
x and y in terms of θ and s2 is called the measurement from the one-off test, which we also refer to as
model. ‘‘one-off data’’ in some equations. We also define
x to be the unknown true value of the unit used in
the one-off test.
2.2 The test and repair model 2. Data from the ‘‘first-pass’’ test where n different
A unit is classified to be good if x is in the interval units are tested. Let y1 , . . . , yn denote the mea-
(L, U ). A unit passes the test if y is in the interval sured values from the first pass test, which are also
(L, U ). The parameters of real interest pertain to the denoted ‘‘first pass data’’.
performance of the test. They are: 3. Data from the ‘‘second-pass’’ test where n2 units,
that failed the first-pass test and were repaired, were
• GI = P(L ≤ x ≤ U | θ ), the proportion of good measured again. In this case we only observe the
units; number of units ns that pass this test.
• αGG = P(L ≤ y ≤ U | L ≤ x ≤ U , θ , s2 ), the
probability that a good unit passes the test; While the quantity being measured in the test may
• αGB = 1 − αGG , the probability that a good unit be continuous, sometimes the available one-off and
fails the test (a false negative); first pass data only show whether the test was passed
• αBB = P(y < L or y > U | x < L or x > U , θ , s2 ), or not e.g. the zj and yi are interval-censored to (L, U ).
the probability that a bad unit fails the test; It is difficult to fit the Gaussian mixture model to such
• αBG = 1 − αBG , the probability that a bad unit interval censored data because they contain very little
passes the test (a false positive). information about the number of components. How-
• βBG , the probability that a unit that is bad is repaired ever we will show that it is possible to fit a single
to good. This arises because a unit that fails the test component Gaussian. We have found that it is suf-
is sent to be repaired and is then retested. There ficiently parsimonious to allow identification of the
is imperfect repair so truly bad units may not be measurement model parameters and produce sensible
repaired to good. We do assume, however, that truly estimates of the test model parameters.
good units that have failed the test cannot be repaired
to be bad.
3 STATISTICAL INFERENCE
Figure 1 represents the test and repair process.
There are therefore 4 free parameters of the test and A Bayesian approach is adopted, so the goal is to com-
repair model: GI , αGG , αBB and βBG . It is important pute the distribution p(GI , αGG , αBG , βBG | data). The
to note that the first three are defined in terms of the likelihood is most easily written in terms of θ, s2 , βBG

950
and also x, the true value of the unit used in the one- 3.2 Computing the posterior distribution
off tests. The deterministic relationships in Section 2.2
Currently we assume flat prior distributions for all
between the measurement model parameters (θ , s2 )
model parameters, although we recognise that a lot of
and the test model parameters then allow us to com-
useful information could be incorporated into a prior
pute the posterior distribution of GI , αGG , αBG and
that could improve the estimation, particularly in the
βBG from that of θ, s2 and βBG .
case of censored data where the information in the data
can be weak. We compute the posterior distribution of
3.1 The likelihood the test parameters p(GI , αGG , αBB , βBG | all data) as
follows. We have found that the second pass data con-
If the measurements themselves are recorded then the tain little information about GI , αGG and αBB so we
likelihood for z1 , . . . , zm and y1 , . . . , yn is: ignore it and use the approximation:

p(one-off data, first pass data | x, θ, s2 )


⎛ ⎞ p(GI , αGG , αBB , βBG | all data)
m  n
= ⎝ py|x (zj | x, s )⎠
2
py (yj | θ , s ) .
2
(3) ≈ p(GI , αGG , αBB | one-off data, first pass data)
j=1 i=1
× p(βBG | second pass data, ĜI , α̂GG , α̂BB ). (9)

For the examples that we consider here, we use


Equation 2 for py|x (zj | x, s2 ), and the Gaussian mix-
where ĜI , α̂GG and α̂BB are posterior means
ture model of Equation 1 for x, which means that
that are computed from p(GI , αGG , αBB | one-off data,
py (yj | θ , s2 ) is a Gaussian mixture probability as in
first pass data). Therefore our task is reduced
Equation 1 but with variances σk2 + s2 instead of σk2 . If to the computation of p(GI , αGG , αBB | one-off data,
interval censored data are recorded then the likelihood
first pass data) and p(βBG | second pass data, ĜI , α̂GG ,
is Bernoulli for both zj and yi with success probabilities
α̂BB ).
Pz = P(L ≤ y ≤ U | x, s2 ) (4) The term p(βBG | second pass data, ĜI , α̂GG , α̂BB ) is
straightforward, being proportional to the likelihood
and of Equation 8 with fixed values of GI = ĜI , αGG =
α̂GG and αBB = α̂BB . This is computed directly on
Py = P(L ≤ y ≤ U | θ, s2 ) (5) a discrete grid of values of βBG , once the posterior
respectively. These probabilities are easily evaluated means ĜI , α̂GG and α̂BB have been computed from
under the Gaussian mixture model. Hence: p(GI , αGG , αBB | one-off data, first pass data).
The first term is more difficult. It is done by simu-
lating values of θ and s2 from p(θ, s2 , x | one-off data,
p(one-off data, first pass data | x, θ , s2 ) first pass data). For each sample, we compute
n
= Pznz (1 − Pz )m−nz Py y (1 − Py )n−ny , (6) (GI , αGG , αBB ) via the Equations in Section 2.2. The
sample averages of these are the posterior means
ĜI , α̂GG and α̂BB . The samples are smoothed, via
where ny and nz are the number of units passing in the a kernel density estimate, to an approximation of
first pass and one off tests respectively. p(GI , αGG , αBB | one-off data, first pass data).
The second-pass test data is always sim- The way θ and s2 are simulated from p(θ, s2 , x |
ply pass/fail. The probability of passing is Ps = one-off data, first pass data) is different according
P(pass 2nd test | fail 1st test). Applying the partition to whether the data are exact or censored. This is
law and Bayes’ law we can show that: described in the subsections below.

Ps = [αGG (1 − αGG )GI + αGG βBG αBB (1 − GI )


+ (1 − αBB )(1 − βBG )αBB (1 − GI )] 3.2.1 Exact data case
By Bayes law, p(θ, s2 , x | one-off data, first pass data)
× [(1 − αGG )GI + αBB (1 − GI )]−1 . (7) is proportional to Equation 3. For the Gaussian
mixture model that we are proposing, recall that
The likelihood for ns units passing from n2 is θ = {pk , μk , σk2 | k = 1, . . . , K}. We note that if we
Bernoulli: define κi2 = σi2 + s2 then we can reparameterise
(θ , s2 ) as θ ∗ = {pk , μk , κk2 | k = 1, . . . , K} and
p(ns | n2 , αGG , αBB , βBG ) = Psns (1 − Ps )n2 −ns . (8) s2 , with the restriction κ 2 > s2 . This allows us to

951
factorise the posterior. 3.2.2 Censored Data Case
Here p(θ, s2 , x | one-off data, first pass data) is pro-
p(θ ∗ , s2 , x | one-off data, first pass data) portional to Equation 6, and we only attempt to
fit a single component Gaussian model. Since in
∝ p(s2 , x | one-off data) p(θ ∗ | first pass data). this case θ = (μ, σ 2 ), there are only 4 unknown
(10) parameters—μ, σ 2 , s2 and x—and this distribution
can be evaluated on a discrete grid, from which values
To compute p(θ ∗ | first pass data), the standard of (θ, s2 ) are simulated.
Bayesian approach to fitting a mixture model is
by Monte Carlo simulation and requires a time-
consuming reversible jump MCMC (Richardson and
Green 1997). This is too slow for the practical imple- 4 EXAMPLES
mentation of this method, where we expect the test
engineer to interact with the inference algorithm as Two examples are shown. One is simulated data and the
a test sequence is designed. We adopt a much faster other is a real data example. In both cases the posterior
although more crude alternative that assumes that each distribution of the test properties was computed using
set of component means and variances are indepen- the method of Section 3.2.
dent, and so the posterior distribution is evaluated Data were simulated from the Gaussian mixture
for each separately. This independence assumption model with 3 components. The means, variances
is used in the variational Bayes approximation to and weights of the components were (3.0, 3.5, 20.0),
mixtures (Constantinopoulos and Likas 2007). The (0.22 , 0.42 , 0.32 ) and (0.2, 0.7, 0.1). The measurement
number of components in the mixture K is determined variance is s2 = 0.0052 and the true value of the
by an initial fit using the fast message length algo- unit used in the one-off tests is x = 3.05. Units are
rithm (Figueiredo and Jain 2002). This method also accepted if they are measured to be in the interval
gives point estimates of mixture means, variances and (2, 4). This leads to true test model parameter values
weights which we denote μ̂k , κ̂k2 and p̂k . A posterior of GI = 0.826, αGG = 0.9992 and αBB = 0.9966;
distribution on the μk , κk2 and pk is fitted around these βBG was defined to be 0.9. Finally, sample sizes were
point estimates by first assigning each first pass obser- n = 5000, m = 500 and n2 = 10. Note that the
vation to the mixture component with the smallest data are mainly concentrated in the accept interval but
Mahalonobis distance (yi − μ̂k )/κ̂k . A posterior dis- that there is a group of observations very far from
tribution for each component mean and variance pair that interval, centered around 20.0. Also note that the
(μk , κk2 ) is then computed separately using the obser- second pass data size is small; since few units fail
vations assigned to it using a Gaussian likelihood; the test, this is also typical. The presence of a group
the result is the standard conjugate normal-inverse of extreme outliers like this is common in data from
gamma posterior for each (μk , κk2 ) separately (Gelman real tests. Figure 2 shows the marginal posterior dis-
et al. 2003). Components with no observation assigned tributions of GI , αGG , αBB and βBG . We see that the
are eliminated. The posterior distribution of the pk is method has recovered the true test parameter values
Dirichlet with the parameter given to pk equal to the quite well. The much greater posterior variance for
number of observations assigned to it; this is again the
conjugate posterior distribution for the pk . Thus we
make the approximation:

p(θ ∗ | first pass data)


K

≈ p(μk , κk | first pass data assigned to k)
2

k=1

× p(p1 , . . . , pn ). (11)

The other distribution p(s2 , x | one-off data) is


another conjugate normal-inverse gamma, since the
data are a normal random sample with mean x and vari-
ance s2 . Values of θ ∗ and s2 , and hence θ and s2 , are
easily simulated from these distributions, as it merely
requires sampling from Gaussian, inverse gamma and Figure 2. Posterior distributions of (clockwise from top left)
Dirichlet distribution, with the only difficulty being GI , αGG , βBG and αBB from simulated data. True parameter
that the samples must satisfy κk2 > s2 . values are shown by a vertical line.

952
the censored data. The analysis now gives pos-
terior means and (2.5%, 97.5%) probability inter-
vals as: GI = 0.950 (0.871, 0.997); αGG =
0.939 (0.880, 0.993); αBB = 0.697 (0.380, 0.962);
βBG = 0.55 (0.13, 0.99). We see that the poste-
rior distributions have considerably higher variance,
reflecting the loss of information from the censoring.

5 CONCLUSIONS

We have presented a Bayesian approach to estimating


test properties of an imperfect test and repair model
from test measurement data. For the sort of data that
we describe, many statistical estimation methods have
problems because of the highly varied properties of the
Figure 3. Histogram of first pass data with fitted Gaussian measurement data. Our method can cope with outliers
mixture model and accept interval limits (vertical lines). and multi-modality.
Several issues have come out of the work so far.
First, more work can be done to specify informative
priors that could help analyses, especially in the cen-
sored data case. Second, we have employed quite a
crude estimate to the posterior of the mixture model
parameters, by fitting each component separately to
observations that are closest to it. In our experience,
comparing the measurement data to the fitted Gaus-
sian mixture, this appears to work well, particularly
because in most real cases the mixture components
are well separated in the data. The approximation, by
imposing independence between mixture component
parameters, is in the spirit of the variational Bayes
approximation. Nevertheless, a better study to evalu-
ate how well this approximation performs is necessary.
Another issue is that we have left βBG to be defined
directly rather than through the measurement model,
as is the case with the other test properties. This means
that we have no way of using the test measurements to
Figure 4. Posterior distributions of (clockwise from top left) infer anything more than the probability that a repair
GI , αGG , βBG and αBB for data from the real test data. succeeds or not. A model for the effect of repair on the
measured value would allow a more detailed under-
standing of the performance of the repair process.
βBG reflects the far smaller sample size of the second
pass test. Figure 3 shows the first pass data, with the
fitted Gaussian mixture model, showing that the model ACKNOWLEDGEMENTS
has captured the multi-modality of the data easily.
Figure 4 shows the analysis of data from a real test. This research is work of the Centre for Telecommuni-
The accept limits for this test are (−40, −24). Sample cations Value-Chain Research http://www.ctvr. ie,
sizes are n = 878 (of which 807 passed), m = 22 (of supported by Science Foundation Ireland under grant
which all passed and the sample standard deviation as number 03/CE3/I405. Bell Labs Ireland research was
0.07) and n2 = 3 (of which 2 passed). The analysis also partly funded by a grant from the Irish Develop-
gives posterior means and (2.5%, 97.5%) probabil- ment Agency.
ity intervals as: GI = 0.92 (0.89, 0.95); αGG =
0.9999 (0.9996, 1.000); αBB = 0.999 (0.996, 1.000);
βBG = 0.60 (0.19, 0.93). To illustrate the effect REFERENCES
of censoring, we take these data and assume that
they are interval censored to (−40, −24). We Constantinopoulos, C. and A. Likas (2007). Unsupervised
then fit a single component Gaussian model to learning of Gaussian mixtures based on variational com-

953
ponent splitting. IEEE Transactions on Neural Networks Fisher, E., S. Fortune, M. Gladstein, S. Goyal, W. Lyons,
18, 745–755. J. Mosher, and G. Wilfong (2007b). Economic modeling
Dick, J.H., E. Trischler, C. Dislis, and A.P. Ambler (1994). of global test strategy II: software system and examples.
Sensitivity analysis in economic based test strategy plan- Bell Labs Technical Journal 12, 175–186.
ning. Journal of Electronic Testing: Theory and Applica- Gelman, A., J.B. Carlin, H.S. Stern, and D.B. Rubin (2003).
tions 5, 239–252. Bayesian Data Analysis (Second ed.). London: Chapman
Figueiredo, M. and A.K. Jain (2002). Unsupervised learning and Hall.
of finite mixture models. IEEE Trans. Pattern Anal. Mach. Richardson, S. and P. Green (1997). On Bayesian analy-
Intell. 24, 381–396. sis of mixtures with an unknown number of components
Fisher, E., S. Fortune, M. Gladstein, S. Goyal, W. Lyons, (with discussion). Journal of the Royal Statistical Society,
J. Mosher, and G. Wilfong (2007a). Economic modeling Series B 59, 731–792.
of global test strategy I: mathematical models. Bell Labs
Technical Journal 12, 161–174.

954
Risk and evidence based policy making
Safety, Reliability and Risk Analysis: Theory, Methods and Applications – Martorell et al. (eds)
© 2009 Taylor & Francis Group, London, ISBN 978-0-415-48513-5

Environmental reliability as a requirement for defining environmental


impact limits in critical areas

E. Calixto & Emilio Lèbre La Rovere


UFRJ-COPPE, Rio de Janeiro, Brasil

ABSTRACT: The main objective of this study is to define reliability requirements in relation to environmental
impacts in critical areas in terms of environmental resource sensitivity. Nowadays many enterprises in Brazil are
evaluated in this area in term of many different environment requirements, but the environmental impact of the
enterprise or the group of enterprises as a whole are not assessed, and nor are their future modifications.
When the number of enterprises in a specific area increases the risk of accidents is also rises. In other words
reliability over time gets worse. Unfortunately most of cases in Brazil do not take into account the entire enterprise
risk impact in a specific area and the decrease of reliability over time.
The methodology in question takes into account all the critical events which cause a serious environmental
impact for each enterprise in the same area that take place over time. By taking into account all relevant events,
it is possible to produce the Environment Diagram Block which covers all related events and their probability
of occurring over time. This means that failures in any block represent accidents with potential environment
impacts.
The environmental reliability target is associated with the tolerable number of environmental impacts in a
specific area, taking into account all events over a specific period of time. The tolerable number of accidents
depends on social perception and environmental sensitivity.
For this analysis the Monte Carlo simulation has to be carried out over a period of time in order to define the
Environmental Availability and Environmental Reliability related to the number of tolerable events. Moreover, in
the case of any enterprise modifications or an increase in the number of enterprises a new block will be inputted
in the Environmental Block Diagram and the new results will be assessed.

1 INTRODUCTION in relation to all the enterprises in a specific area.


Therefore, based on environmental sensitivity it is pos-
Nowadays many environmental requirements are sible to define a limit in terms of the number and types
imposed on enterprises in order to preserve environ- of environment accidents in order to set the reliabil-
mental conditions and avoid serious environmental ity target for a specific area taking all enterprises as a
impacts. In Brazil there are specific laws which stipu- single system. The reliability target is environmental
late specific risk analyses, procedures and waste limits reliability and the protection level will be stipulated
for enterprises depending on their characteristics and both for each individual enterprise and for the group
potential environmental impacts. of enterprises in a specific area in order to preserve
In the Oil and Gas industry, the law is even more the environment over time.
strict due to previous catastrophic events, such as acci- To clarify the methodology a case study will be
dents in Guanabara Bay, the P-36 platform incident, carried out of a group of enterprises with an environ-
and so on. mental impact in a specific area.
Even regarding the worst possibilities of accidents
and environmental impacts, no methodology takes
into account groups of enterprises and analyzes their 2 ANALYSIS METHODOLOGY
impact on the environment over time. Therefore, there
is no strategy to evaluate a specific area regarding the In order to perform this analysis a number of important
overall potential environmental impact. steps have to be followed to achieve consistent results
Tolerable environmental sensitivity is related to which can provide support for decision making about
the ability of ecosystems to adapt or react to the enterprise limits and environmental reliability targets
number of potential environmental impacts over time for specific areas.

957
The first step is to discover the sensitivity of the of system environments will be influenced by
ecosystem in terms of environmental impacts. For this enterprises’ reliability.
the characteristics of the ecosystem have to be ana- After the simulation the system reliability will be
lyzed and its limits estimated. This is very difficult analyzed and it will be possible to discover if the target
in many cases and almost impossible in others due has or has not been achieved and whether the number
to complex features of ecosystems. Therefore, it is of accidents is higher or not. In negative cases, it is also
advisable that ecosystems be compared with others to possible to find out how much improvement is neces-
estimate limits regarding environmental impacts. In sary to achieve the reliability target. The methodology
this case, it is important to be conservative as regards is summarized below in Figure 1.
the tolerable limits of events in order to preserve the
environment.
After the environmental impact limits have been 3 ENVIRONMENTAL SENSITIVITY
defined, the enterprises and their potential environ-
mental accident impacts have to be studied. In this Environmental sensitivity in some specific areas can
case historical accident data has to be analyzed and involve issues with social, economic and environment
the density probability function established to discover impacts in the case of accidents. To facilitate the under-
accident frequency over time. In many cases acci- standing of environmental sensitivity ESI maps were
dent frequency is considered constant over time, but drafted to serve as quick references for oil and chemi-
this is not true in all situations. This concept signifi- cal spill responders and coastal zone managers. They
cantly influences the analysis because some accidents contain three kinds of information:
increase in frequency in a specific period of time, Shorelines are ranked based on their physical and
which in turn allows these accidents to be discussed biological character, then color-coded to indicate their
and in some cases leads to the discovery of their causes sensitivity to oiling.
and the consequent proposal of preventive action in Sensitive biological resources, such as seabird col-
order to avoid future accidents. onies and marine mammal hauling grounds, are
The final step is group all the enterprises and sim- depicted by shaded polygons and symbol icons to
ulate events over time. The Monte Carlo simulation convey their location and extent on the maps.
will be used, with the group of enterprises being rep- ESI maps also show sensitive human-use resour-
resented by the Block Diagram Methodology. This ces, such as water intakes, marinas, and swimming
analysis requires that the group of enterprises be taken beaches.
as a single system. Each enterprise will be repre- In the USA at present project scientists have cre-
sented by a specific block and all blocks will be in ated collections of ESI maps, called ESI atlases, for
series. This means that in the case of accidents the most coastal areas, including Alaska. To do this, vul-
system will impact the environment and the reliability nerable coastal locations have to be identified before
a spill happens, so that protection priorities can be
established and cleanup strategies identified. To meet
this need, NOAA OR&R researchers, working with
1 – Environment Sensitivity colleagues in state and federal governments, have pro-
duced Environmental Sensitivity Index (ESI) maps.
An example section from an ESI map appears in
2 – Critical Events Figure 2 below.
The Environmental Sensitivity Index (ESI) project
team has developed a systematic method for creating
3 – Environment Diagram Block ESI maps. Others are welcome to adopt this method
when it proves useful to them. This section gives
an introduction to the basic elements of ESI maps.
4 – Simulation
ESI maps include three kinds of information, delin-
eated on maps by color-coding, symbols, or other
markings:
5 – Critical analysis • Shoreline Rankings: Shorelines are ranked accord-
ing to their sensitivity, the natural persistence of oil,
and the expected ease of cleanup.
6 – Conclusion • Biological Resources: Oil-sensitive animals, as well
as habitats that either (a) are used by oil-sensitive
animals, or (b) are themselves sensitive to spilled
Figure 1. Environmental reliability analysis methodology. oil (e.g., coral reefs).

958
Table 1. Sensitivity rankings of marine habitats.

Sensitivity
ranking Habitat type

High Saltmarsh
Sheltered Rocky Intertidal
Sheltered Rocky Intertidal
Special Use (endangered
species/marine protected areas)
Medium – High Seagrass Meadow (low intertidal to
shallow subtidal)
Medium Open Water Enclosed Bays and
Harbours
Low – Medium Exposed Sand/Gravel/Cobble Intertidal
Low Exposed Rocky Intertidal
Kelp Forest Subtidal
Open Water, Non-enclosed Nearshore
and Offshore
Soft Bottom to Rocky Subtidal

by wave action. Moreover, rocky intertidal species are


adapted to counteracting the stressful effects of desic-
cation and these adaptations can help them against oil.
The importance of habitat is also reflected in the Vul-
Figure 2. Environmental sensitivity. nerability Index and habitat recovery generalizations
(excluding tropical habitats) listed in Table 2.
In relation to Biological Resources, certain ani-
• Human-Use Resources: Resources and places
mal and plant species are especially vulnerable to the
important to humans and sensitive to oiling, such as
effects of oil spills. Under the ESI method, these
public beaches and parks, marine sanctuaries, water
species have been classified into seven categories,
intakes, and archaeological sites.
each further divided into sub-categories of species
The Shoreline Rankings have been defined on the similar in their sensitivity to spilled oil. Many species
basis of factors that influence sensitivity to oiling, that are vulnerable to oil are wide-ranging, and may be
including substrate grain size, permeability, traffica- present over large areas at any time. These species can
bility, and mobility; the slope of the intertidal zone; be especially vulnerable at particular times and places.
relative degree of exposure of the physical setting; ease ESI maps show where the most sensitive species, life
of cleanup; and biological productivity and sensitivity. stages, and locations exist, but do not necessarily show
A ranking of 1 represents the shorelines least suscepti- the entire area where members of a sensitive species
ble to damage by oiling, and 10 represents the locations occur.
most likely to be damaged. In Human-Use Resources exact locations of some
Habitat is the single most important influence on archaeological and cultural resources cannot be dis-
impacts of oil in marine ecosystems (API 1985; NAS closed because of the risk of vandalism. Either these
1985). Intertidal habitats are exposed to much higher locations are shown within a polygon enclosing a
concentrations of oil than subtidal habitats (Ballou larger area, or a map symbol is placed near to, but
et al. 1987). Benthic (sea bottom) habitats are gen- not at the exact location. People using the ESI method
erally more affected by contact with oil than pelagic to map human-use resources are encouraged to denote
(open water) habitats. The two key intertidal habitat not only surface water intakes, but also groundwater
variables are exposure to wave action and substrate recharge zones and well fields.
(Baker 1991). Table 1 below lists sensitivity rankings Although this methodology defines the most crit-
of marine habitats (excluding tropical) to oil aug- ical areas in the case of accidents, it is not enough
mented from the American Petroleum Institute (API to know the exact environmental limits of accident
1985). impacts. Also necessary is specialist knowledge of
Sheltered habitats with fine-grained sediments are environmental behavior in the case of accidents or to
highly sensitive whereas exposed rocky shores have a carry out a simulation of environmental effects. The
relatively low sensitivity to oil pollution. In sheltered, most usual means is to look for a similar environ-
fine-grained habitats oil tends to linger whereas on ment area that was affected by similar accidents and to
exposed rocky shores oil is subject to rapid removal evaluate its effects and environment behavior in order

959
Table 2. Vulnerability index and habitat recovery of a system, subsystem or equipment to work properly
generalizations. for a specific period of time. The reliability func-
tion requires historical data and uses methods like the
Vulnerability Shoreline minimum square to establish the PDF (density prob-
index 1 type Comments
ability function) that best fits the historical data. The
10 Marine Very productive aquatic reliability function is as follows:
Wetlands ecosystems; oil can
persist for decades t
9 Sheltered Tidal Areas of low wave R(t) = 1 − f (t)dt
Flat energy-high
Boulder biological 0
Barricade productivity; oil may
Beach persist for decades Depending on the PDF function the reliability
8 Sheltered Areas of reduced format can differ.
Rocky wave action; oil may The reliability concept can be used in environmen-
Coast persist for over a tal analysis in order to establish the probability of an
decade
environment impact not occurring in a specific period
7 Gravel Beach Same as Index 6; if
asphalt pavement of time. It is possible to stipulate environmental reli-
forms at high spring ability targets to limit the quantity of environmental
tide level it will impacts and increase the level of safety protection in
persist for decades one or more enterprises. Figure 3 represents the envi-
6 Mixed Oil may undergo rapid ronment reliability of oil spills in Japan, in relation to
Sand/Gravel penetration/burial the worst events. Most of the events have occurred in
Beach under moderate to the last 30 years, due to the increase in oil transport.
low-energy conditions; The best PDF that represents the events is Gumbel,
oil may persist for
with a correlation of 0.9484.
decades
5 Exposed Most oil not likely to The remarkable aspect is that the frequency of this
Compacted adhere to or penetrate event is not constant, as is usually assumed in most
Tidal Flat the compacted risk analyses, but changes over time. The frequency
sediments index is:
4 Course Sand Oil may sink and/or be
Beach buried rapidly; under f (t)
moderate to high-energy λ(t) =
R(t)
conditions oil likely
removed naturally
within months Figure 4 below show the frequency index, which
3 Fine Sand Oil does not usually is almost constant until 10 years and then starts to
Beach penetrate far into the increase.
sediment; oil may Environmental reliability can be used to prioritize
persist several months critical areas or enterprises in terms of environmen-
2 Eroding Wave-swept; most oil tal impact risk, which can provide useful support for
Wavecut removed by natural decision making related to the allocation of emergency
Platform processes within weeks resources. It should be noted that priorities will change
1 Exposed Rocky Wave reflection keeps
over time depending on the PDF of the critical events
Headland most oil offshore

Note: 10 = most vulnerable, 1 = least vulnerable, index ReliaSoft Weibull++ 7 - www.ReliaSoft.com Reliability vs Time Plot
is a qualitative rank order. 1,000 Reliability
Derramamento de óleo no japão
Gumbel-2P
RRX SRM MED
FMF=16/S=0
Data Points
0,801 Reliability Line

to define limits for accident impacts. When doing this


Reliability, R(t)=1-F(t)

it is necessary to be conservative in defining limits in 0,601

order to preserve environments because in most cases


0,402
environment limits are not easy to define.
0,202
Eduardo Calixto Calixto
Petrobras

4 ENVIRONMENTAL RELIABILITY 0,003


4/12/2007
17:08:12

0,000 10,000 20,000 30,000 40,000 50,000


Time, (t)

The concept of reliability is known in industry and has


many different applications. It means the probability Figure 3. Environmental reliability of oil spills in Japan.

960
ReliaSoft Weibull++ 7 - www.ReliaSoft.com

4,000
Failure Rate vs Time Plot The concept takes into account the number of envi-
Failure Rate

Derramamento de óleo no japão


Gumbel-2P
ronmental impacts and their duration as shown in the
RRX SRM MED FM
F=16/S=0
Failure Rate Line equation below (D(t) = availability).
3,201


n
Failure Rate, f(t)/R(t)

2,401 ti
i=1
D(t) =
n
1,602
Ti
i=1
0,802

Eduardo Calixto Calixto


Where D(t) is environmental availability, t is the
0,003
0,000 10,000 20,000 30,000 40,000 50,000
Petrobras
4/12/2007
17:05:02 duration of the event, T lifetime duration and i repre-
Time, (t)
sents the time when the undesired event occurred.
It is possible in this way to represent the actions of
Figure 4. Event frequency. emergency teams to alleviate environmental impacts
and to assess their performance in relation to vari-
ReliaSoft Weibull++ 7 - www.ReliaSoft.com
ous aspects, such as time taken for emergency team
4,000
Failure Rate vs Time Plot
Failure Rate
to reach the accident area and the duration of emer-
Derramamento de óleo no japão
Gumbel-2P
RRX SRM MED FM
F=16/S=0
gency actions to control the accident and eliminate the
3,201
Failure Rate Line
environmental impact.
To represent a group of events or enterprises an
2,401
environmental block diagram is used covering the
whole system or group of events that cause the environ-
Failure Rate, f(t)/R(t)

mental impact. The main idea is to take into account


1,602
a single group of events that can occur in a specific
area or a group of enterprises which have a potential
0,802
environmental impact.
Eduardo Calixto Calixto
In Brazil the total environment impact of a group of
Petrobras
4/12/2007
enterprises in critical environmental areas is not taken
into account. Nevertheless, the ecosystem can only
support a limited number of environmental impacts
Figure 5. Environmental reliability of oil spills in the UK. and this number has to be taken into account when a
group of enterprises is being evaluated.

analyzed. Figure 5 below shows the environmental


reliability curve for different oil spill areas in the UK. 6 CASE STUDY
In relation to environmental reliability it can be seen
that oil spills in UKCS, represented by the red curve, To find out the maximum number of environmental
are the most reliable event along time. This means impacts and their intensity, ecosystems and their char-
that it is not a critical area requiring the investment of acteristics have to be analyzed. The result is related
emergency resource allocations during the first thirty to environment reliability and can be estimated using
years. Monte Carlo Simulation Methodology. Figure 6 rep-
On the other hand, it is also necessary to ana- resents the Santos Basin which is considered to be
lyze what the historical data represents and if the a sensitive environmental area. Therefore, no catas-
behavior will remain the same over time. In the UK trophic events in this area are tolerable and to find out if
oil spill area, the first events are related to acci- enterprises are reliable the Environmental Reliability
dents with ships, while the final one involve platform Methodology has to be followed.
accidents and other types of occurrence. In this par- There are many different types of enterprises and oil
ticular case, it is most advisable to consider the last spill resources, but in relation to drilling activities the
ten years. worst events are surface, submarine and underground
blowouts. The probability that events are constant over
time is shown in Table 3 below.
5 ENVIRONMENTAL AVAILABILITY When carrying out the Monte Carlo Simulation if
it is intended not to permit any kind of catastrophic
Environmental availability is the part of total time event, the maximum level of drilling in the area has to
when there is no environmental impact in an area. keep the number of catastrophic events below one.

961
Submarine Surface Underground

Figure 7. Drill block diagram.

Drill 1 Drill 2 Drill 3 Drill 4 Drill 5 Drill 6

Figure 8. Drill block diagram.

The Direct Simulation will take into account three


catastrophic events for all drilling activities over time.
As a result it is possible to show the number of drilling
activities in this area over a lifetime of 20 years as
Figure 6. Area of santos basin. illustrated in Table 4 below.
The first column represents the number of drilling
activities and the second environment availability.
Environmental reliability is given in the fourth col-
Table 3. Blowout events.
umn, while the number of expected catastrophic events
Blowout Probability is stated in column five. Finally, column six gives the
emergency team capacity. The maximum number of
Underground Blowout 1,50E-04 drilling activities in this area is ten because the num-
Submarine Blowout 1,80E-04 ber of expected catastrophes is lower than one and
Surface Blowout 1,62E-06 there is one hundred percent emergency team avail-
Blowout 3,33E-04 ability. The emergency availability capacity represents
the total time that a catastrophic event takes place and
the availability of the emergency team to respond.
Unavailability to respond to events represents worse
The main objective of the methodology is to esti- damage to the environment.
mate the period of time when the event analyzed It is important to state that drilling is only one part
happen based on its PDF characteristics. In this direct of the oil production chain. Therefore, it is advisable
simulation it is possible to analyze the life of the to take into account all platform and ship transport in
system as a whole and to find out which event occurred, the area and related catastrophic events.
when it happened and to have an idea about its direct In this case the probabilities were considered to be
and indirect impacts and the environment reliability constant over time due to the lack of historical data.
and availability for a specific period of time. For This is an important assumption which is very influen-
instance, a group of events involving drilling activity tial in the analysis because it considers that the density
in the Campos Basin is represented in an Environ- of events is random over time, However, this is actually
mental Diagram Block where each block takes into not the case, in fact blowout events have a high prob-
account blowout events over the lifetime of the drilling. ability of occurring early in drilling life. Therefore,
Each drilling activity has three catastrophic events that the PDF lognormal distribution would represent bet-
are represented in Figure 7 below. The events are in ter this kind of event and indicates that it is advisable
series in the Block Diagram because the occurrence of to be more careful at the beginning. Figure 9 shows
any event represents an environmental impact on the the blowout frequency index.
system as a whole. The other important aspect is to know which events
The next step is to represent the group of enterprises are the most critical to the system as a whole. In the
located in same area and to estimate environmental case in question it is the underground blowout as shown
availability and reliability limits. In the same way that in Figure 10 below. To be able to identify the most
each event is in a series in an individual enterprise, critical events with the greatest impact on environmen-
groups of enterprises are also in series. Figure 8 below tal availability, it is essential to assess environmental
represents the drilling group of the Campos Basin. The availability events. In fact, the event which has the
same block is also used for the analysis. lowest environment availability will impact more on

962
Table 4. Monte Carlo simulation results.

Monte Carlo Simulation (20 years)

Drills Env Avail Env Unavail Env Reliab CE ETC EIT

1 99,99% 0,01% 96,80% 0,03 100% 23


5 99,92% 0,08% 82,00% 0,20 100% 141
10 99,84% 0,16% 64,40% 0,40 100% 283
20 99,69% 0,31% 46,00% 0,74 99,47% 523
40 99,42% 0,58% 25,20% 1,41 99,44% 1017
80 98,81% 1,19% 6,40% 2,86 98,77% 2082

lowest environmental availability will impact more on


ReliaSoft BlockSim 7 - www.ReliaSoft.com
Failure Rate vs Time the environmental availability system.
2,300E-6
Failure Rate

Multiplos 10 pocos
In some cases certain events will happen many
Failure Rate Line
times but will not impact over time like others which
1,840E-6 happen much less frequently.
Failure Rate, f (t) / R (t)

1,380E-6

7 CONCLUSION
9,200E-7
Environmental reliability is a powerful tool to sup-
port decision making related to environmental pro-
4,600E-7
tection, defining limits for enterprises with reliability
requirements, numbers of enterprises and establishing
Eduardo Calixto Calixto
Petrobras
the most vulnerable areas for the location of emergency
9/12/2007
0,000
0,000 600000,000 1,200E+6 1,800E+6‘ 2,400E+6
11:18:52
3,000E+6
teams.
Time, (t)
Unlike the usual methodology, it is possible
to consider a group of enterprises and critical
Figure 9. Drill blowout frequency. events in simulation over a specific period of time.
The difficultly is obtaining historical data about
events and defining environments limits for specific
ReliaSoft BlockSim 7 - www.ReliaSoft.com
RS FCI
areas.
7,071 Availability In the case of emergency teams it is considered that
100%

they will be in the correct position and that all proce-


50%
dures and actions will happen correctly avoiding any
5,657

0%
delay. In real life this does not happen, therefore the
1 Item(s)
specific model has to be evaluated taking into account
4,242 the performance of emergency teams.
Underground blowout The remarkable point about historical data is
2,828
understanding why accidents happens and if the data
fits well enough to be used in the current simulation
case.
1,414 In this case study only drilling activities which
Eduardo Calixto Calixto
affected a specific area where taken into account, but in
Petrobras
27/6/2008
14:14:10
addition all enterprises and the lifetimes of platforms
and ships have also to be considered.
Figure 10. Critical system events. The next step in the case study is to consider all
enterprise data which has an influence on environ-
mental sensitivity in the area in question. Because of
the whole system. This happens because in the Block the environmental effects of other enterprises drilling
Diagram when a group of blocks is in a series, the sys- limits will probably be reduced in order to keep
tem availability will be lower than the lowest block the number of catastrophic accidents lower than one
availability, in which case the event which has the during the lifetime in question.

963
REFERENCES IEEE Recommended Practice for the Design of Reliable
Industrial and Commercial Power Systems–IEEE Std.
A.M. Cassula, ‘‘Evaluation of Distribution System Relia- 493–1997.
bility Considering Generation and Transmission Impacts’’, Kececioglu, Dimitri, and Sun, Feng-Bin, Environmental
Master’s Dissertation, UNIFEI, Nov. 1998. Stress Screening–Its Quantification, Optimization and
API (American Petroleum Institute). 1985. Oil spill response: Management, Prentice Hall PTR, New Jersey, 1995.
Options for minimizing ecological impacts. American Lafraia, João R. Barusso, Manual de Confiabilidade,
Petroleum Institute Publication No. 4398. Washington, Mantenabilidade e Disponibilidade, Qualimark, Rio de
DC: American Petroleum Institute. Janeiro, Petrobras, 2001.
Ballou, T.G., R.E. Dodge, S.C. Hess, A.H. Knap and Monteiro, Aline Guimarães. Metodologia de avaliação de
T.D. Sleeter. 1987. Effects on a dispersed and undispersed custos ambientais provocados por vazamento de óleo:
crude oil on mangroves, seagrasses and corals. American O estudo de caso do complexo REDUC-DTSE. Rio de
Petroleum Institute Publication No. 4460. Washington, Janeiro, 22/12/03–COPPE/UFRJ.
DC: American Petroleum Institute. Moraes, Giovanni de Araujo. Elementos do Sistema de
Barber, W.E., L.L. McDonald, W.P. Erickson and Gestão de segurança meio ambiente e saúde ocupacional.
M. Vallario. 1995. Effect of the Exxon Valdez oil spill on Gerenciamento Verde Consultoria Rio de Janeiro: 2004
intertidal fish: A field study. Transactions of the American R. Billinton, and R.N. Allan, ‘‘Reliability Evaluation of
Fisheries Society 124: 461–476. Engineering Systems: Concepts and Techniques’’, 1st
Calixto, Eduardo; Schimitt, William. Análise Ram do projeto Edition, Plenum Press, New York, 1983.
Cenpes II. ESREL 2006, Estoril. ReliaSoft Corporation, Weibull++ 6.0 Software Package,
Calixto, Eduardo. ‘‘The enhancement availability methodol- Tucson, AZ, www.Weibull.com.
ogy: a refinery case study’’, ESREL 2006, Estoril. Rolan, R.G. and R. Gallagher. 1991. Recovery of intertidal
Calixto, Eduardo. ‘‘Sensitivity analysis in critical equip- biotic communities at Sullam Voe following the Esso
ments: the distillation plant study case in the Brazilian Bernica oil spill of 1978. Proceedings of the 1991 Oil
oil and gas industry’’. ESREL 2007, Stavanger. Spill Conference, San Diego. American Petroleum Insti-
Calixto, Eduardo. ‘‘Integrated preliminary hazard analysis tute Publication No. 4529: 461–465. Washington, DC:
methodology regarding environment, safety and social American Petroleum Institute.
issues: The platform risk analysis study’’. ESREL 2007, W.F. Schmitt, ‘‘Distribution System Reliability: Chrono-
Stavanger. logical and Analytical Methodologies’’, Master’s.
Calixto, Eduardo. ‘‘The safety integrity level as hazop
risk consistence. the Brazilian risk analysis case study’’.
ESREL 2007, Stavanger.
Calixto, Eduardo. ‘‘The non-linear optimization methodol-
ogy model: the refinery plant availability optimization
case study’’. ESREL 2007, Stavanger.
Calixto, Eduardo. ‘‘Dynamic equipments life cycle analy-
sis’’. 5◦ International Reliability Symposium SIC 2007,
Brazil.

964
Safety, Reliability and Risk Analysis: Theory, Methods and Applications – Martorell et al. (eds)
© 2009 Taylor & Francis Group, London, ISBN 978-0-415-48513-5

Hazardous aid? The crowding-out effect of international charity

P.A. Raschky & M. Schwindt


University of Innsbruck, Austria and alpS, Centre for Natural Hazard Management—Ltd., Innsbruck, Austria

ABSTRACT: Research suggests that public support for natural hazard mitigation activities is distorted due to
choice anomalies. For that reason, preparedness measures are often implemented to an insufficient extent. From
an ex-post perspective the lack of mitigation might result in a necessity for a risk-transfer. On the other hand,
based on the conclusions from the Samaritan’s Dilemma, the anticipation of relief in case of a disaster event
might induce individuals to diminish ex-ante protection activities.
In order to analyze the existence of this phenomenon in an international context, this paper discusses the
impact of expected foreign aid in case of a natural disaster on the level of disaster mitigation activities. The
results suggest that foreign aid in previous disaster years implies future ex-post charity and thus does crowd
out risk-management activities. The paper concludes with propositions on the enlightenment of natural hazards
aiming to counter the crowding-out of prevention.

1 INTRODUCTION of the people relevant as well as for the reconstruc-


tion of infrastructure. It is meant to limit financial as
The recurrent pictures of disaster victims usually stem well as physical losses. On the other hand, based on
from developing or emerging countries and leave the the conclusions from the Samaritan’s Dilemma, the
mark that these countries are more frequently hit by anticipation of foreign aid in case of a disaster might
natural disasters. However, according to Kahn (2005) induce people to diminish ex-ante protection activities
developing countries do not experience significantly (Buchanan 1975, Coate 1995), especially in the case of
more natural disaster events; nonetheless countries natural disasters where the probability of occurrence is
with lower levels of development suffer ceteris paribus relatively low and the danger is underestimated. The
more victims from natural catastrophes than devel- question arising is, whether the provision of foreign
oped countries. Out of all large-scale natural disasters, aid can induce adverse effects in form of a reduction
epidemics, NA-TECH accidents, famines and techno- of ex-ante mitigating activities.
logical accidents between 1994 and 2004, 79% general In order to build up sustainable development strate-
catastrophes happened in countries receiving OECD gies and to support less-developed countries an anal-
aid (see Table 1). In fact, research suggests that apart ysis of a country’s dependency on foreign aid and its
from low GDP per capita, weak quality of institutions vulnerability against large-scale disasters is essential.
and high income inequality are good indicators for Both, a theoretical as well as an empirical analysis of
countries which suffer more casualties after a disaster this relationship is still missing.
(Anbarci et al. 2005, Kahn 2005). The remainder of the paper is structured as fol-
In order to limit both, the financial as well as lows: The next section presents the theoretical back-
the physical consequences of natural hazards the ground of our analysis. The framework incorporates
implementation of ex-ante mitigating strategies i.e. the ideas of two fields in the literature on the eco-
risk management is important. Since these activities nomics of natural hazards. The first string consists
require detailed knowledge about the extent of the risk of the ideas of choice heuristics in decisions on low-
and the technical possibilities to mitigate such a risk, probability-high-loss events based on the theories
they are hard to put into action for individuals. But by Kahneman & Tversky (1974). The second area
oftentimes, the failure of sufficient preparedness starts touched is the phenomena of charity hazard, an adop-
earlier in the decision process and is based on a lack tion of the Samaritan’s dilemma (Buchanan 1975,
of risk awareness (Kunreuther & Pauly 2006). Coate 1995) to insurance decisions for disaster events.
Foreign aid usually flows to a great extent after In Section 3 the connection of charity hazard and
catastrophes occur, aiming to moderate the conse- foreign aid is illustrated. In section 4 preliminary
quences. Although it is paid ex-post, it can be used for descriptive statistics and a summary of the results of
the catastrophe management, i.e. for medical support the study conducted by Raschky & Schwindt (2008) on

965
Table 1. Large-scale disasters (1994–2004) and Development AID.

OECD OECD
No. Country Year Type No. Killed Total Aid ∗ Emerg. Aid ∗

1 Korea Dem P Rep 1995 Famine 610,000 4.72 0.83


2 Indonesia 2004 Tsunami 165,708 1,886.34 14.49
3 Sri Lanka 2004 Tsunami 35,399 540.29 83.37
4 Venezuela 1999 Flood 30,000 56.58 4.02
5 Iran Islam Rep 2003 Earthquake 26,796 111.73 36.9
6 Italy 2003 Heat wave 20,089
7 India 2001 Earthquake 20,005 1,186.97 79.29
8 France 2003 Heat wave 19,490
9 Turkey 1999 Earthquake 17,127 822.19 264.65
10 India 2004 Tsunami 16,389 2,427.62 8.48
11 Spain 2003 Heat wave 15,090
12 Honduras 1998 Hurricane 14,600 222.41 35.37
13 India 1999 Cyclone 9,843 1,165.99 17.4
14 Germany 2003 Heat wave 9,355
15 Thailand 2004 Tsunami 8,345 581.95 9.4
16 Japan 1995 Earthquake 5,297
17 Afghanistan 1998 Earthquake 4,700 102.1 59.24
18 Nigeria 1996 Meningitis 4,346 51.94 5.08
19 Burkina Faso 1996 Meningitis 4,071 218.5 1.54
20 Viet Nam 1997 Typhoon 3,682 1,256.58 5.87
21 China P Rep 1998 Flood 3,656 2,358.79 14.02
22 Nicaragua 1998 Hurricane 3,332 427.31 34.79
23 Niger 1995 Meningitis 3,022 131.25 2.14
24 India 1998 Cyclone 2,871 877.98 2.64
25 China P Rep 1996 Flood 2,775 3,423.18 7.62
26 Haiti 2004 Hurricane 2,754 291.57 56.94
27 Portugal 2003 Heat wave 2,696
28 Haiti 2004 Flood 2,665 291.57 56.94
29 India 1998 Heat wave 2,541 877.98 2.64
30 Afghanistan 2002 Unknown 2,500 1,291.94 654.39
31 Afghanistan 1998 Earthquake 2,323 102.1 59.24
32 Somalia 1997 Flood 2,311 47.85 34.27
33 Burkina Faso 1997 Meningitis 2,274 198.82 3.96
34 Algeria 2003 Earthquake 2,266 317.2 16.39
35 Taiwan (China) 1999 Earthquake 2,264
36 Papua New Guinea 1998 Tsunami 2,182 182.18 18.31
37 Tanzania Uni Rep 1997 Diarrhoeal 2,025 640.23 11.54
38 India 1994 Flood 2,001 1,984.16 1.7
39 Zaire/Congo Dem Rep 2002 Respiratory 2,000
40 Russia 1995 Earthquake 1,989

Data Source: EM-DAT, CRED Brussels (2008).


∗ in Mio. USD (2004 PPP).

the relationship of foreign aid and earthquake fatalities preparedness activities, which cause costs, in fact
is presented. Finally, section 5 concludes. desired by the society from an ex-ante point of view?
In order to answer this question it is first necessary to
understand how people perceive risk.
2 THEORETICAL BACKGROUND
Addressing this problem Slovic et al. (1984) ask:
‘‘How should a single accident that takes N lives be
2.1 Choice anomalies
weighted relative to N accidents, each of which takes a
Whenever natural hazard events cause losses of human single life?’’ In other words: Do people value the sever-
lives or destroy capital, the question of obligation has ity of an event, measured by the death toll, higher than
to be answered. From an ex-post perspective insuf- the probability of events occurring in their decision
ficient early warning systems and mitigation activ- process? The authors show, that in contrast to then-
ities are usually in the centre of criticism. But are research suggestions, it is not the severity, but rather

966
the frequency of occurrence, which dominates the risk of avalanche victims occurs, changes the individual’s
perception of individuals. attitude towards avalanche risks tremendously. The
Based on this result, a society should undertake fifth heuristic is concerned with the ambiguity about
more risk management activities for natural hazards, the probability that a natural disaster might occur. This
which are more probable, and less for the more vague probabilities lead to inefficiencies on the pri-
unlikely ones. Kahneman & Tversky (1974) point vate insurance market. A publication by Kunreuther &
out the complexity of the problem by arguing that Pauly (2004) included this idea into a formal model
individuals’ decisions are subject to choice anoma- of decision making costs under imperfect information
lies. This theory proposes that the standard expected and showed that individuals still refuse to purchase
utility theory does not sufficiently describe and pre- natural hazard insurance even if the premiums are
dict individual behaviour under uncertainty (Frey & attractive. The authors show that the demand-side inef-
Eichenberger 1989). When it comes to natural hazards, ficiency is a problem of a) transaction costs in order
individuals do not base their decisions on calculated to obtain information and b) ambiguity about prob-
probabilities, but rather use inferential rules known ability estimations by different insurance companies.
as heuristics (Kahneman & Tversky 1974). This sug- The search for the optimal insurance imposes costs
gestion has been clearly applied on the market for which are high enough to discourage the individual
natural hazard insurance by Kunreuther (2000) who to engage in any further mitigation activity4 . Addi-
defined the situation as ‘‘natural disaster syndrome’’ tionally the insurance premiums are likely to be much
which is a term that ‘‘links the lack of interest by higher, because of vagueness about the probabilities.
those at risk in protecting themselves against haz- Returning to the initially point of interest, individ-
ards and the resulting significant financial burden on uals preferences concerning risk mitigation, the above
society, property owners, the insurance industry and mentioned results imply that this decision is more com-
municipal, state and federal governments when severe plex than expected and dependent on a multitude of
disasters do occur.’’ He points out that five heuristics features. The probability of occurrence indeed per-
are responsible for anomalies on the natural disaster ceived by the individuals is likely to deviate from
insurance market. One main reason is connected to the actual value, which implies a suboptimal level of
information biases. Individuals misperceive the risk of mitigation activities.
natural disasters, because of extensive media-coverage The above presented results suggest that insuffi-
(‘‘availability bias’’) or they tend to overestimate risks cient mitigation is a consequence of decision anoma-
of being harmed by a natural hazard that has recently lies. In order to get a clearer picture of people’s
occurred. A second very typical heuristic in the area preferences for risk mitigation, Flynn et al. (1999)
of natural hazard insurance is the common attitude: interview 400 inhabitants of Portland/Oregon about
‘‘It won’t happen to me!’’1 . For example, a mountain their assessment of earthquake risk and their willing-
farmer, who has been living his whole live in an area ness to support the implementation of more rigorous
with high avalanche risk (red zone)2 , where almost building-codes aiming to reduce seismic risk. The
every year avalanches impact next to his farm. Nev- results suggest that contrary to expectations, peo-
ertheless, he has no incentives to either move away or ple are well informed about earthquakes and aware
to insure his farm against potential losses. The third of seismic risk. Nevertheless, public support is pri-
heuristic refers to the role of emotions connected to marily aimed at public facilities. The willingness to
catastrophic events. Individuals may purchase insur- support priva

You might also like