Professional Documents
Culture Documents
Series Editor
Professor Hoang Pham
Department of Industrial Engineering
Rutgers
The State University of New Jersey
96 Frelinghuysen Road
Piscataway, NJ 08854-8018
USA
Complex System
Maintenance Handbook
123
ISBN 978-1-84800-010-0
e-ISBN 978-1-84800-011-7
DOI 10.1007/978-1-84800-011-7
Springer Series in Reliability Engineering series ISSN 1614-7839
British Library Cataloguing in Publication Data
A Complex system maintenance handbook. - (Springer series in
reliability engineering)
1. Maintenance 2. Reliability (Eningeering) 3. Maintenance
- Management
I. Murthy, D. N. P. II. Kobbacy, Khairy A. H.
620'.0046
ISBN-13: 9781848000100
Library of Congress Control Number: 2008923781
2008 Springer-Verlag London Limited
Watchdog Agent is a trademark of the Intelligent Maintenance Systems (IMS) Center, University of
Cincinnati, PO Box 210072, Cincinnati, OH 45221, USA. www.imscenter.net
Apart from any fair dealing for the purposes of research or private study, or criticism or review, as permitted
under the Copyright, Designs and Patents Act 1988, this publication may only be reproduced, stored or
transmitted, in any form or by any means, with the prior permission in writing of the publishers, or in the case
of reprographic reproduction in accordance with the terms of licences issued by the Copy-right Licensing
Agency. Enquiries concerning reproduction outside those terms should be sent to the publishers.
The use of registered names, trademarks, etc. in this publication does not imply, even in the absence of a
specific statement, that such names are exempt from the relevant laws and regulations and therefore free for
general use.
The publisher makes no representation, express or implied, with regard to the accuracy of the information
contained in this book and cannot accept any legal responsibility or liability for any errors or omissions that
may be made.
Cover design: deblik, Berlin, Germany
Printed on acid-free paper
9 8 7 6 5 4 3 2 1
springer.com
To our wives
Iman and Jayashree
for their patience, understanding and support
Preface
viii
Preface
and process large amounts of relevant data and in the tools and techniques needed
to build model to determine the optimal maintenance strategies.
The aim of this book is to integrate this vast literature with different chapters
focusing on different aspects of maintenance and written by active researchers
and/or experienced practitioners with international reputations. Each chapter reviews the literature dealing with a particular aspect of maintenance (for example,
methodology, approaches, technology, management, modelling analysis and optimisation), reports on the developments and trends in a particular industry sector or,
deals with a case study. It is hoped that the book will lead to narrowing the gap
between theory and practice and to trigger new research in maintenance.
The book is written for a wide audience. This includes practitioners from industry (maintenance engineers and managers) and researchers investigating various
aspects of maintenance. Also, it is suitable for use as a textbook for postgraduate
programs in maintenance, industrial engineering and applied mathematics.
We would like to thank the authors of the chapters for their collaboration and
prompt responses to our enquiries which enabled completion of this handbook on
time. We also wish to acknowledge the support of the University of Salford and the
award of CAMPUS Fellowship in 2006 to one of us (PM). We gratefully acknowledge the help and encouragement of the editors of Springer, Anthony Doyle and
Simon Rees. Also, our thanks to Sorina Moosdorf and the staff involved with the
production of the book.
Contents
Part A An Overview
Chapter 1: An Overview
K. Kobbacy and D. Murthy ...................................................................................... 3
Part B Evolution of Concepts and Approaches
Chapter 2: Maintenance: An Evolutionary Perspective
L. Pintelon and A. Parodi-Herz.............................................................................. 21
Chapter 3: New Technologies for Maintenance
Jay Lee and Haixia Wang....................................................................................... 49
Chapter 4: Reliability Centred Maintenance
Marvin Rausand and Jrn Vatn .............................................................................. 79
Part C Methods and Techniques
Chapter 5: Condition-based Maintenance Modelling
Wenbin Wang........................................................................................................ 111
Chapter 6: Maintenance Based on Limited Data
David F. Percy ..................................................................................................... 133
Chapter 7: Reliability Prediction and Accelerated Testing
E. A. Elsayed ........................................................................................................ 155
Contents
Contents
xi
Part A
An Overview
1
An Overview
K.A.H. Kobbacy and D.N.P. Murthy
K. Kobbacy and D. Murthy
1.1 Introduction
The efficient functioning of modern society depends on the smooth operation of
many complex systems comprised of several pieces of equipment that provide a
variety of products and services. These include transport systems (trains, buses,
ferries, ships and aeroplanes), communication systems (television, telephone and
computer networks), utilities (water, gas and electricity networks), manufacturing
plants (to produce industrial products and consumer durables), processing plants
(to extract and process minerals and oil), hospitals (to provide services) and banks
(for financial transactions) to name a few. All equipment is unreliable in the sense
that it degrades with age and/or usage and fails when it is no longer capable of
delivering the products and services. When a complex system fails, the consequences can be dramatic. It can result in serious economic losses, affect humans
and do serious damage to the environment as, for example, the crash of an aircraft
in flight, the failure of a sewage processing plant or the collapse of a bridge.
Through proper corrective maintenance, one can restore a failed system to an
operational state by actions such as repair or replacement of the components that
failed and in turn caused the failure of the system. The occurrence of failures can
be controlled through maintenance actions, including preventive maintenance,
inspection, condition monitoring and design-out maintenance. With good design
and effective preventive maintenance actions, the likelihood of failures and their
consequences can be reduced but failures can never be totally eliminated.
The approach to maintenance has changed significantly over the last one
hundred years. Over a hundred years ago, the focus was primarily on corrective
maintenance delegated to the maintenance section of the business to restore failed
systems to an operational state. Maintenance was carried out by trained technicians
and was viewed as an operational issue and did not play a role in the design and
operation of the system. The importance of preventive maintenance was fully
appreciated during the Second World War. Preventive maintenance involves
additional costs and is worthwhile only if the benefits exceed the costs. Deciding
the optimum level of maintenance requires building appropriate models and use of
sophisticated optimisation techniques. Also, around this time, maintenance issues
started getting addressed at the design stage and this led to the concept of maintainability. Reliability and maintainability (R&M) became major issues in the
design and operation of systems.
Degradation and failure depend on the stresses on the various components of the
system. These depend on the operating conditions that are dictated by commercial
considerations. As a result, maintenance moved from a purely technical issue to a
strategic management issue with options such as outsourcing of maintenance, leasing
equipment as opposed to buying, etc. Also, advances in technologies (new materials,
new sensors for monitoring, data collection and analysis) added new dimensions
(science, technology) to maintenance. These advances will continue at an everincreasing pace in the twenty-first century.
This handbook tries to address the various issues associated with the maintenance of complex systems. The aim is to give a snapshot of the current status and
highlight future trends. Each chapter deals with a particular aspect of maintenance
(for example, methodology, approaches, technology, management, modelling
analysis and optimisation) and reports on developments and trends in a particular
industry sector or deals with a case study. In this chapter we give an overview of
the handbook. The outline of the chapter is as follows. Section 1.2 deals with the
framework that is needed to study the maintenance of complex systems and we
discuss some of the salient issues. Section 1.3 presents the structure of the book
and gives a brief outline of the different chapters in the handbook. We conclude
with a discussion of the target audience for the handbook.
An Overview
1.2.1 Stakeholders
For an asset there can be several stakeholders as indicated in Figure 1.1.
The number of parties involved would depend on the asset under consideration.
For example, in case of a rail network (used to provide a service to transport people
and goods) the customers can include the rail operators (operating the rolling
stock) and the public. The owner can be a business entity, a financial institution or
a government agency. The operator is the agency that operates the track and is
responsible for the flow of traffic. The service provider refers to the agency
carrying out the maintenance (preventive and corrective). It can be the operator (in
which case maintenance is done in-house) or some external agent (if maintenance
is outsourced) or both (when only some of the maintenance activities are outsourced). The regulator is the independent agency which deals with safety and risk
issues. They define the minimum standards for safety and can impose fines on the
owner, operator and possibly the service provider should the safety levels be
compromised. Government plays a critical role in providing the subsidy and
assuming certain risks. In this case all the parties involved are affected by the
maintenance carried out on the asset. If the line is shut either frequently and/or for
long duration, it can affect customer satisfaction and patronage, the returns to the
operators and owners and the costs to the government.
1.2.2 Different Perspectives
We focus our attention on the case where the asset is owned by the owner and
maintenance is outsourced. In this case, we have two parties (i) owner (of the
asset) and (ii) service agent (providing the maintenance). Figure 1.2 is a very
simplified system characterisation of the maintenance process where the main-
tenance activities are defined through a maintenance service contract. The problem
is to determine the terms of the service contract.
Each of the elements of Figure 1.2 involves several variables. For example, the
maintenance service contract involves the following: (i) duration of contract, (ii)
price of contract, (iii) maintenance performance requirements, (iv) incentives and
penalties, (v) dispute resolution, etc. The maintenance performance requirements
can include measures such as availability, mean time between failures and so on.
The characterisation of the owners decision-making process can involve costs,
asset state at the end of the contract, risks (service agent not providing the level and
quality of service) and so on. The interests and goals of the owner are different
from that of the service agent.
The study of maintenance is complicated by the unknown and uncontrollable
factors. It could be rate of degradation (which depends on several factors such as
material properties, operating environment etc) and other commercial factors (high
demand for power in the case of a power plant due to very hot weather).
1.2.3 Key Issues and the Need for Multi-disciplinary Approach
The key issues in the maintenance of an asset are shown in Figure 1.3. The asset
acquisition is influenced by business considerations and its inherent reliability is
determined by the decisions made during design. The field reliability and degradation is affected by operations (usage intensity, operating environment, operating
load etc.). Through use of technologies, one can assess the state of the asset. The
analysis of the data and models allow for optimizing the maintenance decisions
(either for a given operating condition or jointly optimizing the maintenance and
operations). Once the maintenance actions have been formulated it needs to be
implemented.
An Overview
An Overview
MAINTENANCE
STRATEGY
STRATEGIC
LEVEL
MAINTENANCE
PLANNING AND
SCHEDULING
TACTICAL
LEVEL
- DATA COLLECTION
- DATA ANALYSIS (ROOT CAUSE, OTHER
FACTORS)
MAINTENANCE
WORK
EXECUTION
OPERATIONAL
LEVEL
The strategic level deals with maintenance strategy. This needs to be formulated so that it is consistent and coherent with other (production, marketing,
finance, etc.) business strategies. The tactical level deals with the planning and
scheduling of maintenance. The operational level deals with the execution of the
maintenance tasks and collection of relevant data.
10
An Overview
Chapter 1:
Part B:
Chapter 2:
Chapter 3:
Chapter 4:
Part C:
Chapter 5:
Chapter 6:
Chapter 7:
Chapter 8:
Chapter 9:
Part D:
Chapter10:
Chapter 11:
Chapter 14:
Part E:
Management
Chapter 15:
Chapter 16:
Chapter 12:
Chapter 13:
Chapter 17:
Chapter 18:
Chapter 19:
Chapter 20:
An Overview
11
Part F:
Chapter 21:
Chapter 22:
Chapter 23:
Chapter 24:
Chapter 25:
12
that has been developed by the authors, is used to illustrate the new approach.
Several examples from railway applications are provided.
Chapter 5: Condition-based Maintenance Modelling
This chapter presents a model for supporting condition based maintenance decision
making. The chapter discusses various issues related to the subject, such as the
definition of the state of an asset, direct or indirect monitoring, relationship between observed measurements and the state of the asset, and current modelling
developments. In particular, the chapter focuses on a modelling technique used
recently in predicting the residual life via stochastic filtering. This is a key element
in modelling the decision making aspect of condition based maintenance. A few
key condition monitoring techniques are also introduced and discussed. Methods of
estimating model parameters are outlined and a numerical example based on real
data is presented.
Chapter 6: Maintenance-based on Limited Data
Reliability applications often suffer from paucity of data for making informed
maintenance decisions. This is particularly noticeable for high reliability systems
and when new production lines or new warranty schemes are planned. Such issues
are of great importance when selecting and fitting mathematical models to improve
the accuracy and utility of these decisions. This chapter investigates why reliability
data are so limited and proposes statistical methods for dealing with these
difficulties. It considers graphical and numerical summaries, appropriate methods
for model development and validation, and the powerful approach of subjective
Bayesian analysis for including expert knowledge about the application area.
Chapter 7: Reliability Prediction and Accelerated Testing
This chapter presents an overview of accelerated life testing (ALT) methods and
their use in reliability prediction at normal operating conditions. It describes the
most commonly used models and introduces new ones which are distribution
free. Design of optimum test plans in order to improve the accuracy of reliability
prediction is also presented and discussed. The chapter provides, for the first time,
the link between accelerated life testing and maintenance actions. It develops
procedures for using the ALT results for estimating the optimum preventive
maintenance schedule and the optimum degradation threshold level for degrading
systems. The procedures are demonstrated using two numerical examples.
Chapter 8: Preventive Maintenance Models for Complex Systems
Preventive maintenance (PM) of repairable systems can be very beneficial in
reducing repair and replacement costs, and in improving system availability.
Strategies for scheduling PM are often based on intuition and experience, though
considerable improvements in performance can be achieved by fitting mathematical models to observed data. For simple repairable systems comprising few components or many identical components, compound renewal processes are appropriate.
This chapter reviews basic and advanced models for complex repairable systems
and demonstrates their use for determining optimal PM intervals. Computational
An Overview
13
difficulties are addressed and practical illustrations are presented, based on subsystems of oil platforms and
Chapter 9: Artificial Intelligence in Maintenance
AI techniques have been used successfully in the past two decades to model and
optimise maintenance problems. This chapter reviews the application of Artificial
Intelligence (AI) in maintenance management and introduces the concept of
developing intelligent maintenance optimisation system. The chapter starts with an
introduction to maintence management, planning and scheduling and a brief
definition of AI and some of its techniques that have applications in maintenance
management. A review of literatures is then presented covering the applications of
AI in maintenance. We have focused on five AI techniques namely Knowledge
Based Systems, Case Based Reasoning, Genetic Algorithms, Neural Networks and
Fuzzy Logic. This review also covers hybrid systems where two or more AI
techniques are used in an application. A discussion of the development of the
prototype hybrid intelligent maintenance optimisation system (HIMOS) which was
developed to evaluate and enhance PM maintenance routines of complex engineering systems then follows. The chapter ends with a discussion of future
research and concluding remarks.
Chapter 10: Maintenance of Repairable Systems
A repairable system is traditionally defined as a system which, after failing to perform one or more of its functions satisfactorily, can be restored to fully satisfactory
performance by any method other than replacement of the entire system. An
extended definition used in this chapter includes the possibility of additional maintenance actions which aim at servicing the system for better performance, referred to
as preventive maintenance (PM). The common models for the failure process of a
repairable system are renewal processes (RP) and non-homogeneous Poisson processes (NHPP). The chapter considers several generalizations and extensions of the
basic models, for example the trend renewal process (TRP) which includes NHPP
and RP as special cases, and having the property of allowing a trend in processes of
non-Poisson type. When several systems of the same kind are considered, there may
be an unobserved heterogeneity between the systems which, if overlooked, may lead
to wrong decisions. This phenomenon is considered in the framework of the TRP
process. We then consider the extension of the basic models obtained by introducing
the possibility of PM using a competing risks approach. Finally, models for periodically inspected systems are studied, using a combination of time-continuous and
time-discrete Markov chains.
Chapter 11: Optimal Maintenance of Multi-component Systems: A Review
This chapter gives an overview of the literature on multi-component maintenance
optimization focusing on work appearing since the 1991 survey by Cho and Parlar.
A classification scheme primarily based on the dependence between components
(stochastic, structural or economic) is introduced. Next, the papers are also classified on the basis of the planning aspect (short-term vs. long-term), the grouping of
maintenance activities (either grouping preventive or corrective maintenance, or
opportunistic grouping) and the optimization approach used (heuristic, policy
14
An Overview
15
16
An Overview
17
18
function of the real-time SCADA data input. Validation of this algorithm with
unseen condition monitoring data showed misclassification rates of machine faults
as low as 14.3%.
Part B
2
Maintenance: An Evolutionary Perspective
Liliane Pintelon and Alejandro Parodi-Herz
L. Pintelon and A. Parodi-Herz
2.1 Introduction
Over the last decennia industrial maintenance has evolved from a non-issue into a
strategic concern. Perhaps there are few other management disciplines that underwent so many changes over the last half-century. During this period, the role of
maintenance within the organization has drastically been transformed. At first
maintenance was nothing more than a mere inevitable part of production, now it is
an essential strategic element to accomplish business objectives. Without a doubt,
the maintenance function is better perceived and valued in organizations. One
could considered that maintenance management is no longer viewed as an underdog function; now it is considered as an internal or external partner for success.
In view of the unwieldy competition many organizations seek to survive by
producing more, with fewer resources, in shorter periods of time.To enable these
serious needs, physical assets take a central role. However, installations have
become highly automated and technologically very complex and, consequently,
maintenance management had to become more complex having to cope with
higher technical and business expectations. Now the maintenance manager is
confronted with very complicated and diverse technical installations operating in
an extremely demanding business context.
This chapter, while considering the fundamental elements of maintenance and
its environment, describes the evolution path of maintenance management and the
driving forces of such changes. In Section 2.2 the maintenance context is described
and its dynamic elements are briefly discussed. Section 2.3 explains how maintenance practice have evolved in time and different epochs are distinguished.
Further, this sections devotes special attention to describe a common lexicon for
maintenance actions and policies to further focuss on the evolution of maintenance
concepts. Section 2.4 underlines how the role of the maintenance manager has been
reshaped as a consequence of the changes of the maintenance function. Finally, the
chapter concludes with Section 2.5 identifying the new challenges for maintenance.
22
Management
Society
Technology
Total asset
life cycle
optimization
Technological
evolution
Operations
e-business
Logistics Support
Outsourcing
Market
Information
Technology
Competition
To cope with and to coordinate the complex and changing characteristics that
constitute maintenance in the first place, a management layer is imperative.
Management is about what to decide and how to decide. In the maintenance
arena, a manager juggles with technology, operations and logistics elements that
mainly need to harmonize with production. Technology refers to the physical
assets which maintenance has to support with adequate equipment and tools.
Operations indicate the combination of service maintenance interventions with
23
core production activities. Finally, the logistics element supports the maintenance
activities in planning, coordinating and ultimately delivering, resources like spare
parts, personnel, tools and so forth. In one way or another, all these elements are
always present, but their intensity and interrelationships will vary from one
situation to another. For example, the elevator maintenance in a hospital vs. the
plant maintenance in chemical process industries stipulates a different maintenance
recipe tailored to the specific needs. Clearly, the choice of the structural elements
of maintenance is not independent from the environment. Besides, other factors
like the business context, society, legislation, technological evolution, outsourcing
market, will be important. Furthermore, relative new trends, such as the e-business
context, will influence the current and future maintenance management enormously. A whole new era for maintenance is expected as communication barriers
are bridged and coordination opportunities of maintenance service become more
intense.
2.2.1 Changes in the Playing Field of Maintenance
One should expect that neither maintenance management nor its environment are
stationary. The constant changes in the field of maintenance are acknowledged to
have enabled new and innovative developments in the field of maintenance
science.
The technological evolution in production equipment, an ongoing evolution
that started in the twentieth century, has been tremendous. At the start of the
twentieth century, installations were barely or not mechanized, had simple design,
worked in stand-alone configurations and often had a considerable overcapacity.
Not surprisingly, nowadays installations are highly automated and technologically
very complex. Often these installations are integrated with production lines that are
right-sized in capacity.
Installations not only became more complex, they also became more critical in
terms of reliability and availability. Redundancy is only considered for very critical
components. For example, a pump in a chemical process installation can be considered very critical in terms of safety hazards. Furthermore, equipment built-in
characteristics such as modular design and standardization are considered in order
to reduce downtime during corrective or preventive maintenance. However, predominantly only for some newer, very expensive installations, such as flexible
manufacturing systems (FMS), these principles are commonly applied. Fortunately, a move towards higher levels of standardization and modularization begins to
be witnessed at all level of the installations. As life cycle optimization concepts are
commendable, it becomes mandatory that at the early design stages supportability
and maintainability requirements are well thought-out.
Parallel to the technological evolution, the ever-increasing customer focus
causes even higher pressure, especially on critical installations. As customers
service in terms of time, quality and choice becomes central to production decisions, the more flexibility is required to cope with these varying needs. This calls
for well-maintained and reliable installations capable to fulfil shorter and more
reliable lead-times estimation. Physical assets are ever more important for business
success.
24
Maintenance does not escape from the (r)evolution in information communication technology (ICT), which has tremendously changed business practices. However, we comment further on this topic in Section 2.3, by illustrating the impact on
the role of the maintenance manager as such.
Furthermore, new production and management principles such as Just-in-time
(JIT) philosophy, Lean principles, total quality management (TQM) and so forth,
have emerged. These production trends intend, by all means, to reduce waste and
remove non-value added transactions. It is not surprising that work-in-process
(WIP) inventories are one of the key issues for improvement. Clearly, WIP inventories incur high costs as a consequence of the capital immobilization, expensive
floor space, etc. As processes happen to be streamlined, WIP inventories are no
longer a buffer for problems; accordingly, asset availability and reliability are ever
more imperative. Albeit, these principles were initially inspired for production and
manufacturing environments are currently also applied and translated in service
context.
Above all, the business environment has also changed. Competition has
become fierce and worldwide due to the globalization. The latter not only implies
that competitors are located all over the world, but also that decisions to move
production or service activities from a non-efficient site (e.g. due to high operations and maintenance costs) to another site are quickly taken, even if the other
location belongs to another continent. Obviously, with the advent of globalization
and intense competitive pressures, organizations are looking for every possible
source of competitive advantage. This implies that the nature of business environment has become more complex and dynamic requiring different competitive
strategies. Many companies are critically evaluating their value chain and often
decide to drastically reorganize it. This results in focusing on the core business.
Consequently outsourcing of some non-core business activities and the creation of
new partnerships and alliances are being considered by many organizations.
Not surprisingly, maintenance as a support function is no exception for outsourcing. Yet, it may not be so simple. Outsourcing maintenance of technical
systems can become a sensitive issue if it is not handled with diligence. Technical
systems are unique and situation specific. For example, outsourcing maintenance
of utilities or elevators can be relatively straightforward, but when it comes to
production floor equipment it can be a strategic issue that has to be handled with
extreme care. These circumstances suggest that outsourcing needs to be considered
at operational, tactical and strategic level; see Figure 2.2
The simplest, and also the most common, form of outsourcing is operational
outsourcing. At this level, a specific task is outsourced and the relationship
between supplier and customer is strictly limited to a sell-buy situation. The impact
on the internal organization of the customer is also limited. As outsourcing moves
up in the organizational pyramid the relationship between supplier and customer
changes and tactical outsourcing maybe required. At this level of outsourcing the
customer shares management responsibility with the supplier and a simple kind of
partnership is established. The impact on the internal organization is also greater.
Finally, moving towards the organizations top and for more critical maintenance
services, a new form of outsourcing is created, the so-called strategic outsourcing. This type of outsourcing is also labelled as transformational out-
25
Strategic
Transformational
Full
service
To think with
e.g. outsourcing of all
maintenance, BOT, ...
Tactic
Partnership
Service package
e.g. MRO, utilities, facilities, ...
To manage
Projects
e.g. renovation, shutdown, ...
Operational
Supplier Customer
Specialised services
To organise
Generic services
To carry out
Societal expectations concerning technology is also creating boundary conditions for maintenance management. The attention paid to sustainability (3P: people,
profit, planet) is a clear sign of this. Legislation is getting more and more stringent.
This is especially important here because of its impact on occupational safety and
environmental standards.
Note that most of the above-mentioned trends for industrial installations can be
easily translated to the service sector. Think, for example, of automated warehouses
in distribution centre, hospital equipment or building utilities.
26
1950
Technical
matter
1960
1970
Profit
contributor
1980
1990
Cooperative
partnership
2000
Decade
The fact that maintenance has become more critical implies that a thorough
insight into the impact of maintenance interventions, or the omission of these, is
indispensable. Per se, good maintenance stands for the right allocation of resources
(personnel, spares and tools) to guarantee, by deciding on the suitable combination
of maintenance actions, a higher reliability and availability of the installations.
Furthermore, good maintenance foresees and avoids the consequences of the
failures, which are far more important than the failures as such. Bad or no maintenance can appear to render some savings in the short run, but sooner or later it
will be more costly due to additional unexpected failures, longer repair times,
accelerated wear, etc. Moreover, bad or no maintenance may well have a significant impact on customer service as delivery promises may become difficult to
fulfil. Hence, a well-conceived maintenance program is mandatory to attain business, environmental and safety requirements.
Despite the particular circumstances, if one intends to compile or judge any
maintenance programme, some elementary maintenance terms need to be unambiguous and handled with consistency. Yet, both in practice and in the literature a
lot of confusion exists. For example, what for some is a maintenance policy others
refer to as a maintenance action; what some consider preventive maintenance
others will refer to as predetermined or scheduled maintenance. Furthermore, some
argue that some concepts can almost be considered strategies or philosophies, and
27
so on. Certainly there is a lot of confusion, which perhaps is one of the breathing
characteristics of such a dynamic and young management science. The terminology used to describe precisely some maintenance terms can almost be taken as
philosophical arguments. However, the adoption of a rather simplistic, but truly
germane classification is essential. Not intending to disregard preceding terminologies, neither to impose nor dictate a norm, we draw attention, in particular, to
three of those confusing terms: maintenance action, maintenance policy and
maintenance concept. In the remainder of this chapter the following terminology is
adopted.
Maintenance Action. Basic maintenance intervention, elementary task carried out
by a technician (What to do?)
Maintenance Policy. Rule or set of rules describing the triggering mechanism for
the different maintenance actions (How is it triggered?)
Mainenance Concept. Set of maintenance polices and actions of various types and
the general decision structure in which these are planned and supported. (The logic
and maintenance recipe used?)
2.3.1 Maintenance Actions
Basically, as depicted in Figure 2.4, maintenance actions or interventions can be of
two types. They are either corrective maintenance (CM) or precautionary maintenance (PM) actions.
2.3.1.1 Corrective Maintenance Actions (CM)
CM actions are repair or restore actions following a breakdown or loss of function.
These actions are reactive in nature; this merely implies wait until it breaks,
then fit it!. Corrective actions are difficult to predict as equipment failure behavior
is stochastic and breakdowns are unforeseen. Maintenance actions such as
replacement of a failed light bulb, repair of a ruptured pipeline and the repair of a
stalled motor are some examples of corrective actions.
2.3.1.2 Precautionary Maintenance Actions (PM)
PM actions can either be preventive, predictive, proactive or passive in nature.
These types of actions are moderately more complex than the former. To describe
fully each one of them, a book can be written on its own. Nonetheless, the
fundamental ideas aim at diminishing the failure probability of the physical asset
and/or to anticipate, or avoid if possible, the consequences if a failure occurs. Some
PM actions (preventive and predictive) are somewhat easier to plan, because they
can rely on fixed time schedules or on prediction of stochastic behaviours. However, other types of PM actions become ongoing tasks, originating from the attitude
concerning maintenance. Somehow they became part of the tacit knowledge of the
organization. Some precise examples of precautionary actions which can be
mentioned are lubrication, bi-monthly bearing replacements, inspection rounds,
vibration monitoring, oil analysis, design adjustments, etc. All these tasks are
considered to be precautionary maintenance actions; however, the underlying principles may be different.
28
ACTIONS
POLICIES
CONCEPTS
TPM
RCM
Optimizing
existing concept
CIBOCOF
Q&D
BCM
Ad hoc
reactive
Customized
concept
LCC
preventive
predictive
T/UBM
CBM
FBM
DOM
OBM
proactive
passive
Corrective
Precautionary
reactive
Predictive, preventive,
proactive and passive
29
that preventive actions could avoid some of the breakdowns and would lead to cost
savings in the long run. The main concern was how to determine, based on
historical data, the adequate period to perform preventive maintenance. Certainly,
not enough was known about failure patterns, which, among other reasons, have
led to a whole separate branch of engineering and statistics: reliability engineering.
In the late 1970s and early 1980s, equipment became in general more complex.
As result, the super-positioning effect of the failure pattern of individual components starts to alter the failure characteristics of simpler equipment. Hence, if
there is no dominant age-related failure mode, preventive maintenance actions are
of limited use in improving the reliability of complex items. At this point, the
effectiveness of applying preventive maintenance actions started to be questioned
and was considered more carefully. A common concern about over-maintaining
grew rapidly. Moreover, as the insidious belief on preventive maintenance benefits
was put at risk, new precautionary (predictive) maintenance techniques emerged.
This meant a gradual, though not complete, switch to predictive (inspection and
condition-based) maintenance actions. Naturally, predictive maintenance was, and
still is, limited to those applications where it was both technically feasible and
economically interesting. Supportive to this trend was the fact that conditionmonitoring equipment became more accessible and cheaper. Prior to that time,
these techniques were only reserved to high-risk applications such as airplanes or
nuclear power plants.
In the late 1980s and early 1990s a different footprint on maintenance history
occurred with the emergence of concurrent engineering or life cycle engineering.
Here maintenance requirements were already under consideration at earlier product
stages such as design or commission. As a result, instead of having to deal with
built in characteristics, maintenance turned out to be active in setting design
requirements for installations and became partly involved in equipment selection
and development. All this led to a different type of precautionary (proactive) maintenance, the underlying principle of which was to be proactive at earlier product
stages in order to avoid later consequences. Furthermore, as the maintenance
function was better appreciated within the organization, more attention was paid to
additional proactive maintenance actions. For example, as operators are in straight
and regular contact with the installations they could intuitively identify and feel
right or wrong working conditions of the equipment. Conditions such as noise,
smell, rattle vibration, etc., that at a given point are not really measured, represent
tacit knowledge of the organization to foresee, prevent or avoid failures and its
consequences in a proactive manner. Yet these actions are indeed typically not
performed by maintenance people themselves, but are certainly part of the
structural evolution of maintenance as a formal or informal partner within the
organization.
The last type of precautionary (passive) maintenance actions are driven by the
opportunity of other maintenance actions being planned. These maintenance
actions are precautionary since they occur prior to a failure, but are passive as they
wait to be scheduled depending on others probably more critical actions. Passive
actions are in principle low priority for the maintenance staff as, at a given moment
in time, they may not really be a menace for functional or safety failures. However,
these actions can save significant maintenance resources as they may reduce the
30
number of maintenance interventions, especially when the set up cost of maintenance is high. For example, when maintenance actions are planned or need to be
carried out on offshore oil platforms or on windmills in remote locations, getting to
the equipment equipment can be costly. Therefore, optimizing the best combination of maintenance actions, at that point in time, is mandatory. This may invoke
replacing components with significant residual life that in different circumstances
would not be replaced.
2.3.2 Maintenance Policies
As new maintenance techniques happen to be available and the economic implications of maintenance action are comprehended, a direct impact on the maintenance
policies is expected. Several types of maintenance policies can be considered to
trigger, in one way or another, either precautionary or corrective maintenance
interventions. As described in Table 2.1, those policies are mainly failure-based
maintenance (FBM), time/used-based maintenance (TBM/UBM), condition-based
maintenance (CBM), opportunity-based maintenance (OBM) design-out maintenance (DOM), and e-maintenance.
Table 2.1. Generic maintenance policies
Policy
Description
FBM
TBM / UBM
PM is carried out after a specified amount of time (e.g. 1 month, 1000 working
hours, etc.). CM is applied when necessary. UBM assumes that the failure
behaviour is predictable and of the IFR type. PM is assumed to be cheaper than
CM.
CBM
PM is carried out each time the value of a given system parameter (condition)
exceeds a predetermined value. PM is assumed to be cheaper than CM. CBM is
gaining popularity due to the fact that the underlying techniques (e.g. vibration
analysis, oil spectrometry,...) become more widely available and at better prices.
The traditional plant inspection rounds with a checklist are in fact a primitive
type of CBM.
OBM
For some components one often waits to maintain them until the opportunity
arises when repairing some other more critical components. The decision
whether or not OBM is suited for a given component depends on the expectation
of its residual life, which in turn depends on utilization.
DOM
The focus of DOM is to improve the design in order to make maintenance easier
(or even eliminate it). Ergonomic and technical (reliability) aspects are
important here.
For the more common maintenance policies many models have been developed
to support tuning and optimization of the policy setting. It is not our intention to
explain the fundamental differences between these models, but rather to provide an
overview of types of policies available and why these have been developed. Much
31
has to do with the discussion in the previous section regarding the acuity of maintenance actions. Therefore, it is clear that policy setting and the understanding of its
efficiency and effectiveness continues to be fine-tuned as any other management
science. We advocate the reader, particularily interested in the underlying principles
and type of models, to review McCall (1965), Geraerds (1972), Valdez-Flores and
Feldman (1989), Cho and Parlar (1991), Pintelon and Gelders (1992), Dekker (1996),
Dekker and Scarf (1998) and Wang (2002) for a full overview on the state-of-the-art
literature.
The whole evolution of maintenance was based not solely on technical but
rather on techno-economic considerations. FBM is still applied providing the cost
of PM is equal to or higher than the cost of CM. Also, FBM is typically handy in
case of random failure behaviour, with constant failure rate, as TBM or UBM are
not able to reduce the failure probability. In some cases, if there exists a
measurable condition, which can signal the probability of a failure, CBM can be
also feasible. Finally, a FBM policy is also applied for installations where frequent
PM is impracticable and expensive, such as can be the maintenance of glass ovens.
Either TBM or UBM is applied if the CM cost is higher than PM cost, or if it is
necessary because of criticality due to the existence of bottleneck installation or
safety hazards issues. Also in case of increasing failure behaviour, like for example
wear-out phenomena, TBM and UBM policies are appropriate.
Typically, CBM was mainly applied in those situations where the investment in
condition monitoring equipment was justified because of high risks, like aviation
or nuclear power regeneration. Currently, CBM is beginning to be generally
accepted to maintain all type installations. Increasingly this is becoming a common
practice in process industries. In some cases, however, technical feasibility is still a
hurdle to overcome. Another reason that catches the attention of practitioners in
CBM is the potential savings in spare parts replacements thanks to the accurate and
timely forecasts on demand. In turn, this may enable better spare parts management
through coordinated logistics support.
Finding and applying a suitable CBM technique is not always easy. For example,
the analysis of the output of some measurement equipment, such as advanced
vibration monitoring equipment, requires a lot of experience and is often work for
experts. But there are also simpler techniques such as infrared measuring and oil
analysis suitable in other contexts. At the other extreme, predictive techniques can be
rather simple, as is the case of checklists. Although fairly low-level activity, these
checklists, together with human senses (visual inspections, detection of strange
noises in rotating equipment, etc.) can detect a lot of potential problems and initiate
PM actions before the situation deteriorates to a breakdown.
At present FBM, TBM, UBM and CBM accept and seize the physical assets
which they intend to maintain as a given fact. In contrast, there are more proactive
maintenance actions and policies which, instead of considering the systems as a
given, look at the possible changes or safety measures needed to avoid maintenance
in the first place. This proactive policy is referred to as DOM. This policy implies
that maintenance is proactively involved at earlier stages of the product life cycle to
solve potential related problems. Ideally, DOM policies intend to completely avoid
maintenance throughout the operating life of installations, though, this may not be
realistic. This leads one to consider a diverse set of maintenance requirements at the
32
33
34
Concept
Description
Main
strengths
Main
weaknesses
1st
Ad hoc
Simple
Ad hoc
decisions
1st 2nd
Q&D
Consistent,
Allows for
priorities
Rough
questions, and
answers
2nd
LCC
Sound basic
philosophy
Resource and
data intensive
TPM
Considers
human/technical
aspects, fits in
kaizen approach.
Extensive tool box
Time consuming
implementation
RCM
Powerful
approach, Stepby-step procedure
Resource
intensive
RCM-based
Approaches focused on
remediating some of the
perceived RCM shortcomings
Improved
performance
through e.g. use of
sound statistical
analysis
Sometimes an
oversimplification
Exploiting the
companys
strengths and
considering the
specific business
context
Ensuring
consistency and
quality in the
concept
developed
2nd 3rd
3rd
Customized
All these concepts, as many others, enjoy several advantages and are doomed to
specific shortcomings. Correspondingly, new maintenance concepts are developed,
old ones are updated and methodologies to design customized maintenance
concepts are created. These concepts enjoy a lot of interest in their original form
and also give raise to many derived concepts. For example, streamlined RCM from
RCM. One may consider that customized maintenance concepts constitute the
third generation of this evolution. They have fundamentally emerged since it is
very difficult to claim a one fits all concept in the complex and still constantly
changing world of maintenance. They are inspired by the former concepts while
trying to aviod in the future previously experienced drawbacks. One way or
another, customized maintenance concepts mainly consist of a cherry picking of
useful techniques and ideas applied in other maintenance concepts. This important,
but relatively new concept is expected to grow in importance both in practice and
with academicians. Concepts that belong to this generation are, for example, value
driven maintenance (VDM) and CIBOCOF, which was developed at the Centre of
35
Industrial Management (CIB), K.U. Leuven, Belgium. Additionally, in-house maintenance concepts, mostly developed in organization with fairly high maintenance
maturity, also belong to this category of concepts. This, for example, was implemented in a petrochemical company that developed a customised concept, which
was basically following the RCM logic. However, by extending RCM analysis
steps and introducing risk-based inspections (RBI), a more focused and betterconceived maintenance plan could be developed. Moreover, the company borrowed some elements from TPM and incorporated these in their maintenance
concept. For example, multi-skilled training programmes were implemented and
special tool kits were designed for a number of maintenance jobs using TPM principles.
Before the third generation of maintenance concepts was started, or actually
even earlier, they were perceived as necessary. In the literature, a middle step is
recognized to bridge the second generation with maintenance concepts such as
business-centred maintenance (BCM) and risk based centred maintenance (RBCM)
were developed. These concepts are merely RCM-related and still widely applied
in many organizations. However, a slow but steady movement towards more
customized maintenance concept is expected in the near future, as the maintenance
function matures.
Next, a straightforward description on the most important concepts is presented
and important references are provided for the interested reader.
2.3.3.1 Quick & Dirty Decision Charts (Q&D)
A Q&D decision chart is a decision diagram with questions on several aspects
including; failure paterns, repair behaivours of the equipment, business context,
maintenance capabilities, cost structure etc. Answering the questions for a given
installation, the user proceeds through the branches of the diagram. The process
stops with the recommendation of the most appropriate policy for the specific
installation. The Q&D approach allows for a relatively quick determination of the
most advantageous maintenance policy. It ensures a consistent decision making for
all installations. Although some Q&D decision charts are available from literature
(e.g. Pintelon et al. 2000), most companies adopting this approach prefer to draw
up their own charts, which incorporate their experience and knowledge in the
decision process. This can be implemented in several ways. For instance by
defining specific questions, adding or deleting maintenance policies, establishing
preferred sequence in which the different policies should be considered, etc. This
approach however has the drawback of being rough (dirty). The questions are
usually put in the basic yes/no format, limiting the answering possibilities. Moreover, answering the questions is usually done on a subjective basis; for example the
question whether a given action or policy is feasible is answered based on experience rather than on a sound feasibility study.
2.3.3.2 Life Cycle Costing (LCC) Approaches
LCC originated in the late 1960s and is now resurrecting. The basic principle of
LCC is sometimes summarised by it is unwise to pay too much, but is foolish to
spend too little. This refers to the two main underlying ideas of LCC. The first
concerns the cost iceberg structure presented by Blanchard (1992) by whom LCC
36
37
downtime
losses
Loading time
Operating time
quality
losses
Valuable operating
time
loss of
speed
Net operating
time
planning delays
planned maintenance
failures
set-up and adjustment
stoppages
reduced speed
6 big losses
planning
losses
Total time
process defects
reduced yields
38
at large. Well known are the books by Nowlan and Heap (1978); Anderson and
Neri (1990) and Moubray (1997) who contributed to the adoption of RCM by industry.
Note that today many versions of RCM are around, streamlined RCM being
one of the more popular ones. However, the Society for Automotive Engineers
(SAE) holds the RCM definition that is generally accepted. SAE puts forward the
following basic questions to be solved by the any RCM implementation; if any of
these is omitted, the method is incorrectly being refered to as an RCM. To answer
these seven questions a clear step-by-step procedure exists and decision charts and
forms are available:
What are the functions and associated performance standards of asset in its
present operating context?
How can it fail to fulfil its functions? (functional failures)
What causes each failure? (failure modes)
What happens when each failure occurs? (failure effects )
In what way does each failure matter? (failure consequences)
What should be done to predict or prevent each failure? (proactive tasks and
task intervals)
What should be done if a suitable proactive task cannot be found? (default
actions)
RCM is undeniably a valuable maintenance concept. It takes into account
system functionality, and not just the equipment itself. The focus is on reliability.
Safety and environmental integrity are considered to be more important than cost.
Applying RCM helps to increase the assets lifetime and establish a more efficient
and effective maintenance. Its structured approach fits in the knowledge management philosophy: reduced human error, more and better historical data and analysis, exploitation of expert knowledge and so forth.
RCM is popular and many RCM implementations have started during the last
decade. Although RCM offers many benefits, there are also drawbacks. From the
conceptual point of view there are some weak points. For instance, the fact that the
original RCM does not offer a task packaging feature and thus does not automatically offer a workable maintenance plan and the fact that the standard decision
charts and forms offered are helpful but also far from perfect. A serious remark,
mainly from the academic side, is about the scientific basis of RCM: the FMEA
analysis, which is the heart of the RCM analysis, is often done on a rather ad hoc
basis. Often available statistical data are insufficient or inaccurate, there is a lack of
insight in the equipment degradation process (failure mechanisms) and the physical
environment (e.g. corrosive or dusty environment) is ignored. The balance between
valuable experience and equally valuable, objective statistical evidence is often
absent. Many companies call in the (expensive) help of consultants to implement
RCM; some of these consultants however are not capable of offering the help
wanted and this in combination with the lack of in-house experience with RCM
discredits this methodology. RCM is in fact an on-going process, which often
causes reluctance to engage in a RCM project. RCM is undoubtedly a very
resource consuming process, which also makes it difficult to apply RCM to all
equipment.
39
40
achieve this objective the traditional RCM should be enhanced. Coetzee proposes a
new RCM blending concept from different RCM authors related techniques. He
also puts forward some innovations like the funnelling approach to ensure that
RCM efforts are concentrated on the most important failure modes in the organization.
Finally, there is a vast range of so-called streamlined RCM concepts. These
concepts claim to be derivations of RCM. It is consultants who mainly promote
streamlined RCM as the solution for the resource consuming character of RCM.
Although streamlining sounds attractive it should be carefully applied, in order to
keep the RCM benefits. Different streamlining approaches exist; however, very
few are acceptable as formal RCM methodologies. Based on Pintelon and Van
Puyvelde (2006), Table 2.3 provides a picture of popular streamlined RCM approaches.
Table 2.3. Classification of streamlined RCM concepts
Characteristics
Pitfalls
Retro-active
approach
Example
Generic
approach
Skipping
approach
Criticality
approach
Troublemaker
approach
41
best maintenance practices and concepts such as TPM, RCM and RBI. It shows
where the added-value of maintenance lies and how an organisation can be best
structured to realise this value. One of the main contributions of VDM is that it
offers a common language to management and maintenance to discuss maintenance
matters. VDM identifies four value drivers in maintenance and provides concepts to
manage by those drivers. For all four value drivers, maintenance can help to increase a companys economic value. VDM makes a link between value drivers and
core competences. For each of the core competences, some managerial concepts are
provided.
Most recently, Waeyenbergh (2005) presents CIBOCOF as a framework to
developed customised maintenance concepts. CIBOCOF starts out from the idea
that although all maintenance concepts available from the literature contain
interesting ideas, none of them is suitable for implementation without further
customization. Companies have their own priorities in implementing a maintenance
concept and are likely to go for cherry picking from existing concepts. CIBOCOF
offers a framework to do this in an integrated and structured way. Figure 2.6
illustrates the steps that this concept structurally goes through. A particularly
interesting step is step 5, maintenance policy optimization, where a decision chart is
offered to determine which mathematical decision model can be used to optimize
the chosen policy (step 4). This decision chart guides the user through the vast
literature on the topic.
M2
Technical
analysis
M1
Start-up
Maintenance
Plan
M5
Continuous
improvement
M3
Policy decision
making
M4
Implementation
& Evaluation
42
Nowadays, the decisions expected from the maintenance manager are complex and
sometimes can have far reaching consequences. He/she is (partly) responsible for
operational, tactical and strategical aspects of the companys maintenance management. This involves the final responsibility for operational decisions like the
planning of the maintenance jobs and tactical decisions concerning the long-term
maintenance policy to be adopted. More recently, maintenance managers are also
consulted in strategic decisions, e.g. purchases of new installations, design choices,
personnel policy,
The career path of todays maintenance manager starts out from a rather technical
content, but evolves over time into more financial and strategic responsibilities. This
career path can be horizontal or vertical. It is also important that the maintenance
manager is a good communicator and people manager, as maintenance remains a
labor-intensive function. The maintenance manager needs to be able to attract and
retain highly skilled technicians. On-going training for technicians is needed to keep
track of the rapidly evolving technology. Motivation of maintenance technicians
often requires special attention. Job autonomy in maintenance is more than in
production, instructions may be vague, immediate assessment of the quality of work
is mostly not possible, complaints are more often heard than compliments etc.
Aspects like safety and ergonomics are an indispensable element in current maintenance management. Besides people, materials are another important resource for
maintenance work. Maintenance material logistics mainly concerns the spare parts
management and the determination of finding the optimum trade-off between high
spare parts availability and the corresponding stock investments.
The above described evolution in maintenance management incurs a sharp need
for decision support techniques of various nature: statistical analysis tools for
predicting the failure behaviour of equipment, decision schemes for determining
the right maintenance concept, mathematical models to optimize the maintenance
policy parameters (e.g. PM frequency), decision criteria concerning e-maintenance,
decision aids for outsourcing decisions, etc. Table 2.4 illustrates the use of some
decision support techniques for maintenance management. These techniques are
available and have proven their usefulness for maintenance, but they are not yet
widely adopted.
In the 1960s most maintenance publications were very mathematically oriented
and mainly focussed on reliability. The 1970s and early 1980s publications were
more focused on maintenance policy optimization such as determination of optimum preventive maintenance interval, planning of group replacements and inspection modelling. This was a step forward, although these models still often were too
focussed on mathematical tractability rather than on realistic assumptions and
hypotheses. This caused an unfortunate gap between academics and practitioners.
The former had the impression that industry and service sector were not ready
for their work, while the latter felt frustrated because the models were too
theoretical. Fortunately, this is changing. Academics pay more attention to the reallife background of their subject and practitioners discover the usefulness of the
academic work. Moreover academic work gets broader and offers a more diverse
range of models and concepts, such as maintenance strategy design models,
e-maintenance concepts, service parts supply policies, and the like besides the
more traditional maintenance optimization models. With the introduction of main-
43
tenance software, the necessary data required for these models could be more
easily collected. There still is a big gap between practitioners and academics, but it
is already slowly closing.
Table 2.4. OR/OM techniques and its application in maintenance
Techniques
Statistics
Reliability theory
Markov theory
Renewal theory
Math programming
Decision theory
Queueing theory
Simulation
Inventory control
Scheduling rostering
Project planning
MCDM
MRO = maintenance, repair and operating supplies, FMI = fast moving items, NMI = normal
moving items, SMI = slow moving items, VSMI = very slow moving items, MCDM = multi-criteria
decision making, OR/OM=Operations Research / Operations Management
The help from information technology (IT) is of special interest when discussing decision support for maintenance managers. Computerized maintenance
management systems (CMMS), also called computer aided maintenance management (CAMM), maintenance management information systems (MMIS) or even
enterprise asset management systems (EAM), nowadays offer substantial support
for the maintenance manager. These systems too have evolved over time (Table
2.5). IT of course also supports the e-maintenance applications and offers splendid
opportunities for knowledge management implementations. At the beginning of the
knowledge management hype, knowledge management was mainly aimed at fields
like R&D, innovation management, etc. Later on the potential benefits of
knowledge management were also recognized for most business functions. For
maintenance management, a knowledge management programme helps to capture
the implicit knowledge and expertise of maintenance workers and secure this
information in information systems, so making it accessible for other technicians.
The benefits of this in terms of consistency in problem solving approach and
knowledge retention are obvious. Other knowledge management applications can
be, for example, expert systems, assisting in the diagnosis of complex equipment
44
1990s ...
1980s1990s
1970s
Business IT
systems
CMMS
Characteristics
1st generation
2nd generation
3rd generation
45
function and an area of intensive academic research. Efforts are aimed at advancing towards world class maintenance and providing methodologies to do so.
Pintelon et al. (2006) describes several maintenance maturity levels required to
achieve world class maintenance; these are illustrated in Figure 2.7.
Maintenance concept optimization has professionalized. Corrective and precautionary actions are combined in different policies, from reactive to preventive
and from predictive to proactive policies. A sound insight into the pros and cons of
each of these policies is available in practice and research supports the selection
and optimization of these policies. These policies are no longer ad hoc and lose
elements within maintenance management but policies are also embedded in
maintenance concepts, focussing on reliability and productivity. These concepts
ensure consistent decision making for all equipment and at the same time allow for
individualized installation maintenance concepts. Decision tools are available to
support this process.
Top management nowadays, at least in most companies, recognizes the importance of maintenance as an element of their business strategy. Expectations for
maintenance are no longer formulated as keep things running, but are based upon
the overall business strategy. This strategy can be based on flexibility, quality and
low cost. The maintenance organization, with its structural and infrastructural
elements, is built accordingly.
The previous paragraph may give the impression that all problems for
maintenance management are already solved; this however is not the case. New
opportunities in terms of, for example, outsourcing and e-maintenance exist.
Moreover, there is a threatening gap between the top management level and the
overall maintenance strategy determination and the tactical level on which the
maintenance concepts are designed, detailed and implemented (Figure 2.8). The
gap, however, is there between the alignment of the tactical and subsequent
operational phase on the one hand and the strategic phase on the other. While both
aspects are well studied, the link between the two is often not well established.
This leads to disappointments with top management as well as frustration with
maintenance managers. Research shows a similar gap. There is some though
46
still not enough research on the link between maintenance and business
strategy. The main focus of maintenance management research is still on the
tactical and operational planning. Links between the former and the latter part of
research however are still very rare. Closing this gap by linking maintenance and
business throughout all decision levels is one of the major challenges for the
future; every step taken brings us closer to real world-class maintenance.
47
MTBF: Mean-time-between-failures
MTTR: Mean-time-to-repair
NMI: Normal moving items
OBM: Opportunity-based maintenance
OEE: Overall equipment effectiveness
OM: Operations management
OR: Operations research
PM: Precautionary maintenance
Q&D: Quick & dirty decision charts
R&D: Research & development
RBI: Risk-based inspections
RCBM: Risk-based centred
maintenance
RCM: Reliability-centred maintenance
ROI: Return on investment
SAE: Society of automotive
engineering
SMED: Single minute exchange of dies
SMI: Slow moving items
TBM: Time-based maintenance
TCO: Total cost of ownership
TPM: Total productive maintenance
TQM: Total quality management
UBM: Use-based maintenance
VDM: Value-driven maintenance
VSMI: Very slow moving items
WIP: Work in progress
2.7 References
Anderson, R.T., Neri, L., (1990), Reliability Centred Maintenance: Management and
Engineering Methods, Elsevier Applied Sciences, London
Blanchard, B.S., (1992), Logistics Engineering and Management, Prentice Hall, Englewood
Cliffs, New Jersey
Cho, I.D, Parlar, M., (1991), A survey on maintenance models for multi-unit systems.
European Journal of Operational Research, 51:123
Coetzee, J.L., (2002), An Optimized Instrument for Designing a Maintenance Plan: A Sequel
to RCM. PhD thesis, University of Pretoria, South-Africa
Dekker, R., (1996) Applications of maintenance optimization models: A review and
analysis. Reliability Engineering and System Safety, 52(3):229240
Dekker, R., and Scarf, P.A., (1998) On the impact of optimisation models in maintenance
decision making: the state of the art. Reliability Engineering and System Safety,
60:111119
Geraerds, W.M.J., (1972), Towards a Theory of Maintenance. The English University Press.
London.
48
Gits, C.W., (1984), On the Maintenance Concept for a Technical System: A Framework for
Design, Ph.D.Thesis, TUEindhoven, The Netherlands
Haarman, M. and Delahay, G., (2004), Value Driven Maintenance New Faith in Maintenance, Mainnovation, Dordrecht, The Nederlands
Jones, R.B., (1995), Risk-Based Maintenance, Gult Professional Publishing (Elsevier),
Oxford
Kelly, A., (1997), Maintenance Organizations & Systems: Business-Centred Maintenance,
Butterworth-Heinemann, Oxford
McCall, J.J. (1965), Maintenance policies for stochastically failing equipment: A survey.
Management Science, 11 (5):493524
Moubray, J., (1997), Reliability-Centred Maintenance. Second Edition. ButterworthHeinemann, Oxford
Nowlan, F.S., Heap, H.F., (1978), Reliability Centered Maintenance, United Airlines
Publications, San Fransisco
Parkes, D. in Jardine, A.K.S., (1970), Operational Research in Maintenance, University of
Manchester Press, Manchester
Pintelon, L., Gelders, L., Van Puyvelde, F., (2000), Maintenance Management, Acco Leuven/
Amersfoort
Pintelon, L., Gelders, L., (1992) Maintenance management decision making. European
Journal of Operational Research, 58:301317
Pintelon, L., Pinjala, K., Vereecke, A., (2006), Evaluating the Effectiveness of Maintenance
Strategies, Journal of Quality in Maintenance Engineering (JQME), 12(1):214229
Pintelon, L., Van Puyvelde, F., (2006), Maintenance Decision Making, Acco, Leuven,
Belgium
Takahashi, Y. and Takashi, O., (1990) TPM: Total Productive Maintenance. Asian
Productivity Organization, Tokyo
Valdez-Flores, C., Feldman, R.M., (1989) A survey of preventive maintenance models for
stochastically deteriorating single-unit systems. Naval Research Logistics, 36:419446
Waeyenbergh, G., (2005), CIBOCOF A Framework for Industrial Maintenance Concept
Development, PhD thesis, Centre for Industrial Management K.U.Leuven, Leuven,
Belgium
Waeyenbergh, G., Pintelon, L., (2002) A framework for maintenance concept development.
International Journal of Production Economics, 77:299313
Wang H., (2002), A survey of maintenance policies of deteriorating systems. European
Journal of Operational Research, 139:469489
3
New Technologies for Maintenance
Jay Lee and Haixia Wang
3.1 Introduction
For years, maintenance has been treated as a dirty, boring and ad hoc job. Its seen as
critical for maintaining productivity but has yet to be recognized as a key component
of revenue generation. The question most often asked is Why do we need to maintain things regularly? The answer is To keep things as reliable as possible. However, the question that should be asked is How much change or degradation has
occurred since the last round of maintenance? The answer to this question is I
dont know. Today, most machine field services depend on sensor-driven management systems that provide alerts, alarms and indicators. The moment the alarm
sounds, its already too late to prevent the failure. Therefore, most machine maintenance today is either purely reactive (fixing or replacing equipment after it fails) or
blindly proactive (assuming a certain level of performance degradation, with no input
from the machinery itself, and servicing equipment on a routine schedule whether
service is actually needed or not). Both scenarios are extremely wasteful.
Rather than reactive maintenance, fail-and-fix, world-class companies are
moving forwards towards predict-and-prevent maintenance. A maintenance
scheme, referred to as condition based maintenance (CBM), was developed by
considering current degradation and its evolution. CBM methods and practices
have been continuously improved for the last decades; however, CBM is conducted
at equipment level one piece of equipment at a time, and the developed prognostics approaches are application or equipment specific.
Holistic approach, real-time prognostics devices, and rapid implementation
environment are potential future research topics in product and system health
assessment and prognostics. With the level of integrated network systems development in todays global business environment, machines and factories are networked, and information and decisions are synchronized in order to maximize a
companys asset investments. This generates a critical need for a real-time remote
machinery prognostics and health management (R2M-PHM) system. The unmet
needs in maintenance can be categorized into the following:
50
No way to fix it: the maintenance technique is not available for a special
application, or the maintenance technique is at too early stage of development.
Isnt worth it to fix it: some machines were designed to be used only once.
When compared to maintenance cost, it may be more cost-effective just to
discard it.
Neither of the scenarios above is within the scope of the discussion here.
Machine Performance
and uptime
51
Self-Maintenance
or Maintenance-free
Proactive
Machine
Maintenance
(Failure Root
causes analysis)
No
Maintenance
Predictive
Preventive Maintenance
Maintenance
(Scheduled
Reactive Maintenance)
Maintenance
(Fire Fighting)
52
from the historical databases of equipment behavior over time. These two indices
provide a rough estimate of the time between two adjacent breakdowns and the
mean time needed to restore a system when such breakdowns happen. Although
equipment degradation processes vary from case to case, and the causes of failure
can be different as well, the information contained in MTBF and MTTR can still
be informative. Other indices can also be extracted and used, including the mean
lifetime, mean time to first failure, and mean operational life, as discussed by Pham
et al. (1997). With the introduction of minimal repair and imperfect maintenance,
various extensions and modifications to the age-dependent PM policy have been
proposed (Bruns 2002; Chen et al. 2003). Another preventive maintenance policy
that received much attention is the periodic PM policy, in which degraded
machines are repaired or replaced at fixed time intervals independent of the
equipment failures. Various modifications and enhancements to this maintenance
policy have also been proposed recently (Cavory et al. 2001).
The preventive maintenance schemes are time-based without considering the
current health state of the product, and thus are inefficient and less valuable for a
customer whose individual asset is of the most concern. For the case of helicopter
gearboxes, it was found that almost half of the units were removed for overhaul
even though they were in a satisfactory operating condition. Therefore techniques
for more economical and reliable maintenance are needed.
3.2.1.4 Predictive Maintenance
Predictive maintenance (PdM) is a right-on-time maintenance strategy. It is based on
the failure limit policy in which maintenance is performed only when the failure rate,
or other reliability indices, of a unit reaches a predetermined level. This maintenance
strategy has been implemented as condition based maintenance (CBM) in most
production systems, where certain performance indices are periodically (Barbera et
al. 1996; Chen and Trivedi 2002) or continuously monitored (Marseguerra et al.
2002). Whenever an index value crosses some predefined threshold, maintenance
actions are performed to restore the machine to its original state, or to a state where
the changed value is at a satisfactory level in comparison to the threshold.
Predictive maintenance can be best described as a process that requires both
technology and human skills, while using a combination of all available diagnostic
and performance data, maintenance history, operator logs and design data to make
timely decisions about maintenance requirements of major/critical equipment. It is
this integration of various data, information and processes that leads to the success
of a PdM program. It analyzes the trend of measured physical parameters against
known engineering limits for the purpose of detecting, analyzing and correcting a
problem before a failure occurs. A maintenance plan is devised based on the
prediction results derived from condition based monitoring. This method can cost
more up front than PM because of the additional monitoring hardware and software
investment, cost of manning, tooling, and education that is required to establish a
PdM program. However, it provides a basis for failure diagnostics and maintenance
operations, and offers increased equipment reliability and a sufficient advance in
information to improve planning, thereby reducing unexpected downtime and
operating costs.
53
3.2.1.6 Self-maintenance
Self-maintenance is a new design and system methodology. Self-maintenance
machines are expected to be able to monitor, diagnose, and repair themselves in
order to increase their uptime.
One system approach to enabling self-maintenance is based on the concept of
functional maintenance (Umeda et al. 1995). Functional maintenance aims to
recover the required function of a degrading machine by trading off functions,
whereas traditional repair (physical maintenance) aims to recover the initial
physical state by replacing faulty components, cleaning, etc. The way to fulfil the
self-maintenance function is by adding intelligence to the machine, making it
clever enough for functional maintenance, so that the machine can monitor and
diagnose itself, and it can still maintain its functionality for a while if any kind of
failure or degradation occurs. In other words, self-maintainability would be
appended to an existing machine as an additional embedded reasoning system. The
required capabilities of a self-maintenance machine (SMM) are defined as follows
(Labib 2006):
54
55
56
prognostics were built into the network functionality. Vachtsevanos and Wang
(2001) gave an overview of different CBM algorithms and suggested a method to
compare their performance for a specific application.
Prognostic information, obtained through intelligence embedded into the
manufacturing process or equipment, can also be used to improve manufacturing
and maintenance operations in order to increase process reliability and improve
product quality. For instance, the ability to increase reliability of manufacturing
facilities using the awareness of the deterioration levels of manufacturing equipment
has been demonstrated through an example of improving robot reliability (Yamada
and Takata 2002). Moreover, a life cycle unit (LCU) (Seliger et al. 2002) was
proposed to collect usage information about key product components, enabling one
to assess product reusability and facilitating the reuse of products that have
significant remaining useful life.
In spite of the progresses in CBM, many fundamental issues still remain. For
example:
1. Most research is conducted at the single equipment level, and no infrastructure exists for employing a real-time remote machinery diagnosis and
prognosis system for maintenance.
2. Most of the developed prognostics approaches are application or equipment
specific. A generic and scalable prognostic methodology or toolbox doesnt
exist.
3. Currently, methods are focused on solving the failure prediction problem.
The need for tools for system performance assessment and degradation
prediction has not been well addressed.
4. The maintenance world of tomorrow is an information world for featurebased monitoring. Features used for prognostics need to be further developed.
5. Many developed prediction algorithms have been demonstrated in a laboratory environment, but are still without industry validation.
To address the afore-mentioned unmet needs, Watchdog Agent-based intelligent
maintenance systems (IMS) has been presented by the IMS Center with a vision to
develop a systematic approach in advanced prognostics to enable products and
systems to achieve near-zero breakdown reliability and performance.
57
form. More often, no infrastructure exists for delivering the data over a network, or
for managing and analyzing the data, even if the devices were networked.
Watchdog Agent-based real-time remote machinery prognostics and health
management (R2M-PHM) system has been recently developed by the IMS Center.
It focuses on developing innovative prognostics algorithms and tools, as well as
remote and embedded predictive maintenance technologies to predict and prevent
machine failures, as illustrated in Figure 3.2.
Figure 3.2. Key focus and elements of the Intelligent Maintenance Systems
The rest of the section is organized as follows. Section 3.1 deals with the
platform of Watchdog Agent-based real-time remote machinery prognostics and
health management (R2M-PHM) system. Section 3.2 presents a generic and
scalable prognostic methodology or toolbox, i.e., the Watchdog Agent toolbox;
and Section 3.3 illustrates the effectiveness and potentials of this new development
using several real industry case studies.
3.3.1 Watchdog Agent-based R2M-PHM Platform
A generic and scalable prognostics framework was presented by Su et al. (1999) to
integrate with embedded diagnostics to provide total health management
capability. A reconfigurable and scalable Watchdog Agent-based R2M-PHM
platform is being developed by the IMS Center, which expands the well known
open system architecture for condition-based maintenance (OSA-CBM) standard
(Thurston and Lebold 2001) by including real-time remote machinery diagnosis
and prognosis systems and embedded Watchdog Agent technology. As illustrated
in Figure 3.3, the Watchdog Agent (hardware and software) is embedded onto
machines to convert multi-sensory data to machine health information. The
extracted information is managed and transferred through wireless internet or a
satellite communication network, and service is automatically triggered.
58
Figure 3.3. Illustration of IMS real-time remote machinery diagnosis and prognosis system
Embedded software
Sensor signals
Vibration
Temperature
Pressure
59
Watchdog
Agent
toolbox
Database
Decision
support
tools
Web
server
Client
software
Current
Voltage
On/Off
Remote
computer
Embedded computer
60
Figure 3.5.
61
of memory since all of the tools are embedded into the hardware. It has 16 high
speed analog input channels to deal with highly dynamic signals. It also has
various peripherals that can acquire non-analog sensor signals such as RS232/485/432, parallel and USB. The prototype uses a compact flash card for
storage, so it can be placed on top of machine tools and is suitable for withstanding
vibrations in a working environment. Once a certain set of tools/algorithms is
determined for a certain industry application, commercially available hardware,
such as Advantech and National Instruments (NI) as illustrated in Figure 3.6b and
c, respectively, will be further evaluated for customized Watchdog Agent applications.
62
63
The Watchdog Agent toolbox enables one to assess and predict quantitatively
performance degradation levels of key product components, and to determine the
root causes of failure (Casoetto et al. 2003; Djurdjanovic et al. 2000; Lee 1995,
1996), thus making it possible to realize physically closed-loop product life cycle
monitoring and management. The Watchdog Agent consists of embedded
computational prognostic algorithms and a software toolbox for predicting degradation of devices and systems. Degradation assessment is conducted after the
critical properties of a process or machine are identified and measured by sensors. It
is expected that the degradation process will alter the sensor readings that are being
fed into the Watchdog Agent, and thus enable it to assess and quantify the
degradation by quantitatively describing the corresponding change in sensor
signatures. In addition, a model of the process or piece of equipment that is being
considered, or available application specific knowledge can be used to aid the
degradation process description, provided that such a model and/or such knowledge
exist. The prognostic function is realized through trending and statistical modeling
of the observed process performance signatures and/or model parameters.
In order to facilitate the use of Watchdog Agent in a wide variety of applications
(with various requirements and limitations regarding the character of signals,
available processing power, memory and storage capabilities, limited space, power
consumption, the users preference etc.) the performance assessment module of the
64
Watchdog Agent has been realized in the form of a modular, open architecture
toolbox. The toolbox consists of different prognostics tools, including neural
network-based, time-series based, wavelet-based and hybrid joint time-frequency
methods, etc., for predicting the degradation or performance loss on devices, process,
and systems. The open architecture of the toolbox allows one easily to add new
solutions to the performance assessment modules as well as to easily interchange
different tools, depending on the application needs. To enable rapid deployment, a
quality function deployment (QFD) based selection method had been developed to
provide a general suggestion to aid in tool selection; this is especially critical for
those industry users who have little knowledge about these algorithms. The current
tools employed in the signal processing and feature extraction, performance assessment, diagnostics and prognostics modules of Watchdog Agent functionality are
summarized in Figure 3.10.
Each of these modules is realized in several different ways to facilitate the use
of the Watchdog Agent in a wide variety of products and applications.
65
signals, which place a strong emphasis on the need for development and utilization
of non-stationary signal analysis techniques, such as wavelets, or joint timefrequency analysis. The feature extraction module extracts features most relevant
to describing a products performance. Those features are extracted from the time
domain into which the sensory processing module transforms sensory signals,
using expert knowledge about the application, or automatic feature selection
methods such as roots of the autoregressive time-series model, or time-frequency
moments and singular value decomposition.
Currently the following signal processing and feature extraction tools are used
in the Watchdog Agent toolbox:
The Fourier transformation method has been widely used in de-noising and
feature extraction. Noise component in the signal can be distinguished after
it is transformed, and feature components can be identified after the
removal of noise. However, Fourier transformation is applicable to nonstationary signals only since frequency-band energies for applications are
characterized by time-invariant frequency content.
The autoregressive modeling method calculates frequency peak locations
and intensities using autoregressive oscillation modes of sensor readings
and bares significant information about the process (usually, mechanical
systems are well described by the modes of oscillations).
The wavelet/wavelet packet decomposition method enables the rapid
calculation of non-stationary signal energy distribution at the expense of
loosing some of the desirable mathematical properties.
The time-frequency analysis method provides both temporal and spectral
information with good resolution, and is applicable to highly non-stationary
signals (e.g. impacts or transient behaviors). However, it is not applicable if
a large amount of data has to be considered and calculation speed is a
concern.
The application specific features extraction method is applicable in cases
when one can directly extract performance-relevant features out of the
time-series of sensor readings.
66
expert knowledge exists, simple but rapid performance assessment based on the
feature-level fused multi-sensor information can be made using the relative number
of activated cells in the neural network, or by using the logistic regression
approach. For products with open-control architecture, the match between the
current and nominal control inputs and the performance criteria can also be utilized
to assess the products performance. For more sophisticated applications with
intricate and complicated signals and performance signatures, statistical pattern
recognition methods, or the feature map based approach can be employed.
The following performance assessment tools are currently being used in the
Watchdog Agent toolbox:
67
68
failure modes occur, performance signatures related to each specific failure mode
can be collected and used to teach the Watchdog Agent to recognize and diagnose
those failure modes in the future. Thus, the Watchdog Agent is envisioned as an
intelligent device that utilizes its experience and human supervisory inputs over
time to build its own expandable and adjustable world model.
Performance assessment, prediction and prognostics can be enhanced through
feature-level or decision-level sensor fusion, as defined by Hall and Llinas (2000)
(Chapter 2). Feature-level sensor fusion is accomplished through concatenation of
features extracted from different sensors, and the joint consideration of the concatenated feature vector in the performance assessment and prediction modules.
Decision-level sensor fusion is based on separately assessing and predicting process performance from individual sensor readings and then merging these individual sensor inferences into a multi-sensor assessment and prediction through
some averaging technique.
In summary, the following performance forecasting tools are currently used in
the Watchdog Agent:
New tools will be continuously developed and added to the modular, open
architecture Watchdog Agent toolbox based on the development procedure as
shown in Figure 3.12.
69
Tool selection
Prototyping &
testing
No
Accepted
Program
development
No
Evaluation
Yes
Yes
Deployment
70
71
Figure 3.15. The bearing test rig sponsored by Rexnord Technical Service
Figure 3.16 presents the vibration waveform collected from bearing 4 at the last
stage of the bearing test. The signal exhibits strong impulses periodicity because of
the impacts generated by a mature outer race defect. However, when examining the
historical data and observing the vibration signal three days before the bearing
failed, there is no sign of periodic impulses as shown in Figure 3.17a. The periodic
impulse feature is completely masked by the noise.
An adaptive wavelet filter is designed to de-noise the raw signal and enhance
degradation detection. The adaptive wavelet filter is yielded in two steps. First the
optimal wavelet shape factor is found by the minimal entropy method. Then an
optimal scale is identified by maximizing the signal periodicity. By applying the
designed wavelet filter to the noisy raw signal, the de-noised signal can be obtained
as shown in Figure 3.17b. The periodic impulse feature can then be clearly discovered, which serves as strong evidence of bearing outer race degradation. The
wavelet filter-based de-noising method successfully enhanced the signal feature
and provided potent evidence for prognostic decision-making.
72
a Raw Signal
3.3.3.3 Example 3: Bearing Risk of Failure and Remaining Useful Life Prediction
An important issue in prognostic technology is the estimation of the risk of failure,
and of the remaining useful life of a component, given the components age and its
past and current operating condition. In numerous cases, failures were attributed to
many correlated degradation processes, which could be reflected by multiple
degradation features extracted from sensor signals. These features are the major
information regarding the health of the component under monitoring; however, the
failure boundary is hard to define using these features. In reality, the same feature
vector could be attributed to totally different combinations of the underlying
degradation processes and their severity levels. There is only a probabilistic
relationship between the component failure and the certain level of degradation
features. A typical example can be found during bearing operation. Two bearings
of the same type could fail at different levels of RMS and Kurtosis of vibration
signal. To capture the probabilistic relationship between the multiple degradation
features and the component failure as well as to predict the risk of failure and the
remaining useful life, IMS has developed a Proportional Hazards (PH) approach
(Liao et al. 2005) based on the PH model proposed by Cox (1972). The PH model
involving multiple degradation features is given as
(t ; Z ) = 0 (t ) exp( ' Z )
(3.1)
where (t ; Z ) is the hazard rate of the component given the current age t and the
degradation feature vector Z ; 0 (t ) is called the baseline hazard rate function;
is the model parameter vector. This formulation relates the working age and
multiple degradation feature to the hazard rate of the component. To estimate the
parameters, the maximum likelihood approach could be utilized using offline data,
including the degradation features over time of many components and their failure
times. Afterwards, the established model can be used for predicting the risk of
failure for the component by plugging in the working age and the degradation
features extracted from the on-line sensor signals. In addition, the remaining useful
life L(tcurrent ) given the current working age and the history of degradation features
can be estimated as
L(tcurrent )
t current
exp
t
current
(v; z (v)) dv d
73
(3.2)
Table 3.1. Estimates of expected remaining useful life Test 1, Bearing 3 (unit: day)
Time
26
29
31
3.5549
3.3965
1.5295
6.5278
3.5278
1.5278
Error
2.9729
0.1313
0.0017
74
has been developed, which serves as a baseline system for researchers and
companies to develop next-generation e-maintenance systems. It enables machine
makers and users to predict machine health degradation conditions, diagnose fault
sources, and suggest maintenance decisions before a fault actually occurs. The
Watchdog Agent-based R2M-PHM platform expands the OSA-CBM architecture
topology by including real-time remote machinery diagnosis and prognosis
systems and embedded Watchdog Agent technology. The Watchdog Agent is an
embedded algorithm toolbox which converts multi-sensory data to machine health
information. Innovative sensory processing and autonomous feature extraction
methods are developed to facilitate the plug-and-play approach in which the
Watchdog Agent can be setup and run without any need for expert knowledge or
intervention.
Future work will be the further development of the Watchdog Agent-based
IMS platform. Smart software and NetWare will be further developed for proactive
maintenance capabilities such as performance degradation measurement, fault
recovery, self-maintenance and remote diagnostics. For the embedded Watchdog
Agent application, we need to harvest the developed technologies and tools and to
accelerate their deployment in real-world applications through close collaboration
between industrial and academic researchers. Specifically, future work will include
the following aspects: (i) evaluate the existing Watchdog Agent tools and identify
the application needs from the smart machine testbed; (ii) develop a configurable
prognostics tools platform for rotary machinery elements such as bearings, motors,
and gears, etc., so that several of most frequently used prognostics tools can be pretested and deposited into a ready-to-use tool library; (iii) develop a user interface
system for tool selection, which allows users to use the right tools effectively for
the right applications and achieve the first tool correct accuracy; (iv) validate the
reconfiguration of these tools to a variety of similar applications (to be defined by
the company participants); and (v) explore research in a peer-to-peer (P2P)
paradigm in which Watchdog Agents embedded on identical products operating
under similar conditions could exchange information and thus assist each other in
machine health diagnosis and prognosis.
To predict, prioritize, and plan precision maintenance actions to achieve an
every action correct objective, the IMS Center is creating advanced maintenance
simulation software for maintenance schedule planning and service logistics cost
optimization for transparent decision making. At the same time, the Center is
exploring the integration of decision support tool and optimization techniques for
proactive maintenance; this integration will facilitate the functionalities of the
Watchdog Agent-based R2M-PHM in which an intelligent maintenance systems
can operate as a near-zero down-time, self-sustainable and self-aware artificially
intelligent system that learns from its own operation and experience.
Embedding is crucial for creating an enabling technology that can facilitate
proactive maintenance and life cycle assessment for mobile systems, transportation
devices and other products for which cost-effective realization of predictive performance assessment capabilities cannot be implemented on general purpose personal
computers. The main research challenge will be to accomplish sophisticated performance evaluation and prediction capabilities under the severe power consumption,
processing power and data storage limitations imposed by embedding. The Center
75
will develop a wireless sensor network made of self-powered wireless motes for
machine health monitoring and embedded prognostics. These networked smart motes
can be easily installed in products and machines with ad hoc communications. In
addition, the Center is investigating the feasibility of harvesting energy by using
vibration in an environment equipped with wireless motes for remote monitoring of
equipment and machinery. In conjunction with that investigation, the Center is
looking at ways of developing communication protocols that require less energy for
communication. Power converter circuitry has been designed by using vibration
signals in order to convert vibration energy into useful electric energy. These technologies are very critical for monitoring equipment or systems in a complex environment where the availability of power is the major constraint.
In the area of collaborative product life cycle design and management, the
Watchdog Agent can serve as an infotronics agent to store product usage and endof-life (EOL) service data and to send feedback to designers and life cycle
management systems. Currently, an international intelligent manufacturing systems
consortium on product embedded information systems for service and EOL has
been proposed. The goal is to integrate Watchdog Agent capabilities into products
and systems for closed-loop design and life cycle management, as illustrated in
Figure 3.19.
The Center will continue advancing its research to develop technologies and tools
for closed-loop life cycle design for product reliability and serviceability, as well as
explore research in new frontier areas such as embedded and networked agents for
self-maintenance and self-healing, and self-recovery of products and systems. These
new frontier efforts will lead to a fundamental understanding of reconfigurability and
allow the closed-loop design of autonomously reconfigurable engineered systems
that integrate physical, information, and knowledge domains. These autonomously
reconfigurable engineered systems will be able to sense, perform self-prognosis, self-
76
Near
Near0
0
Downtime
Closed-Loop
Life
LifeCycle
Cycle
Design
Design
Design for
Reliability and
Serviceability
Product
Center
Health Monitoring
Product or
System Sensors & Embedded
In Use
Intelligence
Product
Redesign
Smart
Design
Enhanced
Six-Sigma
Design
Degradation Watchdog
Agent
Self-Maintenance
Communications
Redundancy
Active
Passive
Tether-Free
(Bluetooth)
Internet
TCP/IP
Service
Maintenance
Business
and Service
Synchronization
(CBM)
Asset Optimization
3.5 References
Badia, F.G., Berrade, M.D. and Campos, C.A., (2002) Optimal Inspection and Preventive
Maintenance of Units with Revealed and Unrevealed Failures. Reliability Engineering
and System Safety 78: 157163.
Barbera, F., Schneider, H. and Kelle, P., (1996) A Condition Based Maintenance Model
with Exponential Failures and Fixed Inspection Interval. Journal of the Operational
Research Society 47(8): 10371045.
Bonissone, G., (1995) Soft computing applications in equipment maintenance and service
in: ISIE 95, Proceedings of the IEEE International Symposium, 2: 1014.
Brotherton, T., Jahns, G., Jacobs, J. and Wroblewski, D., (2000) Prognosis of faults in gas
turbine engines, in: Aerospace Conference Proceedings, (2000) IEEE, 6: 1825.
Bruns, P., (2002) Optimal Maintenance Strategies for Systems with Partial Repair Options
and without Assuming Bounded Costs. European Journal of Operational Research 139:
146165.
Bunday, B.D., (1991) Statistical Methods in Reliability Theory and Practice, Ellis Horwood.
Burrus, C., Gopinath, R. and Haitao, G., (1998) Introduction to wavelets and wavelet
transforms a primer. NJ: Prentice Hall.
Casoetto, N., Djurdjanovic, D., Mayor, R., Lee, J. and Ni, J., (2003) Multisensor process
performance assessment through the use of autoregressive modeling and feature maps.
Trans. of SME/NAMRI, 31:483490.
77
Cavory, G., Dupas, R. and Goncalves, R., (2001) A Genetic Approach to the Scheduling of
Preventive Maintenance Tasks on a Single Product Manufacturing Production Line,
International Journal of Production Economics, 74: 135146.
Chen, C.T., Chen, Y.W. and Yuan, J., (2003) On a Dynamic Preventive Maintenance Policy
for a System under Inspection. Reliability Engineering and System Safety 80: 4147.
Chen, D. and Trivedi, K., (2002) Closed-Form Analytical Results for Condition-Based
Maintenance. Reliability Engineering and System Safety 76: 4351.
Cohen, L., (1995) Time-frequency analysis. NJ: Prentice Hall.
Cox, D., (1972) Regression models and life tables (with discussion). Journal of the Royal
Statistical Society, Series B 34:187220.
Djurdjanovic, D., Widmalm, S.E., William, W.J., et al., (2000) Computerized classification
of temporomandibular joint sounds. IEEE Transactions on Biomedical Engineering
47:977984.
Djurdjanovic, D., Ni, J. and Lee, J., (2002) Time-frequency based sensor fusion in the
assessment and monitoring of machine performance degradation. Proceedings of 2002
ASME Int. Mechanical Eng. Congress and Exposition, paper number IMECE2002-32032.
Garga, A., McClintic, K.T., Campbell, R.L., et al., (2001) Hybrid reasoning for prognostic
learning in CBM systems, in: Aerospace Conference, 1017 March, 2001, IEEE Proceedings, 6: 29572969.
Goodenow, T., Hardman, W., Karchnak, M., (2000) Acoustic emissions in broadband
vibration as an indicator of bearing stress. Proceedings of IEEE Aerospace Conference,
2000; 6: 95122.L.D.
Hall, L.D. and Llinas, J., (Eds.), (2000) Handbook of Sensor Fusion, CRC Press.
Hall, L.D., (1992) Mathematical techniques in Multi-Sensor Data Fusion, Artech House Inc.
Hansen, R., Hall, D., Kurtz, S., (1994) New approach to the challenge of machinery
prognostics. Proceedings of the International Gas Turbine and Aeroengine Congress and
Exposition, American Society of Mechanical Engineers, June 1316 1994: 18.
IMS, NSF I/UCRC Center for Intelligent Maintenance Systems, www.imscenter.net; 2004.
Kemerait, R., (1987) New cepstral approach for prognostic maintenance of cyclic
machinery. IEEE SOUTHEASTCON, 1987: 256262.
Kleinbaum, D., (1994) Logistic regression. New York: Springer-Verlag.
Labib, A.W., (2006) Next generation maintenance systems: Towards the design of a selfmaintenance machine. 2006 IEEE International Conference on Industrial Informatics,
Integrating Manufacturing and Services Systems, 1618 August, Singapore
Lee, J., (1995) Machine performance monitoring and proactive maintenance in computerintegrated manufacturing: review and perspective. International Journal of Computer
Integrated Manufacturing 8:370380.
Lee, J., (1996) Measurement of machine performance degradation using a neural network
model. Computers in Industry 30:193209.
Lee, J., Ni, J., (2002) Infotronics agent for tether-free prognostics. Proceeding of AAAI
Spring Symposium on Information Refinement and Revision for Decision Making:
Modeling for Diagnostics, Prognostics, and Prediction. Stanford Univ., Palo Alto, CA,
March 2527.
Liang, E., Rodriguez, R., Husseiny, A., (1988) Prognostics/diagnostics of mechanical
equipment by neural network, Neural Networks 1 (1) 33.
Liao, H., Lin, D., Qiu, H., Banjevic, D., Jardine, A., Lee, J., (2005) A predictive tool for
remaining useful life estimation of rotating machinery components. ASME International
20th Biennial Conference on Mechanical Vibration and Noise, Long Beach, CA, 2428
September, 2005.
Liu, J., Djurdjanovic, D., Ni, J., Lee, J., (2004) Performance similarity based method for
enhanced prediction of manufacturing process performance. Proceedings of the 2004
ASME International Mechanical Engineering Congress and Exposition (IMECE), 2004.
78
4
Reliability Centred Maintenance
Marvin Rausand and Jrn Vatn
4.1 Introduction
Reliability centred maintenance (RCM) is a method for maintenance planning that
was developed within the aircraft industry and later adapted to several other
industries and military branches. A high number of standards and guidelines have
been issued where the RCM methodology is tailored to different application areas,
e.g., IEC 60300-3-11, MIL-STD-217, NAVAIR 00-25-403 (NAVAIR 2005), SAE
JA 1012 (SAE 2002), USACERL TR 99/41 (USACERL 1999), ABS (2003, 2004),
NASA (2000) and DEF-STD 02-45 (DEF 2000). On a generic level, IEC 60300-3-11
(IEC 1999) defines RCM as a systematic approach for identifying effective and
efficient preventive maintenance tasks for items in accordance with a specific set of
procedures and for establishing intervals between maintenance tasks. A major advantage of the RCM analysis process is a structured, and traceable approach to determine the optimal type of preventive maintenance (PM). This is achieved through a
detailed analysis of failure modes and failure causes. Although the main objective of
RCM is to determine the preventive maintenance, the results from the analysis may
also be used in relation to corrective maintenance strategies, spare part optimization,
and logistic consideration. In addition, RCM also has an important role in overall
system safety management.
An RCM analysis process, when properly conducted, should answer the
following seven questions:
1.
2.
3.
4.
5.
6.
7.
What are the system functions and the associated performance standards?
How can the system fail to fulfil these functions?
What can cause a functional failure?
What happens when a failure occurs?
What might the consequence be when the failure occurs?
What can be done to detect and prevent the failure?
What should be done when a suitable preventive task cannot be found?
80
Study preparation
System selection and definition
Functional failure analysis (FFA)
Critical item selection
Data collection and analysis
Failure modes, effects, and criticality analysis (FMECA)
Selection of maintenance actions
Determination of maintenance intervals
Preventive maintenance comparison analysis
Treatment of non-critical items
Implementation
In-service data collection and updating
The rest of the chapter is structured as follows: In Section 4.2 we describe and
discuss the 12 steps of the RCM process. The concepts of generic and local RCM
analysis are introduced in Section 4.3. These concepts have been used in a novel
RCM approach to improve and speed up the analyses in a railway application.
Models and methods for optimization of maintenance intervals are discussed in
Section 4.4. Some main features of a new computer tool, OptiRCM, are briefly
introduced. Concluding remarks are given in Section 4.5. The RCM analysis
approach that is described in this chapter is mainly in accordance with accepted
standards, but also contains some novel issues, especially related to steps 6 and 8
and the approach chosen in OptiRCM. The RCM approach is illustrated with
examples from railway applications. Simple examples from the offshore oil and
gas industry are also mentioned.
81
82
All systems may in principle benefit from an RCM analysis. With limited
resources we must, however, set priorities, at least when introducing RCM in a
new plant. We should start with the systems we assume will benefit most from the
analysis. The following criteria may be used to prioritize systems for an RCM
analysis:
Superstructure
Substructure
Signalling
Telecommunications
Power supply (overhead line with supporting systems)
Low voltage systems
In this chapter, the following terms are used for the levels of the assembly
hierarchy:
Plant: A logical grouping of systems that function together to provide an output
or product by processing and manipulating various input raw materials and feed
stock. An offshore gas production platform may, e.g., be considered as a plant. For
railway application a plant might be a maintenance area, where the main function
of that plant is to ensure satisfactory infrastructure functionality in that area.
Moubray (1997) refers to the plant as a cost centre. In railway application a plant
corresponds to a train set (rolling stock), or a line (infrastructure).
System: A logical grouping of subsystems that will perform a series of key
functions, which often can be summarized as one main function, that is required of
a plant (e.g., feed water, steam supply, and water injection). The compression
system on an offshore gas production platform may, e.g., be considered as a
system. Note that the compression system may consist of several compressors with
a high degree of redundancy. Redundant units performing the same main function
should be included in the same system. It is usually easy to identify the systems in
a plant, since they are used as logical building blocks in the design process.
83
The system level is usually recommended as the starting point for the RCM
process. This is further discussed and justified, e.g., by Smith (1993) and in MILSTD 2173 (MIL-STD 1986). This means that on an offshore oil/gas platform the
starting point of the analysis should be the compression system, the water injection
system or the fire water system, and not the whole platform. In railway application
the systems were defined above as the next highest level in the plant hierarchy.
The systems may be further broken down into subsystems, and sub-subsystems,
and so on. For the purpose of the RCM analysis process the lowest level of the
hierarchy should be what we will call an RCM analysis item.
RCM analysis item: A grouping or collection of components, which together
form some identifiable package that will perform at least one significant function
as a stand-alone item (e.g., pumps, valves, and electric motors). For brevity, an
RCM analysis item will in the following be called an analysis item. By this
definition, a shutdown valve, e.g., is classified as an analysis item, while the valve
actuator is not. The actuator is supporting equipment to the shutdown valve, and
only has a function as a part of the valve. The importance of distinguishing the
analysis items from their supporting equipment is clearly seen in the FMECA in
Step 6. If an analysis item is found to have no significant failure modes, then none
of the failure modes or causes of the supporting equipment are important, and
therefore do not need to be addressed. Similarly, if an analysis item has only one
significant failure mode, then the supporting equipment only needs to be analyzed
to determine if there are failure causes that can affect that particular failure mode
(Paglia et al. 1991). Therefore, only the failure modes and effects of the analysis
items need to be analyzed in the FMECA in Step 6. An analysis item is usually
repairable, meaning that it can be repaired without replacing the whole item. In the
offshore reliability database OREDA (2002) the analysis item is called an
equipment unit. The various analysis items of a system may be at different levels
of assembly. On an offshore platform, for example, a huge pump may be defined
as an analysis item in the same way as a small gas detector. If we have redundant
items, e.g., two parallel pumps; each of them should be classified as analysis items.
When in Step 6 we identify causes of analysis item failures, we often find it
suitable to attribute this failure causes to failures of items on an even lower level of
indenture. The lowest level is usually referred to as components.
Component: The lowest level at which equipment can be disassembled without
damage or destruction to the items involved. Smith (2005) refers to this lowest level
as least replaceable assembly, while OREDA (2002) uses the term maintainable
item.
It is very important that the analysis items are selected and defined in a clear
and unambiguous way in this initial phase of the RCM analysis process, since the
following analysis will be based on these analysis items. If the OREDA database is
to be used in later phases of the RCM process, it is recommended as far as possible
to define the analysis items in compliance with the equipment units in OREDA.
84
2.
3.
4.
5.
6.
Essential functions: These are the functions required to fulfil the intended
purpose of the item. The essential functions are simply the reasons for
installing the item. Often an essential function is reflected in the name of the
item. An essential function of a pump is, e.g., to pump a fluid.
Auxiliary functions: These are the functions that are required to support the
essential functions. The auxiliary functions are usually less obvious than the
essential functions, but may in many cases be as important as the essential
functions. Failure of an auxiliary function may in many cases be more
critical than a failure of an essential function. An auxiliary function of a
pump is, e.g., to contain fluid.
Protective functions: The functions intended to protect people, equipment,
and the environment from damage and injury. The protective functions may
be classified according to what they protect, as: (i) safety functions, (ii)
environment functions, and (iii) hygiene functions. An example of a protective function is the protection provided by a rupture disk on a pressure
vessel.
Information functions: These functions comprize condition monitoring,
various gauges and alarms, and so on.
Interface functions: These functions apply to the interfaces between the item
in question and other items. The interfaces may be active or passive. A passive
interface is, e.g., present when an item is a support or a base for another item.
Superfluous functions: According to Moubray (1997) Items or components
are sometimes encountered which are completely superfluous. This usually
happens when equipment has been modified frequently over a period of years,
or when new equipment has been over-specified. Superfluous functions are
85
sometimes present when the item has been designed for an operational context
that is different from the actual operational context. In some cases failures of a
superfluous function may cause failure of other functions.
For analysis purposes the various functions of an item may also be classified as:
86
The term functional failure is mainly used in the RCM literature, and has the
same meaning as the more common term failure mode. In RCM we talk about
functional failures on equipment level, and use the term failure mode related to the
parts of the equipment. The failure modes will therefore be causes of a functional
failure. It is important to realize that a functional failure (and a failure mode) is a
manifestation of the failure as seen from the outside, i.e., a deviation from performance standards.
Functional failures and failure modes may be classified in three main groups
related to the function of the item:
Total loss of function: In this case the function is not achieved at all, or the
quality of the function is far beyond what is considered as acceptable.
Partial loss of function: This group may be very wide, and may range from
the nuisance category almost to the total loss of function.
Erroneous function: This means that the item performs an action that was
not intended, often the opposite of the intended function.
Function
Date:
Function
requirements
Performed by:
Functional
failure
87
Page: of:
Frequency
Criticality
S
The performance requirements to the functions, like target values and acceptable
deviations, are listed in column 3. For each function (in column 2) all the relevant
functional failures are listed in column 4. In column 5 the frequency/probability of
the functional failure is listed. A criticality ranking of each functional failure in that
particular operational mode is given is given in column 6. The reason for including
the criticality ranking is to be able to limit the extent of the further analysis by
disregarding insignificant functional failures. For complex systems such a screening
is often very important in order not to waste time and money.
The criticality ranking depends on both the frequency/probability of the
occurrence of the functional failure, and the severity of the failure. The severity must
be judged at plant level.
The severity ranking should be given in the four consequence classes: (S) safety
of personnel, (E) environmental impact, (A) production availability, and (C) economic losses. For each of these consequence classes the severity should be ranked as
for example (H) high, (M) medium, or (L) low. How we should define the borderlines between these classes will depend on the specific application.
If at least one of the four entries are (M) medium or (H) high, the severity of the
functional should be classified as significant, and the functional failure should be
subject to further analysis.
The frequency of the functional failure may also be classified in the same three
classes. (H) high may, e.g., be defined as more than once per 5 years, and (L) low
less than once per 50 years. As above, the specific borderlines will depend on the
application.
The frequency classes may be used to prioritize between the significant system
failure modes.
If all the four severity entries of a system failure mode are (L) low, and the
frequency is also (L) low, the criticality is classified as insignificant, and the
functional failure is disregarded in the further analysis. If, however, the frequency is
(M) medium or (H) high the functional failure should be included in the further
analysis even if all the severity ranks are (L) low, but with a lower priority than the
significant functional failures.
The FFA may be rather time-consuming because, for all functional failures, we
have to list all the maintenance significant items (MSIs) (see Step 4). The MSI lists
will hence have to be repeated several times. To reduce the workload we often
conduct a simpler FFA where for each main function we list all functional failures in
one column, and all the related MSIs in another column. This is illustrated in Figure
4.3 for a railway application.
88
The function name reflects the functions to be carried out on a relatively high
level in the system. In principle, we should explicitly formulate the function(s) to
be carried out. Instead we often specify the equipment class performing the
function. For example, departure light signal is specified rather than the more
correct formulation ensure correct departure light signal. We observe that the last
functional failure in Figure 4.3 is not a failure mode for the correct functional
description (Ensure correct departure light signal), but is related to another function
of the departure light signal. Thus, if we use an equipment class description
rather than an explicit functional statement, the list of failure modes should cover
all (implicit) functions of the equipment class.
At the functional failure level, it is also convenient to specify whether the
failure mode is evident or hidden; see Figure 4.3 where we have introduced an
EF/HF column.
For each function we also list the relevant items that are required to perform the
function. These items will form rows in the FMECA worksheets; see Step 5.
4.2.4 Step 4: Critical Item Selection
The objective of this step is to identify the analysis items that are potentially
critical with respect to the functional failures identified in Step 3(iii). These
analysis items are denoted functional significant items (FSI). For simple systems
the FSIs may be identified without any formal analysis. In many cases it is obvious
which analysis items that have influence on the functional failures. For complex
systems with an ample degree of redundancy or with buffers, we may need a
formal approach to identify the FSIs.
If failure rates and other necessary input data are available for the various
analysis items, it is usually a straightforward task to calculate the relative importance
of the various analysis items based on a fault tree model or a reliability block
diagram. A number of importance measures are discussed by Rausand and Hyland
(2004).
In addition to the FSIs, we should also identify items with high failure rate,
high repair costs, low maintainability, long lead-time for spare parts, or items
requiring external maintenance personnel. These analysis items are denoted
maintenance cost significant items (MCSI).
The sum of the functional significant items and the maintenance cost significant
items are denoted maintenance significant items (MSI).
In an RCM project for the Norwegian Railway Administration the use of
generic RCM analyses (see Section 4.3) made it possible to analyze all identified
MSIs. In this case this step could be omitted.
4.2.5 Step 5: Data Collection and Analysis
The purpose of this step is to establish a basis for both the qualitative analysis
(relevant failure modes and failure causes), and the quantitative analysis (reliability
parameters such as MTTF, PF-intervals, and so on). The data necessary for the
RCM analysis may be categorized into the following three groups:
89
Function: _______
Function: Home signal
Function: Departure light signal
Description: Five lamp signals, with three main signals and two pre-signals
Functional failure
EF / HF
HF
HF
HF
HF
MSI
- Signal mast
- Brands
- Background shade
- Earth conductor
- Lamp
- Lens
- Transformer
- etc.
1.
During the initial phase of the RCM analysis process it often becomes evident that
the format and quality of the operational data are not sufficient to estimate the
relevant reliability parameters. Some of the main problems encountered are:
The failure data is on a too high level in the assembly hierachy, i.e., data is
not reported on the RCM analysis item level (MSI).
Failure mode and failure causes are not reported, or the recorded information does not correspond to definitions and code lists used in the
FMECA of Step 6.
90
Events
Experience has shown that we can significantly reduce the workload of the
FMECA by introducing so-called TOP events as a basis for the analysis. The idea is
that for each failure mode in the FMECA, a so-called TOP event is specified as
consequence of the failure mode. A number of failure modes will typically lead to
the same TOP event. A consequence analysis is then carried out for each TOP event
to identify the end consequences of that particular TOP event, covering all consequence classes (e.g., safety, availability/punctuality, environmental aspects). For
many plants, risk analyses (or safety cases) have been carried out as part of the
design process. These may sometimes be used as a basis for the consequence
analysis.
Figure 4.4 shows a conceptual model of this approach for a railway application
where the left part relatively to the TOP event is treated in the FMECA, and the
right part is treated as generic, i.e., only once for each TOP event.
91
C1
C2
Initiating event
C3
TOP event
Train collision
C4
C5
C6
Failure cause:
- Burn-out bulb
In the rectangle (dashed line) in the left-hand side of Figure 4.4 an initiating
event and a barrier are illustrated. To analyze this rectangle we need reliability parameters, such as MTTF, aging parameter, and PF interval, that are included
in the FMECA worksheet (e.g., see Rausand and Hyland 2004). Three situations
are considered:
1.
Other barriers in Figure 4.4 can prevent the component failure from
developing into a critical TOP event. Track circuit detection may be a barrier
against rail breakage, because the track circuit can detect a broken rail. Typical
examples of TOP events in railway application are:
Train derailment
Collision train-train
Collision train-object
92
Fire
Persons injured or killed in or at the track
Persons injured or killed at level crossings
Passengers injured or killed at platforms
Minor injury
Medical treatment
Permanent injury
1 fatality
210 fatalities
>10 fatalities
Note that the consequence reducing barriers and the end consequences are not
analyzed explicitly during the FMECA, but treated as generic for each TOP event.
In the railway situation this means only six analyses of the safety consequences
related to human injuries/fatalities.
In the following, a list of fields (columns) for the FMECA worksheets is
proposed. The structure of the FMECA is hierarchical, but the information is
usually presented in a tabular worksheet. The starting point in the FMECA is the
functional failures from the FFA in Step 3. Each maintainable item is analyzed
with respect to any impact on the various functional failures. In the following we
describe the various columns:
Failure mode (equipment class level). The first column in the FMECA
worksheet is the failure mode at the equipment class level identified in the
FFA in Step 3.
Maintenance significant item (MSI). The relevant MSI were identified in
the FFA.
MSI function. For each MSI, the functions of the MSI related to the current
equipment class failure mode are identified.
Failure mode (MSI level). For the MSI functions we also identify the failure
modes at the MSI level.
Detection method. The detection method column describes how the MSI
failure mode may be detected, e.g., by visual inspection, condition monitoring, or by the central train control system (for railway applications).
Hidden or evident. Specify whether the MSI function is hidden or evident.
Demand rate for hidden function, fD. For MSI functions that are hidden, the
rate of demand of this function should be specified.
Failure cause. For each failure mode there is/are one or more failure
causes. A failure mode will typically be caused by one or more component
failures at a lower level. Note that supporting equipment to the component
is considered for the first time at this step. In this context a failure cause
may therefore be a failure mode of supporting equipment.
93
Failure mechanism. For each failure cause, there is one or several failure
mechanisms. Examples of failure mechanisms are fatigue, corrosion, and
wear. To simplify the analysis, the columns for failure cause and failure
mechanism are often merged into one column.
Mean time to failure (MTTF). The MTTF when no maintenance is performed should be specified. The MTTF is specified for one component if it
is a point object, and for a standardized distance if it is a line object
such as rails, sleepers, and so on.
TOP event safety. The TOP event in this context is the accidental event that
might be the result of the failure mode. The TOP event is chosen from a
predefined list established in the generic analysis
Barrier against TOP event safety. This field is used to list barriers that are
designed to prevent a failure mode from resulting in the safety TOP event.
For example, brands on the signalling pole would help the locomotive
driver to recognize the signal in case of a dark lamp.
PTE-S. This field is used to assess the probability that the other barriers
against the TOP event all fail; see Figure 4.4. PTE-S should count for all the
barriers listed under Barrier against TOP event safety.
TOP event availability/punctuality. Also for this dimension a predefined list
of TOP events may be established in the generic analysis.
Barrier against TOP event availability/punctuality. This field is used to list
barriers that are designed to prevent a failure mode from resulting in an
availability/punctuality TOP event. Since the fail safe principle is fundamental
in railway operation, there are usually no barriers against the punctuality TOP
event when a component fails. An example of a barrier is a two out of three
voting system on some critical components within the system.
PTE-P. This field is used to assess the probability that the other barriers
against an availability/punctuality TOP event all fails. PTE-P should count for
all the barriers listed under Barrier against TOP event availability/
punctuality. Due to the fail safe principle, PTE-P will often be equal to one.
Other consequences. Other consequences may also be listed. Some of these
are non-quantitative like noise effects, passenger comfort, and aesthetics.
Material damage to rolling stock or components in the infrastructure may
also be listed. Material damage may be categorized in terms of monetary
value, but this is not pursued here.
Mean downtime (MDT). The MDT is the time from a failure occurs until
the failure has been corrected and any traffic restrictions have been
removed.
Criticality indexes. Based on already entered information, different criticality
indexes can be calculated. These indexes are used to screen out nonsignificant MSIs.
94
Failure progression. For each failure cause the failure progression should
be described in terms of one of the following categories: (i) gradual
observable failure progression, (ii) non-observable and fast observable
failure progression (PF model), (iii) non-observable failure progression but
with aging effects, and (iv) shock type failures.
Gradual failure information. If there is a gradual failure progression
information about a what values of the measurable quantity represents a
fault state. Further information about the expected time and standard
deviation to reach this state should be recorded.
PF-interval information. In case of observable failure progression the PF
model is often applied (e.g., see Rausand and Hyland 2004, p. 394). The
PF concept assumes that a potential failure (P) can be observed some time
before the failure (F) occurs. This time interval is denoted the PF interval
(e.g., see Rausand and Hyland 2004). We need information both on the
expected value and the standard deviation of the PF interval.
Aging parameter. For non-observable failure progression aging effects
should be described. Relevant categories are strong, moderate or low aging
effects. The aging parameter can alternatively be described by a numeric
value, i.e., the shape parameter in the Weibull distribution.
Maintenance task. The maintenance task is determined by the RCM logic
discussed in Step 7.
Maintenance interval. Often we start by describing existing maintenance
interval, but after the formalized process of interval optimalization in Step
8 we enter the optimized interval.
Prevent a failure
Detect the onset of a failure
Reveal a hidden failure
95
event
Safety
barriers
PTE-S
event
MSI
Function
Lamp
Give light
No light
Burnt-out
filament
Train
Train
Directional
block, ATP,
TCC,
Black=red
3 x 10
Manual
train
operation
Lens
Protect
lamp
Broken
lens
Rock fall
Train
Train
Directional
block, ATP,
TCC,
Black=red
2 x 10
-5
None
Slip
through
light
No light
slipping
through
Fouling
Train
Train
Directional
block, ATP,
TCC,
Black=red
2 x 104
Failure
cause
TOP
TOP
None
The failure mechanisms behind each of the dominant failure modes should be
entered into the RCM decision logic to decide which of the following basic
maintenance tasks is most applicable:
1.
2.
3.
4.
5.
6.
96
The man-hour cost of inspection is often larger than the cost of installing a
sensor.
Since the scheduled inspection is carried out at fixed points of time, one
might miss situations where the degradation is faster than anticipated.
There must be an identifiable age at which the item shows a rapid increase
in the items failure rate function.
2. A large proportion of the units must survive to that age.
3. It must be possible to restore the original failure resistance of the item by
reworking it.
Scheduled replacement (SRP) is scheduled discard of an item (or one of its parts)
at or before some specified age limit. A scheduled replacement task is applicable
only under the following circumstances:
1.
2.
The item must be subject to a functional failure that is not evident to the
operating crew during the performance of normal duties.
2. The item must be one for which no other type of task is applicable and
effective.
Run to failure (RTF) is a deliberate decision to run to failure because the other
tasks are not possible or the economics are less favourable.
Continuous oncondition
task (CCT)
Yes
Does a failure alerting
measurable indicator
exist?
Yes
Is continuous
monitoring
feasible?
Scheduled oncondition
task (SCT)
No
No
Yes
Is aging parameter
>1?
Yes
Scheduled overhaul
(SOH)
Is overhaul
feasible?
No
No
Is the function
hidden?
97
Yes
Scheduled
replacement
(SRP)
Scheduled function
test (SFT)
No
No PM activity
found (RTF)
98
It must be applicable
It must be effective
99
Task analysis, e.g., see Kirwan and Ainsworth (1992), may be used to reveal
the risk involved with each maintenance job. See Hoch (1990) for further discussion on implementing the RCM analysis results.
4.2.12 Step 12: In-service Data Collection and Updating
The reliability data we have access to at the outset of the analysis may be scarce, or
even almost none. In our opinion, one of the most significant advantages of RCM
is that we systematically analyze and document the basis for our initial decisions
and, hence, can better utilize operating experience to adjust that decision as
operating experience data is collected. The full benefit of RCM is therefore only
achieved when operation and maintenance experience is fed back into the analysis
process.
The updating process should be concentrated on three major time perspectives:
1.
2.
3.
For each significant failure that occurs in the system, the failure characteristics
should be compared with the FMECA. If the failure was not covered adequately in
the FMECA, the relevant part of the RCM analysis should, if necessary, be revised.
The short-term update can be considered as a revision of previous analysis
results. The input to such an analysis is updated reliability figures either due to more
data, or updated data because of reliability trends. This analysis should not require
excessive resources, since the framework for the analysis is already established. Only
Steps 5 and 8 in the RCM process will be affected by short-term updates.
The medium term update will also review the basis for the selection of
maintenance actions in Step 7. Analysis of maintenance experience may identify
significant failure causes not considered in the initial analysis, requiring an updated
FMECA in Step 6.
100
The long-term revision will consider all steps in the analysis. It is not sufficient
to consider only the system being analyzed; it is required to consider the entire
plant with its relations to the outside world, e.g., contractual considerations, new
laws regulating environmental protection, and so on.
2.
3.
4.
5.
A railway point is a railway switch that allows a train to go from one track to another.
A railway point is called a turnout in American English.
101
consider all parameters that are involved in the optimization model (see
Section 4.4.
6. Re-run the optimization procedure. Based on the new local parameters
we next re-run the optimization procedure to adjust maintenance intervals
taking local differences into account. To carry out this process we need a
computerized tool to streamline the work.
7. Document the results. The results from the local analysis are stored in a
local RCM database. This is a database where only the adjustment factors
are documented, for example, for railway points A, B, C, and D on line Y
the MTTF is 30 % higher than the average. Hence the maintenance interval
is also reduced accordingly.
102
The interpretation of the effective failure rate is not straightforward for hidden
functions. For such functions we also need to specify the rate at which the hidden
function is demanded. In this situation we may approximate the effective failure
rate by the product of the demand rate and the probability of failure on demand
(PFD) for the hidden function.
In the following we indicate models that may be used for modelling the
effective failure rate, and we refer to the literature for details. The aim of OptiRCM
has been:
Only the Weibull distribution is used to model aging failures in OptiRCM. There
may, of course, be situations where another distribution would be more realistic,
but our experience is that the user of such a tool rarely has data or insight that helps
him to do better than applying the Weibull model.
4.4.1.1 Effective Failure Rate in the Situation of Aging
A standard block replacement policy is considered where an aging component is
periodically replaced after intervals of length . Upon a failure in one interval, the
component is replaced without affecting the next planned replacement. The
effective failure rate, i.e., the average number of failures per time unit is then given
by E ( ) = W ( ) / , where W ( ) is the renewal function (e.g., see Rausand and
Hyland 2004). Approximation formulas for the effective failure rate exist if we
assume Weibull distributed failure times (e.g., see Chang et al. 2006). OptiRCM
103
uses the renewal equation to establish an iterative scheme for the effective failure
rate based on an initial approximation.
4.4.1.2 Effective Failure Rate in the Situation of Gradual Observable Failure
Progression
The assumptions behind this situation is that the failure progression, say Y (t) , can
be observed as a function of time. In the simplest situation Y (t) is onedimensional, whereas in more complex situations Y (t) may be multidimensional.
We may also have situations where Y (t) denotes some kind of a signal where, for
example, the fast Fourier transform of the signal is available. In OptiRCM a very
simple situation is considered, where Y (t) is monotonically increasing. As Y (t)
increases, the probability of failure also increases, and at a predefined level
(maintenance limit), say l , the component is replaced, or overhauled. The effective
failure rate, E (, l) , is now a function of both the inspection interval, and the
maintenance limit. In OptiRCM a Markov chain model is used to model the failure
progression (e.g., see Welte et al. 2006 for details of the Markov chain modelling,
and also an extension where it is possible to reduce the inspection intervals as we
approach the maintenance limit). In the Markov chain model it is easy to treat the
situation where Y (t) is a nonlinear function of time. If we restrict ourselves to
linear failure progression, continuous models as the Wiener and gamma processes
may also be used.
4.4.1.3 Effective Failure Rate in the PF Model
The assumption behind the PF model is that failure progression is not observable for
a rather long time, and then at some point of time we have a rather fast failure
progression. This is the typical situation for cracks (potential failures) that can be
initiated after a large number of load cycles. The cracks may develop rather fast, and
it is important to detect the cracks before they develop into breakages. The time from
a crack is observable until a failure (breakage) occurs is denoted the PF-interval. The
important reliability parameters are the rate of potential failures, the mean and
standard deviation of the PF-interval, and the coverage of the inspection method. The
model implemented in OptiRCM for the PF situation is described in Vatn and Svee
(2002). See also Castanier and Rausand (2006) for a similar approach, and the more
general application of delay time models (Christer and Waller 1984).
4.4.2 System Model
Figure 4.4 shows a simplified model of the risk picture related to the component
failure being analyzed. In order to quantify the risk related to safety, we need the
following input data:
104
PLLj = PLL-contribution
0.01
2,000
0.05
30,000
0.1
300,000
C4: 1 fatality
0.7
1,600,000
4.5
13,000,000
30
160,000,000
(4.1)
where PCj is the probability that the TOP event results in consequence class C j .
We will later indicate how we can model Equation 4.1 as a function of the
maintenance interval, .
In some situations we also assign a cost, and/or a PLL (potential loss of life)
contribution to the various cost elements. PLL denotes the annual, statistically
expected number of fatalities in a specified population. Proposed values adopted by
the Norwegian National Rail Administration are given in Table 4.2. Please see
discussion by Vatn (1998) regarding what it means to assign monetary values to
safety.
The total PLL contribution related to the component failure being analyzed is
then
PLL = PTES j=1 (PC j PLL j ) E ( )
6
(4.2)
(4.3)
CP = PTE-P PC (TOP) E ( )
105
(4.4)
This procedure may, if required, be repeated for other dimensions like environment, material damage, and so on.
4.4.3 Total Cost and Interval Optimization
The approach to interval optimization is based on minimizing the total cost related
to safety, punctuality, availability, material damage, etc. Within an ALARP regime
(e.g., see Vatn 1998) this requires that the risk is not unacceptable. Assuming that
risk is acceptable, we proceed by calculating the total cost per time unit:
C( ) = CS ( ) + CP ( ) + CPM ( ) + CCM ( )
(4.5)
where CS ( ) and CP ( ) are given by Equation 4.3 and 4.4, respectively. Further,
CPM ( ) = PM Cost /
(4.6)
where PM Cost is the cost per preventive maintenance activity. Note that for
condition-based tasks we distinguish between the cost of monitoring the item, and
the cost of physically improving the item by some restoration or renewal activity.
This complicates Equation 4.6 slightly because we have to calculate the average
number of renewals.
Further, if CM Cost is the cost of a corrective maintenance activity, we have
CCM ( ) = CM Cost E ( )
(4.7)
Table 4.3. Generic probabilities, PCj, of consequence class Ci for the different TOP events
event
PC1
PC2
PC3
PC4
PC5
PC6
Derailment
0.1
0.1
0.1
0.1
0.05
0.01
Collision train-train
0.02
0.03
0.05
0.5
0.3
0.1
Collision train-object
0.1
0.2
0.3
0.15
0.01
0.001
Fire
0.1
0.2
0.2
0.1
0.02
0.005
0.3
0.3
0.2
0.05
0.01
0.001
0.1
0.2
0.3
0.3
0.09
0.01
0.2
0.2
0.2
0.3
0.1
0.0001
TOP
106
2
(0.09 0.2)
(1 + 1/ ) 1 0.1
E ( ) =
+
1
2
MTTF
MTTF
MTTF
(4.8)
The total cost C( ) in Equation 4.5 can now be found as a function of ; see
Figure 4.7 for a graphical illustration. The optimum interval is found to be 7.5 million km. The maintenance action is scheduled replacement of the pump; see
Figure 4.5.
107
4.5 Conclusions
The main parts of the RCM approach that we have described in this chapter are
compatible with common practice and with most of the RCM standards. We are,
however, using a more complex FMECA where we also record data that are
necessary during maintenance interval optimization. The novel parts of our approach
are related to the use of so-called generic RCM analysis and to maintenance interval
optimization. The use of generic RCM analysis will significantly reduce the
workload of a complete RCM analysis. Maintenance optimization is, generally, a
very complex task, and only a brief introduction is presented in this chapter. For
maintenance personnel to be able to use the proposed methods, they need to have
access to simple computerized tools where the mathematically complex methods are
hidden. This was our objective in developing the OptiRCM tool. Maintenance
optimization modules are, more or less, non-existent in the standard RCM tool.
OptiRCM is not a replacement for these tools, but rather a supplement. OptiRCM is
still in the development stage, and we are currently trying to implement several new
features into OptiRCM. Among these are additional methods related to maintenance
strategies, and grouping of maintenance tasks.
4.6 References
ABS, (2003) Guide for Survey Based on Reliability-Centered Maintenance. American
Bureau of Shipping, Houston.
ABS, (2004) Guidance Notes on Reliaility-Centered Maintenance. American Bureau of
Shipping, Houston.
Blanchard BS, Fabrychy WJ, (1998) Systems Engineering and Analysis, 3rd ed. Prentice
Hall, Englewood Cliffs, NJ.
Blanche KM, Shrivastava AB, (1994) Defining failure of manufacturing machinery and
equipment. Proceedings from the Annual Reliability and Maintainability Symposium,
pp. 6975.
Castanier B, Rausand M, (2006) Maintenance optimization for subsea oil pipelines.
Pressure Vessels and Piping 83:236243.
Chang KP, (2005) Reliability-centered maintenance for LNG ships. ROSS report 200506,
NTNU, Trondheim, Norway.
Chang KP, Rausand M, Vatn J, (2006) Reliability Assessment of Reliquefaction Systems on
LNG Carriers. Submitted for publication in Reliability Engineering and System Safety.
Cho DI, Parlar M, (1991) A survey of maintenance models for multi-unit systems. European
Journal of Operational Research 51:123.
Christer AH, Waller WM, (1984) Delay time models of industrial inspection maintenance
problems, Journal of the Operational Research Society 35:401406.
DEF-STD 02-45 (NES 45), (2000) Requirements for the application of reliability-centred
maintenance technique to HM ships, submarines, Royal fleet auxiliaries and other naval
aixiliary vessels. Defense Standard, U.K. Ministry of Defence, Bath, England.
Gertsbakh I, (2000) Reliability Theory with Applications to Preventive Maintenance.
Springer, New York.
Hoch R, (1990) A practical application of reliability centered maintenance. the American
Society of Mechanical Engineers, 90-JPGC/Pwr-51, Joint ASME/IEEE Power
Generation Conference, Boston, MA, 2125 October.
108
Part C
5
Condition-based Maintenance Modelling
Wenbin Wang
5.1 Introduction
The use of condition monitoring techniques in industry to direct maintenance
actions has increased rapidly over recent years to the extent that it has marked the
beginning of what is likely to prove a new generation in production and maintenance management practice. There are both economic and technological reasons
for this development driven by tight profit margins, high outage costs and an
increase in plant complexity and automation. Technical advances in condition
monitoring techniques have provided a means to achieve high availability and to
reduce scheduled and unscheduled production shutdowns. In all cases, the
measured condition information does, in addition to potentially improving decision
making, have a value added role for a manager in that there is now a more objective means of explaining actions if challenged.
In November 1979, the consultants, Michael Neal & Associate Ltd published A
Guide to Condition Monitoring of Machinery for the UK Department of Trade and
Industry; Neal et al. (1979). This groundbreaking report illustrated the difference in
maintenance strategies (e.g., breakdown, planned, etc.) and suggested that condition
based maintenance, using a range of techniques, would offer significant benefits to
industry. By the late 1990s condition based maintenance had become widely
accepted as one of the drivers to reduce maintenance costs and increase plant
availability. With the advent of e-procurement, business to business (B2B), customer
to business (C2B), business to customer (B2C) etc., industry is fast moving towards
enterprise wide information systems associated with the internet. Today, plant asset
management is the integration of computerised maintenance management systems
and condition monitoring in order to fulfil the business objectives. This enables
significant production benefits through objective maintenance prediction and
scheduling. This positions the manufacturer to remain competitive in a dynamic
market.
Today there exists a large and growing variety of condition monitoring techniques for machine condition monitoring and fault diagnosis. A particularly popular
112
W. Wang
one for rotating and reciprocal machinery is vibration analysis. However, irrespective
of the particular condition monitoring technique used, the working principle of
condition monitoring is the same, namely condition data become available which
need to be interpreted and appropriate actions taken accordingly. There are generally
two stages in condition based maintenance. The first stage is related to condition
monitoring data acquisition and their technical interpretations. There have been
numerous papers contributing to this stage, as evidenced by the proceedings of
COMADEM over recent years. This stage is characterised by engineering skill,
knowledge and experience. Much effort of the study at this stage has gone into
determining the appropriate variables to monitor, Chen et al. (1994), the design of
systems for condition monitoring data acquisition, Drake et al. (1995), signal
processing, Wong et al. (2006), Samanta et al. (2006), Harrison (1995), Li and Li
(1995), and how to implement computerised condition monitoring, Meher-Homji et
al. (1994). These are just a few examples and no modelling is explicitly entered into
the maintenance decision process based upon the results of condition monitoring. For
detailed technical aspects of condition monitoring and fault diagnosis, see Collacott
(1997). The second stage is maintenance decision making, namely what to do now
given that condition information data and their interpretations are available. The
decision at this stage can be complicated and entails consideration of cost, downtime,
production demand, preventive maintenance shutdown windows, and most importantly, the likely survival time of the item monitored. Compared with the extensive literature on condition monitoring techniques and their applications, relatively
little attention has been paid to the important problem of modelling appropriate
decision making in condition based maintenance.
This chapter focuses on the second stage of condition monitoring, namely
condition based maintenance modeling as an aid to effective decision making. In
particular, we will highlight a modelling technique used recently in condition based
maintenance, e.g. residual life modelling via stochastic filtering (Wang and
Christer 2000). This is a key element in modeling the decision making aspect of
condition based maintenance. The chapter is organised as follows. Section 5.2
gives a brief introduction to condition monitoring techniques. Section 5.3 focuses
on condition based maintenance modeling and discuss various modeling techniques used. Section 5.4 presents the modelling of the residual life conditional on
observed monitoring information using stochastic filtering. Section 5.5 concludes
the chapter with a discussion of topics for future research.
113
industrial equipment some measurements can be taken and the likely condition of
the plant assessed.
Today there exists a large and growing variety of forms of condition monitoring
techniques for machine condition monitoring and fault diagnosis. Understanding the
nature of each monitoring technique and the type information measured will certainly
help us when establishing a decision model. Here we briefly introduce five main
techniques and among them, vibration and oil analysis techniques are the two most
popular.
5.2.1 Vibration Based Monitoring
Vibration based monitoring is the main stream of current applications of condition
monitoring in industry. Vibration based monitoring is an on (off) line technique
used to detect system malfunction based on measured vibration signals.
Generally speaking, vibration is the variation with time of the magnitude of a
quantity that is descriptive of the motion or position of a mechanical system, when
the magnitude is alternatively greater than and smaller than some average value or
reference.
Vibration monitoring consists essentially in identifying two quantities:
The magnitude is basically used for establishing the severity of the vibration
and the frequency content for the cause or origin. Vibration velocity has been seen
as the most meaningful magnitude criterion for assessing machine condition,
though displacement or acceleration is also used. The magnitude of vibration is
usually measured in root mean square (rms). If T denotes the period of vibration
and V (t ) is the vibration (say, velocity) measured at time t, then
Vrms =
1
T
T
0
(V (t )) 2 dt ,
114
W. Wang
which causes the abnormal signals, but not vice versa (Wang 2002). This factor
plays an important role when selecting an appropriate model for describing such a
relationship.
5.2.2 Oil Based Monitoring
A detailed analysis of a sample of engine, transmission and hydraulic oils is a
valuable preventive maintenance tool for machines. In many cases it enables the
identification of potential problems before a major repair is necessary, has the
potential to reduce the frequency of oil changes, and increase the resale value of
used equipment.
Oil based monitoring involves sampling and analyzing oil for various properties
and materials to monitor wear and contamination in an engine, transmission or
hydraulic system etc. Sampling and analyzing on a regular basis establishes a
baseline of normal wear and can help indicate when abnormal wear or contamination
is occurring. Oil analysis works as follows. Oil that has been inside any moving
mechanical apparatus for a period of time reflects the possible condition of that
assembly. Oil is in contact with engine or mechanical components as wear metallic
trace particles enter the oil. These particles are so small they remain in suspension.
Many products of the combustion process will also become trapped in the circulating
oil. The oil becomes a working history of the machine. Particles caused by normal
wear and operation will mix with the oil. Any externally caused contamination also
enters the oil. By identifying and measuring these impurities, one can get an
indication of the rate of wear and of any excessive contamination. An oil analysis
also will suggest methods to reduce accelerated wear and contamination.
The typical oil analysis tests for the presence of a number of different materials
to determine sources of wear, find dirt and other contamination, and even check for
the use of appropriate lubricants. Today there exists a variety of forms of oil based
condition monitoring methods and techniques to check the volume and nature of
foreign particles in oil for equipment health monitoring. There are spectrometric oil
analysis, scan electron microscopy/energy dispersive X-ray analysis, energy dispersive X-ray fluorescent, low powered optical microscopy, and ferrous debris
quantification. One purpose of the oil analysis is to provide a means of predicting
possible impending failure without dismantling the equipment. One can look
inside an engine, transmission or hydraulic systems without taking it apart.
For oil based monitoring there is no such clear cut distinction between normal
and abnormal operating based on observed particle information in the oil samples.
The foreign particles that accumulate in the lubricant oil increase monotonically so
that we may not able to see a two-stage failure process as seen in the vibration
based monitoring. The casual relationship between the measured amount of particles in the oil and the state of the plant may also be bilateral in that, for example,
the wear may cause the increase of observed metals in the oil, but the metals and
other contaminants in the oil may also accelerate the wear. This marks a difference
when modeling the state of the plant in oil based monitoring compared to vibration
based.
115
116
W. Wang
is only carried out when it becomes necessary utilizing available condition information. But in reality, all too often we see effort and money spent on monitoring
equipment for faults which rarely occur, and we also see planned maintenance
being carried out when the equipment is perfect healthy though the monitored
information indicates something is wrong. A study of oil based condition monitoring of gear boxes of locomotives used by Canadian Pacific Railway (Aghjagan
1989) indicated, that since condition monitoring was commissioned (entailed 34
samples per locomotive per week, 52 weeks per year), the incidence failure of gear
boxes while in use fell by 90 %. This is a significant achievement. However, when
subsequently stripped down for reconditioning/overhaul, there was nothing evidently wrong in 50 % of cases. Clearly, condition monitoring can be highly effective, but may also be very inefficient at the same time. Modelling is necessary to
improve the cost effectiveness and efficiency of condition monitoring.
5.3.1 The Decision Model
This is an extension to the agebased replacement model in that the replacement
decision will be made not only dependent upon the age, but also upon the
monitored information, plus other cost or downtime parameters. If we take the cost
model as an example, then the decision model amounts to minimising the long run
expected cost per unit time. We use the following notation:
c f : The mean cost per failure
c p : The mean cost per preventive replacement
The long term expected cost per unit time, C (t ) , given that a preventive replacement is scheduled at time t> ti is given by (Wang 2003)
C (t ) =
(c f c p ) P (t ti | i ) + c p + icm
ti + (t ti )(1 P (t ti | i )) +
t ti
0
where P (t ti | i ) = P ( X i < t ti | i ) =
(5.1)
xi pi ( xi | i ) dxi
t ti
0
of a failure before t conditiional on i . The right hand side of Equation 5.1 is the
expected cost per unit time formulated as a renewal reward function, though the
lifetimes are independent but not identical.
117
The time point t is usually bounded within the time period from the current to
the next monitoring since a new decision shall be made once a new monitoring
reading becomes available at time ti +1 .
In general, if a minimum of C (t ) is found within the interval to the next
monitoring in terms of t , then this t should be the optimal replacement time. If no
minimum is found, then the recommendation would be to continue to use the plant
and evaluate Equation 5.1 at the next monitoring point when new information
becomes available. For a graphical illustration of the above principle see Figure 5.1.
C(t)
No replacement is recommended
ti Current time
t*
118
W. Wang
There are two problems with proportional hazards modeling or accelerated life
models in condition based maintenance. The first is that the current hazard is
determined partially by the current monitoring measurements and the full
monitoring history is not used. The second is the assumption that the hazard or the
life is a function of the observed monitoring data which acts directly on the hazard
via a covariate function. Both problems relate to the modeling assumption rather
than the technique. The first can be overcome if some sort of transformation of the
observed data is used. The second problem remains unless the nature of monitoring
indicates so. It is noted however that, for most condition monitoring techniques,
the observed monitoring measurements are concomitant types of information
which are a function of the underlying plant state. A typical example is in vibration
monitoring where a high level of vibration is usually caused by a hidden defect but
not vice versa as we have discussed earlier. In this case the observed vibration
signals may be regarded as concomitant variables which are caused by the plant
state. Note that in oil based monitoring things are different as the metal particles
and other contaminants observed in the oil can be regarded both as concomitant
119
If the severity of the defect is represented by the length of the residual life, the
relationship between the residual life and observed condition related variables
follows.
120
W. Wang
X (ti ti 1 ) if X i 1 > ti ti 1
X i = i 1
.
not defined
else
(5.2)
121
yi
x1
y3
y2
y1
x3
x2
Threshold level
t1
t2
t3
failure
pi ( xi | i ) = p ( xi | yi , i 1 ) =
p ( xi , yi | i 1 )
p ( yi | i 1 )
(5.3)
(5.4)
(5.5)
p( yi | i 1 ) =
p( xi , yi | i 1 )dxi =
p( yi | xi ) p( xi | i 1 )dxi
(5.6)
122
W. Wang
p( xi | i 1 ) = pi 1 ( g ( xi ) | i 1 , X i 1 > ti ti 1 )
Since
dg ( xi )
dxi
(5.7)
dg ( xi )
= 1 and
dxi
pi 1 ( g ( xi ) | i 1 , X i 1 > ti ti 1 ) =
pi 1 ( g ( xi ) | i 1 )
ti ti 1
pi 1 ( xi 1 | i 1 )dxi 1
(5.8)
we finally have
p( xi | i 1 ) =
pi 1 ( xi + ti ti 1 | i 1 )
ti ti 1
pi 1 ( xi 1 | i 1 )dxi 1
(5.9)
pi ( xi | i ) =
p ( yi | xi ) pi 1 ( xi + ti ti 1 | i 1 )
p( yi | xi ) pi 1 ( xi + ti ti 1 | i )dxi 1
(5.10)
p1 ( x1 | 1 ) =
p ( y1 | x1 ) p0 ( x1 + t1 t0 | 0 )
p( y1 | x1 ) p0 ( x1 + t1 t0 | 0 )dx1
(5.11)
123
p( yi | xi ) =
yi
(
yi
(
) 1 e A+ Be
cx
cx
A + Be
A + Be
i
cxi
(5.12)
This is a concept called floating scale parameter, which is particularly useful in this
case (Wang 2002). There are other choices to model the relationship between yi
and xi , but these will not be discussed here, and can be found in Wang (2006a).
5.4.3 Estimating the Model Parameters Within pi ( xi | i )
To calculate the actual pi ( xi | i ) we need to know the values for the model
parameters. They are the parameters of p0 ( x0 ) and p ( yi | xi ) . The most popular
way to estimate them is using the method of maximum likelihood.
At each monitoring point, ti , two pieces information are available, namely, yi
and X i 1 > ti ti 1 , both conditional on i1 . The pdf. for yi | i 1 is given by
Equation 5.7 and the probability function of X i 1 > ti ti 1 | i 1 is given by
P ( X i 1 > ti ti 1 | i 1 ) =
ti ti 1
pi 1 ( xi 1 | i 1 ) dxi 1
(5.13)
If the item monitored failed at time t f after the last monitoring at time t n , the
complete likelihood function is then given by
L () =
n
i =1
p( yi | i 1 )
ti ti 1
pi 1 ( xi 1 | i 1 )dxi 1 ) pn (t f tn | n )
(5.14)
124
W. Wang
The initial point of the second stage in these bearings is identified using a
control chart called the Shewhart average level chart and the threshold levels of the
bearings are shown in Table 5.1 (Zhang 2004).
Table 5.1. Threshold level for each bearing
Bearing
1
2
3
4
5
6
Threshold level
5.06
5.62
4.15
5.14
3.92
4.9
125
0)
and
yi
(
yi
(
) 1 e A+ Be
cx
cx
A + Be
A + Be
p( yi | xi ) =
cxi
pi ( xi | i ) =
( z + ti ) 1 e ( ( z + t ))
i
k ( xi , ti )
k =1
i
(5.15)
( z , ti )dz
k =1 k
where
k ( z , ti ) =
C ( z +ti tk ) 1
) )
e ( y ( A+ Be
.
C ( z + t t )
A + Be
k
0.011
1.873
A
7.069
B
27.089
C
0.053
4.559
Based on the estimated parameter values in Table 5.2 and Equation 5.15 the
predicted residual life at some monitoring points given the history information of
bearing 6 in Figure 5.3 is plotted in Figure 5.4.
In Figure 5.4 the actual residual lives at those checking points are also plotted
with symbol *. It can be seen that actual residual lives are well within the predicted
residual life distribution as expected.
Given the estimated values for parameters and associated costs such as
c f = 6000 , c p = 2000 and cm = 30 (Wang and Jia 2001) we have the expected
cost per unit time for one of the bearings at various checking time t, shown in
Figure 5.5.
126
W. Wang
27
t=80.5 hrs
t=92.5 hrs
t=104 hrs
t=116.5 hrs
t=129 hrs
23
19
15
0
10
20
30
Figure 5.5. Expected cost per unit time vs. planned replacement time in hours from the
current time t
In can be seen from Figure 5.5. that at t = 116.5 and 129 h both planned replacements are recommended within the next 30 h.
To illustrate an alternative decision chart in terms of the actual condition
monitoring reading, we transformed the cost related decision into actual reading in
Figure 5.6 where the dark grey area indicates that if the reading falls within this area
a preventive replacement is required within the planning period of consideration.
The advantage of Figure 5.6 is that it can not only tell us whether a preventive
replacement is needed but also show us how far the reading is from the area of preventive replacement so that appropriate preparation can be done before the actual
replacement.
127
14
Observed CM reading
12
10
8
6
4
2
0
80.5
92.5
104
116.5
129
128
W. Wang
129
system state. A model which can handle both type of information is ideal, but very
few attempts have been made (Hussin and Wang 2006).
5.7 References
Aghjagan, H.N., (1989) Lubeoil analysis expert system, Canadian Maintenance Engineering
Conference, Toronto.
Aven, T., (1996) Condition based replacement policies a counting process approach, Rel.
Eng. & Sys. Safety, 51(3), 275281.
Banjevic, D., Jardine, A.K.S., Makis, V. and Ennis, M., (2001) A control-limit policy and
software for condition based maintenance optimization, INFOR 39(1), 3250.
Baruah, P. and Chinnam R.B., (2005) HMM for diagnostics and prognostics in maching
processes, I. J. Prod. Res., 43(6), 12751293.
Black, M., Brint, A.T. and Brailsford J.R., (2005) A semi-Markov approach for modelling
asset deterioration, J. Opl. Res. Soc. 56(11), 12411249.
Bunks C., McCarthy, D. and Al-Ani T., (2000) Condition based maintenance of machine
using hidden Markov models, Mech. Sys. & Sig. Pro., 14(4), 597612.
Chen, D. and Trivedi, K.S., (2005) Optimization for condition based maintenance with
semi-Markov decision process, Rel. Eng. & Sys. Safety, 90(1), 2529.
Chen, W., Meher-Homji, C.B. and Mistree, F., (1994) COMPROMISE: an effective
approach for condition-based maintenance management of gas turbines. Engineering
Optimization, 22, 185201.
Christer, A.H., Wang, W. and Sharp, JmM., (1997) A state space condition monitoring
model for furnace erosion prediction and replacement, Euro. J. Opl. Res., 101, 114.
Christer, A.H. and Wang, W., (1992) A model of condition monitoring inspection of
production plant, I. J. Prod. Res., 30, 21992211.
Christer A.H and Wang, W., (1995) A simple condition monitoring model for a direct
monitoring process, E. J. Opl. Res., 82, 258269.
Collacott, R.A., (1977) Mechanical fault diagnosis and condition monitoring, Chapman and
Hall Ltd., London.
Dong M. and He, D., (2004) Hidden semi-Markov models for machinery health diagnosis
and prognosis, Trans. North Amer. Manu. Res. Ins. of SME, 32, 199206.
130
W. Wang
Drake, P.R., Jennings, A.D., Grosvenor, R.I. and Whittleton, D., (1995) acquisition system
for machine tool condition monitoring. Quality and Reliability Engineering International 11, 1526.
Freund, J.E., (2004) Mathematical statistics with applications, Pearson Prentice and Hall,
London.
Harrison, N., (1995) Oil condition monitoring for the railway business. Insight 37, 278283.
Hontelez, J.A.M., Burger, H.H. and Wijnmalen, D.J.D., (1996) Optimum condition based
maintenance policies for deteriorating systems with partial information, Rel. Eng. & Sys.
Safety, 51(3), 267274.
Hussin, B., and Wang, W., (2006) Conditional residual time modelling using oil analysis: a
mixed condition information using accumulated metal concentration and lubricant
measurements, to appear in Proc. 1st Main. Eng. Conf, Chendu, China.
Jardine, A.K.S., Makis, V., Banjevic, D., Braticevic, D. and Ennis, M., (1998) A decision
optimization model for condition based maintenance, J. Qua. Main. Eng., 4(2), 115
121.
Jensen, U., (1992) Optimal replacement rules based on different information level, Naval
Res. Log. 39, 937955.
Kalbfleisch, J.D. and Prentice, R.L., (1980) The Statistical Analysis of Failure Time Data.
Wiley, New York.
Kumar, D. and Westberg, U., (1997) Maintenance scheduling under age replacement policy
using proportional hazard modelling and total-time-on-test plotting, Euro. J. Opl. Res.,
99, 507515.
Li, C.J. and Li, S.Y., (1995) Acoustic emission analysis for bearing condition monitoring.
Wear 185, 6774.
Lin, D. and Makis, V., (2003) Recursive filters for a partially observable system subject to
random failures, Adv. Appl. Prob., 35(1), 207227.
Lin D. and Makis, V., (2004) Filters and parameter estimation for a partially observable
system subject to random failures with continuous-range observations, Adv. Appl. Prob.,
36(4), 12121230.
Love C.E., Zhang Z.G., Zitron M.A., and Guo R., (2000) A discrete semi-Markov decision
model to determine the optimal repair/replacement policy under general repairs, Euro. J.
Opl Res, 125, 2, 398409
Love, C.E. and Guo, R., (1991) Using proportional hazard modelling in plant maintenance.
Quality and Reliability Engineering International, 7, 717.
Makis, V. and Jardine, A.K.S., (1991) Computation of optimal policies in replacement
models, IMA J. Maths. Appl. Business & Industry, 3, 169176.
Matthew, C. and Wang, W., (2006) A comparison study of proportional hazard and
stochastic filtering when applied to vibration based condition monitoring, submitted to
Int. Tran OR.
Meher-Homji, C.B., Mistree, F. and Karandikar, S., (1994) An approach for the integration
of condition monitoring and multi-objective optimization for gas turbine maintenance
management. International Journal of Turbo and Jet Engines, 11, 4351.
Neal, M., and Associates, (1979) Guide to the condition monitoring of machinery, DTI,
London.
Reeves, C.W. (1998) The vibration monitoring handbook, Coxmoor Publishing Company,
Oxford.
Samanta, B., Al-Balushi, K.R., Al-Araimi, S.A. (2006) Artificial neural networks and
genetic algorithm for bearing fault detection Soft Computing, 10 (3), 264271.
Wang, W., (2002) A model to predict the residual life of rolling element bearings given
monitored condition monitoring information to date, IMA. J. Management Mathematics,
13, 316.
131
Wang, W., (2003) Modelling condition monitoring intervals: A hybrid of simulation and
analytical approaches, J. Opl. Res Soc, 54, 273282.
Wang, W., (2006a) A prognosis model for wear prediction based on oil based monitoring, to
appear in J. Opl. Res Soc,
Wang, W., (2006b) Modelling the probability assessment of the system state using available
condition information, to appear in IMA. J. Management Mathematics.
Wang, W. and Christer, A.H., (2000) Towards a general condition based maintenance model
for a stochastic dynamic system, J. Opl. Res. Soc. 51, 145155.
Wang, W. and Jia, Y., (2001) A multiple condition information sources based maintenance
model and associated prototype software development, proceedings of COMADEM
2001, Eds. A. Starr and Raj B.K.N. Rao, Elsevier, 889898.
Wang, W. and Zhang, W., (2005) A model to predict the residual life of aircraft engines
based on oil analysis data, Naval Logistics Research, 52, 276284.
Wong, M.L.D., Jack, L.B., Nandi, A.K., (2006) Modified self-organising map for automated
novelty detection applied to vibration signal monitoring Mech. Sys. & Sig. Proc., 20(3),
593610.
Zhang, W., (2004) Stochastic modeling and applications in condition based maintenance,
PhD, thesis, University of Salford, UK.
6
Maintenance Based on Limited Data
David F. Percy
6.1 Introduction
Reliability applications often suffer from a lack of data with which to make informed maintenance decisions. Indeed, the very nature of maintenance is to avoid
observed failure data from arising!
This effect is particularly noticeable for high reliability systems such as aircraft
engines and emergency vehicles, and when new production lines are established or
warranty schemes are planned. The evaluation of such systems is a learning process and knowledge is continually updated as more information becomes available.
Such issues are of great importance when selecting and fitting mathematical
models to improve the accuracy and utility of these decisions.
This chapter investigates why reliability data are so limited, identifies the
problems that this causes and proposes statistical methods for dealing with these
difficulties. In particular, it considers graphical and numerical summaries, appropriate methods for model development and validation, and the powerful approach of
subjective Bayesian analysis for including expert knowledge about the application
area, such as information pertaining to a particular manufacturing process and experience of similar operational systems.
Many reliability problems involve making strategic decisions under risk or uncertainty. Stochastic models involving unknown parameters are often adopted for
this purpose and our concern is how to make inference about, and arising from,
these unknown parameters. The easiest approach involves skilfully guessing the
parameter values by subjective means, which is fine so long as there is sufficient
expert knowledge to perform this task well. More commonly, the parameters are
estimated from observed data and decisions are then made by assuming the
parameters equal to their estimates. This frequentist approach to inference is very
good if there are sufficient data to estimate the parameters well.
However, few data are available in many areas of maintenance and replacement; see Percy et al. (1997) and Kobbacy et al. (1997) for example. There are
several reasons why data are scarce in these situations. New systems and processes
134
D. Percy
naturally offer scant historical data about their performance and reliability. Poor
and incomplete maintenance records are often kept, as the engineers and managers
do not always appreciate the potential benefits that can be achieved through
quantitative modelling and analysis. Of equal importance, many observations of
failure times tend to be censored due to maintenance interventions.
Typical applications take the form of reliability analysis, such as modelling a
critical systems time to failure, and scheduling problems, such as determining
efficient policies for scheduling capital replacement and preventive maintenance,
all of which are considered elsewhere in this book. Other applications include
determining appropriate thresholds for condition monitoring and specifying
warranty schemes for new products. Under these circumstances, it is important to
allow for the uncertainty about the unknown model parameters. This is readily
achieved by adopting the Bayesian approach to inference, as described by
Bernardo and Smith (2000) and OHagan (2004).
The structure for the remainder of this chapter is as follows. Section 6.2
explains the need for Bayesian analysis and Section 6.3 introduces the concepts
beginning with Bayes theorem, which is of great importance in its own right.
Section 6.4 discusses the construction of prior and posterior distributions, whilst
Section 6.5 considers the role of predictive distributions and Section 6.6 considers
techniques for setting the hyperparameters of prior distributions. One of the great
strengths of the Bayesian approach, particularly in relation to practical problems in
reliability and maintenance, is its ability to improve the quality of decision
analysis, as described in Section 6.7. Section 6.8 presents a review of the Bayesian
approach to maintenance and Section 6.9 includes specific case studies that demonstrate these methods. Finally, Section 6.10 suggests topics for future research and
possible new applications.
For convenience, there follows a list of symbols and acronyms that are used
throughout this chapter.
P()
: Probability
E ()
: Expected value
p()
: Probability mass function
f ()
: Probability density function
R()
: Reliability function
L()
g ()
Be( )
Po( )
Ge( )
Ex( )
:
:
:
:
:
:
No(, ) :
Ga ( , ) :
We( , ) :
Likelihood function
Prior or posterior probability density function
Bernoulli distribution
Poisson distribution
Geometric distribution
Exponential distribution
Normal distribution
Gamma distribution
Weibull distribution
135
Figure 6.1. The link between fundamental aspects of maintenance modelling and analysis
R ( x , ) = exp x
(6.1)
136
D. Percy
i =1
R ( xi , ) = exp xi .
i =1
(6.2)
x
i =1
x
i =1
=0;
(6.3)
log xi = 0 .
(6.4)
These have no finite solutions for and , so our analysis has been thwarted by
the lack of uncensored data.
P ( A B) P ( B)
P ( A)
(6.5)
where it is sometimes useful to evaluate the probability of A using the law of total
probability
P ( A) = P ( A B ) P ( B ) + P ( A B ) P ( B )
(6.6)
where the event B is the complement of the event B ; that is, the event that B
does not occur. Bayes theorem can be interpreted as a way of transposing the
conditionality from P ( A B ) to P ( B A ) , or as a way of updating the prior probability P ( B ) to give the posterior probability P ( B A ) .
Example 6.2 An aircraft warning light comes on if the landing gear on either
side is faulty. Suppose we know that faults only occur 0.4% of the time, that they
are detected with 99.9% reliability and that false alarms only occur 0.5% of the
time when the landing gear is operational. Defining events W = warning light
comes on and L = landing gear faulty, this information can be summarized
137
(6.7)
P (W L ) P ( L )
P (W )
0.999 0.004
= 0.45
0.008976
(6.8)
to two decimal places. This result implies that most (55%) of these warning lights are
false alarms, despite the apparent accuracy of the alarm system! The reason for this
paradoxical outcome is that the landing gear is operational for the vast majority of the
time. If we were to specify P ( L ) = 0.04 instead, we would obtain P ( L W ) = 0.89 ,
which is far more acceptable. Similar patterns of behaviour apply to medical
screening procedures in order to reduce the incidence of misdiagnoses, only
patients deemed to be at risk of an illness are routinely screened for it.
Only in the mid-twentieth century were the real benefits of Bayes theorem
appreciated though. Not only does it apply to probabilities, but also to random
variables. For example, suppose X is a discrete random variable and Y is a continuous random variable. Then the conditional probability density function of Y
given X can be determined using Bayes theorem, if we know the marginal distributions of X and Y , and the conditional distribution of X given Y :
f ( y x) =
p ( x y) f ( y)
p ( x)
(6.9)
This rule for transposing the conditionals has proven to be crucial in a variety of
important applications, including quality control, fault diagnosis, image processing,
medical screening and criminal trials.
Even more importantly, we can apply Bayes theorem to unknown model
parameters. This is the foundation of the Bayesian approach to statistical inference
and has had an enormous and profound impact on the subject over the last few
decades. Suppose that a continuous random variable X has a probability distribution
that depends on an unknown parameter . For example, X might represent the firebreach time of a door in minutes and it might have an exponential distribution with
unknown mean = 1 .
A nave approach to statistical inference would simply replace by a good
guess based on expert opinions. However, this is inherently inaccurate and can lead
to poor decisions. A better method is the frequentist approach to inference, where-
138
D. Percy
by we evaluate an estimate for the unknown parameter based on a set of observed data D = { x1 , x2 , , xn } , which might consist of a random sample of actual
fire-breach times for the above example. Subsequent analyses generally invoke the
approximation , which can again lead to poor decisions.
In contrast, the Bayesian approach does not involve any guesses or estimates of
unknown parameters in the model. Rather, it uses Bayes theorem to update our
prior beliefs about in response to the observed data D thus:
g ( D ) =
f ( D ) g ( )
f ( D)
(6.10)
This enables us to make any inference we wish about . We can also use our
posterior beliefs about for any subsequent inference involving X . The price that
we pay for obtaining exact answers and avoiding approximations in this way
comes in two parts, the need to assume a prior distribution for and the increase
in algebraic complexity. This chapter shows how to resolve these issues.
Example 6.3 Suppose the unknown parameter represents the proportion of
car batteries that fail within two years and our prior beliefs about can be expressed in terms of the probability density function
g ( ) = 2 (1 ) ; 0 < < 1 .
(6.11)
Suppose also that we observe three car batteries, one of which fails within two
years and two of which do not. Then we can express the likelihood of these data
using the binomial probability mass function
p ( D ) = 3 (1 ) ,
2
(6.12)
(6.13)
for 0 < < 1 , so our posterior beliefs about the unknown parameter can be
expressed as a beta distribution D ~ Be ( 2, 4 ) . We elaborate on this process
further in Section 6.4.
139
f (x ) .
i =1
(6.14)
(6.15)
or, in words,
posterior is proportional to likelihood times prior.
This is the fundamental rule for Bayesian inference.
Example 6.4 Previously, in Example 6.3, we considered the proportion of car
batteries that fail within two years. This involved the use of Bayes theorem for
this unknown model parameter and was an illustration of how the fundamental
rule posterior is proportional to likelihood times prior can be applied. To clarify
this demonstration, the likelihood function takes the form
L ( ; D )
p ( x ) = (1 )
i =1
(6.16)
where the probability mass function p ( xi ) corresponds to a Bernoulli distribution. Consequently, the posterior probability density function of given the data
D has the form
g ( D ) L ( ; D ) g ( ) (1 )
(6.17)
for 0 < < 1 , which agrees with the result we obtained previously. The
corresponding prior and posterior probability density functions are graphed for
comparison in Figure 6.2.
140
D. Percy
3
prior ( )
posterior ( )
1
0.2
0.6
0.4
0.8
Figure 6.2. Prior and posterior probability density functions for Example 6.4
Having evaluated a posterior distribution using this rule, we can evaluate the
posterior mode such that
( )
g D g ( D ) ,
(6.18)
(6.19)
However, to find the median or mean, and to use this posterior density to make any
further inference, we need to determine the constant of proportionality in the
fundamental rule above. In standard situations, we can recognise the functional
form of L ( ; D ) g ( ) and hence quote published work on probability distributions
to determine this constant of proportionality and so derive g ( D ) explicitly. In
non-standard situations, we determine this constant of proportionality using
numerical quadrature or simulation, both of which we discuss later.
6.4.1 Reference Priors
There are two main types of prior distribution, which loosely correspond with
objective priors and subjective priors. As objective priors strictly do not exist, this
category is generally known as reference priors and are used if little prior
information is available and as a benchmark against which to compare the output
from using subjective priors. This offers a default Bayesian analysis that is not
dependent upon any personal prior knowledge. The simplest reference prior is
proposed by the Bayes-Laplace postulate and simply recommends the use of a
uniform or locally-uniform prior g ( ) 1 for all in the region of support R .
141
(6.20)
d 2 log f ( x )
I ( ) = E X
d 2
(6.21)
where
(6.22)
so that the posterior density has the same functional form as the prior density. This
property is particularly appealing, as our prior knowledge can be regarded as
posterior to some previous information. Again, we tend to suppose that components in multi-parameter problems are independent, so that their joint prior
density is the product of corresponding univariate marginal priors.
Such closed priors exist, and are called natural conjugate priors, for sampling
distributions f ( x ) that belong to the exponential family. This family includes
Bernoulli, binomial, geometric, negative binomial, Poisson, exponential, gamma,
normal and lognormal models. For a model in the exponential family with scalar
parameter , we can express the probability density or mass function in the form
142
D. Percy
f ( x ) = exp {a ( x ) b ( ) + c ( x ) + d ( )}
(6.23)
g ( ) exp {k1b ( ) + k2 d ( )}
(6.24)
for suitable constants k1 and k 2 . However, any conjugate prior of the form
g ( ) h ( ) exp {k1b ( ) + k2 d ( )}
(6.25)
(6.26)
If necessary, linear transformations of the parameters ensure that these priors are
sufficient for modelling all situations. They match with the natural conjugate priors
for simple models and extend to deal with more complicated models. Mixtures of
these priors can be used if multimodality is present and prior independence can be
assumed for multiparameter situations.
143
Bayesian approach to inference makes statements about the parameters given the
data, which are precisely what is required. OHagan (1994) commented that the
Bayesian approach is fundamentally sound, very flexible, produces clear and
direct inferences and makes use of all the available information. In contrast, he
noted that the Classical approach suffers from some philosophical flaws, has a
restrictive range of inferences with rather indirect meanings and ignores prior
information.
One of the most important and useful features of the Bayesian approach arises
when we wish to make predictions about future values of the random variable X
where f ( x ) is specified. If is unknown, the prior predictive probability
density function of X is
f ( x) =
f ( x ) g ( ) d .
(6.27)
f ( x ) g ( D ) d
(6.28)
f ( x ) L ( ; D ) g ( ) d .
( )
( )
f ( x ) exp ( x )
0
1
1
d
(6.29)
for x > 0 , which is improper. However, this does provide information about the
relative likelihoods for different values of X . For example, the ratio of probabilities
that X lies in the intervals (5,10) and (10,20) is given by
144
D. Percy
10
P ( 5 < X < 10 )
x dx
5
20
1
dx
x
10
log10 log 5
=1
log 20 log10
(6.30)
so the time to breakdown of this pulper is equally likely to lie in these two intervals
without taking account of any subjective or empirical information that might
be available. Even if we subsequently observe a random sample of lifetimes
D = { x1 , x2 , , xn } the posterior predictive density
n
f ( x D ) exp ( x ) n 1 exp xi d
i =1
0
n!
; x>0
=
n +1
n
x + xi
i =1
(6.31)
is still improper, though we can evaluate relative likelihoods as we did for the prior
predictive density. In contrast, a frequentist approach would merely generate the
approximation X D ~ Ex (1 x ) and could do no better than guess a value for X
before observing any data.
f ( x ) = exp ( x )
ab
( x + b)
a +1
ba
a 1 exp ( b ) d
(a)
(6.32)
; x>0
which corresponds with a special form of gamma-gamma distribution. If we subsequently observe a random sample of lifetimes D = { x1 , x2 , , xn } the posterior
predictive density is given by
f ( x D ) exp ( x ) a + n 1 exp b +
0
( a + n + 1)
=
; x>0
a + n +1
n
x + b + xi
i =1
x d
i =1
(6.33)
145
146
D. Percy
failure of a pulper, as before, or the downtime X incurred as a result of a computer system failure) with a gamma prior is given by
f ( x) =
ab a
( x + b)
a +1
; x>0.
(6.34)
ab a
b
dx = 1
; x>0.
a +1
x+b
0 ( x + b)
(6.35)
1
b
= 1
;
3
2,500 + b
(6.36)
2
b
= 1
.
3
7,500 + b
(6.37)
There are many algorithms for solving simultaneous nonlinear equations and
several computer packages that contain these algorithms. Mathcad gives the values
a = 3.5240 and b = 20,502 , so the prior distribution for the exponential parameter
is specified completely as ~ Ga ( 3.5240, 20,502 ) .
147
analysis to allow for the uncertainty attached to these parameters. This effect is
particularly important when dealing with limited amounts of data, a common
problem in the area of reliability and maintenance and the subject of this chapter.
For example, the author recently acquired a set of data relating to the performance of an industrial valve subject to corrective and preventive maintenance. Only 12
uncensored lifetime observations were available, despite the fact that this represents
six years of data collection. From a frequentist point of view, it would be unwise to
fit any model involving more than three parameters to these data. However, the
Bayesian is not constrained in this manner, as prior knowledge gleaned from experience of similar systems can be incorporated in the analysis. Of course, parsimony
still dictates that models with fewer parameters are more robust for predictive
purposes, even if they provide better fits to the observed data. We can resolve such
issues using model comparison methods using prior odds, Bayes factors and posterior
odds, which we do not discuss here.
Consider a set of possible decisions d with associated utility function
u (d , ) , which depends on an unknown parameter . The best decision is that
which maximizes the prior expected utility
E {u ( d , )} =
u ( d , ) g ( ) d
(6.38)
l ( d , ) g ( ) d
(6.39)
u ( d , ) g ( D ) d
(6.40)
where
g ( D ) L ( ; D ) g ( )
(6.41)
(6.42)
148
D. Percy
(6.43)
E {l ( i, i )} = ci E ( i )
(6.44)
Then
i
and we choose system i which minimizes this expected loss, where E ( i ) is the
prior mean.
149
Prior
Posterior
BERNOULLI
beta
beta
Be ( )
Be ( a, b )
Be ( a + nx , b + n (1 x ) )
POISSON
gamma
gamma
Po ( )
Ga ( a, b )
Ga ( a + nx , b + n )
GEOMETRIC
beta
beta
Ge ( )
Be ( a, b )
Be ( a + n, b + n ( x 1) )
EXPONENTIAL
gamma
gamma
Ex ( )
Ga ( a, b )
Ga ( a + n, b + x )
NORMAL
normal
normal
No ( , )
No ( a, b )
ab + nx
No
, b + n
b
+
n
NORMAL
gamma
gamma
No ( , )
Ga ( c, d )
n ( x ) ( n 1) s 2
n
Ga c + , d +
+
2
2
2
probability
mean
probability
hazard
mean
known precision
known mean
precision
Table 6.1 presents a summary of the probability distributions commonly encountered in maintenance analysis, together with details of their natural conjugate
prior and posterior distributions. For models with two parameters, including the
unconstrained normal, gamma and Weibull sampling distributions, the analysis is
less straightforward and readers are referred to Section 6.4.2 for guidance.
Among the published research that applies this methodology to maintenance
modelling is the extensive book on Bayesian reliability analysis by Martz and
Waller (1982). Journal papers that address specific issues include those by Soland
150
D. Percy
(1969), Bury (1972) and Canavos and Tsokos (1973), who are concerned particularly with analysis of the Weibull distribution. Singpurwalla (1988) and Percy
(2004) are concerned with prior elicitation for reliability analysis and OHagan
(1998) presents an accessible, general discussion of Bayesian methods.
There are many other academic publications dealing with Bayesian approaches
in maintenance and a representative sample of recent articles include those by van
Noortwijk et al. (1992), Mazzuchi and Soyer (1996), Chen and Popova (2000),
Apeland and Aven (2000), Kallen and van Noortwijk (2005) and Celeux et al.
(2006). The general aim is to determine optimal policies for maintenance scheduling and operation, by combining subjective prior knowledge with observed data
using Bayes theorem and employing belief networks for larger systems.
1
b 1
a 1 (1 ) ; 0 < < 1
B ( a, b )
(6.45)
(6.46)
so the posterior probability that a randomly chosen box from the shipment is
defective is P ( X = 1 D ) = 0.033 , or 1 in 30.
151
30
20
prior( )
posterior( )
10
0.1
0.2
Figure 6.3. Prior and posterior probability density functions for digital set top boxes
4010 9
exp ( 40 ) ; > 0 .
9!
(6.47)
She runs an experiment for one day, replacing each flat battery by an identical fully
charged battery after failure, so that the total number of failures X has a Poisson
distribution with probability mass function
p(x ) =
( 24 )
x!
exp ( 24 ) ; = 0,1, 2, .
(6.48)
152
D. Percy
20
prior( )
10
posterior ( )
0.5
Figure 6.4. Prior and posterior probability density functions for rechargeable tool batteries
6.10 Conclusions
Bayesian inference represents a methodology for mathematical modelling and
statistical analysis of random variables and unknown parameters. It provides an
excellent alternative to the frequentist approach which gained immense popularity
throughout the twentieth century. Whereas the frequentist approach is based upon
the restrictive inference of point estimates, confidence intervals, significance tests,
p-values and asymptotic approximations, the Bayesian approach is based upon
probability theory and provides complete solutions to practical problems.
Advocates of the Bayesian approach regard it as superior to the frequentist
approach in most circumstances and infinitely superior in some. However, it does
depend upon the existence and specification of subjective probability to represent
individual beliefs, whereas the frequentist approach is almost completely objective.
Partial resolution of these difficulties was addressed in Section 6.6 and continues to
be improved upon, particularly in regards to eliciting subjective prior knowledge
for multiparameter models. The approach advocated here also involves more analytical and computational complexity, though this is not much of a hindrance with
modern computing power.
In particular, this approach often involves intractable integrals of the forms
g ( D )
f ( x) =
L ( ; D ) g ( ) d
posterior densities;
(6.49)
f ( x ) g ( ) d
predictive densities;
(6.50)
E {u ( d , )} =
153
u ( d , ) g ( ) d
expected utilities.
(6.51)
Monte Carlo simulation can be used to approximate any integral of this form by
generating many pseudo-random numbers 1 , 2 , , n from the prior or posterior
density in the integrand and evaluating the unbiased estimator
s ( ) g ( ) d n s ( ) ,
i =1
(6.52)
6.11 References
Apeland S, Aven T, (2000) Risk based maintenance optimization: foundational issues.
Reliability Engineering & System Safety 67:285292
Bernardo JM, Smith AFM, (2000) Bayesian Theory. Chichester: Wiley
Bury KV, (1972) Bayesian decision analysis of the hazard rate for a two-parameter Weibull
process. IEEE Transactions on Reliability 21:159169
154
D. Percy
Canavos GC, Tsokos CP, (1973) Bayesian estimation of life parameters in the Weibull
distribution. Operations Research 21:755763
Celeux G, Corset F, Lannoy A, Ricard B, (2006) Designing a Bayesian network for
preventive maintenance from expert opinions in a rapid and reliable way. Reliability
Engineering & System Safety 91:849856
Chen TM, Popova, E, (2000) Bayesian maintenance policies during a warranty period.
Communications in Statistics 16:121142
Jeffreys H, (1998) Theory of Probability. Oxford: University Press
Kallen MJ, van Noortwijk JM, (2005) Optimal maintenance decisions under imperfect
maintenance. Reliability Engineering & System Safety 90:177185
Kobbacy KAH, Percy DF, Fawzi BB, (1995) Sensitivity analyses for preventive-maintenance models. IMA Journal of Mathematics Applied in Business and Industry 6:5366
Kobbacy KAH, Percy DF, Fawzi, BB, (1997) Small data sets and preventive maintenance
modelling. Journal of Quality in Maintenance Engineering 3:136142
Lee PM, (2004) Bayesian Statistics: an Introduction. London: Arnold
Martz HF, Waller RA, (1982) Bayesian Reliability Analysis. New York: Wiley
Mazzuchi TA, Soyer R, (1996) A Bayesian perspective on some replacement strategies.
Reliability Engineering & System Safety 51:295303
OHagan A, (1998) Eliciting expert beliefs in substantial practical applications. The
Statistician 47:2135
OHagan A, (1994) Kendall's Advanced Theory of Statistics Volume 2B: Bayesian Inference.
London: Arnold
Percy DF, (2002) Bayesian enhanced strategic decision making for reliability. European
Journal of Operational Research 139:133145
Percy DF, (2004) Subjective priors for maintenance models. Journal of Quality in Maintenance Engineering 10:221227
Percy DF, Kobbacy KAH, Fawzi BB, (1997) Setting preventive maintenance schedules
when data are sparse. International Journal of Production Economics 51:223234
Singpurwalla ND, (1988) An interactive PC-based procedure for reliability assessment
incorporating expert opinion and survival data. Journal of the American Statistical
Association 83:4351
Soland RM, (1969) Bayesian analysis of the Weibull process with unknown scale and shape
parameters. IEEE Transactions on Reliability 18:181184
van Noortwijk JM, Dekker A, Cooke RM, Mazzuchi TA, (1992) Expert judgement in
maintenance optimization. IEEE Transactions on Reliability 41:427432
7
Reliability Prediction and Accelerated Testing
E. A. Elsayed
7.1 Introduction
Reliability is one of the key quality characteristics of components, products and
systems. It cannot be directly measured and assessed like other quality characteristics but can only be predicted for given times and conditions. Its value depends on
the use conditions of the product as well as the time at which it is to be predicted.
Reliability prediction has a major impact on critical decisions such as the optimum
release time of the product, the type and length of warranty policy and associated
duration and cost, and the determination of the optimum maintenance and replacement schedules. Therefore, it is important to provide accurate reliability predictions
over time in order to determine accurately the repair, inspection and replacements
strategies of products and systems.
Reliability predictions are based on testing a small number of samples or prototypes of the product. The difficulty in predicting reliability is further complicated by
many limitations such as the available time to conduct the test and budget constraints, among others. Testing products at design conditions requires extensive
time, large number of units and cost. Clearly some kind of reliability testing, other
than testing at normal design conditions, is needed. One of the most commonly used
approaches for testing products within the above stated constraints is accelerated life
testing (ALT) where units or products are subjected to more severe stress conditions
than normal operating conditions to accelerate its failure time and then use the test
results to predict (extrapolate) the reliability at design conditions. This Chapter will
address the determination of optimum maintenance schedule at normal operating
conditions while utilizing the results from accelerated testing.
We classify the ALT into two types: accelerated failure time testing (AFTT) and
accelerated degradation testing (ADT). The AFTT is conducted when accelerated
conditions result in the failure of test units without experiencing failure mechanisms
different from those occurring at normal operating conditions and when there is
enough units to be tested at different conditions. Moreover, the economics of
conducting AFTT need to be justified as the test is destructive and its duration is
156
E. Elsayed
directly related to the reliability of test units and the applied stresses. Finally, testing
at stresses far from normal makes it difficult to predict reliability accurately at
normal conditions as in some cases few or no failures are observed even under
accelerated conditions making reliability inference via failure time analysis highly
inaccurate, if not impossible. On the other hand ADT is a viable alternative to
AFTT when the products physical characteristics or performance indices leading to
failure (e.g. drift in resistance value of a resistor, change in light intensity of light
emitting diodes (LED) and loss of strength of a bridge structure) experience
degradation over time. Moreover, significant degradation data can be obtained by
observing degradation of a small number of units over time. Degradation testing
may also be conducted either at normal or accelerated conditions, and no actual
failure is required for reliability inference (Liao 2004).
In this chapter, we address the issues associated with conducting accelerated life
testing and describe how the reliability models obtained from ALT are used in the
determination of the optimum maintenance schedules at normal operating conditions.
This chapter is organized as follows. Section 7.1 provides an overview of the role of
reliability prediction and the importance of accelerated life testing. In Section 7.2 we
present the two most commonly used accelerated life testing types in reliability
engineering. The approaches and models for predicting reliability using accelerated
life testing are described in Section 7.3 while Section 7.4 focuses on mathematical
formulation and solution of the design of accelerated life testing plans. Section 7.5
shows how accelerated life testing is related to maintenance decisions at normal
operating conditions. Models to determine the optimum preventive maintenance
schedules for both failure time models and degradation models are presented in
Section 7.6. A summary of the chapter is presented in Section 7.7. We begin by
describing the ALT types.
157
made about the relationship of the failure time distributions at both the accelerated
and the normal conditions. However, it is not always true that the number of cycles
to failure at high usage rate is the same as that of the normal usage rate. Moreover,
the effect of aging is ignored. Therefore, this type of testing must be run with
special care to assure that product operation and stress remain normal in all regards
except usage rate and the effect of aging is taken into account, if possible.
An alternative to the above accelerated failure time testing is to accelerate
stress (apply stresses more severe than that of the normal conditions) to shorten
product or component life. Typical accelerating stresses are temperature, voltage,
humidity, pressure, vibration, and fatigue cycling. It is important to recognize the
type of stress which indeed accelerates product or component life. Suitable
accelerating stresses need to be determined. One may also wish to know how
product life depends on several stresses operating simultaneously. In accelerated
life testing, the test stress levels should also be controlled. They cannot be so high
as to produce other failure modes that rarely or are unlikely to occur at normal
conditions. Yet levels should be high enough to yield enough failures similar to
those that exist at the design (operating) stress. The limited range of the stress
levels needs to be specified in the test plans to avoid invalid or biased estimates of
reliability. The stress application loading can be constant, increase (or decrease)
continuously or in steps, vary cyclically, or vary randomly or combinations of
these loadings. The choice of such stress loading depends on how the product is
loaded in service and on practical and theoretical limitations (Shyur 1996).
7.2.2 Accelerated Degradation Testing
In some cases, applying high stresses might not induce failures or result in
sufficient data and reliability inference via failure time analysis becomes highly
inaccurate, if not impossible. However, if a products physical characteristics or
performance indices leading to failure experience degradation over time then
degradation analysis could be a viable alternative to traditional failure time analysis. The advantages of degradation modeling over time-to-failure modeling are
significant. Indeed, degradation data may provide more reliability information than
would otherwise be available from time-to-failure data with censoring. Moreover,
degradation testing may be conducted either at normal or accelerated conditions,
and no actual failure is required for reliability inference.
Degradation data needed for reliability inference may be obtained from two
categories: the first is field application and the second is degradation testing
experiments. The first category requires an extensive data collection system over a
long time. Since the collected data are often subject to highly random stress environment and human errors, the data may exhibit significant volatility and sometimes its
accuracy is questionable, limiting its use for reliability inference and prediction. The
second category, prognostics, is a process of predicting the future state of a product
(or component). Degradation data analysis might be used in this process to minimize field failure and reduce the life-cycle expenses by recommending conditionbased maintenance on observed components or systems. Moreover, degradation
testing is usually conducted to demonstrate products reliability and helps in
revealing the main failure mechanisms and the major failure-causing stress factors.
158
E. Elsayed
159
(7.1)
(7.2)
(7.3)
ho ( t ) =
(7.4)
160
E. Elsayed
(7.5)
161
Etezadi-Amoli 1985; Etezadi-Amoli and Ciampi 1987; Shyur et al. 1999) is proposed to combine the PH and AFT models into one form:
(t ; z ) = 0 (ez' t ) exp( z')
(7.6)
The unknowns of this model are the regression coefficients , and the
unspecified baseline hazard function 0 (t ) . The model reflects that the covariate z
has both the time scale changing effect and hazard multiplicative effect. It becomes
the PH model when = 0 and the AFT model when = .
Elsayed et al. (2006) propose a new model called Extended Linear Hazard
Regression (ELHR) model. The ELHR model (e.g., with one covariate) assumes
those coefficients to be changing linearly with time:
(t ; z ) = 0 (te(
+ 1t ) z
) exp ( ( 0 + 1t ) z )
(7.7)
The model considers the proportional hazards effect, time scale changing effect
as well as time-varying coefficients effect. It encompasses all previously developed
models as special cases. It may provide a refined model fit to failure time data and
a better representation regarding complex failure processes.
Since the covariate coefficients and the unspecified baseline hazard cannot be
expressed separately, the partial likelihood method is not suitable for estimating the
unknown parameters. Elsayed et al. (2006) propose the maximum likelihood
method which requires the baseline hazard function to be specified in a parametric
form. In the EHR model, the baseline hazards function has two specific forms; one
is a quadratic function and the other is a quadratic spline. In the proposed ELHR
model, we assume the baseline hazard function 0 (t ) to be a quadratic function:
0 (t ) = 0 + 1t + 2 t 2
(7.8)
(7.9)
162
E. Elsayed
(t ; z ) =
=
t
0
(u; z )du =
t
0
0 e 0 z +1zu du +
t
0
1ue0 z +1zu du +
t
0
2u 2e0 z +1zu du
0 0 z +1zt 0 0 z 1t 0 z +1zt
e +
e
1 2 e0 z +1zt + 1 2 e0 z
1 z
1 z
1 z
(1 z )
(1 z )
2 2t 0 z +1zt
2 2 0 z +1zt
2 2 0 z
2t 2 0 z +1zt
e
e
e
e
2
3
3
1 z
(1 z )
(1 z )
(1 z )
(7.10)
(7.11)
We refer to this model as the proportional mean residual life regression model
which is used to model accelerated life testing. Clearly e0 ( x) serves as the MRL
corresponding to a baseline reliability function R0 (t ) and is called the baseline
mean residual function; e(t z ) is the conditional mean residual life function of
T t given T > t and Z = z . Where z T = ( z1 , z2 ; , z p ) is the vector of covariates,
T = ( 1 , 2 ; , p ) is the vector of coefficients associated with the covariates,
and p is the number of covariates. Typically, we can experimentally obtain
{(ti , zi ), i = 1, 2, , n} the set of failure time and the vectors of covariates for each
unit (Zhao and Elsayed, 2005). The main assumption of this model is the proportionality of mean residual lives with applied stresses. In other words, the mean
163
residual life of a unit subjected to high stress is proportional to the mean residual
life of a unit subjected to low stress.
7.3.2.4 Proportional Odds Model
In many applications, however, it is often unreasonable to assume the effects of
covariates on the hazard rates remain fixed over time. Brass (1971) observes that
the ratio of the death rates, or hazard rates, of two populations under different
stress levels (for example, one population for smokers and the other for nonsmokers) is not constant with age, or time, but follows a more complicated course,
in particular converging closer to unity for older people. So the PH model is not
suitable for this case. Brass (1974) proposes a more realistic model: the proportional odds (PO) model. The proportional odds model has been successfully
used in categorical data analysis (McCullagh 1980; Agresti and Lang 1993) and
survival analysis (Hannerz 2001) in the medical fields. The PO model has a distinct
different assumption on proportionality, and is complementary to the PH model. It
has not been used in reliability analysis of accelerated life testing so far. Zhang and
Elsayed (2005) extend this model for reliability estimates using ALT data.
We describe the PO model as follows. Let T > 0 be a failure time associated
F (t ; z )
,
with stress level z with cumulative distribution F (t ; z ) , and that ratio
1 F (t ; z )
or
1 R(t ; z )
, be the odds on failure by time t . The PO model is then expressed as
R(t ; z )
F (t )
F (t ; z )
= exp( z ) 0
1 F (t ; z )
1 F0 (t )
(7.12)
(7.13)
which is independent of the baseline odds function 0 (t ) and the time t . Hence,
the odds functions are constantly proportional to each other. The baseline odds
function could be any monotone increasing function of time t with the property of
0 (0) = 0 . When 0 (t ) = t , PO model presented by Equation 7.13 becomes the
164
E. Elsayed
log-logistic accelerated failure time model (Bennett 1983), which is a special case
of the general PO models.
In order to utilize the PO model in predicting reliability at normal operating
conditions, it is important that both the baseline function and the covariate
parameter, , be estimated accurately. Since the baseline odds function of the
general PO models could be any monotone increasing function, it is important to
define a viable baseline odds function structure to approximate most, if not all, of
the possible odds function. In order to find such a universal baseline odds
function, we investigate the properties of odds function and its relation to the
hazard rate function.
The odds function (t ) is denoted by
(t ) =
F (t )
1 R(t )
1
=
=
1
1 F (t )
R(t )
R(t )
(7.14)
From the properties of reliability function and its relation to odds function
shown in Equation 7.14, we could easily derive the following properties of odds
function (t ) :
1.
(0) = 0 , () =
2.
3.
(t ) =
1 exp[ (t )]
= exp[ (t )] 1 , and (t ) = ln[ (t ) + 1]
exp[ (t )]
4.
(t ) =
(t )
(t ) + 1
165
Natural logarithm
Maximum likelihood
Total number of test units
High, medium, low stress levels respectively
Specified design stress
Proportion of test units allocated to zL, zM and zL, respectively
Pre-specified period of time over which the reliability estimate is of
interest
Reliability at time t, for given z
Pdf at time t, for given z
Cdf at time t, for given z
166
E. Elsayed
(t ; z )
0 (t )
1t 2 z
)e
2
Min
Var[(
+ 1t )e z ]dt
D
subject to
= F 1
0 < pi < 1, i = 1, 2,3
3
p
i =1
=1
167
where, MNF is the minimum number of failures and is the inverse of the
Fisher's information matrix.
Other objective functions can be formulated which result in different design of
the test plans. These functions include the D-Optimal design that provides efficient
estimates of the parameters of the distribution. It allows relatively efficient determination of all quantiles of the population, but the estimates are distribution dependent.
7.4.1.2 Numerical Example
An accelerated life test is to be conducted at three temperature levels for MOS
capacitors in order to estimate its life distribution at design temperature of 50C.
The test needs to be completed in 300 h. The total number of items to be placed
under test is 200 units. To avoid the introduction of failure mechanisms other than
those expected at the design temperature, it has been decided, through engineering
judgment, that the testing temperature should not exceed 250C. The minimum
number of failures for each of the three temperatures is specified as 25. Furthermore, the experiment should provide the most accurate reliability estimate over a
10-year period of time.
Consider three stress levels; then the formulation of the objective function and
the test constraints follow the same formulation given in the above section. The
optimum plan derived (Elsayed and Jiao 2002) that optimizes the objective
function and meets the constraints is shown as follows:
z L = 160o C , zM = 190o C , z H = 250o C
The corresponding allocations of units to each temperature level are:
p1 = 0.5, p2 = 0.4, p3 = 0.1
7.4.1.3 Concluding Remarks
Design of ALT plans plays a major role in providing accurate estimates of
reliability, mean time to failure and the variance of failure time at normal operating
conditions. These estimates have a major impact on many decisions during the
product life cycle such as maintenance schedules, warranty and repair policies and
replacement times. Therefore, the test plans should be robust (Pascual 2006), i.e., it
should be:
1.
2.
3.
Robust to planning values of the model parameters. This implies that ALT
conducted at three or more stresses are more robust than those conducted at
two stresses. Allocating more units at the low stress level will also improve
the robustness of the plan.
Robust to the type of the underlying distribution. In other words, misspecification of the underlying distribution should not result in significant errors
in calculating reliability characteristics.
Robust to the underlying stress-life relationship. The commonly used
concept that higher stresses result in more failures might result in the
wrong stress-life relationship. For example, testing circuit packs at higher
temperature reduces humidity which in turn results in fewer failures than
those at field conditions. In essence, this is a deceleration test (higher stresses
show fewer failures).
168
E. Elsayed
Figure 7.1. Distributions of the time to failure at stress and normal conditions
60 C
Degradation Path
80 C
169
Threshold
Time
Figure 7.2. Distributions of degradation paths with time at different stress levels
170
E. Elsayed
where 0 (t ) = 0 + 1t + 2 t 2
3. A baseline experiment is conducted to obtain initial estimates for the model
parameters. These values are: 0 = 0.0001 , 1 = 0.5 , 2 = 0 , 1 = 3800 ,
and 2 = 10 .
Approximating 0 to zero we write the hazard rate function as
(t ; T , V ) = 0.5t e
3800 10
+ )
T
V
(7.15)
The reliability and the probability density function (pdf) expressions are respec2
(7.16)
171
(7.17)
(7.18)
2
(7.19)
PREVENTIVE
REPLACEMENT
FAILURE
REPLACEMENTS
0
ONE CYCLE
tp
Let c(t p ) be the total replacement cost per unit time as a function of t p .Then
c (t p ) =
(7.20)
The total expected cost in the interval (0, t p ] is the sum of the expected cost of
failure replacements and the cost of the preventive replacement. During the interval
(0, t p ], one preventive replacement is performed at a cost of c p and M (t p ) failure
172
E. Elsayed
c p + c f M (t p )
tp
(7.21)
10 + 1200 tf n (t ) dt
0
c(t p ) =
(7.22)
tp
Calculated values of the cost per unit time are shown in Table 7.1 and plotted in
Figure 7.4. The optimum preventive maintenance schedule at normal operating
conditions is 0.18 unit times.
Table 7.1. Time vs. cost per unit time values (bold numbers indicate optimum values)
0.13
0.14
0.15
0.16
0.17
0.18
0.19
0.2
885
862
847
839
836
840
848
Time
3500
3000
2500
2000
1500
1000
500
0
0,03
0,13
0,23
0,33
0,43
Time
Figure 7.4. Optimum preventive maintenance schedule
0,53
173
(7.23)
where Di = 1.41 in. is the initial reinforcing bar diameter and t is the elapsed time.
Note that t T1 and D(t) 0. For more details of Equation 7.23 the reader is
referred to Enright and Frangopol (1998).
The time-variant reinforced concrete strength, Mp(t), can now be evaluated
using the conventional design equations in Enright and Frangopol (1998):
a
M p = nAs f y d
2
a = ( nAs f y ) 0.85 f c` b
(7.24)
(7.25)
Note that As = D(t ) 2 4 . The reinforcing steel and the concrete strengths are
f y and f c` , respectively. The number of reinforcing bars is n. The effective depth
and the width of the beam are d and b, respectively. For the current example, the
174
E. Elsayed
x
]
b exp(at )
(7.26)
L ( , a , b , t ) = (
i =1
) ni
b e x p ( a ti )
x ij
ni
x ij 1 e x p ( b e x p ( a ti ) )
i =1 j =1
(7.27)
i =1
i =1
i =1
i =1 j =1
ni
ln L = ni ln ni ln b + ni ati + ( 1) ln xij
i =1 j =1
xij
b exp(ati )
(7.28)
x
]
b exp(at )
or
x1.49
Rx (t ) = exp[
].
1.1346598 107 exp (-0.12t )
The reliability for different threshold values of the strength is shown in Figure
7.5. The time to failure for threshold values of 4800, 4000, 3500, 3000, and 2500
are 25.04, 27.25, 28.88, 30.76, and 33.0 years respectively.
s=2500
1
Reliability
175
s=3000
s=3500
0.8
0.6
s=4000
0.4
s=4800
0.2
0
0
10
20
30
40
50
60
Time (Years)
Figure 7.5. Reliability for different threshold levels
The next step is to determine the optimum preventive maintenance schedule for
every threshold level and select the schedule corresponding to the smallest cost
among all optimum cost values. This will represent both the optimum threshold
level and the corresponding optimum preventive maintenance schedule.
We demonstrate this for two threshold levels (S = 4800 and S = 2500) assuming
c p =10 and c f =1200; we utilize Equation 7.21 as follows:
tp
10 + 1200 tf ( x; t ) dt
c(t p ) =
tp
(7.29)
where
f ( x; t ) =
1
x
) , t > 0, (t ) = be at
x exp(
(t )
(t )
(7.30)
As shown in Figure 7.6, the optimum t p values for S=400 and S=2500 are 17
and 16 years respectively. The minimum of the two is the one corresponding to
S = 2500. Therefore, the optimum threshold is 2500 and the corresponding optimum
maintenance schedule is 16 years.
176
E. Elsayed
3
2,5
S=4800
S=2500
1,5
1
0,5
0
2
12
22
32
Time
7.7 Summary
In this chapter we present the common approaches for predicting reliability using
accelerated life testing. The models are classified as accelerated life testing models
(ALT) and accelerated degradation models (ADT). The ALT models are also
classified as accelerated failure time models with assumed failure time distributions and distribution free models. Also we modify the proportional odds model
to be used for reliability prediction with multiple stresses. Most of the research in
the literature does not extend the use of accelerated life testing beyond reliability
predictions at normal conditions. This is the first work that links the ALT to
maintenance theory and maintenance scheduling. We develop optimum preventive
maintenance schedules for both ALT models and degradation models. We
demonstrate how the reliability prediction models obtained from ALT can be used
in obtaining the optimum maintenance schedules. We also demonstrate the link
between the optimum degradation threshold level and the optimum maintenance
schedule. This work can be further extended to include other maintenance cost or
insurance of minimum availability level of a system. Further work is needed to
investigate the relationship between threshold levels at accelerated conditions and
those at normal conditions. Moreover, the models need to include the repair rate as
well as spares availabilities.
177
7.8 References
Agresti, A. and Lang, J.B., (1993) Proportional odds model with subject-specific effects for
repeated ordered categorical responses, Biometrika, 80, pp. 527534
Bennett, S. (1983) Log-logistic regression models for survival data, Applied Statistics, 32,
165171
Brass, W., (1971) On the scale of mortality, In: Brass, W., editor. Biological aspects of
Mortality, Symposia of the society for the study of human biology. Volume X. London:
Taylor & Francis Ltd.: 69110
Brass, W., (1974) Mortality models and their uses in demography, Transactions of the
Faculty of Actuaries, Vol. 33, 122133.
Ciampi, A. and Etezadi-Amoli, J., (1985) A general model for testing the proportional
hazards and the accelerated failure time hypotheses in the analysis of censored survival
data with covariates, Commun. Statist. - Theor. Meth., Vol. 14, pp. 651667.
Cox, D.R., (1972) Regression models and life tables (with discussion), Journal of the Royal
Statistical Society B, Vol. 34, pp. 187208
Cox, D.R., (1975) Partial likelihood, Biometrika, Vol. 62, pp. 269276
Eghbali, G. and Elsayed, E.A., (2001) Reliability estimate using degradation data, in
Advances in Systems Science: Measurement, Circuits and Control, Mastorakis, N. E.
and Pecorelli-Peres, L. A. (Editors), Electrical and Computer Engineering Series, WSES
Press, pp. 425430
Elsayed, E.A., (1996) Reliability engineering, Addison-Wesley Longman, Inc., New York,
1996.
Elsayed, E.A. and Jiao, L., (2002) Optimal design of proportional hazards based accelerated
life testing plans, International Journal of Materials & Product Technology, Vol. 17,
Nos. 5/6, 411424
Elsayed, E.A. and Zhang, H., (2006) Design of PH-based accelerated life testing plans under
multiple-stress-type, to appear in the Reliability Engineering and Systems Safety
Elsayed, E.A., Liao, H., and Wang, X., (2006) An extended linear hazard regression model
with application to time-dependent-dielectric-breakdown of thermal oxides, IIE Transactions on Quality and Reliability Engineering, Vol. 38, No. 4, 329340
Elsayed, E.A. and Zhang, H., (2005) Design of optimum simple step-stress accelerated life
testing plans, Proceedings of 2005 International Workshop on Recent Advances in
Stochastic Operations Research. Canmore, Canada.
Enright, M.P. and Frangopol, D.M., (1998) Probabilistic analysis of resistance degradation
of reinforced concrete bridge beams under corrosion, Engineering Structures, Vol. 20
No. 11, pp. 960971
Etezadi-Amoli, J. and Ciampi, A., (1987) Extended hazard regression for censored survival
data with covariates: a spline approximation for the baseline hazard function,
Biometrics, Vol. 43, pp. 181192
Ettouney, M. and Elsayed, E.A., (1999) Reliability estimation of degraded structural
components subject to corrosion, Fifth ISSAT International Conference, Las Vegas,
Nevada, August 1113
Hannerz, H., (2001) An extension of relational methods in mortality estimation,
Demographic Research, Vol. 4, p. 337368
Kalbfleisch, J.D. and Prentice, R.L., (2002) The statistical analysis of failure time data, John
Wiley & Sons, New York, New York
Liao, H., Elsayed, E.A., and Ling-Yau Chan, (2005) Maintenance of continuously monitored
degrading systems, European Journal of Operational Research, Vol. 75, No. 2, 821835
Liao, H., (2004) Degradation models and design of accelerated degradation testing plans,
Ph.D. Dissertation, Department of Industrial and Systems Engineering, Rutgers
University
178
E. Elsayed
McCullagh, P., (1980) Regression models for ordinal data, Journal of the Royal Statistical
Society. Series B, Vol. 42, No. 2, 109142
Meeker, W.Q. and Escobar, L.A., (1998) Statistical methods for reliability data, John Wiley
& Sons, New York, New York
Nelson, W., (2004) Accelerated testing: statistical models, test plans, and data analyses,
John Wiley & Sons, New York, New York
Oakes, D. and Dasu, T. (1990) A note on residual life, Biometrika, 77, pp. 409410.
Pascual, F.G., (2006) Accelerated life test plans robust to misspecification of the stress-life
relation, Technometrics, Vol. 48, No. 1, 1125
Shyur, H-J., (1996) A General nonparametric model for accelerated life testing with timedependent covariates, Ph.D. Dissertation, Department of Industrial and Systems
Engineering, Rutgers University
Shyur, H-J., Elsayed, E.A. and Luxhoj, J.T., (1999) A General model for accelerated life
testing with time-dependent covariates, Naval Research Logistics, Vol. 46, 303321
Tobias, P. and Trindade, D., (1986) Applied reliability, Von Nostrand Reinhold Company,
New York, New York
Zhang, H. and Elsayed, E.A., (2005) Nonparametric accelerated life testing based on
proportional odds model, Proceedings of the 11th ISSAT International Conference on
Reliability and Quality in Design, St. Louis, Missouri, USA, August 46
Zhao, W. and Elsayed, E.A., (2005) Optimum accelerated life testing plans based on
proportional mean residual life, Quality and Reliability Engineering International
8
Preventive Maintenance Models for Complex Systems
David F. Percy
8.1 Introduction
Preventive maintenance (PM) of repairable systems can be very beneficial in
reducing repair and replacement costs, and in improving system availability, by
reducing the need for corrective maintenance (CM). Strategies for scheduling PM
are often based on intuition and experience, though considerable improvements in
performance can be achieved by fitting mathematical models to observed data; see
Handlarski (1980), Dagpunar and Jack (1993) and Percy and Kobbacy (2000) for
example.
For systems comprising few components, and systems comprising many identical components, modelling and analysis using compound renewal processes
might be possible. Such situations are considered by Dekker et al. (1996) and Van
der Duyn Schouten (1996). However, many systems comprise a large variety of
different components and are too complicated for applying this methodology. We
refer to these as complex repairable systems.
This chapter reviews basic models for complex repairable systems, explaining
their use for determining optimal PM intervals. Then it describes advanced
methods, concentrating on generalized proportional intensities models, which have
proven to be particularly useful for scheduling PM. Computational difficulties are
addressed and practical illustrations are presented, based on sub-systems of oil
platforms and refineries.
The motivation is that for complex systems, one needs to build models for
failures based on the history of maintenance (PM and CM) available. Once a model
is built, one can evaluate different PM strategies to determine the best one. The
focus is to look at different models and how to determine the best model based on
historical data.
Section 8.2 presents some real examples of complex systems with historical
data sets. In each case, it discusses current maintenance policies and any problems
with collection or accuracy of the data. Section 8.3 considers the effects of PM and
CM actions upon system reliability and availability, so justifying the need for
180
D. Percy
modelling the operating situations in order to determine suitable scheduling strategies. In Section 8.4, we review the models that can be used for this purpose. We
also assess the relevance, strengths and weaknesses of each model and provide
references where readers can find more details.
The remainder of the chapter presents general recommendations for modelling
of complex systems in order to schedule PM in practice. Section 8.5 describes the
generalized proportional intensities model, Section 8.6 reviews the method of
maximum likelihood for estimating unknown model parameters, Section 8.7
addresses the problem of model selection, and considers statistical tests for this
purpose, and Section 8.8 looks at the scheduling problem. Finally, Section 8.9
applies these methods to some of the data of Section 8.2 and Section 8.10 presents
some concluding remarks.
For convenience, we now present a list of symbols and acronyms that are used
throughout this chapter.
PM
Preventive maintenance
CM
Corrective maintenance
ROCOF
Rate of occurrence of failures
NHPP
Nonhomogeneous Poisson process
T1 , T2 , Failure times of a system
X 1 , X 2 , Inter-failure times of a system
N (t )
Number of failures up to time t
History of process up to time t
H (t )
Intensity function
(t )
0 (t )
Baseline intensity function
Po( )
Poisson distribution
F ( x)
Cumulative distribution function
f (x )
Probability density function
R(x )
Reliability or survivor function
h( x )
Hazard function
h0 (x )
Baseline hazard function
DRP
Delayed renewal process
DARP
Delayed alternating renewal process
VAM
Virtual age model
PHM
Proportional hazards model
IRM
Intensity reduction model
PIM
Proportional intensities model
GPIM
Generalized proportional intensities model
MLE
Maximum likelihood estimate
AIC
Akaike information criterion
BIC
Bayes information criterion
181
Sad system
Noncommittal system
15
177
51
27
65
43
32
51
27
43
43
177
51
32
15
65
27
65
177
15
32
Example 8.2 Percy et al. (1998) published a set of data relating to the reliability
and maintenance history of a valve in a petroleum refinery, as displayed in Table 8.2.
The two columns successively represent the times in days between maintenance
actions and the types of actions, where 0 indicates no failure (PM) and 1 indicates
failure (CM).
At first glance, this would appear to be a noncommittal system. However, on
further inspection, there appear to be fewer failures later on and more preventive
actions. Whether the PM is proving to be effective or the system is generally happy
is not easy to determine. Modelling can provide these answers though. Based on
these data, our ultimate goal is to decide how often to perform PM in future or on
similar systems.
When collecting such data, it is very important to record all PM and CM events
accurately, as errors of omission or commission can result in wrong decisions. For
example, if the first failure were not recorded, the average time until system failure
over the first 94 days would appear to be twice its actual value, perhaps suggesting
that PM is not required.
182
D. Percy
Type of
action
Time since
last action
Type of
action
71
186
23
14
64
207
112
136
57
66
28
37
119
139
250
206
250
144
Example 8.3 Kobbacy et al. (1997) published a set of historical reliability and
maintenance data collected from a main pump at an oil refinery over a period of
nearly seven years. These data are reproduced in Table 8.3, with consecutive
observations reading down the columns successively from left to right.
Table 8.3. Reliability and maintenance history of a main oil refinery pump
Times since last actions
34*
37
22
14
28
51
21
81*
13
38
51
86*
27
20
15
26
156*
28*
18
15*
20*
148*
44
35
96*
92
26*
44*
47*
13
56
37
61
45*
13
64
36
84*
97
67*
8*
12
88*
29
62*
12*
65*
30
12
27
43*
46
102
183
184
D. Percy
such information. Much preventive maintenance is less specific in terms of particular systems but not in terms of the work involved, and applies more generally. For
example, motor vehicles might be serviced annually according to a strict checklist
procedure. The actual work conducted during PM can involve many tasks, such as
cleaning surfaces, lubricating joints, sharpening blades, replacing fluids, removing
waste, cooling down and redecorating. As for CM, we incur costs of PM due to
parts, labour and downtime, though these tend to be substantially less than for
repairs.
The challenge is to balance the costs of preventive maintenance with the
supposed improvements in system reliability. Too few PM actions means we incur
big CM costs and small PM costs, whereas too many PM actions means we incur
small CM costs and big PM costs. Unfortunately, there is no simple explanation of
how CM and PM affect system reliability. By modelling the failure patterns of
these systems mathematically, we can gain valuable insights about cost-effective
strategies for maintenance and replacement.
We generally model the time to first failure using a familiar lifetime probability
distribution or hazard function. However, this approach is inadequate for modelling
other times to failure, as the inter-failure times are neither independent nor
identically distributed in general (Ascher and Feingold 1984). Stochastic processes
form the appropriate basis for models to use under these circumstances. We are
interested in the probability that a system fails in the interval (t, t + ] given the
history of the process up to time t . We describe the behaviour of the failure
process by the intensity function (identified here by the Greek letter iota):
( t ) = lim
}.
P N (t + ) N (t ) 1 H (t )
185
(8.1)
For an orderly process, where simultaneous failures are impossible, the intensity
function is equal to the derivative of the conditional expected number of failures:
(t ) =
d
E N (t ) H (t ) ,
dt
(8.2)
Nonhomogeneous
Poisson process
CM
PM
Comments
References
Repair back to
(or replace by)
new item
Only CM actions,
zero repair times
Watson (1970)
Delayed renewal
process
Distributions for
failures after PM
and CM actions,
zero downtimes
Delayed alternating
renewal process
Fixed or random
downtimes
Proportional hazards
model
Different hazard
functions for
failures after PM
and CM actions
Cox (1972a);
Jardine et al. (1987);
Newby (1994);
Lutigheid et al. (2004)
Intensity reduction
model
CM minimal repair,
Doyen and Gaudoin
PM reduction in
(2004)
intensity function
Proportional intensities
model
Takes account of
covariates, CM as
minimal repairs
Cox (1972b);
Percy et al. (1998b)
Generalized proportional
intensities model
Both CM and PM
affect the intensity
function
186
D. Percy
(ii)
{ N ( t ) N ( s )} N ( s )
(iii)
{ N ( t ) N ( s )} ~ Po ( t ) dt
[independence of increments]
f ( x ) = exp ( x ) ;
(ii)
f (x ,) =
(iii)
f ( x , ) = ( x )
x > 0 [exponential]
1
x exp ( x ) ;
( )
1
exp ( x )
x > 0 [gamma]
};
x > 0 [Weibull]
The form of the hazard function is precisely the same as the form of the intensity
function if we were to use a stochastic process to model the complex system. For a
nonhomogeneous Poisson process, this intensity function applies beyond the first
failure. However, successive hazard functions for inter-failure times have different
forms, which correspond to shifted and truncated versions of the distribution for
time to first failure.
Imperfect maintenance models must allow for the dynamic evolution of a system
and take account of hypothesized and observed knowledge about the effectiveness
of repairs. As mentioned above, this section reviews a variety of existing models for
repairable systems and describes suitable adaptations for systems that are subject to
preventive maintenance. In passing, we remark that time is used as the only scale of
measurement here. Some applications use running time instead, or both, such as the
flight time of an aircraft or the mileage and age of a car. Further details of such
variations are described by Baik et al. (2004) and Jiang and Jardine (2006).
187
( t ) = 0 t t N ( t )
(8.3)
where 0 ( t ) is the baseline intensity function, which would prevail if there were no
system failures. As this is a renewal process, the baseline intensity function is
equal to the hazard function for the inter-failure times: 0 ( x ) = h ( x ) . The baseline
intensity function can take many forms, including:
(i)
0 (t ) =
(ii)
0 (t ) = t
[loglinear]
(iii)
0 (t ) = t
[power-law]
[constant]
The renewal process is a plausible first order model for components or parts
when the repair time is negligible, since complete replacement of a component
after failure implies renewal instead of repair. Conversely, the renewal process is a
poor model for complex systems, where repairs involve replacing or restoring just
a fraction of the systems components. If a large portion of a system needs to be
restored, it is often more economical to replace the entire system. Even if a repair
restores the systems performance to its original specification, the presence of
predominantly aged components implies that system reliability is not renewed.
8.4.2 Nonhomogeneous Poisson Process (Minimal Repair)
The assumptions underlying this model imply that, when a repair is carried out, a
system assumes the same condition that it was in immediately before failure. The
nonhomogeneous Poisson process (NHPP) differs from the homogeneous Poisson
process only in that the rate of occurrence of failures varies with time rather than
being constant. As mentioned early in this section, it is the fundamental model for
repairable systems. The NHPP is also the most appropriate model for the reliability
of a complex system comprising infinity components. However, for a finite
number of components, this model can only serve as an approximation, often poor,
as the intensity function changes following each repair. In this model, the interarrival times X 1 , X 2 , X 3 , are neither independent nor identically distributed.
188
D. Percy
( t ) = 0 ( t )
(8.4)
Corrective maintenance (CM) corresponds to major or minor repair work and may
involve replacing the damaged components, whereas preventive maintenance (PM)
usually corresponds to minor interventions such as lubrication, cleaning and
inspection.
Given this structure, we assume that failure times after corrective operations are
independent and identically distributed, as are failure times after preventive
operations. However, we allow for different probability distributions in the two
cases and this defines the delayed renewal process (DRP). This is not a simple
renewal process, because of the different lifetime distributions following the two
types of action. However, the simple renewal process could be regarded as a
limiting case of the DRP, if corrective operations were to repair the system to the
same state as preventive operations. Maximal repairs involve restoring the system
upon failure to its condition at new. Similarly, if corrective operations were to
restore the system to the state immediately before failure, minimal repairs would
result. This is not strictly a special case of the delayed renewal process, but a
computer program could easily allow for this assumption if required. However, we
believe that minimal repairs are convenient for mathematical modelling but are not
always valid in practice.
189
As shown in Figure 8.2, define the random variables U and V to be the lifetimes
after PM and CM respectively. Their probability density functions, conditional
upon known parameters, are fU ( u ) and fV ( v ) respectively. These distributions
might take the exponential, gamma or Weibull forms defined earlier, to achieve the
required flexibility. Note that the exponential distribution is a limiting case of the
gamma as 1 and Weibull as 1 . The DRP assumes that downtimes are
negligible compared with the costs of parts and labour. We now consider the
effects of non-ignorable downtimes.
8.4.4 Delayed Alternating Renewal Process
The delayed renewal process described above assumes that the downtimes for
preventive and corrective maintenance are negligible when compared with the
lifetimes. It also assumes that the costs associated with these downtimes are
dominated by the costs of parts and labour. The model and analysis are further
complicated when we allow for periods of downtime, when maintenance actions
take place. In many applications involving continuous-process industries, the
principal costs are not due to parts and labour, but are due to lost production whilst
the system is down. Consequently, we must consider downtime costs and durations
when determining cost-effective strategies for scheduling PM.
This extension results in the delayed alternating renewal process (DARP), for
which analytical solution is not even feasible in practice. The downtimes following
preventive and corrective maintenance can be fixed or random. Since analytical
solution of the optimisation problems is not possible and we are adopting a
simulation approach here, either of these can be included in the calculations with
ease. In the following work, we consider them fixed to avoid confusion. Another
benefit of simulation over numerical solution of the renewal equations is that
anomalies are readily catered for, such as switching from CM to PM if the system
is in the failed state when PM is due. The DARP is illustrated in Figure 8.3.
The delayed alternating renewal process is appropriate when the time to replace
(or repair back to new) a failed item is non-zero. In this case, we have working and
190
D. Percy
failed states and these alternate. So far, we have only allowed for systems that
display no long-term trends, corresponding to improvement or deterioration. We
now discuss age-based models that allow for such trends. These models can also be
used for stationary and non-stationary systems when concomitant information is
available. We discuss these benefits later, as the need for including such extra
sources of information is described.
8.4.5 Virtual Age Model (Rejuvenation)
The virtual age model (VAM) modifies the hazard function for a systems interfailure times at each corrective maintenance action. For these repairs, the systems
virtual age at any given time is determined by a variety of additive or multiplicative age-reduction factors. This resets the system to a younger state, which is
only an approximation for reasons mentioned earlier. The intensity function of a
point process under the age reduction model may be additive
N (t )
( t ) = 0 t
si
i =1
(8.5)
or multiplicative
N (t )
( t ) = 0 t si
i =1
(8.6)
where both si are constants, representing the age reduction factors, and 0 ( t ) is
the baseline intensity function again.
In order to evaluate the intensity function for a sequence of failures under age
reduction, the renewal function governs the system failure pattern. The additive
model can generate negative intensities but the multiplicative model is suitable if
replacement components are infallible. The age-reduction model has been applied
to systems under a block replacement policy. A critical defect of the age-reduction
model and its many variants is that they do not provide a realistic description of the
failure processes. For example, replacing a corroded exhaust pipe does not reduce a
cars age, as very many other components are no less likely to fail.
8.4.6 Proportional Hazards Model
The proportional hazards model (PHM) is more flexible than the renewal process,
DRP and DARP, as it allows for non-stationarity. It is also more flexible than the
virtual age model because it allows for concomitant information. In principle, this
model appears to be inappropriate for representing a complex system, because
hazards naturally relate to lifetimes of components rather than inter-failure times of
processes. We cannot physically justify this model as readily as the proportional
intensities model described later. However, this does not invalidate its use in this
context as a statistical model rather than a mathematical model and considerable
191
(8.7)
and after CM
( v ) = 0 ( v ) exp ( z t )
(8.8)
We might consider other factors and covariates for inclusion here, representing the
concomitant information mentioned earlier. These could include:
Temporal, or continuously time varying covariates (time since last PM and time
since last CM) cause substantial computational difficulties. These may be avoided
by choosing baseline hazard functions that are sufficiently flexible. The vectors
and contain the regression coefficients, which generally take the form of unknown parameters. The results from extensive analyses demonstrate that this proportional hazards model is flexible, easy to use and of considerable practical value,
despite its doubtful mathematical suitability for modelling repairable systems.
8.4.7 Intensity Reduction Model (Correction)
Improvement factors feature in additive and multiplicative intensity reduction
models (IRM) for imperfect maintenance. Perhaps the most suitable of these is an
intensity reduction model that involves a multiplicative scaling of the intensity
function upon each failure and repair. This is the natural model for systems that are
improving or deteriorating with time and provides a perfect description of the
physical situation. This model can be expressed as an NHPP with intensity function
192
D. Percy
( t ) = 0 ( t )
N (t )
s
i =1
(8.9)
where the si are constants representing the intensity reduction factors and 0 ( t ) is
the baseline intensity function again. We later generalize this model by supposing
si are simple functions of i , or are random variables that are independent of the
failure and repair process. Having concluded that this model is ideally suited to
modelling complex repairable systems, this chapter later considers how to extend it
to allow for preventive maintenance and concomitant information.
8.4.8 Proportional Intensities Model
Whilst the proportional hazards model offered a valuable generalization of the
delayed renewal process and delayed alternating renewal process to allow for nonstationarity and concomitant information, it is not the natural model for repairable
systems. The natural model takes the form of a nonhomogeneous Poisson process
and is the essence of the proportional intensities model (PIM), which is the subject
of this subsection and is a generalization of the intensity reduction model described
above.
Define the random variable N ( t ) as the number of system failures by time t .
Then the NHPP is characterised by conditionally independent increments, corresponding with conditionally independent times between failures that occur with
intensity
( t ) = lim
P N (t + ) N (t ) 1 H (t )
(8.10)
at system age t units, where H ( t ) is the history of the process. However, the NHPP
corresponds with minimal repair as in Section 8.4.2 and makes no allowances for
system improvement, or even deterioration, arising from maintenance actions.
Hence, we modify the intensity function by introducing a multiplicative factor,
so that we can express the intensity function as
( t ) = 0 ( t ) exp xTt ,
(8.11)
where the baseline intensity 0 ( t ) has a standard form such as constant, loglinear
and power-law. Furthermore, the parameter vector represents the regression
coefficients and the observation vector x t contains factors and covariates relating
to the system, such as the cumulative observations and concomitant information
mentioned in Section 8.4.6.
An alternative option arises when using the PIM to model a complex repairable
system subject to PM. Rather than adopting a global time scale for the baseline
intensity function as implied above, we could reset the time scale of the baseline
intensity function to zero upon each PM action. This introduces an element of
193
M ( t ) N ( t )
( t ) = 0 ( t ) ri s j exp xTt .
i =1 j =1
( )
(8.12)
Here, 0 ( t ) is the baseline intensity function, whilst ri > 0 and s j > 0 are the
intensity scaling factors for preventive maintenance (PM) and corrective maintenance (CM) actions respectively. Furthermore, M ( t ) and N ( t ) are the total
numbers of PM and CM actions, whilst xt is a vector of predictor variables and
is an unknown parameter vector of regression coefficients. One might expect the
rj and s j to be less than one for a deteriorating system and greater than one for an
improving system, though replacing failed components with used parts and accidentally introducing faults during maintenance can produce the opposite effects.
System copies can have different forms of baseline intensity function. For
reduction of intensity, the scaling factors can take the forms of positive constants,
random variables, deterministic functions of time ( t ) and events ( i and j ) or
stochastic functions of time and events. As for the intensity reduction model described in Section 8.4.7, a reasonable assumption for initial analysis is that ri =
for i = 1, 2, , M ( t ) and s j = for j = 1, 2, , N ( t ) , in which case the GPIM
corresponds with the PIM of Section 8.4.8. The vector of predictor variables might
include:
194
D. Percy
The quality of maintenance affects the functionality of a system and its future
performance. Our justification for including the time since last maintenance here is
to allow for the possibility that maintenance interventions can introduce problems
similar to the burn-in of new components. The first of these is a discrete function
of time, whereas the second is a continuous function of time. Condition indicators,
when available, give direct and very strong guidance on the likely occurrence of
failures. They are typically discrete functions of time that vary at, and between,
maintenance actions.
D = ( ui , vij ) ; i = 1, , n; j = 1, , ni
(8.13)
and
(8.14)
195
(8.15)
{ f ( u )} {R ( u )} { f ( v )} {R ( v )}
n
i =1
1 ci
ci
1 dij
dij
ij
j =1
(8.16)
ij
L ( , ; D ) = L ( ; D ) L ( ; D )
(8.17)
where
n
} {R ( u )}
L ( ; D ) f ( ui )
i =1
1 ci
ci
(8.18)
and
L ( ; D )
ni
{ f ( v )} {R ( v )}
n
i =1
j =1
1 dij
dij
ij
ij
(8.19)
L {; H ( t )} ti exp ( t ) dt
i =1
( )
(8.20)
196
D. Percy
l {; H ( t )} = const. +
N (T )
log ( t ) ( t ) dt .
i =1
(8.21)
Therefore, once we specify the formulation of ( t ) , we can obtain estimates for its
unknown parameters via likelihood-based methods.
Example 8.4 Assuming T = t N (T ) so that observation ceases at a failure, the
maximum likelihood estimates (MLEs) can be determined analytically for the
power-law process (NHPP with power-law intensity). With (t ) = t and
n = N (T ) , the MLEs are
n
T
log
ti
i =1
(8.22)
and
).
n + 1
+1
(8.23)
For a particular system, successive arrival times (not inter-arrival times) were
observed to be 15, 42, 74, 117, 168, 233 and 410 days. With n = 7 , T = 410 and
t1 = 15, , t7 = 410 , we have 0.3007 and then 0.07288 . As < 0 , the
intensity is a strictly decreasing function of time; this is a happy system that seems
to improve with age.
Analysis of the intensity based models follows by extending this likelihood
function corresponding to the NHPP. Consider the generalized proportional intensities model of Section 8.5. The choice of which predictor variables to include
depends upon the sample size (history of failures) and the results of standard
selection procedures based on analyses of deviance for nested models. Only important predictors should be included in order to produce a robust model. We can
estimate the parameters in the model by maximum likelihood, on extending the
NHPP likelihood presented above, whereby the log-likelihood is given by
l {; H (T )} = const. +
n
M t N t
( )
(
k =0
tk +1
T
tk
(8.24)
( t ) exp ( x ) dt .
0
tk
T
t
This corresponds to the simple case where the scaling factors are constant: minor
changes are needed for the more general cases.
197
U=
nt
t 2
i =1
n
t
12
(8.25)
with standard normal critical values, rejecting the null hypothesis of no trend if
U ( z p 2 , z p 2 ) for a hypothesis test at the 100 p % level of significance, where
the proportion p represents the size of the test. For a 5% significance test, the
critical values are given by z p 2 = 1.960 .
If we decide that a system is nonstationary, we could use the VAM or PHM,
which are easier to fit to data than the stochastic processes considered next, but are
less robust because of their statistical rather than mathematical derivation.
However, all of these models require numerical computation to some extent. The
VAM and PHM might provide a better fit to the observed data on occasions,
198
D. Percy
L1
~ 2 ( p1 p2 )
L2
(8.26)
and so we can test whether the extra parameters are significant. This is particularly
beneficial when choosing which elements to include in a linear predictor.
If the models M 1 and M 2 are not nested, we cannot use this formal test and
simply compare the log-likelihood functions log L1 and log L2 , choosing the model
with the larger log-likelihood. This is appropriate for choosing between gamma and
Weibull baseline hazard functions, for example. However, it is only valid if p1 = p2 ,
as a model with more parameters often fits better than a model with fewer parameters, by definition. To compare non-nested models with different numbers of parameters, we usually apply a correction factor to the log-likelihood functions.
Two common modified forms are the Akaike information criterion (AIC),
which suggests that we compare log L1 p1 with log L2 p2 , and the Schwarz
criterion, or Bayes information criterion (BIC), which suggests that we compare
log L1 ( p1 log n ) 2 with log L2 ( p2 log n ) 2 where n is the number of observations in the data set. The latter arises as the limiting case of the posterior odds
resulting from a Bayesian analysis with reference priors. In each case, the best
model to choice is the one that maximizes the information criterion.
Example 8.5 Suppose we fit two non-nested models to a set of lifetime data,
based on n = 31 observed failures. The first model contains three parameters and
has a likelihood of L1 = 8.742 1018 . The second model contains five parameters
and has a likelihood of L2 = 3.110 1017 . The Bayes information criterion for the
first model is log L1 ( p1 log n ) 2 44.43 and for the second model it is
log L2 ( p2 log n ) 2 46.59 so we prefer the first, simpler model here.
199
1
m
K
i =1
(8.27)
represents an unbiased estimator for the total cost per PM interval. This enables us
to estimate the expected cost per unit time as K t .
Now we must repeat the whole simulation for different values of t , using an
efficient search algorithm, to determine the value of t that minimises this expected
cost per unit time. This is the recommended PM interval duration. We advocate
direct search algorithms for practical implementation, such as golden-section search.
For practical purposes, t is unlikely to vary continuously and discrete values will
dominate. Convenient multiples of days, weeks or months provide suitable units of
measurement for practical implementation.
200
D. Percy
P N (t + ) N (t ) = n H (t )
{t ( )}
=
n!
exp { t ( )}
(8.28)
t ( ) =
t +
( t ) dt
(8.29)
Rt ( ) = P N ( t + ) N ( t ) = 0 H ( t ) = exp { t ( )} ,
(8.30)
(8.31)
201
This allows us to simulate the process as before, evaluate expected costs over a
finite horizon, and so deduce the most economical time for the next preventive
maintenance. This decision can be made at any specific event, such as during PM
or CM, or even between events, so long as the intensity function is known.
Next we consider the proportional hazards model. To avoid referring separately
to the hazard functions ( u ) and ( v ) , consider a general hazard function h ( x ) .
For the purposes of simulation in order to schedule PM in the future, the reliability
function can be determined as
x
R ( x ) = exp h ( x ) dx ,
0
(8.32)
f ( x ) = R ( x ) = h ( x ) exp h ( x ) dx ,
0
(8.33)
8.9 Applications
We now apply some of these models to the data sets in Section 8.2.
Example 8.6 For each system, we fitted the intensity reduction model using
constant, loglinear and power-law baseline intensities with constant reduction
factors. Its goodness of fit is measured by the log-likelihoods in Table 8.5, obtained
using Mathcad software. For comparison, we also display the log-likelihoods for
the extremes of renewal process (maximal repairs) and nonhomogeneous Poisson
process (minimal repairs)
202
D. Percy
intensity reduction
maximal repair
minimal repair
Baseline
intensity
Happy
system
Sad
system
Noncommittal
system
constant
33 7
33 7
35 5
loglinear
32 4
28 5
33 4
power-law
29 4
32 0
34 7
constant
35 5
35 5
35 5
loglinear
34 8
34 8
34 8
power-law
35 1
35 1
35 1
constant
35 5
35 5
35 5
loglinear
34 8
32 0
35 2
power-law
35 0
31 8
35 3
As expected, the intensity reduction model provides a good fit to all three
systems, preferring the power-law baseline intensity for the happy system and the
loglinear baseline intensity for the sad and noncommittal systems. Figure 8.4
shows that these baseline intensities are all increasing functions and any apparent
happiness is due to the high quality of repairs rather than a self-improving system.
203
Intensity Function
0.18
( t , a , b , s)
0
0
410
Intensity Function
0.18
( t , a , b , s)
0
0
410
Intensity Function
0.18
( t , a , b , s)
0
0
410
Figure 8.4. Best fitting models for happy, sad and noncommital systems, respectively
204
D. Percy
U=
nt
t 2
i =1
n
t
12
22 2,128
2
0.5230 .
22
2,128
12
21,901
(8.34)
As 1.960 < U < 1.960 , the test is not significant at the 5% level and we conclude
that this test provides no evidence of non-stationarity for these data. Consequently,
the delayed renewal process might provide an adequate fit to these data, without
the need for a more complicated model. However, we might consider using the
DARP if downtime is important or one of the later models if concomitant
information is also available.
Example 8.8 Here the data comprise 65 event observations collected over seven
years. In the first half of this period, there were 15 CM and 11 PM actions. In the
second half of this period, there were 29 CM actions and 10 PM actions. Hence,
this is a sad system, which might benefit from preventive maintenance. We fit the
generalized proportional intensities model to these data with explanatory variables
representing quality of last maintenance and time since last maintenance. A
loglinear baseline with constant reduction factors generates the results in Table 8.6.
Table 8.6. Log-likelihoods and parameter estimates for GPIM analyses of oil pump data
Predictor
variables
Loglikelihood
Parameter estimates
211.7
5 10 4
1.01
0.719
0.740
Quality of
last action
210.2
6 104
1.01
0.699
0.745
6 103
Ttime since
last action
210.8
7 104
1.01
0.666
0.728
8 103
Quality of
last action
209.5
8 104
1.01
0.653
0.734
6 103
Time since
last action
7 10 3
The best model includes both quality of last maintenance action and time
since last maintenance action as predictor variables. This is not surprising, as it
contains six parameters whereas the model with no predictor variables has only
four. As the associated PM reduction factor is about two-thirds, preventive
205
maintenance reduces the intensity of critical failures for this system and so
improves its reliability. Although slightly less impressive, corrective maintenance
reduces the intensity function too. Hence, the maintenance workforce appears to be
very effective for this application! A graph of the intensity function for the GPIM
with both covariates follows in Figure 8.5, based on the corresponding parameter
estimates in the last row of Table 8.6.
Intensity Function
0.1
( t , a , b , r , s , c1 , c2)
0
0
2487
Fig. 8.5. Intensity function for GPIM analysis of oil pump data with two covariates
We now perform a simulation analysis for this last model based on the methods
described in Section 8.8, in order to determine an optimal strategy for scheduling
preventive maintenance. Several convenient PM intervals are considered for our
calculations, including weekly, monthly, two-monthly, quarterly, biannually, annually and biennially. The minimum cost per unit time over a ten-year fixed horizon
is achieved with monthly PM and generates a projected 80% saving over annual
PM, though this estimated reduction in costs is sensitive to the choice of model.
The previous policy implemented averages about three PM actions per year, which
our simulation estimates would cost about four times as much in preventive maintenance when compared with the optimal policy of monthly PM.
8.10 Conclusions
This chapter discussed the ideas of modelling complex repairable systems, with the
intention of scheduling preventive maintenance to improve operational efficiency
and reduce running costs. It started by emphasising the importance of improved,
accurate and complete data collection in practice. It then presented the renewal
process, delayed renewal process and delayed alternating renewal process as
reasonable models for systems that exhibit stationary failure patterns.
206
D. Percy
The virtual age model and proportional hazards model were described as
suitable for systems that do not exhibit stationarity and for systems where predictor
variables such as condition monitoring observations are also measured. The nonhomogeneous Poisson process, intensity reduction model and proportional intensities model, with a promising generalization, were described next. We claim that
these models offer natural interpretations of the physical underlying reliability and
maintenance processes.
Finally, this chapter demonstrated some applications of these ideas using
reliability and maintenance data taken from the oil industry and reviewed several
methods for model selection and goodness-of-fit testing, including graphs, Laplace
trend test, likelihood ratios and the Akaike and Bayes information criteria. The use
of mathematical modelling and statistical analysis in this fashion can improve, and
has improved, the quality of PM scheduling. This can then result in considerable
cost savings and help to improve system availability.
8.11 References
Ascher HE, Feingold H, (1984) Repairable Systems Reliability: Modeling, Inference,
Misconceptions and their Causes. New York: Marcel Dekker
Baik J, Murthy DNP, Jack N, (2004) Two-dimensional failure modeling with minimal
repair. Naval Research Logistics 51:345362
Cox DR, (1972a) Regression models and life tables (with discussion). Journal of the Royal
Statistical Society Series B 34:187220
Cox DR, (1972b) The statistical analysis of dependencies in point processes. In Stochastic
Point Processes (Lewis PAW). New York: Wiley
Crowder MJ, Kimber AC, Smith RL, Sweeting TJ, (1991) Statistical Analysis of Reliability
Data. London: Chapman and Hall
Dagpunar JS, Jack N, (1993) Optimizing system availability under minimal repair with nonnegligible repair and replacement times. Journal of the Operational Research Society
44:10971103
Dekker R, Frenk H, Wildeman RE, (1996) How to determine maintenance frequencies for
multi-component systems? A general approach. In Reliability and Maintenance of
Complex Systems (Ozekici S). Berlin: Springer
Doyen L, Gaudoin O, (2004) Classes of imperfect repair models based on reduction of
failure intensity or virtual age. Reliability Engineering and System Safety 84:4556
Handlarski J, (1980) Mathematical analysis of preventive maintenance schemes. Journal of
the Operational Research Society 31:227237
Jack N, (1998) Age-reduction model for imperfect maintenance. IMA Journal of
Mathematics Applied in Business and Industry 9:347354
Jardine AKS, Anderson PM, Mann DS, (1987) Application of the Weibull proportional
hazards model to aircraft and marine engine failure data. Quality and Reliability
Engineering International 3:7782
Jiang R, Jardine AKS, (2006) Composite scale modeling in the presence of censored data.
Reliability Engineering and System Safety 91:756764
Kobbacy KAH, Fawzi BB, Percy DF, Ascher HE, (1997) A full history proportional hazards
model for preventive maintenance scheduling. Quality and Reliability Engineering
International 13:187198
Lindqvist BH, Elvebakk G, Heggland K, (2003) The trend-renewal process for statistical
analysis of repairable systems. Technometrics 45:3144
207
9
Artificial Intelligence in Maintenance
Khairy A. H. Kobbacy
9.1 Introduction
Over the past two decades their has been substantial research and development in
operations management including maintenance. Kobbacy et al. (2007) argue that the
continous research in these areas implies that solutions were not found to many
problems. This was attributed to the fact that many of the solutions proposed were
for well-defined problems, that the solutions assumed accurate data were available
and that the solutions were too computationally expensive to be practical. Artificial
intelligence (AI) was recognised by many researchers as a potentially powerful tool
especially when combined with OR techniques to tackle such problems. Indeed,
there has been vast interest in the applications of AI in the maintenance area as
witnessed by the large number of publications in the area. This chapter reviews the
application of AI in maintenance management and planning and introduces the
concept of developing intelligent maintenance optimisation system.
The outline of the chapter is as follows. Section 9.2 deals with various maintenance issues including maintenance management, planning and scheduling. Section 9.3 introduces a brief definition of AI, some of its techniques that have applications in maintenance and Decision Support Systems. A review of the literature is
then presented in Section 9.4 covering the applications of AI in maintenance. We
have focused on five AI techniques namely knowledge based systems, case based
reasoning genetic algorithms, neural networks and fuzzy logic. This review also
covers hybrid systems where two or more of the above mentioned AI techniques
are used in an application. Other AI techniques seem to have very few applications
in maintenance to date. A discussion of the development of the prototype hybrid
intelligent maintenance optimisation system (HIMOS) which was developed to
evaluate and enhance preventive maintenance (PM) routines of complex engineering systems follows in Section 9.5. HIMOS uses knowledge based system to
identify suitable models to schedule PM activities and case base reasoning to add
capability to utilise past experience in model selection. Future developments and
210
K. Kobaccy
211
212
K. Kobaccy
great promise and indeed being investigated for application in more complex PM
situations, e.g., multiple PM routines.
9.3 AI Techniques
AI is a branch of computer science that develops programmes to allow machines to
perform functions normally requiring human intelligence (Microsoft ENCARTA
College Dictionary 2001). The goal of AI is to teach machines to think to a
certain extent under special conditions (Firebaugh 1988). There are many AI
techniques, the most used in maintenance decision support are as follows.
Knowledge based systems (KBS): use of domain specific rules of thumb or
heuristics (production rules) to identify a potential outcome or suitable course of
action.
Case based reasoning (CBR): utilises past experiences to solve new problems. It
uses case index schemes, similarity functions and adaptation. It provides machine
learning through updating of the case base.
Genetic algorithms (GAs): these are based on the principle that solutions can
evolve. Potential promising solutions evolve through mutation and weaker solutions
become extinct.
Neural networks (NNs): use back propagation algorithm to emulate behaviour
of human brain. Both of NNs and GAs are capable of learning how to classify,
cluster and optimise.
Fuzzy logic (FL): allows the representation of information of uncertain nature.
It provides a framework in which membership of a category is graded and hence
quantifies such information for mathematical modelling, etc.
There are several other AI techniques and these include Data Mining, Robotics
and Intelligent Agents. However, to date very few publications are available about
their applications in maintenance.
9.3.1 Intelligent Decision Support Systems
A useful definition of DSS is as follows. It is a computer based system that helps
decision makers confront ill-structured problems through direct interaction through
data and analysis and models (Sprague and Watson 1986).
The result of integrating an AI technique within a DSS is referred to in this
chapter as an Intelligent DSS. This is essentially a DSS as defined above, but has
the additional capabilities to understand, suggest and learn in dealing with
managerial tasks and problems. The method of integration and the features of the
end product depend very much on the area of application.
9.4 AI in Maintenance
AI techniques have been used successfully in the past two decades to model and
optimise maintenance problems. Since the resurgence of AI in the mid-1980s
researchers have consider the applications of AI in this field. The article by
213
Dhaliwal (1986) is one of the early ones that argued for the appropriateness of
using AI techniques for addressing the issues of operating and maintaining large
and complex engineering systems. Kobbacy (1992) discusses the useful role of
knowledge based systems in the enhacement of maintenance routines. Over the
years the applications of AI in maintenance grew to cover very wide area of applications using a variety of AI techniques. This can be explained by the individual
nature of each technique. For example GAs and NNs have the advantage of being
useful in optimising complex and nonlinear problems and overcome the limitations
of the classic black box approaches, where attempt is made to identify the system
by relating system outputs to inputs without understanding and modelling the underlying process. Hence the widespread applications in the scheduling area and
also in fault diagnosis.
In this section, an up to date survey is presented covering the area of application of AI techniques in maintenance including fault diagnosis. This chapter will
only refer to some of the references in the vast applications of AI in fault diagnosis.
Interested readers can refer to the recent comprehensive review by Kobbacy et al.
(2007) on applications of AI in Operations.
9.4.1 Case Based Reasoning (CBR)
CBR is an interesting AI technique which adds learning capabilities to DSS
systems. This may explain the lack of publications on using CBR on its own in
maintenance. Instead there are few hybrid applications which utilises CBR together
with other AI techniques. Details about CBR technique are discussed while presenting the case study in Section 9.5.3.
Yu et al. (2003) present a problem-oriented multi-agent-based E-service system
(POMAESS). The system uses a CBR-based decision support function. The case
study, which is discussed later in this chapter deals with a hybrid KBS/CBR maintenance optimisation system (HIMOS).
More publications are found in fault diagnosis including papers on its
application in locomotive diagnostics, e.g. Varma and Roddy (1999). Xia and Rao
(1999) argue the need to develop dynamic CBR which introduces new mechanisms
such as time-tagged indexes and dynamic and multiple indexing to help accurate
solving of problems taking into account system dynamics and fault propagation
phenomena. Cunningham et al. (1998) describe an incremental CBR mechanism
that can initiate the fault diagnosis process with only a few features.
There are also papers on hybrid CBR systems in fault diagnosis including the
use of CBR with Petri nets for induction motor fault diagnosis (Tang et al. 2004),
CBR with FL in fault diagnosis of modern commercial aircraft ( Wu et al. 2004),
CBR with NN in web-based intelligent fault diagnosis system (Hui et al. 2001),
CBR with heuristic reasoning and hypermedia for incident monitoring (Rao et al.
1998) and CBR with KBS in pattern search problem in fault diagnosis (Kohno et
al. 1997).
214
K. Kobaccy
215
216
K. Kobaccy
217
bines FL and NNs for the classification of defects by extracting features in segmented buried pipe images.
Applications for FL in fault diagnosis include fault diagnosis of railway wheels
(Skarlatos et al. 2004), thrusters for an open- frame underwater vehicle (Omerdic
and Roberts 2004), chemical processes (Dash et al. 2003) and rolling element
bearings in machinery (Mechefske 1998).
218
K. Kobaccy
219
Figure 9.1 illustrates the conceptual structure of HIMOS which is divided into
two areas. The DSS contribution area contains a database to store maintenance
historical data, a model base for data analysis models and optimisation models, and
a user interface to communicate with the user. In the AI contribution area, there are
two bases which contain experts knowledge: knowledge base and case base.
9.5.3.1 HIMIS Procedure
Figure 9.2 illustrates the model selection for a data set consisting of a sequence of
preventive maintenance (PM) and corrective action (CO) events to enable calculating the optimal PM interval. HIMOS has the ability to use a set of production
rules to select and then optimise a suitable model in order to provide an evaluation
of the current maintenance routine and to propose an optimal policy. These rules
are acquired from experts knowledge and may require subjective judgements to be
made. The processor of HIMOS identifies data patterns through data analysis
procedure and then selects the most appropriate model for a given data set by
consulting the rule base. If a data set cannot be matched by any of the KBS rules,
then the system attempts to use CBR to identify a suitable model.
220
K. Kobaccy
Data Formatting and Analysis After reading data from the input data file, the
system formats and checks the data to create a suitable data set for the next step of
analysis. Suspect or missing items of data are flagged in order to be sorted out by
the system or investigated by the user.
The analysis consists of five steps: recognition of PM and CO patterns, calculation of current availability, Weibull distribution fitting to failure times, trend test
of frequency and severity to establish data stationarity with respect to frequency
and severity or otherwise, and if applicable analysis of Multi-PM cases. In the first
step a basic analysis is carried out to identify the features of the data set such as the
numbers of PM and CO events and the mean lives to failure, so that the data set
can be compared with characteristic data patterns in the model selection process.
The data produced in this process are referred to as metadata.
Model Base The model base contains two sets of models: the data analysis models
and the PM scheduling optimisation models. The data analysis models identify a
data pattern which together with the RBR/CBR help to select an optimisation
221
If Not matched
and There are multi-PMs
Then
Apply Multi-PM Model
Matched
RULE 2:
If Not matched
and Trend test statistics of frequency is significantly large
and Trend test statistics of severity of CO is significantly large
and Trend test statistics of severity of PM is significantly large
Then
Apply NHPPScoSpm Model
Matched
RULE 3:
If Not matched
and Trend test statistics of frequency is significantly large
and Trend test statistics of severity of CO is significantly large
Then
Apply NHPPSco Model
Matched
Model Selection Using CBR CBR is an approach to problem solving that utilises
past experiences to solve new problems. The first step in the operation of a CBR
system is the retrieval in which the inputs are analysed to determine the critical
222
K. Kobaccy
features to use in retrieving past cases from the case database. Among the well
known methods for case retrieval is the nearest neighbour which is used in
HIMOS. To find the nearest neighbour matching the case being considered, the
case with the largest weighted average of similarity functions for selected features
is selected. In HIMOS four features were selected and all given equal weights.
These features are: number of PM, number of CO, trend value and variability of
PM cycle length. The reason for selecting these features is that they were found to
be the main causes for failure to select a suitable model using the rule based
system. The similarity function was selected as the difference between the values
of feature in the current and retrieved cases divided by the standard deviation of the
feature.
Once the best matching case has been retrieved, adaptation is carried out to
reduce any prominent difference between the retrieved case and the current case
through the derivational replay method. Thus in the CBR phase, the system uses
rules similar to those used in the KBS phase to find a solution. However some
critical values in the adaptation rules are more relaxed compared with the original
rules.
In the evaluation step the system displays multiple candidate models (possible
solutions) with their critical features for the current case (adaptation results). The
user can then evaluate these alternatives and selects one using their expertise.
For the non-expert user, the system itself provides the Recommended Model
as a result of evaluation. Here the system compares the results of adaptation with
the results of retrieval. If there is no matching model then no recommendation is
made, otherwise the system recommends the matching model. If there is more than
one matching model, the system merely recommends the first ranked (nearest
neighbour) model.
9.5.3.2 Results and Validation of HIMOS
HIMOS results for a component include some basic statistics for, e.g. number of
PM, CO, current availability, etc. The most important result from the decisionmakers point of view is the recommended PM interval. The optimal availability
gives an estimate of availability which might be achieved if the recommended PM
policy is implemented.
Table 9.1 shows the percentage success rate of HIMOS in modelling a large
number of components. As can be seen, around two thirds of the components could
not be modelled because no rule matched the data to a specific model. The introduction of case base reasoning can add to the success rate of modelling components. The table also shows that the introduction of CBR reduces the percentage
of cases where no suitable model was identified from 68.6 % to 52.7 %. Given the
self-learning nature of CBR where the case base expands with use, it is possible to
improve the success rate with the extended use of the system in certain
environments.
223
Table 9.1. Percentage use of maintenance models for HIMOS when applied to large
systems, 1633 components in three data files (Kobbacy 2004)
Model
HIMOS*
RBR
Stochastic
RBR+CBR
RP
6.6
12.8
NHPP
1.6
1.6
NRP
2.3
2.3
Total stochastic
10.5
16.4
Geometric I
15.7
23.5
Geometric II
1.7
1.8
Weibull
1.7
3.7
Deterministic
1.8
1.9
68.6
52.7
No model suitable
HIMOS was validated using test cases by comparing the results of analysis of
selected cases by HIMOS with the recommendations of an expert panel.
For the validation HIMOS, eight data sets were used and a panel of five experts
were involved. In general there was agreement between HIMOS and the experts.
The experts had a measure of disagreement in their advices as a result of making
different assumptions in their analysis. Experts also made useful suggestion for the
operation of the system. Table 9.2 is a typical example of HIMOS and the experts
recommendations.
Table 9.2. Example of validation of IMOS
Data Set 3
HIMOS
Expert A
Expert B
Expert C
Expert D
Increase PM interval
Expert E
224
K. Kobaccy
225
Figure 9.3. Outline design of AMMCM (Adaptive Maintenance Measurement and Control
Model
226
K. Kobaccy
9.8 Acknowledgments
The author wishes to acknowledge the contributions of those who collaborated at
the various stages of the development of IMOS and HIMOS. In particular I wish to
acknowledge the significant contribution of A.L. Labib in developing the proposal
for the AMMCM presented in Section 9.6.
9.9 References
Acosta, G.G., Verucchi, C.J. and Gelso, E.R. (2006) A current monitoring system for
diagnosing electrical failures in induction motors, Mechanical Systems and Signal
Processing, 20, 953965.
Ahmed, K., Langdon, A. and Frieze, P.A., (1991), An expert system for offshore structure
inspection and maintenance, Computers and Structures, 40, 143159.
Al-Garni, A.Z., Jamal, A., Ahmad, A.M. Al-Garni, A.M. and Tozan, M. (2006), Neural
network-based failure rate prediction for De Havilland Dash-8 tires, Engineering
Applications of Artificial Intelligence, 19, 681691.
Al-Najjar, B. and Alsyouf, I. (2003), Selecting the most efficient maintenance approach
using fuzzy multiple criteria decision making, International Journal of Production
Economics, 84, 85100.
Ascher, H.E. and Kobbacy, K.A.H. (1995), Modelling preventive maintenance for
deteriorating repairable systems, IMA Journal of Mathematics Applied in Business &
Indistry, 6, 8599.
Bansal, D., Evans, D.J. and Jones, B. (2004), A real-time predictive maintenance system for
machine systems, International Journal of Machine Tools and Manufacture, 44,
759766.
Baroni, P., Canzi, U. and Guida, G. (1997), Fault diagnosis through history reconstruction:
an application to power transmission networks, Expert Systems with Applications, 12,
3752.
Batanov, D., Nagarue, N. and Nitikhunkasem, P. (1993) EXPERT-MM: A knowledge-based
system for maintenance management, Artificial Intelligence in Engineering, 8, 283291.
Beaulah, S.A. and Chalabi, Z.C. (1997), Intelligent real-time fault diagnosis of greenhouse
sensors, Control Engineering Practice, 5, 15731580.
Booth, C. and McDonald, J.R. (1998), The use of artificial neural networks for condition
monitoring of electrical power transformers, Neurocomputing, 23, 97109.
Braglia, M., Frosolini, M. and Montanari, R. (2003), Fuzzy criticality assessment model for
failure modes and effects analysis, International Journal of quality & Reliability
Management, 20, 503524.
Cavory, G., Dupas R. and Goncalves, G. (2001), A genetic approach to the scheduling of
preventive maintenance tasks on a single product manufacturing production line.
International Journal of Production Economics 74, 135146.
Chan, F.T.S., Chung, S.H., Chan, L.Y., Finke, G. and Tiwari, M.K. (2006), Solving
distributed FMS scheduling problems subject to maintenance: Genetic algorithms
approaches, Robotics and Computer-Integrated Manufacturing, 22, 493504.
Chen, Q., Chan, Y.W. and Worden, K. (2003), Strucural fault diagnosis and isolation using
neural networks based on response-only data, Computers & Structures, 81, 21652172.
227
Chootinan, P., Chen, A., Horrocks, M.R. and Bolling, D. (2006), A multi-year pavement
maintenance program using a stochastic simulation-based genetic algorithm approach,
Transportation Research Part A: Policy and Practice, 40, 725743.
Cunningham, P., Smyth, B. and Bonzano, A. (1998), An incremental retrieval mechanism
for case-based electronic fault diagnosis. Knowledge-Based Systems 11, 239248.
Dash, S., Rengaswamy, R. and Venkatasubramanian, V. (2003), Fuzzy-logic based trend
classification for fault diagnosis of chemical processes, Computers & Chemical
Engineering, 27, 347362.
de Brito, J., Branco, F.A., Thoft-Christensen, P. and Sorensen, J.D. (1997), An expert
system for concrete bridge management, Engineering Structures, 19, 519526.
Dendronic Decisions Ltd (2003), www.dendronic.com/articles.htm.
Dhaliwal, D.S. (1986), The use of AI in maintaining and operating complex engineering
systems, in Expert systems and Optimisation in Process Control, A. Mamdani and J E
Pstachion, eds, 2833. Gower Technical Press, Aldershot.
Dragan, A.S., Walters, G.A. and Knezevic, J. (1995), Optimal opportunistic maintenance
policy using genetic algorithms, 1 formulation, Journal of Quality in Maintenance
Engineering, 1, 3449.
Drury, C.G. and Prabhu, P. (1996), Information requirements of aircraft inspection: framework and analysis, International Journal of human-Computer Studies, 45, 679695.
Eldin, N.N. and Senouci, A.B. (1995), Use of neural networks for condition rating of joint
concrete pavements, Advances in Enginering software, 23, 133141.
Feldman, R.M., William, M.L., Slade, T., McKee, L.G. and Talbert, A. (1992), The
development of an integrated mathematical and knowledge-based maintenance delivery
system, Computers & Operations Research, 19, 425434.
Firebaugh, M.W. (1988), Artificial Intelligence: A Knowledge-based Approach, Boyd &
Fraser Publishing Co. Danvers, MA, USA.
Frank, P.M. and Ding, X. (1997), Survey of robust residual gereration and evaluation
methods in observed-based fault detection systems, Journal of Process Control, 7,
403427.
Frank, P.M. and Koppen-Seliger, B. (1997), New developments using AI in fault diagnosis,
Engineering applications in Artificial Intelligence, 10, 314.
Garcia, E., Guyennet, H., Lapayre, J.C. and Zerhouni, N. (2004), A new industrial
cooperative tele-maintenance platform. Computers & Industrial Engineering 46,
851864.
Gilabert, E. and Arnaiz, A. (2006), Intyelligent automation systems for predictive maintenance: A case study, Robotics and Computer Integrated Manufacturing, 22, 543549.
Gits, C.W. (1984), On the maintenance concept for a technical system, PhD Thesis,
Eindhoven Technische Hogeschool, Eindhoven.
Gromann de Araujo Goes, A., Alvarenga, M.A.B. and Frutuoso e Melo, P.F. (2005),
NAROAS: a neural network-based advanced operator support system for the assessment
of systems reliability, Reliability Engineering & System Safety, 87, 149161.
Hissel, D., Pera, M.C. and Kauffmann, J.M. (2004) Diagnosis of automotive fuel cell power
generators, Journal of power Sources, 128, 239246.
Huang, S.J. (1998), Hydroelectric generation scheduling an application of geneticembedded fuzzy system approach. Electric Power Systems Research 48, 6572.
Hui, S.C., Fong, A.C.M. and Jha, G. (2001) A web-based intelligent fault diagnosis system
for customer service support, Engineering Applications of Artificial Intelligence, 14,
537548.
Jeffries, M., Lai, E.. Plantenberg, D.H. and Hull, J.B. (2001), A fuzzy approach to the
condition monitoring of a packaging plant, Journal of Materials Processing technology,
109, 8389.
228
K. Kobaccy
Jha, M.K. and Abdullah, J. (2006) A Markovian approach for optimising highway life-cycle
with genetic algorithms by considering maintenance of roadside appurtenances, Journal
of the Franklin Institute, 343, 404419.
Jota, P.R.S., Islam, S.M.,Wu, T. and Ledwich, G. (1998), A class of hybrid intelligent
system for fault diagnosis in electric power systems. Neurocomputing 23, 207224.
Khoo, L.P., Ang, C.L. and Zhang, J. (2000), A Fuzzy-based genetic approsach to the
diagnosis of manufacturing systems, Engineering Applications of artificial Intelligence,
13, 303310.
Kobbacy, K.A.H. (1992), The use of knowledge-based systems in evaluation and enhancement of maintenance riutines, International Journal of Production Economics, 24, 243
248.
Kobbacy, K.A.H. (2004), On the evolution of an intelligent maintenance optimisation
system, journal of the Operational Research Society, 55, 139146
Kobbacy, K.A.H. and Jeon, J. (2001), The development of a hybrid intelligent maintenance
optimisation system (HIMOS), Journal of the Operational Research society, 52,
762778.
Kobbacy, K.A.H., Percy, D.F. and Fawzi, B.B. (1995a), Sensitivity analysis for preventive
maintenance modeld, IMA Journal of Mathematics Applied in Business& industry, 6,53
66.
Kobbacy, K.A.H., Proudlove, N.L. and Harper, M.A. (1995b), Towards an intelligent
maintenance optimisation system, Journal of the Operatonal Research society, 46,
229240.
Kobbacy, K.A.H., Fawzi, B.B., Percy, D.F. and Ascher, H.E. (1997), A Full history
proportional hazards model for preventive maintenance modelling, Journal of Quality
and Reliability Engineering Internationa, 13, 187198.
Kobbacy, K.A.H., Percy, D. F. and Sharp, J.M. (2005), Results of preventive maintenance
survey, unpublished report,University od Salford..
Kobbacy, K.A.H., Vadera, S. and Rasmy, M.H.(2007), AI and OR in management of
operations:history and trends, Journal of the Operational Research Society, 58, 1028.
Kohno, T., Hamada, S., Araki, D., Kojima, S. and Tanaka, T. (1997) Error repair and
knowlledge acquisition via case-based reasoning, Artificial Intelligence, 91, 85101.
Kuo, H-C. and Chang, H-K. (2004) A new symbiotic evolution-based fuzzy-neural approach
to fault diagnosis of maine propulsion systems, Engineering Applications of Artificial
Intelligence, 17, 919930.
Labib, A.W. (1998) World class maintenance using a computerised maintenance
management system, Journal of Quality in Maintenance Engineering,4, 6675.
Lee, C-K. and Kim, S-K. (2007) GA-based algorithm for selecting optimal repair and
rehabilitation methods for reinforced concrete (RC) bridge decks, Automation in
Construction, 16, 153164.
Leung, D. and Romagnoli, J. (2002) An integration mechanism for multivariate knowledgebased fault diagnosis, Journal of Process Control, 12, 1526.
Lin, C-C. and Wang, H-P. (1996), Performance analysisof routating machinary using
venhanced cerebellar model articulation controller (E-CMAC) neural netyworks,
Computers and industrial Engineering, 30, 227242.
Luxhoj, J.T. and Williams, T.P. (1996), Integrated decision support for aviation safety
inspectors. Finite Elements in Analysis and Design 23, 381403.
Marseguerra, M., Zio, E. and Podofillini, L. (2004), A multiobjective genetic algorithm
approach to optimisation of the technical specifications of a nuclear safety system,
Reliability Engineering & System Safety, 84, 8799.
Martland, C.D., McNeil, S., Axharya, D., Mishalani, R. and Eshelby, J. (1990), Applications
of expert systems in railroad maintenance:Scheduling rail relays, Transportation
Research Part A: General, 24, 3952.
229
Mechefske, C.K. (1998), Objective machinery fault diagnosis using fuzzy logic, Mechanical
Systems and signal Processing, 12, 855862.
Microsoft ENCARTA College Dictionary (2001), StMartins Press, N.Y.
Miller, D., Mellichamp, J.M. and Wang, J. (1990), An image enhanced knowledge based
expert system for maintenance trouble shooting, Computers in Industry, 15, 187202.
Milne, R., Nicole, C. and Trave-Massuyes, L. (2001) TIGER with model based diagnosis:
initial deployment, Knowledge-based Systems, 14, 213222.
Morcous, G. and Lounis, Z. (2005), Maintenance optimisation of infrastructure networks
using genetic algorithms, Automation in Construction, 14, 129142.
Nam, D.S., Jeong, C.W., Choe, Y.J. and Yoon, E.S. (1996), Operation-aided system for fault
diagnosis of continuous and semi-continuous processes, Computers& Chemical
Engineering, 20, 793803.
Oke, S.A. and Charles-Owaba, O.E. (2006), Application of fuzzy logic control model to
Gantt charting preventive maintenance scheduling, International Journal of Quality &
Reliability Management, 23, 441459.
Omerdic, E. and Roberts, G. (2004), thruster fault diagnosis and accommodation for openframe underwater vehicles, Control Engineering Practice, 12, 15751598.
Patel, S.A., Kamrani, A.K. and Orady, E. (1995), A knowledge-based system for fault
diagnosis and maintenance of advanced automated systems, Computers & Industrial
Engineering, 29, 147151.
Percy, D.F., Kobbacy, K.A.H. and Ascher, H.E. (1998), Using proportional intensities
models to schedule preventive maintenance intervals, IMA Journal of Mathematics
Applied in Business& industry, 9, 289302.
Rao, M., Yang, H. and Yang, H. (1998), Integrated distributed intelligent system architechture for incidents monitoring and diagnosis, Computers in Industry, 37, 143151.
Ruiz, D., Canton, J., Nougues, J.M., Espuna, A. and Puigjaner, L. (2001), On-line fault
diagnosis system support for reactive scheduling in multipurpose batch chemical plants,
Computers & Chemical Engineering, 25, 829837.
Ruiz, R., Garcia-Diaz, C. and Maroto, C. (2006), Considering scheduling and preventive
maintenance in the flowshop sequencing problem, Computers & Operations
Rresearch,34, 33143330.
Saranga, H. (2004) Opportunistic maintenance using genetic algorithms, Journal of Quality
in Maintenance Engineering, 10, 6674.
Scenna, N.J. (2000) Some aspects of fault diagnosis in batch processes, Reliability
Engineering & System Safety, 70, 95110.
Sergaki, A. and Kalaitzakis, K. (2002), Reliability Engineering& System Safety, 77, 1930.
Sharma, R., Singh, K., Singhal, D. and Ghosh, R. (2004), Neural network applications for
detecting process faults in packed towers. Chemical Engineering and Processing 43,
841847.
Shayler, P.J., Goodman, M. and Ma, T. (2000), The exploitation of neural networks in
automative engine management systems, Engineering Applications of Artificial
Intelligence, 13, 147157.
Shyur, H.J., Luxhoj, J.T. and Williams, T.P. (1996), Using neural networks to predict
component inspection requirements for aging aircraft. Computers & Industrial
Engineering 30, 257267.
Simani, S. and Fantuzzi, C. (2000), Fault diagnosis in power plant using neural networks,
Information Sciences, 127, 125136.
Sinha, S.K. and Fieguth, P.W. (2006) Neuro-fuzzy network for the classification of buried
pipe defects, Automation in Construction, 15, 7383.
Skarlatos, D., Karakasis, K. and Trochidis, A. (2004), Railway wheel fault diagnosis using a
fuzzy-logic method, Applied Acoustics, 65, 951966.
230
K. Kobaccy
Sortrakul, N., Nachtmann, H.L. and Cassady, C.R. (2005), Genetic algorithms for integrated
preventive maintenance planning and production schedulling for a single machine,
Computers in Industry,56, 161168.
Spoerre, J.K. (1997), Application of the cascade correlation algorithm (CCA) to bearing
fault classification problems. Computers in Industry 32, 295304.
Sprague, R.H. and Watson, H.J. (1986) Decision support systems putting theory into
practice, Prentice Hall, Englewood Cliffs, New Jersey.
Srinivasan, D., Liew, A.C., Chen, J.S.P. and Chang, C.S. (1993) Intelligent maintenance
scheduling of distributed system components with operating constraints, Electric Power
Systems Research, 26, 203209.
Sudiaros, A. and Labib, A.W. (2002) A fuzzy logic approach to an integrated maintenance/
production scheduling algorithm, International Journal of Production Research, 40,
31213138.
Tan, J.S. and Kramer, M.A. (1997), A general framework for preventive maintenance
optimization in chemical process operations. Computers & Chemical Engineering 21,
14511469.
Tang, B-S., Jeong, S.K., Oh, Y-M. and Tan, A.C.C. (2004), Case-based reasoning system
with Petri nets for induction motor fault diagnosis, Expert Systems with Applications, 27,
301311.
Tarifa, E.E., Humana, D., Franco, S., Martinez, S.l. Nunez, A.F. and Scenna, N.J. (2003)
Fault diagnosis for MSF using neural networks, Desalination, 152, 215222.
Tsai, Y-T., Wang, K-S. and Teng, H-Y. (2001), Optimizing preventive maintenance for
mechanical components using genetic algorithms. Reliability Engineering & System
Safety 74, 8997.
Varde, P.V., Sankar, S. and Verma, A.K. (1998), An operator support system for research
reactor operations and fault diagnosis through a connectionist framework and PSA based
knowledge based system, Reliability Engineering and System safety, 60, 5369.
Varma, A. and Roddy, N. (1999), ICARUS: design and deployment of a case-based
reasoning system for locomotive diagnostics, Engineering Applications of Artificial
Intelligence 12, 681690.
Villanueva, H. and Lamba, H. (1997). Operator guidance system for industrial plant
supervision, Expert systems withy Applications, 12, 441454.
Wen, F. and Chang, C.S. (1998), A new approach to fault diagnosis in electrical distribution
networks using a genetic algorithm. Artificial Intelligence in Engineering 12, 6980.
Wu, H., Liu, Y., Ding, Y. and Qiu, Y. (2004), Fault diagnosis expert system for modern
commercial aircraft, Aircraft Engineering and Aerospace Technology, 76, 398403
Xia, Q. and Rao, M. (1999), Dynamic case-based reasoning for process operation support
systems. Engineering Applications of Artificial Intelligence 12, 343361.
Yang, B-S. and Kim, K.J. (2006) Applications of Dempster-Shafer theory in fault diagnisis
of induction motors, Mechanical systems and Signal Processing, 20, 403420.
Yang, B-S., Han, T. and Kim, Y-S (2004), Integration of ART-Kohonen neural network and
case-based reasoning for intelligent fault diagnosis, Expert Systems with Applications,
26, 387395.
Yang, B-S., Lim, D-S. and Tan, A.C.C. (2005), VIBEX : an expert system for vibtation fault
diagnosis of rotating machinery using decision tree and decision table, Expert Systems
with Applications, 28, 735742.
Yangping, Z., Bingquan, Z. and DongXin, W. (2000), Application of genetic algorithms to
fault diagnosis in nuclear power plants. Reliability Engineering & System Safety, 67,
153160.
Yu, R., Iung, B. and Panetto, H. (2003), A Multi-Agents based E-maintenance system with
case-based reasoning decision support, Engineering Applications of Artificial Intelligence, 16, 321333.
231
Zhang, H.Y., Chan, C.W., Cheung, K.C. and Ye, Y.J. (2001) Fuzzy artmap neural network
and its application to fault diagnosis of navigation systems, Automatica, 37, 10651070.
Zhao, Z. and Chen, C. (2001), concrete bridge deterioration diagnosis using fuzzy inference
system, Advances in Engineering Software, 32, 317325.
Part D
10
Maintenance of Repairable Systems
Bo Henry Lindqvist
10.1 Introduction
A commonly used definition of a repairable system (Ascher and Feingold 1984)
states that this is a system which, after failing to perform one or more of its
functions satisfactorily, can be restored to fully satisfactory performance by any
method other than replacement of the entire system. In order to cover more realistic
applications, and to cover much recent literature on the subject, we need to extend
this definition to include the possibility of additional maintenance actions which
aim at servicing the system for better performance. This is referred to as preventive
maintenance (PM), where one may further distinguish between condition based
PM and planned PM. The former type of maintenance is due when the system exhibits inferior performance while the latter is performed at predetermined points in
time.
Traditionally, the literature on repairable systems is concerned with modeling of
the failure times only, using point process theory. A classical reference here is
Ascher and Feingold (1984). The most commonly used models for the failure process of a repairable system are renewal processes (RP), including the homogeneous
Poisson processes (HPP), and nonhomogeneous Poisson processes (NHPP). While
such models are often sufficient for simple reliability studies, the need for more
complex models is clear. In this chapter we consider some generalizations and
extensions of the basic models, with the aim to arrive at more realistic models which
give better fit to data. First we consider the trend renewal process (TRP) introduced
and studied in Lindqvist et al. (2003). The TRP includes NHPP and RP as special
cases, and the main new feature is to allow a trend in processes of non-Poisson
(renewal) type.
As exemplified by some real data, in the case where several systems of the
same kind are considered, there may be unobserved heterogeneity between the
systems which, if overlooked, may lead to non-optimal or possibly completely
wrong decisions. We will consider this in the framework of the TRP process,
which in Lindqvist et al. (2003) is extended to the so-called HTRP model which
236
B. Lindqvist
The last extension of the basic models to be considered in the present chapter
consists of using Markov models to model the behavior of periodically inspected
systems in between inspections, with the use of separate Markov models for the
maintenance tasks at inspections.
Recent review articles concerning repairable systems and maintenance include
Pea (2006) and Lindqvist (2006). A review of methods for analysis of recurrent
events with a medical bias is given by Cook and Lawless (2002). General books on
statistical models and methods in reliability, covering much of the topics considered here, are Meeker and Escobar (1998) and Rausand and Hyland (2004).
237
Pr (event of type j in [t , t + t ) | H t )
.
t
(10.1)
From this we obtain an expression for the likelihood function, which is needed for
statistical inference. Suppose that a single system as described above is observed
from time 0 to time , resulting in observations T1 , T2 ,, TN ( ) . The likelihood
function is then given by (Andersen et al. 1993, Section II.7)
N ( )
L = (Ti ) exp (u ) du .
0
i =1
(10.2)
where t TN (t ) is the time since the last failure strictly before time t .
Suppose instead that after a failure, the system is repaired only to the state it
had immediately before the failure, called a minimal repair. This means that the
conditional intensity of the failure process immediately after the failure is the same
as it was immediately before the failure, and hence is exactly as it would be if no
failure had ever occurred. Thus we must have
(t ) = z (t ),
238
B. Lindqvist
t
It can be shown (Lindqvist et al. 2003) that the conditional intensity function,
given the history H t , for the TRP( F , ()) is
(t ) = z ( (t ) (TN (t ) )) (t )
(10.3)
239
i =1
For the NHPP ( ()) we have z (t ) 1 , so the likelihood simplifies to the well
known expression (Crowder et al. 1991, p 166)
N ( )
(T )}exp{
L ={
i =1
(u ) du}.
f [(T ) (T
L ={
i =1
i 1
(10.5)
This latter form of the likelihood of the TRP follows directly from the
definition, since the conditional density of Ti given T1 = t1 ,, Ti 1 = ti 1 is
f [ (ti ) (ti 1 )] (ti ) , and the probability of no failures in the time interval
(TN ( ) , ] , given T1 ,, TN ( ) , is 1 F [ ( ) (TN ( ) )] .
This again simplifies if (t ) 1 in which case it gives the likelihood of an
RP(F) observed on [0, ] .
10.2.4 Observations from Several Similar Systems
Suppose that m systems of the same kind are observed, where the j-th system
( j = 1, 2,, m ) is observed in the time interval [0, j ] . For the j-th system, let N j
denote the number of failures that occur during the observation period, and let the
specific failure times be denoted T1 j < T2 j < < TN j . Figure 10.3 illustrates the
notation and explains the information given in a so-called event plot which is
provided by computer packages for analysis of this kind of data (see examples
below).
j
240
B. Lindqvist
Figure 10.3. Observation of failure times of m systems. The j-th system is observed over
the time interval [0, j ] , with N j 0 observed failures
Figure 10.4. Event plot for times of valve seat replacements for 41 diesel engines, taken
from Nelson (1995)
When data are available for m systems as described above, one will typically
assume that the systems behave independently but with the same probability laws
(i.i.d. rules). The total likelihood for the data will then be the product of the
likelihoods at Equations 10.4 or 10.5, one factor for each of the m systems.
241
Figure 10.5. Event plot for times of external leakage from nuclear plant valves, taken from
Bhattacharjee et al. (2003). In addition, 88 valves had no failures in 3286 days (9 years)
However, even if the m systems are considered to be of the same type, they
may well exhibit different probability failure mechanisms. For example, systems
may be used under varying environmental or operational conditions.
To cover such cases we shall assume that failures of the j-th system follow the
process TRP ( F , j ()) , j = 1,, m , where the renewal function F is fixed and
differences between systems are modeled by letting the trend functions j (t ) vary
from system to system. The assumption of a fixed F parallels the NHPP case,
where F is the unit exponential distribution.
Assuming that systems work independently of each other, we obtain from
m
(10.7)
242
B. Lindqvist
(10.8)
i =1
However, since the a j are unobservable, we need to take the expectation with
respect to the a j , giving
L j = E[ L j (a j )] = L j ( a j ) dH ( a j )
as the contribution to the likelihood from the j-th system. The total likelihood is
then the product
m
L = Lj .
(10.9)
j =1
We shall use the notation HTRP ( F , (), H ) for the model with the likelihood at
Equation 10.9,). Here the renewal distribution F and the heterogeneity distribution
H are distributions corresponding to positive random variables with expected value
1, while the basic trend function (t ) is a positive function defined for t 0 .
243
A useful feature of the HTRP model is that several important models for repairable systems are easily represented as submodels. With the notation HPP, NHPP,
RP and TRP used as before, we define corresponding models with heterogeneity as
at Equation 10.8 by putting an H in front of the abbreviations. Specifically, from
a full model, HTRP ( F , (), H ) , we can identify the seven submodels described in
Table 10.1.
Table 10.1. The seven submodels of HTRP ( F , (), H ) . exp means the unit exponential
distribution, 1 means the distribution degenerate at 1 . The third column contains references to work on the corresponding models or special cases of them.
Submodel
HTRP-formulation
HPP ( )
RP ( F , )
HTRP ( F , ,1)
NHPP ( ())
TRP ( F , ())
HTRP ( F , (),1)
HHPP ( , H )
HTRP (exp, , H )
HRP ( F , , H )
HTRP ( F , , H )
HNHPP ( (), H )
The HTRP and the seven submodels may also be represented in a cube, as
illustrated in Figures 10.6 and 10.7. Each vertex of the cube represents a model,
and the lines connecting them correspond to changing one of the three coordinates in the HTRP notation. Going to the right corresponds to introducing a time
trend, going upwards corresponds to entering a non-Poisson case, and going backwards (inwards) corresponds to introducing heterogeneity. In analyzing data by
parametric HTRP models we shall see below how we use the cube to facilitate the
presentation of maximum log-likelihood values for the different models in a convenient, visual manner. The log-likelihood cube was introduced in Lindqvist et al.
(2003).
Example 1 (continued) Figure 10.6 shows the log-likelihood cube of the valve-seat
data. It should be noted that each arrow points in a direction where exactly one
parameter is added (see text of Figure 10.6 for definitions of parameters). Using
standard asymptotic likelihood theory we know that if this parameter has no
influence in the model, then twice the difference in log likelihood is approximately
chi-square distributed with one degree of freedom. For example, if twice the
difference is larger than 3.84, then the p-value of no significant difference is less
than 5% and we have an indication that the extra parameter in fact has some
relevance. Note that adding an extra parameter will always lead to a larger value of
the maximum log likelihood, but from what we just argued, the difference needs to
be more than, say, 3.84 / 2 = 1.92 to be of real interest.
244
B. Lindqvist
Figure 10.6. The log-likelihood cube for the Nelson valve seat data of Nelson (1995), fitted
with a parametric HTRP( F , (), H ) model and its sub-models. Here F is a Weibulldistribution with expected value 1 and shape parameter s , (t ) = cbt b 1 is a power
function of t , and H is a gamma-distribution with expected value 1 and variance v . The
maximum value of the log likelihood is denoted l
Looking at the valve-seat data cube in Figure 10.6 we note first that going from
a vertex of the front face to the corresponding vertex of the back face (adding H
in front of the model acronym) there is never much to gain (1.17 at most from HPP
to HHPP). This indicates no apparent heterogeneity between the various engines.
By comparing the left and right faces we conclude, however, that there seems to
be a gain in including a time trend. Having already excluded heterogeneity we are
thus faced with the possibilities of either NHPP or TRP. Here the latter model
wins, since the difference in log-likelihood is as large as (343.66) (346.49) =
2.83 and twice the difference equal to 5.66 corresponding to an approximate p-value
of 0.017.
The resulting estimated TRP is seen to have a renewal distribution which is
Weibull with shape parameter 0.6806 which implies a decreasing failure rate. This
means that the conditional intensity function will jump upward at each failure,
which may be explained by burn-in problems at each valve-seat replacement.
Further, there will be an estimated time trend of the form
(t ) = 3.26 106 1.929 t 0.929 = 6.29 10 6 t 0.929 which increases with t so
that replacements are becoming more and more frequent.
Example 2 (continued) For the closing valve failures considered by Bhattacharjee
et al. (2003), previous studies had shown significant variations in the number of
245
The expected number of failures in 3286 days are hence 0.125 and 4.99 , respectively, for the good and bad valves.
246
B. Lindqvist
Figure 10.7. The log-likelihood cube for the data of Bhattacharjee et al. (2003) concerning
failures of motor operated closing valves in nuclear reactor plants in Finland, fitted with a
parametric HTRP( F , (), H ) model and its sub-models. Here F is a Weibull-distribution
b 1
is a power function of t , and
with expected value 1 and shape parameter s , (t ) = cbt
H is a two-point distribution with unit expectation, giving probability p for the value
low and 1 p for the value high. The maximum value of the log likelihood is denoted
by l
Doyen and Gaudoin (2006) recently presented a point process approach for
modeling of such competing risks situations between failure and PM. A general
setup for this kind of processes is furthermore suggested in the review paper
Lindqvist (2006).
For simplicity we shall in this chapter consider only the case where the
component or system is perfectly repaired or maintained at the end of each sojourn.
This will lead to the observation of independent copies of the competing risks
situation in the same way as for a renewal process. We will therefore in the following consider only a single sojourn and hence suppress the subscripts of the observed times. Thus we let X and Z be, respectively, the potential times to failure
and time to PM of a single sojourn. Then Y = min( X , Z ) is the observed sojourn,
and in addition we observe the indicator variable which we define to be 1 if
there is a PM ( Y = Z ) and 0 if there is a failure ( Y = X ). This situation has been
extensively studied by Cooke (1993, 1996), Bedford and Cooke (2001), Langseth
and Lindqvist (2003, 2006), Lindqvist et al. (2006) and Lindqvist and Langseth
(2005).
Thus note that the observable result is the pair (Y , ) , rather than the underlying
times X and Z , which may often be the times of interest. For example, knowing
247
F X (t ) = FX (t ) /FX () , F Z (t ) = FZ (t ) /FZ () .
It is important to note that the functions FX , FZ , F X , F Z are identifiable from
data of the form (Y , ) , since they are given in terms of probabilities of events that
can be expressed by (Y , ) . For example, FX (t ) = P(Y t , = 0) and can hence be
estimated consistently from a sample of values of (Y , ) .
On the other hand, as already mentioned, the marginal distribution functions
FX , FZ are not identifiable in general since they are not probabilities of events that
can be expressed by (Y , ) .
We now show that the marginal distribution of X is identifiable under random
signs censoring. In fact this follows directly from the definition, since we must have
248
B. Lindqvist
F X (t ) = P( X t | X < Z ) = P( X t ) = FX (t )
(10.10)
by independence of X and the event X < Z . As verified above, F X (t ) can always be estimated consistently from data, and thus this holds for FX (t ) as well by
Equation 10.10. Hence we have the somewhat surprising result under random signs
censoring that the marginal distribution of X is the same as the distribution of the
observed occurrences of X .
Cooke (1993) showed that under random signs censoring we have
F X (t ) < F Z (t ) for all t > 0.
(10.11)
Moreover, he showed the kind of inverse statement that whenever Equation 10.11
holds, there exists a joint distribution of ( X , Z ) satisfying the requirements of
random signs censoring and giving the same sub-distribution functions.
On the other hand, if F X (t ) F Z (t ) for some t , then there is no joint distribution of ( X , Z ) for which the random signs requirement holds. For more discussion
on random signs censoring and its applications we refer to Cooke (1993, 1996) and
Bedford and Cooke (2001, Chapter 9). One idea is to estimate the functions F X (t )
and F Z (t ) from data to check whether Equation 10.11 may possibly hold and
when this is the case to suggest a model that satisfies the random signs property.
10.3.3 The Repair Alert Model
Lindqvist et al. (2006) introduced the so-called repair alert model which extends
the idea of random signs censoring by defining an additional repair alert function
which describes the alertness of the maintenance crew as a function of time. The
definition can be given as follows:
The pair ( X , Z ) of life variables satisfies the requirements of the repair alert
model provided the following two conditions both hold:
(i) Z is a random signs censoring of X b
(ii) There exists an increasing function G defined on [0, ) with G (0) = 0 , such
that for all x > 0 ,
P( Z z | Z < X , X = x) =
G( z)
, 0 < z x.
G ( x)
The function G is called the cumulative repair alert function. Its derivative g
(when it exists) is called the repair alert function. The repair alert model is hence a
specialization of random signs censoring, obtained by introducing the repair alert
function G .
Part (ii) of the above definition means that, given that there would be a failure
at time X = x , and given that the maintenance crew will perform a PM before that
249
time (i.e. given that Z < X ), the conditional density of the time Z of this PM is
proportional to the repair alert function g .
Lindqvist et al. (2006) showed that whenever Equation 10.11 holds there is a
unique repair alert model giving the same sub-distribution functions. Thus, restricting to repair alert models we are able to strengthen the corresponding result for
random signs censoring which does not guarantee uniqueness.
The repair alert function is meant to reflect the reaction of the maintenance
crew. More precisely, g (t ) ought to be high at times t for which failures are expected and the alert therefore should be high. Langseth and Lindqvist (2003) simply put g (t ) = (t ) where (t ) is the failure rate of the marginal distribution of
X . This property of g (t ) of course simplifies analyses since it reduces the number
of parameters, but at the same time it seems fairly reasonable given a competent
maintenance crew. In a subsequent paper, Langseth and Lindqvist (2006) present
ways to test whether g (t ) can be assumed equal to the hazard function (t ) .
It follows from the construction in Lindqvist et al. (2006) that the repair alert
model is completely determined by the marginal distribution function FX of X ,
the cumulative repair alert function G , the probability q P( Z < X ) , and the
assumption that X is independent of the event {Z < X } (i.e. random signs censoring). Thus, given statistical data, the inference problem consists of estimating
FX (t ) (possibly on parametric form), the repair alert function g (or G ), and the
probability q of PM. We refer to Lindqvist et al. (2006) and Lindqvist and
Langseth (2005) for details on such statistical inferences.
The following is a simple example of a repair alert model.
Example 3 Let ( X , Z ) be a pair of life variables with joint density parameterized
by > 0 and 0 < q < 1 ,
f XZ ( x, z; , q ) = (q /x) e x for x > 0, 0 < z < x /q.
P( Z z | Z < X , X = x) =
250
B. Lindqvist
The practical interpretation of this example is as follows. We consider a component or system with lifetime X which is exponentially distributed with failure
rate . With probability q a PM is performed before X , at a time which for
given X = x is uniformly distributed on the interval from 0 to x .
10.3.4 Further Properties of The Repair Alert Model
The following formula (taken from Lindqvist et al. 2006) shows in particular why
Equation 10.11 holds under the repair alert model:
F Z (t ) = FX (t ) + G (t )
f X ( y)
dy.
G( y)
(10.12)
Note that for random signs and hence for the repair alert model we have
F X (t ) = FX (t ) .
We next discuss some implications of the repair alert model, in particular how
the parameters q and G influence the observed performance of PM and failures.
In order to help intuition, we sometimes consider the power version G (t ) = t
where > 0 is a parameter. Then g (t ) = t 1 so = 1 means a constant repair
alert function, while < 1 and > 1 correspond to, respectively, a decreasing and
increasing repair alert function.
Under the random signs assumption, the parameter q = P( Z < X ) is connected
to the ability to discover signals regarding a possibly approaching failure. More
precisely, q is understood as the probability that a failure is avoided by a preceding PM.
Given that there will be a PM, one should ideally have the time of PM immediately before the failure. It is seen that this issue is connected to the function G . For
example, large values of will correspond to distributions with most of its mass
near x .
Moreover, it follows from Equation 10.12 that
E (Z | Z < X ) =
M (X )
(1 F Z ( z ))dz = E ( X ) E
G( X )
E( X )
+1
(10.13)
251
f X ( y)
dy
G( y)
M (X )
E (Y ) = E ( X ) qE
, where M ( x) =
G( X )
x
0
G (t )dt.
Furthermore, if G (t ) = t , then
q
E (Y ) = E ( X ) 1
.
+1
(10.14)
We finally give a simple illustration of how the parameters q and (assuming G (t ) = t for simplicity) influence the long run cost per time unit under the
repair alert model. Let CPM , CF be costs of PM and failure, respectively, for a
single sojourn. Assume now that following an event (PM or failure), the operation
is restarted with a system assumed to be as good as new, and that this process
continues. This leads to a sequence of observations of (Y , ) , which we shall
assume are independent and identically distributed. The theory of renewal reward
processes (e.g. Ross 1983, p 78) implies that the expected cost per unit time in the
long run equals the expected cost per sojourn divided by the expected length of a
sojourn, i.e.
qCPM + (1 q)CF
E ( X ) 1 q+1
252
B. Lindqvist
, 2 , 3 ,,
called PM epochs. Here > 0 is the length of what we shall call the PM interval.
10.4.1 The Markov Model
Let X (t ) S denote the state of the system at time t , where the set S of possible
states is finite. It is assumed that X (t ) behaves like a time homogeneous Markov
chain as long as time runs inside PM intervals, i.e. inside time intervals
n t < (n + 1) for n = 0,1,. This Markov chain is governed by an infinitesimal
intensity matrix A , where the entry a jk of A for j k is the transition intensitiy
from state j to state k ; see for example Taylor and Karlin (1984, p 254). An
example of an intensity matrix A is given by Equation 10.15, an illustration of
which is provided by the state diagram in Figure 10.9. Let
Pjk (t ) = P( X (t ) = k | X (0) = j ); j, k S , t > 0
denote transition probabilities for the Markov chain governed by A and let
P(t ) = ( Pjk (t ); j, k S )
253
which is the state of the system immediately before the n -th PM epoch. The effect
of PM at time n is to change the state of the system from Yn to Z n according to
a transition matrix R = ( R jk ) , where
P( Z n = k | Yn = j ) = R jk ; j, k S .
The model description is completed by defining the initial state of the Markov
chain X (t ) running inside the PM interval [n , (n + 1) ) to be X (n ) Z n
( n = 0,1,), where Z 0 is the initial state of the system, usually the perfect state in
S . It is furthermore assumed that the Markov chain X (t ) on [n , (n + 1) ) , given
its initial state Z n , is independent of all transitions occurring before time n .
Let the distribution of Z 0 X (0) be denoted = ( j ; j S ) , where
j = P( Z 0 = j ) . Then for any k S ,
P(Y1 = k ) = P( X ( ) = k )
= P( X ( ) = k | X (0) = j ) P( X (0) = j )
jS
= j Pjk ( ) = [ P( )]k .
jS
P(Y
n +1
S
P
S
k
= k | Z n = , Yn = j ) P( Z n = | Yn = j )
( ) R j = [ RP( )] jk .
254
B. Lindqvist
P( Z
S
n +1
= k | Yn +1 = , Z n = j )
P(Yn +1 = | Z n = j )
=
P
S
j
( ) Rk = [ P( ) R] jk .
Un =
1 ( n +1)
P ( X (t ) G )dt .
n
255
P
0
jS
jG
(t ) P( Z n = j )dt
Following Hokstad and Frvig (1996) we shall define the critical safety unavailability (CSU) of the system by
CSU = lim U n
n
1
=
P
0
jG
(t ) j dt =
jS
Q
j
jS
where
Qj =
PjG (t )dt
is the critical safety unavailability given that the system state is j at the beginning
of the PM interval.
10.4.3 The Failure Model of Hokstad and Frvig
As an illustration we shall reconsider the most general failure model of Hokstad
and Frvig (1996), namely their Failure Mechanism III. Here the state space is
S = {O, D, K I , K II },
where O = the system is as good as new, D = the system has a failure classified as
degraded (noncritical), K I = the system has a failure classified as critical, caused
by a sudden shock, K II = the system has a failure classified as critical, caused by
the degradation process.
It is assumed that the Markov chain X (t ) is defined by the state diagram of
Figure 10.9, and thus has infinitesimal transition matrix
256
B. Lindqvist
d k
0
A=
d
k dk
k
k
0
dk
0
(10.15)
Figure 10.9. State diagram for the failure mechanism of Hokstad and Frvig (1996)
The model assumes that no repairs are done in the time intervals between PM
epochs. Moreover, since A is upper triangular, we can obtain P(t ) = etA rather
easily. It is clear that P(t ) can be written
POO (t ) POD (t ) POK (t ) POK (t )
0
0
1
0
I
II
II
where expressions for the entries are found in Lindqvist and Amunrustad (1998).
In practice it is of interest to quantify the effect of various forms of preventive
maintenance. This can be done in the presented framework by means of the repair
matrix R . Some examples are given below.
If all failures are repaired at PM epochs, then the PM always returns the system
back to state O , and we have
1
1
R=
1
0 0 0
0 0 0
0 0 0
0 0 0
257
Next, if only critical failures are repaired at PM epochs, then the appropriate R
matrix is
1
0
R=
1
0 0 0
1 0 0
0 0 0
0 0 0
More generally one may consider an extension of this by assuming that all
critical failures are repaired, while degraded failures are repaired with probability
1 r and remain unrepaired with probability r , 0 r 1 . The repair strategy is
thus determined by the parameter r .
This clearly leads to the matrix
0 0 0
1
1 r r 0 0
R=
1
0 0 0
0 0 0
1
R=
1 r
1 rk1
r 0
0 rk1
0
0
1 rk 2
k 2
Here r has the same meaning as before, while 1 rk1 is the probability of successful repair of a K I failure and 1 rk 2 is the similar for K II .
258
B. Lindqvist
checking (see for example the consideration of maximum log likelihoods in the
examples of Section 10.2.5). Another way of extending the NHPP processes is via
the large class of imperfect repair models. The classical model is here the one
suggested by Brown and Proschan (1983) (see the review paper Lindqvist 2006 for
an introduction to the subsequent literature). Imperfect repair models combine two
basic ingredients, a hazard rate z (t ) of a new system together with a particular
repair strategy which governs a so called virtual age process. The idea is that the
virtual age of the system is reduced at repairs by a certain amount which depends
on the repair strategy. The extreme cases are the perfect repair (renewal) models
where the virtual age is set to 0 after each repair, and the minimal repair (NHPP)
models where the virtual age is not reduced at repairs and hence always equals the
actual age.
Second, we have put some emphasis on the consideration of possible heterogeneity between systems of the same kind. Recall our Example 2 based on data
from Bhattacharjee et al. (2003). The authors write in their conclusion: The
heterogeneity of failure behaviour of safety related components, such as valves in
our case study, may have important implications for reliability analysis of safety
systems. If such heterogeneity is not identified and taken into account, the decisions made to maintain or to enhance safety can be non-optimal or even erroneous.
This non-optimality is more serious if the safety related decisions are made on the
basis of failure histories of the components. Still it is believed that heterogeneity
has been neglected in many reliability applications. In fact, analyses of reliability
data will often lead to an apparent decreasing failure rate which is counterintuitive
in view of wear and ageing effects. Proschan (1963) pointed out that such observed
decreasing rates could be caused by unobserved heterogeneity. Proschan presented
failure data from 17 air conditioner systems on Boeing 720 airplanes, concluding
that an HPP model was appropriate for each plane, but that the rates differed from
plane to plane. This is a classical example of heterogeneity in reliability. If times
between failures had been treated as independent and identically distributed across
planes, the conclusion would have been that these times between failures had a
decreasing failure rate.
It has long been known in biostatistics that neglecting individual heterogeneity
may lead to severe bias in estimates of lifetime distributions. The idea is that
individuals have different frailties, and that those who are most frail will die or
fail earlier than the others. This in turn leads to a decreasing population hazard,
which has often been misinterpreted in the same manner as mentioned for the
reliability applications. Important references on heterogeneity in the biostatistics
literature are Vaupel et al. (1979), Hougaard (1984) and Aalen (1988). It should be
noted that heterogeneity is in general unidentifiable if being considered an individual quantity. For identifiability it is necessary that frailty is common to several
individuals, for example in family studies in biostatistics, or if several events are
observed for each individual, such as for the repairable systems considered in this
paper. The presence of heterogeneity is often apparent for data from repairable
systems if there is a large variation in the number of events per system. However, it
is not really possible to distinguish between heterogeneity and dependence of the
intensity on past events for a single process.
259
The third point to be mentioned regards the use, or lack of use, of methods for
competing risks in reliability applications. The following is a citation from
Crowder (2004) appearing in the article on Competing Risks in Encyclopedia of
Actuarial Sciences: If something can fail, it can often fail in one of several ways
and sometimes in more than one way at a time. In the real world, the cause, mode,
or type of failure is usually just as important as the time to failure. It is therefore
remarkable that in most of the published work to date in reliability and survival
analysis there is no mention of competing risks. The situation hitherto might be
referred to as a lost case. Fortunately, some work has been done recently in order
to include competing risks in the study of repaired and maintained systems. Much
of this work, partly reviewed in Section 10.3, has been motivated by the work of
Cooke (1996) and his collaborators. His point of departure was formulated in the
conclusion of Cooke (1996): The main themes of Parts I and II of this article are
that current RDB (Reliability Data Bank) designs: 1. are not giving RDB users
what they need; 2. are not doing a good job of analyzing competing risk data; 3. are
not doing a good job in handling uncertainty. Improvements in all these areas are
possible. However, it must be acknowledged that the models and methods presented here merely scratch the surface. It is therefore appropriate to conclude with
a summary of open issues...
The final section of the present chapter considers an example of an approach
which in some sense generalizes the competing risks issue, namely using Markov
chains to model failure mechanisms of various equipment.
The chapter has mostly considered the modeling of repairable systems, with
less mention of statistical methods. It is believed that much of future research on
maintenance of repairable systems will still be centered around modeling, possibly
with an increased emphasis on point process models including multiple types of
events (see for example Doyen and Gaudoin 2006). More detailed models of the
underlying failure and maintenance mechanisms may indeed be of great value for
planning and optimization of maintenance actions. On the other hand, the new
advances in modeling certainly lead to considerable statistical challenges. This
point was touched on by Cooke (1996) as cited above, and it is clear that the information in reliability databases could and should be handled by more sophisticated methods than the ones that are traditionally used. Here there is much to learn
from the biostatistics literature where there has for a long time been an emphasis
on nonparametric methods and on regression methods using covariate information.
10.6 References
Aalen OO, (1988) Heterogeneity in survival analysis. Statistics in Medicine 7:11211137.
Andersen P, Borgan O, Gill R, Keiding, N, (1993) Statistical Models Based on Counting
Processes. Springer, New York.
Ascher H, Feingold H, (1984) Repairable Systems Modeling, inference, misconceptions
and their causes. Marcel Dekker, New York.
Bedford T, Cooke RM, (2001) Probabilistic Risk Analysis: Foundations and Methods;
Cambridge University Press: Cambridge.
260
B. Lindqvist
261
Proschan F, (1963) Theoretical explanation of observed decreasing failure rates. Technometrics 5:375383.
Rausand M, Hyland A, (2004) System reliability theory: Models, statistical methods, and
applications. 2nd ed. Wiley-Interscience, Hoboken, N.J.
Ross SM, (1983) Stochastic Processes. Wiley, New York.
Taylor HM, Karlin S, (1984) An introduction to stochastic modeling. Academic Press,
Orlando.
Vaupel JW, Manton KG, Stallard E, (1979) The impact of heterogeneity in individual frailty
on the dynamics of mortality. Demography 16:439454.
11
Optimal Maintenance of Multi-component Systems:
A Review
Robin P. Nicolai and Rommert Dekker
11.1 Introduction
Over the last few decades the maintenance of systems has become more and more
complex. One reason for this is that systems consist of many components which
depend on each other. On the one hand, interactions between components complicate the modelling and optimization of maintenance. On the other hand, interactions
also offer the opportunity to group maintenance which may save costs. It follows
that planning maintenance actions is a big challenge and it is not surprising that
many scholars have studied maintenance optimization problems for multi-component systems. In some articles new solution methods for existing problems are
proposed, in other articles new maintenance policies for multi-component systems
are studied. Moreover, the number of papers with practical applications of optimal
maintenance of multi-component systems is still growing.
Cho and Parlar (1991) give the following definition of multi-component
maintenance models: Multi-component maintenance models are concerned with
optimal maintenance policies for a system consisting of several units of machines
or many pieces of equipment, which may or may not depend on each other
(economically/stochastically/structurally). So, in these models it is all about making
an optimal maintenance planning for systems consisting of components that interact
with each other. We will come back later to the concepts of optimality and interaction. For now it is important to remember that the condition of the systems depends
on (the state of) the components which will only function if adequate maintenance
actions are performed.
In this chapter we will give an up-to-date review of the literature on multicomponent maintenance optimization. Let us start with a brief summary of the
overview articles that have appeared in the past. Cho and Parlar (1991) review
articles from 1976 to 1991. The authors divide the literature into five topical
categories: machine-interference/repair models, group/block/cannibalization/opportunistic models, inventory/maintenance models, other maintenance/replacement
models and inspection/maintenance models. Dekker et al. (1996) deal exclusively
264
265
266
267
268
269
fixed cost (the set-up cost) and variable costs. In the articles discussed below, this
will not be different.
Castanier et al. (2005) consider a two-component series system. Economic
dependence between the two components is present in the following way. The setup cost for inspecting or replacing a component is charged only once if the actions
on both components are combined. That is, joint maintenance of components saves
costs. In this article the condition of the components is modelled by a stochastic
process and it is monitored by non-periodic inspections. In the opportunistic
maintenance policy several thresholds are defined for doing inspections, corrective
and preventive replacements, and opportunistic maintenance. These thresholds are
decision variables. Many articles on this type of models have appeared, but most of
these articles only consider single component models.
The articles of Scarf and Deara (1998, 2003) consider both economic and
stochastic dependence between components in a series system. This combination is
scarce in the literature. Positive economic dependence is modelled on the basis that
the cost of replacement of one or more components includes a one-off set-up cost
whose magnitude does not depend on the number of components replaced. We will
discuss these articles in more detail in Section 11.4.
In one of the few case studies found in the literature, Van der Duyn Schouten et
al. (1998) investigate the problem of replacing light bulbs in traffic control signals.
Each installation consists of three compartments for the green, red, and yellow
lights. Maintenance of light bulbs means replacement, either correctively or
preventively. First, positive economic dependence is present in the form of set-up
cost, because each replacement action requires a fixed cost in the form of
transportation of manpower and equipment. Second, the failure of individual bulbs
is an opportunity for doing preventive maintenance on other bulbs. The authors
propose two types of maintenance policies. In the first policy, also known as the
standard indirect-grouping strategy (introduced in maintenance by Goyal and Kusy
1985; for a review of this strategy we refer to Dekker et al. 1996), corrective and
preventive replacements are strictly separated. Economies of scale can thus only be
achieved by combining preventive replacements of the bulbs. The authors also
propose the following opportunistic age-based grouping policy. Upon failure of a
light bulb, the failed bulbs and all other bulbs older than a certain age are replaced.
Budai et al. (2006) consider a preventive maintenance scheduling problem
(PMSP) for a railway system. In this problem (short) routine activities and (long)
unique projects for one track have to be scheduled in a certain period. To reduce
costs and inconvenience for the travellers and operators, these activities should be
scheduled together as much as possible. With respect to the latter, maintenance of
different components of one track simultaneously requires only one track possession.
Time is discretized and the PMSP is written as a mixed-integer linear programming
model. Positive dependence is taken into account by the objective function, which is
the sum of the total track possession cost and the maintenance cost over a finite
horizon. To reduce possible end-of-horizon effects an end-of-horizon valuation is
also incorporated in the objective function. Note that the possession cost can be seen
as a downtime cost. The cost is modelled as a fixed/ set-up cost. This is the reason
that it is classified in this category. Besides this positive dependence there also exists
negative dependence between components, since some activities exclude each other.
270
271
these policies the system is replaced at the time of the m-th failure, every T time
units, and at the minimum time of these events, respectively. These policies were
first introduced by Assaf and Shanthikumar (1987), Okumoto and Elsayed (1983)
and Ritchken and Wilson (1990), respectively. Popova and Wilson (1999) assume
that downtime costs are incurred when failed components are not repaired or
replaced. So, when the system operates there is also negative dependence between
the components. After all, when the components are left in a failed condition, with
the intention to group corrective maintenance, then downtime costs are incurred. In
the maintenance policies a trade-off between the downtime costs and the advantages of grouping (corrective) maintenance is made.
Sheu and Jhang (1996) propose a new two-phase opportunistic maintenance
policy for a group of independent identical repairable units. Their model takes into
account downtime costs and the maintenance policy includes minimal repair,
overhaul, and replacement. In the first phase, (0,T], minor failures are removed by
minimal repairs and catastrophic failures by replacements. In the second phase,
(T,T+W], minor failures are also removed by minimal repairs, but catastrophic
failures are left idle. Group maintenance is conducted at time T+W or upon the k-th
idle, whichever comes first. The generalized group maintenance policy requires
inspection at either the fixed time T+W or the time when exactly k units are left
idle, whichever comes first. At an inspection, all idle components are replaced with
new ones and all operating components are overhauled so that they become as
good as new.
Higgins (1998) studies the problem of scheduling railway track maintenance
activities and crews. In this problem positive economic dependence is present in the
following way. The occupancy of track segments due to maintenance prevents all
train movements on those segments. The costs associated with this can be regarded
as downtime costs. The maintenance scheduling problem is modelled as a large scale
0-1 programming problem with many (non-linear) restrictions. The objective is to
minimize expected interference delay with the train schedule and prioritized finishing
time. The downtime costs are modelled by including downtime probabilities in the
objective function. The author proposes tabu search to solve the problem. The
neighbourhood, which plays a prominent role in local search techniques, is easily defined by swapping the order of activities or maintenance crews.
The article of Sriskandarajah et al. (1998) discusses the maintenance scheduling
of rolling stock. Multiple train units have to be overhauled before a certain due date.
The aim is to find a suitable common due date for each train so that the due dates of
individual units do not deviate too much from the common due date. Maintenance
carried out too early or too late is costly since this may cause loss of use of a train.
A genetic algorithm is proposed to solve this scheduling problem.
11.3.2 Negative Economic Dependence
Negative economic dependence between components occurs when maintaining
components simultaneously is more expensive than maintaining components individually. There can be several reasons for this:
272
Manpower restrictions
Safety requirements
Redundancy/production-loss
273
increases more than linearly with the number of pieces out of action. The objective
is to minimize the loss of production cost, which is incurred when a piece is
overhauled. The optimal policy is found by a relative value successive approximation algorithm.
In Langdon and Treleaven (1997) the problem of scheduling maintenance for
electrical power transmission networks is studied. There is negative economic
dependence in the network due to redundancy/production-loss. Grouping certain
maintenance activities in the network may prevent a cheap electricity generator
from running, so requiring a more expensive generator to be run in its place. That
is, some parts of the network should not be maintained simultaneously. These
exclusions are modelled by adding restrictions to the MIP formulation of the problem. The authors propose several genetic algorithms and other heuristics to solve
the problem.
11.3.3 k-out-of-n Systems
In this section we discuss the different dependencies in the k-out-of-n system in
more detail. This system is a typical example of a system with both positive and
negative economic dependence between components. A k-out-of-n system functions if at least k components function. If k = 1, then it is a parallel system; if k = n,
then it is a series system. Let us for the moment distinguish between the cases k = n
and k < n.
In the series system (k = n), there is positive economic dependence due to
downtime opportunities. The failure of one component results in an expensive
downtime of the system and this time can be used to group preventive and corrective maintenance. Negative economic dependence is not explicitly present in the
series system.
If k < n, then there is redundancy in the system and it fails less often than its
individual components. This way a specified reliability can be guaranteed. Typically, the components of this system are identical which allows for economies of
scale in the execution of maintenance activities. It is not only possible to obtain
savings by grouping preventive maintenance, but also by grouping corrective
maintenance. Note that the latter form of grouping is not advantageous in series
systems. In other words, the redundant components introduce additional positive
dependence in the system. Whereas positive economic dependence is present upon
failure of a component, negative economic dependence plays a role as long as the
system operates. A single failure of a component may not always be an opportunity
to combine maintenance activities. First, grouping corrective and preventive maintenance upon the failure of the component increases the probability of system
failure and costly production losses. Second, leaving components in a failed
condition for some time, with the intention to group corrective maintenance at a
later stage, has the same effect. So, there is a trade-off between the potential loss
resulting from a system failure and the benefit of joint maintenance.
One problem of optimizing (age-based) maintenance in k-out-of-n systems is
the determination of downtime costs, as a failure does not directly result in system
failure. Smith and Dekker (1997) derive the uptime, downtime and costs of maintenance in a 1-out-of-n system (with cold standby), but in general it is very difficult
274
to assess the availability and the downtime costs of a k-out-of-n system. In their
article, Smith and Dekker (1997) optimize the following age-replacement policy. A
component is taken out for preventive maintenance and replaced by a stand-by one,
if its age has reached a certain value Tpm. Moreover, they determine the number of
redundant components needed in the system.
In the maintenance policies considered in the articles below, an attempt is made
to balance the negative aspects of downtime costs and the positive aspects of
grouping (corrective) maintenance. The opportunistic maintenance policies proposed
in these articles are age-based and also contain a threshold for the number of failures
(except for the policy introduced by Sheu and Kuo 1994).
In Dekker et al. (1998b) the maintenance of light-standards is studied. A light
standard consists of n independent and identical lamps screwed on a lamp assembly. To guarantee a minimum luminance, the lamps are replaced if the number of
failed lamps reaches a pre-specified number m. In order to replace the lamps the
assembly has to be lowered. This set-up activity is an opportunity to combine
corrective and preventive maintenance. Several opportunistic age-based variants of
the m-failure group replacement policy (in its original form only corrective maintenance is grouped) are considered in this paper. Simulation optimization is used to
determine the optimal opportunistic age threshold.
Pham and Wang (2000) introduce imperfect PM and partial failure in a k-outof-n system. They propose a two-stage opportunistic maintenance policy for the
system. In the first stage failures are removed by minimal repair; in the second
stage failed components are jointly replaced with operating components when m
components have failed, or the entire system is replaced at time T, whichever
occurs first. Positive economic dependence is of an opportunistic nature. Joint
maintenance requires less time than individual maintenance.
Sheu and Kuo (1994) introduce a general age replacement policy for a k-out-ofn system. Their model includes minimal repair, planned and unplanned replacements, and general random repair costs. The system is replaced when it reaches age
T. The long-run expected cost rate is obtained. The aim of the paper is to find the
optimal age replacement time T that minimizes the long-run expected cost per unit
time of the policy.
The article of Sheu and Liou (1992) will be discussed in Section 11.4, because
they assume stochastic dependence between the components of a k-out-of-n system.
275
Instead, we want to give insight into the different ways of modelling failure
interaction between components and explain the implications of certain approaches
and assumptions with respect to practical applicability.
Stochastic dependence, also referred to as failure interaction or probabilistic
dependence, implies that the state of components can influence the state of the
other components. Here, the state can be given by the age, the failure rate, state of
failure or any other condition measure. In their seminal work on stochastic dependence, Murthy and Nguyen (1985b) introduce three different types of failure
interaction in a two-component system.
Type I failure interaction implies that the failure of a component can induce a
failure of the other component with probability p (q), and has no effect on the other
component with probability 1 p (1 q). It follows that there are two types of
failures: natural and induced. The natural failures are modelled by random
variables and the induced failures are characterized by the probabilities p and q. In
Murthy and Nguyen (1985a) the authors extend type I failure interaction to systems
with multiple components. It is assumed that whenever a component fails it
induces a total failure of the system with probability p and has no effect on the
other components with probability (1 p). In this chapter we will consider this to
be the definition of type I failure interaction.
Type II failure interaction in a two-component system is defined as follows.
The failure of component 2 can induce a failure of component 1 with probability q,
whereas every failure of component 1 acts as a shock to component 2, without
inducing an instantaneous failure, but affecting its failure rate.
Type III failure interaction implies that the failure of each component affects
the failure rate of the other component. That is, every failure of one of the components acts as a shock to the other component.
A potential problem of the failure rate interaction defined by the last two types,
is determining the size of the shock. In practice it is very difficult to assess the
effect of a failure of one component on the failure rate of another component.
Usually there is not much data on the course of the failure rate of a component
after the occurrence of a shock. Shocks can also be modelled by adding a (random)
amount of damage to the state of another component. Natural failures then occur if
the state of a component (measured by the cumulative damage) exceeds a certain
level. In this paper we will bring this modelling of type II and III failure interaction
together in one definition. That is, we renew the definition of type II failure
interaction for multi-component systems. It reads as follows. The system consists of
several components and the failure of a component affects either the failure rate of
or causes a (random) amount of damage to the state of one or more of the
remaining components. It follows that we regard a mixture of induced failures and
shock damage as type II failure interaction. Models with type II failure interaction
will also be called shock damage models.
In general, the maintenance policies considered in the literature on stochastic
dependence, are mainly of an opportunistic nature, since the failure of one component is potential harmful for the other component(s). Modelling failure interaction
appears to be quite elaborate. Therefore, most articles only consider two-component systems. Below we review the articles on failure interaction in the following
order. First, we will discuss the type I interaction models. For this type of inter-
276
action different opportunistic versions of the well known age and block replacement policies have been proposed. Second, the articles on type II interaction will
be reviewed. We will see that in most of these articles the occurrence of shocks is
modelled as a non-homogeneous Poisson process (NHPP) or that the failure rate of
components is adjusted upon failure of other components. Third, we pay attention
to articles that consider both types of failure interaction. Finally, we discuss other
forms of modelling failure interaction.
11.4.1 Type I Failure Interaction
Murthy and Nguyen (1985a) consider two maintenance policies in a multicomponent system with type I failure interaction. Under the first policy all failed
components are replaced by new ones. When there is no total system failure, then
only the single failed component is replaced. Under the second policy all components, also the functioning component(s), are replaced. When there is no total
system failure, then the single failed component is subjected to minimal repair and
made operational. The failure rate of the failed component after repair is the same
as that just before failure. The authors deduce both the expected cost of keeping the
system operational for a finite time period as well as the expected cost per unit
time, of keeping the system operational for an infinite time period.
Sheu and Liou (1992) consider an optimal replacement policy for a k-out-of-n
system subject to shocks. Shocks arrive according to a NHPP. The system is
replaced preventively whenever it reaches age T > 0 at a fixed cost c0. If the m-th
shock arrives at age Sm < T, it can cause the simultaneous failure of i components
at the same time with probability pi(Sm) for i = 0, 1,..., n, where
n
i =0
pi ( S m ) = 1 . If
277
278
279
Lai and Chen (2006) consider a two-component system with failure rate
interaction. The lifetimes of the components are modelled by random variables
with increasing failure rates. Component 1 is repairable and it undergoes minimal
repair at failures. That is, component 1 failures occur according to a NHPP. Upon
failure of component 1 the failure rate of component 2 is modified (increased).
Failures of component 2 induce the failure of component 1 and consequently the
failure of the system. The authors propose the following maintenance policy. The
system is completely replaced upon failure, or preventively replaced at age T,
whichever occurs first. The expected average cost per unit time is derived and the
policy is optimized with respect to parameter T. The optimum turns out to be
unique.
Barros et al. (2006) introduce imperfect monitoring in a two-component
parallel system. It is assumed that the failure of component i is detected with
probability 1 pi and is not detected with probability pi. The components have
exponential lifetimes and when a component fails the extra stress is placed on the
surviving one for which the failure rate is increased. Moreover, independent shocks
occur according to a Poisson process. These shocks correspond to common cause
failures and induce a system failure. The following maintenance policy is proposed.
Replace the system upon failure (either due to a shock or failure of the components
separately), or preventively at time T, whichever occurs first. Assuming that
preventive replacement is cheaper, the total expected discounted cost over an
unbounded horizon is minimized. Numerical examples show the relevance of taking
into account monitoring problems in the maintenance model. The model is applied
to a parallel system of electronic components. When one fails, the surviving one is
overworked so as keep the delivery rate not affected.
11.4.3 Types I and II failure interaction
Murthy and Nguyen (1985b) derive the expected cost of operating a two-component system with type I or type II failure interaction for both a finite and an
infinite time period. They consider a simple, non-opportunistic, maintenance
policy. Always replace failed components immediately. This means that the system
is only renewed if a natural failure induces a failure of the other component.
Nakagawa and Murthy (1993) elaborate on the ideas of Murthy and Nguyen
(1985b). They consider two types of failure interaction between two components.
In the first case the failure of component 1 induces a failure of component 2 with a
certain probability. In the second case the failure of component 1 causes a random
amount of damage to the other component. In the latter case the damage
accumulates and the system fails when the total damage exceeds a specified level.
Failures of component 1 are modelled as an NHPP with increasing intensity
function. The following maintenance policy is examined. The system is replaced at
failure of component 2 or at the N-th failure of component 1, whichever occurs
first. For both models the optimal number of failures before replacing the system as
to minimize the expected cost per unit time over an infinite horizon is derived. The
maintenance policy for the shock damage model is extended as follows: the system
is also replaced at time T. This results in a two-parameter maintenance policy,
which is also optimized. The authors give an application of their models to the
280
281
282
The optimization methods applied to finite horizon models are either exact
methods or heuristics1. Exact methods always find the global optimum solution of
a problem. If the complexity of an optimization problem is high and the computing
time of the exact method increases exponentially with the size of the problem, then
heuristics can be used to find a near-optimal solution in reasonable time.
The scheduling problem studied by Grigoriev et al. (2006) appears to be NPhard. Instead of defining heuristics, the authors choose to work on a relatively fast
exact method. Column-generation and a branch-and-price technique are utilized to
find the exact solution of larger-sized problems. The problem considered by
Papadakis and Kleindorfer (2005) is first modelled as a mixed integer linear programming problem, but it appears that it can also be formulated as a max-flow
min-cut problem in an undirected network. For this problem efficient algorithms
exist and thus, an exact method is applicable.
Langdon and Treleaven (1997), Sriskandarajah et al. (1998), Higgins (1998)
and Budai et al. (2006) propose heuristics to solve complex scheduling problems.
The first two articles utilize genetic algorithms. Higgins (1998) applies tabu search
and Budai et al. (2006) define different heuristics that are based on intuitive
arguments. In all four articles the heuristics perform well; a good solution is found
within reasonable time.
11.7.1 Trends
In the last few years several articles have appeared on optimal maintenance of
systems with stochastic dependence. In particular, the shock-damage models have
received much attention. One explanation for this is that type II failure interaction
can be modelled in several ways, whereas there is not much room for extensions in
the type I failure model. Another reason is that since the field of stochastic dependence is not very broad yet, it is easy to add a new feature such as minimal
repair or imperfect monitoring to an existing model. Third, many existing opportunistic maintenance policies for systems with economic dependence have not yet
been applied to systems with (type II) failure interaction.
Another upcoming field in multi-component maintenance modelling is the class
of finite horizon maintenance scheduling problems. Finite horizon models can be
283
284
11.8 Conclusions
In this chapter we have reviewed the literature on optimal maintenance of multicomponent maintenance. We first classified articles on the basis of the type of
dependence between components: economic, stochastic and structural dependence.
Subsequently, we subdivided these classes into new categories. For example, we
have introduced the categories positive and negative economic dependence. We
have paid attention to articles with both forms of interaction. Moreover, we have
defined several subcategories in the class of models with positive economic dependence. With respect to articles in the class of stochastic dependence, we are the
first to review these articles systematically.
Another classification has been made on the basis of the planning horizon
models and optimization methods. We have focussed our attention on the use of
heuristics and exact methods in finite horizon models. We have concluded that this
is a promising open research area.
We have discussed the trends and the open areas of research reported in the
literature on multi-component maintenance. We have observed a shift from infinite
horizon models to finite horizon models and from economic to stochastic dependence. This immediately defines the open research areas, which also include topics
such as case studies, modelling combinations of dependencies between components and modelling multiple set-up activities.
285
11.9 References
Assaf D, Shanthikumar J, (1987) Optimal group maintenance policies with continuous and
periodic inspections. Management Science 33:14401452
Barros A, Brenguer C, Grall A, (2006) A maintenance policy for two-unit parallel systems
based on imperfect monitoring information. Reliability Engineering and System Safety
91:131136
Budai G, Huisman D, Dekker R, (2006) Scheduling preventive railway maintenance
activities. Journal of the Operational Research Society 57:10351044
Castanier B, Grall A, Brenguer C (2005) A condition-based maintenance policy with nonperiodic inspections for a two-unit series system. Reliability Engineering & System Safety
87:109120
Cho D, Parlar M, (1991) A survey of maintenance models for multi-unit systems. European
Journal of Operational Research 51:123
Dekker R, Plasmeijer R, Swart J, (1998a) Evaluation of a new maintenance concept for the
preservation of highways. IMA Journal of Mathematics applied in Business and Industry
9:109156
Dekker R, van der Duyn Schouten F, Wildeman R, (1996) A review of multi-component
maintenance models with economic dependence. Mathematical Methods of Operations
Research 45:411435
Dekker R, van der Meer J, Plasmeijer R, Wildeman R, (1998b) Maintenance of lightstandards: a case-study. Journal of the Operational Research Society 49:132143
Goyal, S, Kusy M, (1985) Determining economic maintenance frequency for a family of
machines. Journal of the Operational Research Society 36:11251128
Grigoriev A, van de Klundert J, Spieksma F, (2006) Modeling and solving the periodic
maintenance problem. European Journal of Operational Research 172:783797
Grler , Kaya A, (2002) A maintenance policy for a system with multi-state components: an
approximate solution. Reliability Engineering & System Safety 76:117127
Higgins A, (1998) Scheduling of railway track maintenance activities and crews. Journal of
the Operational Research Society 49:10261033
Jhang J, Sheu S, (2000) Optimal age and block replacement policies for a multi-component
system with failure interaction. International Journal of Systems Science 31:593603
Lai M, Chen Y, (2006) Optimal periodic replacement policy for a two-unit system with
failure rate interaction. The International Journal of Advanced Manufacturing and
Technology 29:367371
Langdon W, Treleaven P, (1997) Scheduling maintenance of electrical power transmission
networks using genetic programming. In Warwick K, Ekwue A, Aggarwal R, (eds.)
Artificial intelligence techniques in power systems, Institution of Electrical Engineers,
Stevenage, UK, 220237
Murthy D, Nguyen D, (1985a) Study of a multi-component system with failure interaction.
European Journal of Operational Research 21:330338
Murthy D, Nguyen, D (1985b) Study of two-component system with failure interaction.
Naval Research Logistics Quarterly 32:239247
Nakagawa T, Murthy D, (1993) Optimal replacement policies for a two-unit system with
failure interactions. RAIRO Recherche operationelle / Operations Research 27:427438
Okumoto K, Elsayed E, (1983) An optimum group maintenance policy. Naval Research
Logistics Quarterly 30:667674
zekici S, (1988) Optimal periodic replacement of multicomponent reliability systems.
Operations Research 36:542552
Papadakis I, Kleindorfer P, (2005) Optimizing infrastructure network maintenance when
benefits are interdependent. OR Spectrum 27:6384
286
12
Replacement of Capital Equipment
P.A. Scarf and J.C. Hartman
12.1 Introduction
Businesses require equipment in order to function and deliver their outputs. In the
global, competitive environment, this equipment is critical to success. However,
equipment generally degrades with age and usage, and investment is required to
maintain the functional performance of equipment. For example, in mass urban
transportation, annual expenditure on equipment replacement for the Hong Kong
underground is of the order of $50 million, and further, the Hong Kong underground
network is a fraction of the size of that in London, Paris or New York. Where
equipment replacement impacts significantly on the bottom line of a corporation and
decision-making about such expenditure is under the control of the company
executive, the modelling of such decision making is within the scope of this chapter.
Capital equipment investment projects are typically driven by operating cost
control, technical obsolescence, requirements for performance and functionality
improvements, and safety. That is, rational decision-making about capital equipment replacement will take account of engineering, economic, and safety requirements. In this chapter we will assume that the engineering requirements concerning
replacement will define certain choices for equipment replacement. For example,
engineers would normally propose a number of options for providing the continuity
of equipment function: retain the current equipment as is, refurbish the equipment in
order to improve operation and functionality, or replace the equipment with new improved technology. We will further assume that safety requirements are addressed
when these options are analysed by engineers. Consequently, we argue that rational
choice between the defined replacement options is an economic question. Thus, a
logistics corporation may be considering replacement of certain assets in its road
transportation fleet. The organisation may have to raise capital to fund such
replacement. There is the expectation that engineers for the corporation will offer a
number of choices for replacement (e.g. buy tractors from company X or Y, buy
tractors now or in N years time, or scrap or retain existing tractors as spares) that
meet future functional and safety requirements. In this way, decision making about
288
replacement then necessarily considers the costs of the replacement options over
some suitable planning horizon. As capital equipment replacement potentially incurs significant costs, the cost of capital is a factor in the decision problem and
models to support decision making typically take account of the time value of
capital through discounting.
Capital equipment is a significant asset of a business. It consists of necessarily
complex systems and a business would typically own or operate a fleet of equipment:
the Mass Transit Railway Corporation Limited of Hong Kong operates hundreds of
escalators; Fed Ex Express, the cargo airline corporation operates more than 600
aircraft; electricity distribution systems comprise thousands of kilometres of cable
and hundreds of thousands of items such as transformers and switches; water supply
networks are on a similar scale. We can appeal to the law of large numbers and
assume with some justification that the economic costs that enter capital equipment
replacement decisions are deterministic. Consequently, we consider deterministic
models in this chapter and model rational decision making throughout using net
present value techniques (e.g. see Arnold 2006; Northcott 1985).
When considering optimal equipment replacement in an uncertain environment,
authors have argued the case for using real options (Dixit and Pindyck 1994; Bowe
and Lee 2004). Whenever replacement decisions may be exercised continuously, it
is argued that the choice to replace an existing asset with a new asset at a specified
time is characteristic of an American call optionthis approach seeks to value the
opportunity to replace the asset. Such a modelling approach would be valuable
when considering expansion of assets, for example, through the building of a new
transportation link for which the likely return on investment would be highly
uncertain. However, we do not consider this approach in this chapter.
We do not consider problems of component replacement in which the functionality of repairable systems is optimized either on a cost basis or a required
reliability basis. Such maintenance does not typically involve capital expenditure,
and the models used are often stochastic in naturetimes to failure are considered
to be random. For a recent review of such models, see Wang 2002.
The outline of the chapter is as follows. In Section 12.2 we describe the framework for the classification of models that are discussed in this chapter. This framework considers the nature of capital equipment replacement problems in general
and presents further detail regarding the nature of cost factors that contribute to
replacement decisions. Section 12.3 looks at economic life models and discusses
several models and an application of one of the models. Section 12.4 deals with
replacement of a network system. Dynamic programming models are discussed in
Section 12.5 and the chapter concludes with a discussion of topics for future
research in Section 12.6.
289
ment, or entire fleet replacement (Scarf and Christer 1997). The capital replacement models that are considered in this chapter may be classified as economic life
models or dynamic programming models. The former are concerned with determining the optimal lifetime of an item of equipment, taking account of costs over
some planning horizon. The latter considers replacement decisions dynamically,
determining whether plant should be retained or replaced after each period. Economic life models may be further classified according to the length of the planning
horizon: infinite, variable finite (with length of the horizon a function of decision
variables), or fixed (with a variable number of replacement cycles). Dynamic
programming models generally require a finite horizon, but may be used to identify
the optimal time zero decision for an infinite horizon.
Early models (e.g. Eilon et al. 1966) were formulated in continuous time with
optimum policy obtained using calculus. More complex models are simpler to
implement under a discrete time formulation. In the case of economic life models,
optimization may be performed using a crude search when there exists a small
number of decision variables. For fleets with many items, the discrete time
formulation naturally gives rise to mathematical programming problems. Dynamic
programming models necessarily require a discrete time formulation. Real options
models are formulated in continuous time.
We begin by looking at simple economic life models. These are applied in a
case study on escalator replacement. Economic life models are then extended to
consider first an inhomogeneous fleet and second a network system viewed as an
inhomogeneous fleet with interacting items. A number of different dynamic programming models are introduced for singular systems and then expanded to homogeneous and inhomogeneous fleets and networks of assets.
It is assumed that data relating to maintenance are available and sufficient for
modelling purposes. Data on other age related operating costs, such as fuel costs
and failures (breakdowns), would also ideally be available. Where usage of plant is
non-uniform, particularly if decreasing with age, usage data are also required for
replacement policy to be meaningful. This is because, for example, maintenance
costs for older plant may be artificially low due to under utilization or neglect of
good maintenance practice for plant near the end of their useful life. Some plant
may even be retired as occasional spares. Under reporting and thus bias of maintenance cost data may also be significant (Scarf 1994). Replacement models have
also been considered when cost information is obtained subjectively (Apeland and
Scarf 2003).
Penalty costs play a role in all replacement decisions (Christer and Scarf 1994).
It is only the extent to which penalty cost is quantified in the modelling process
that varies. Rather than attempt to estimate the values of difficult to quantify
parameters such as penalty cost and then determine optimal policy, the influence of
these parameters on the decision should be quantified. In this latter approach,
threshold values that lead to a step-change in optimum policy can be investigated
and presented and the decision makers can then consider whether they believe that
such values are realistic within the context of the problem. Thus, the penalty cost
can be used to measure in part the subjective component of a replacement choice.
All costs considered in the modelling will be discounted to net present value
through the use of a constant discount factor. We refer the reader to Kobbacy and
290
Nicol (1994) for a detailed discussion of the role of discounting in capital replacement. Appropriate functions describing resale values are assumed to be known, as
are purchase costs. Tax considerations in particular contexts should be taken into
account and modelled.
c(T ) = { m0 (t ) dt + R}/ T ,
(12.1)
where m0 (t ) is the operating cost rate and R is the replacement cost, and assuming
no residual value. From Equation 12.1, it follows that T* is the solution of
T*
0
m0 (t ) dt + R = T * m0 (T *) ,
provided it exists. In its discrete time form the total cost per unit time is
T
c(T ) = { i =1 m0i + R}/ T , where m0i is the operating cost in time period i. With a
discount factor , discounting to year end, and a residual value function S(T), the
net present value (NPV) of all future costs in perpetuity is
cNPV (T ) = (1 + T + 2T + ...){
T
i =1
m0i i + T [ R S (T )]}
= (1 T ) 1{ i =1 m0i i + T [ R S (T )]}.
T
T
i =1
m0i i + T [ R S (T )]} ,
291
whence
crent (T ) =
(1 )
T
(1 )
i =1 m0i i + T [ R S (T )]} .
Notice that as 1, crent (T ) c(T ), the total cost per unit time. The economic life can be obtained by minimising crent (T ) , typically using a spreadsheet
by considering a range of values of T.
12.3.2 Analysing Technological Change Using a Two-cycle Model
The economic life model can be adapted to consider technological change in a
number of ways. One can consider economic factors for new models of equipment
(future operating costs) in a parametric fashion, specifying a model for technological change which then implies operating cost functions, replacement cost and residual values for each replacement cycle into the future (Elton and Gruber 1976).
Alternatively, one can model replacement over a limited time scale, either by
fixing the time horizon, or by fixing the number of replacement cycles. Christer
(1984) did the latter and described a two-cycle model which models the immediate
replacement decision problem by considering existing plant as having age and
age-related operating cost m0i , and new plant as having operating cost m1i . In its
discrete form, the annuity for this model is
2
crent
( K , L) =
K
i =1
K +L
i =1
.(12.2)
Here K and L are decision variables, with K modelling the time (from now) to
replacement of the existing asset; K+L is the time to second replacement. The
advantage of this model is that one only need estimate the operating cost of the
existing and new assets (as functions of age), the capital cost for the new asset, R1 ,
and the age-related resale or residual value of new and existing assets, S0 , S1 .
12.3.3 A Fixed Planning Horizon Model
In the financial appraisal of projects, a standard approach fixes the time horizon
and determines the NPV of future costs over this horizon (e.g. Northcott 1985).
This fixed horizon model has been studied by Scarf and Hashem (2003) and its
simplicity lends itself to application in complex contexts (e.g. Scarf and Martin
2001). The annuity for this model can be derived from Equation 12.2 above simply
by setting X = K and K + L = h , the length of the planning horizon, and then
considering h as fixed. Whence, there is only one decision variable, X, the time to
replacement. Given the possibility that X = h , that is, no replacement over the
planning horizon whence we retain the current asset, the annuity function has a
discontinuity at X = h , and X * = h implies that it is not optimal to undertake the
292
(replacement) project. Furthermore, since the replacement at the end of the horizon
has a fixed cost (with respect to the decision variable X) its inclusion or exclusion
has no effect on the optimal time to replacement. It is natural not to include the
replacement cost at the horizon-end since a standard financial appraisal approach
would only account for revenue costs up to project execution, capital costs at
project execution, subsequent revenue costs up to the horizon-end, and residual
values. Including the replacement at h on the other hand allows cost comparisons
with the two-cycle model and the associated rent, Equation 12.2. We take the
former approach here however and the annuity is
h
crent ( X ) =
X
i =1
m0(i + ) i + X [ R1 S0 ( X + )] +
h S1 (h X )}/
h
i =1
h
i = X +1
i,
X < h,
i =1
m0(i + ) i h S0 (h + )}/
m1(i X ) i
i,
i =1
(12.3)
X = h.
(12.4)
and
[ Xm0 + R + (h X )m1 ] / h,
h
crent
(X ) =
m0
X < h,
X = h.
293
(12.5)
We can consider a similar argument for the fixed horizon model. Thus,
h
h
dcrent
( X ) / dX = ( m0 m1 ) / h, X < h and so dcrent
( X ) / dX > 0 if m0 > m1 . Howh
ever, since crent ( X ) has a discontinuity at X=h, X*=0 is optimal only if m0 > m1
h
h
and crent
(0) < crent
(h) . That is, if ( R + hm1 ) / h < m0 , that is, if
h(m0 m1 ) > R .
(12.6)
Thus, comparison of inequalities at Equations 12.5 and 12.6 shows that the two
models have different properties in terms of the behaviour of optimal policy as a
function of cost parameters. Thus the two-cycle model is inconsistent with standard
financial models. However, a simple modification to the model will correct this
inconsistency. Scarf et al. (2006) suggest simply to omit the replacement at the end
of the second cycle. For the constant revenue case above, the rent becomes
2
c rent
( K , L) = ( Km 0 + R + Lm1 ) /( K + L) and optimal policy would be K*=0
( L* = lmax ) if lmax (m0 m1 ) > R , which is consistent with the fixed horizon model
and hence with standard financial appraisal models.
However, it would appear that the two-cycle model with its two replacements
(at t=K and at t=K+L) is applicable for the case of increasing operating costs and
that a modified two-cycle model with one replacement (at t=K only) for operating
costs that are constant or increasing only slowly. However, this issue can be
resolved. When operating costs are increasing only slowly, typically L* does not
exist, and, in practice L must be constrained such that L lmax (as pointed out
above) since numerically we can only search for L* over a finite space. In
constraining L lmax under the two replacements formulation, we impose a
replacement at lmax when in fact there should not be a second replacement since
L* does not exist. This then suggests that the two-cycle replacement model should
be modified in the following subtle way: if there does not exist an L such that
2
crent
( K , L) has a minimum strictly within the search space, that is, within
{( K , L) : 0 < K < K max ,0 < L < lmax } then, when determining that K which
2
minimises crent
( K , lmax ) , no replacement cost should be incurred at t = K + lmax .
Thus the model should be modified so that there is only one replacement.
Otherwise the cost hurdle for replacement of the current asset will be set
artificially high (inequality at Equation 12.5). Thus, in all practical situations for
which operating costs are increasing only slowly, one should use this modified
two-cycle model or the fixed horizon model as a special case.
12.3.5 Discussion of Finite Horizon Replacement Models
Using the fixed horizon model or equivalently using the modified two-cycle model
with a finite search space may lead to significant end-of-horizon effects (since
costs beyond the horizon-end are ignored). Thus time to first replacement will
depend on h (or equivalently lmax ). Choice of h (or lmax ) will need to be considered carefully; in practice the horizon may be specified by company policy on
accounting methods and discounting may reduce those costs incurred in the distant
294
295
determine the scale of refurbishment of older assets and the level of major parts
replacement and supply within the negotiated contract.
For the presentation of the modelling work in this example, it is necessary to
consider the asset management options open to the corporation in a simple manner,
and a homogeneous sub-fleet of the escalators is considered, with modeling carried
out for a typical escalatorthis is a reasonable simplification since all escalators in
the sub-fleet were installed at approximately the same time. For this group,
replacement, although crudely costed, was not really a viable optioneconomic
costs were too high and disruption unacceptable given the duration of replacement
work. Refurbishment by the original manufacturer, replacing worn parts, upgrading
the control system and maintenance access was being carefully considered by the
corporation as a viable strategy for managing the asset life. Cost savings could be
achieved through a reduction in the annual maintenance contract price subsequent
to refurbishment. Thus, put simply, for the escalator group, the corporation was
faced with the decision: continue with the current relatively higher-price maintenance contract or refurbish and benefit from a new relatively lower-priced
maintenance contract. Other benefits would also accrue from refurbishment for
both contractor and the corporation. For the contractor, improved access and safety
for maintenance was part of the refurbishment package. For the corporation, upgrade of the control system would result in fewer unplanned escalator stoppages.
We consider some four asset management options: do nothingcontinue
with high-price maintenance contract; refurbrenew worn parts, retro-fit new
control system and proceed with lower-price maintenance contract; delay
refurbdelay refurbishment for up to n years; replacea full replacement
option with nominal costs included for comparison purposes. The costs of refurbishment (per escalator) in the present study were obtained from initial quotations
from the respective manufacturers: these are $63K for refurbishment. On-going
annual maintenance contract costs (per escalator) are: $9K pre-refurbishment; $7K
post-refurbishment. Prior to refurbishment the cost of replacement of major parts is
in addition to the annual maintenance contract and major parts are replaced on the
basis of condition. Post-refurbishment, the annual maintenance contract includes
replacement of major parts at no extra cost. Given that we might expect major parts
to be replaced somewhat less frequently than dictated by their recommended lives,
we introduce a cost parameter to model such life-extensionthis is called the
effective life factor, . = 1 implies that major parts are replaced at a frequency
corresponding to their recommended life (for example, once every 25 years for the
steps at a cost of $48K), and the replacement frequency 1 / ( = 2 implies
replacement of steps once every 50 years). The cost of a replacement ($170K) is a
nominal figure and used mainly for crude comparison with refurbishment. In
practice, replacement may cost significantly more than this.
The corporation recommend a discount rate of r= 0.11 and a projected inflation
rate of i= 0.05. This corresponds to an effective discount factor, , of 0.057
( 1 /(1 + ) = (1 + i ) /(1 + r ) ). Integral to the refurbishment option is the up-grading of
the escalator control system to allow power-dip ride-throughthis facility
prevents unnecessary emergency stops caused by momentary power loss that can
cause injuries to passengers. However, the effectiveness of the ride-through
facility is uncertain; hence we introduce another cost parameter, control system
296
K,
length of
the first
cycle,
years
1
3
5
7
9
11
13
15
17
19
21
1
491.8
302.4
241.1
211.6
193.9
182.2
173.9
167.8
163.1
159.4
156.4
3
298.0
236.1
206.8
190.2
179.3
171.6
165.9
161.5
158.0
155.3
153.0
5
233.0
202.7
186.2
176.1
169.0
163.7
159.7
156.5
154.0
151.9
150.2
17
146.6
143.8
142.5
142.1
141.7
141.4
141.2
140.9
140.8
140.6
140.4
19
142.5
140.5
139.7
139.7
139.6
139.6
139.6
139.6
139.6
139.5
139.5
21
139.2
137.8
137.4
137.7
137.9
138.1
138.3
138.4
138.5
138.6
138.7
The cost parameters in Table 12.1 and Figure 12.1 are held at intermediate
values. In Figure 12.2, we present annuities for a number of replacement options
as a function of each of the cost parameters. These replacement options correspond
to those considered by the corporation, with refurb referring to immediate refurbishment (in year 1), and delay refurb referring to refurbishment in year 10 (from
297
time of study). Given the size of the fleet, a constraint on the number of escalators
that can be refurbished at any one time and the duration of refurbishment, we
would expect the refurbishment programme to last some 15 years and therefore a
significant proportion of the fleet would experience this kind of delay prior to
refurbishment. Therefore we include it as a particular policy for indicative purposes. We use the fixed horizon model here in order to make comparisons between
annuitiesthis is because one would wish to compare the cost of different options
over the same horizon. Equivalently, we could use the modified two-cycle model
with the additional constraint K+L= h= 22 (years), say.
150.0
annuity per escalator, HK$K
170.0
160.0
150.0
140.0
140.0
130.0
120.0
110.0
130.0
1
13
17
21
K, years
a
L=10
L=12
0 0
.
0 0
.
0 0
.
0 0
.
0 0
.
L=16
5
13
17
21
K, years
b
L=14
L=18
1
L=21
2
fixed
Figure 12.1a,b. Annuities ($000s per escalator per year) for modified two-cycle model with
refurbishment at K years from now and operation for a further L years. Annuities for fixed
horizon model with h = 22 years also shown (X < 22: bold, solid curve; X = 22: ). Cost
parameters as Table 12.1 except: a effective discount factor equals 0.06 (equivalent to
inflation rate of approximately 0.05 and discount rate of 0.11); b no discounting.
From Figure 12.2 we can see that optimum policy is certainly sensitive to these
cost factors with the influence of cost parameters as expected. Threshold values that
lead to a step-change in the optimum policy (option) can be observed from these
figures. Thus while estimation of the penalty cost of failure, for example, may be
difficult and contentious, the importance of its effect can be observed. This may
then provide an incentive for further investigation of this parameter or discussion
about whether its true value is above or below the threshold of policy change.
As a final note for the escalator replacement problem in particular, one could
argue that the cost of differing options or policies will reflect the maintenance
contractors profit requirement, whatever the details of the arrangement, and therefore the total costs of options would expect to vary very little. What can differ,
however, is that some options may lead to lower risk (for example where the
contractor bears the cost of major parts wear-out which may be subject to significant uncertainty) and lower risk is certainly desirable from the point of view of the
operator.
298
200
200
190
190
Consider now a fleet consisting of sub-fleets classified on the basis of class (e.g.
vehicle-type) and age (or condition) so that the operator of the fleet is concerned with
the replacement of sub-fleets, and not with replacement of individual equipment or
with replacement of the entire fleet. For this fleet, it is natural to focus on the replacement of particular sub-fleet(s). The economic life models of the previous section
must be extended given that the replacement of particular sub-fleets has cost implications for the rest of the fleet.
180
170
160
150
140
130
1.00
1.50
1.75
150
140
20
40
60
80
b
200
160
2.00
200
170
130
1.25
190
180
170
160
150
140
130
0.05
180
190
180
170
160
150
140
130
0.07
0.09
0.11
discount rate
00
129
1
8 0
0
1
1 7
6 0
0
1
5
0
1
1 4
3 0
0
2 0
refurb
2 2
20
0.13
22
d
delay refurb
2 4
24
26
28
horizon / years
2 6
do nothing
2 8
replace
Figure 12.2ad. Annuities (per escalator) as a function of cost parameters for fixed horizon
model, with h = 22 years for various refurbishment/replacement options: a annuity vs.
effective life parameter; b annuity vs. penalty cost of failure; c annuity vs. nominal discount
rate; d annuity vs. horizon length h. Cost parameter values when not varying set at: effective
life, 1.5; penalty cost of failure, $5K; nominal discount rate, 0.11; control system retro-fit
effectiveness, 75%; refurbishment delay cost, $10K.
299
ti {
i =1
ti
s = ti 1 +1
mi ( s ) s ti + nr + i Rr + i Si (ti )}
(12.7)
where t i = ij=10 Li with L0 = 0 . Here mi (.) is the age related operating cost of
the whole fleet in cycle i; S i (.) is the age related resale value of plant in sub-fleet
i; Rr +i is the cost each of replacement plant in sub-fleet r+i; and v is the discount
rate. The costs mi (.) and S i (.) may be expressed as
mi ( s ) = k =i
r + i 1
nk
j =1
M k ( kj + s ), (i = 1,..., N ),
Si ( Li ) = j =1 Si1 ( ij + Li ), (i = 1,..., N ) ,
ni
300
where M k (.) is the age related operating cost per unit time for an individual plant
in sub-fleet k (k = 1,..., r + N ) , and S i1 (.) is the age related resale value for
individual plant in sub-fleet i. (Also, kj = 0 for k>r). Appropriate penalty costs,
associated with failures, may be incorporated into the operating costs.
The annuity, c tdc ( N , L; h) / ih=1 i , or other suitable objective function may
then be minimized subject to the constraint iN=1 Li = h . Technological change is
allowed for in that costs relating to proposed plant for cycles 2,..,N may be
assigned as appropriate. The optimum replacement schedule may be obtained by
minimizing the objective function over all possible schedules. In practice the range
of possible schedules would be narrowed greatly by the experience of the operator.
However, as the decision-maker will not have a firm value for the horizon length,
the optimum policy must be robust to variation in h. Furthermore, because the fleet
is mixed, both different replacement schedules and different planning horizon
lengths will give rise to different age compositions of the fleet at the end of the
horizon. Thus replacement policies may need to be compared not just on the basis
of cost but also on the basis of the age composition of the fleet at the end of the
planning horizon. This final age composition can be considered as quantifying the
end-of-horizon effect.
Non-uniform usage, particularly between sub-fleets, may be allowed for by
varying the fleet size at replacements. For example, if older plant are underutilized, a smaller number of new plant would be required to meet the demand
currently placed on an older sub-fleet. This effectively reduces the replacement
cost for that sub-fleet by factor which is the ratio of the utilization of the old to the
new sub-fleet. Of course, other more complex methods of accounting for differing
usage may be considered. Given sufficient data, operating costs could be quantified
in terms of usage and optimum policy may be obtained given forecasts for usage of
sub-fleets over the planning horizon.
The models may be extended to the case in which sub-fleets are retired as
spares. The number of sub-fleets would simply increase by one at each replacement, with the costs associated with retired sub-fleet added. Predicting operating
costs for a retired sub-fleet would be difficult however, as it is likely that no data
would be available for this. Also it is assumed that equipment is bought new: in
principle it is a simple matter to extend Equation 12.7 to the case in which used
equipment may be purchased.
Note that the formulation as presented allows for the possibility for a sub-fleet
to be composed of a single unit of equipment. This may be appropriate if the fleet
is small. The complexity of the computational problem increases rapidly as the
number of sub-fleets increases. However we do not consider efficient algorithms
for determining optimum policy here.
12.3.8 Application of a Replacement Model for an Inhomogeneous Fleet
Example 12.2
Scarf and Hashem (1997) consider the inter-city coach fleet operated by Express
National Berhad in Malaysia. The fleet comprised of 160 vehicles of 5 vehicletypes of varying ages, with maintenance cost modelled as M ( ) = a b and resale
301
values S ( ) = 0.6 R(0.81) , for replacement cost R (Table 12.2). The data available
were not sufficient for obtaining the maintenance cost model for all vehicletypesfor example, for the MAN, only data relating to their first year of operation
were available. Furthermore, for older vehicles the costs appeared to be decreasing.
This could perhaps be put down to under-utilization (partial retirement) and also
neglect of vehicles reaching the end of their useful life. It was therefore necessary
to pool the data to obtain reasonable cost models. The fitted maintenance cost
models for the Cummins, Isuzu CJR and MAN were obtained by first fitting an
overall cost model to data on vehicles up to eight years old, and then scaling this
model to the costs of the individual vehicletypes in the manner described in
Christer (1988). The costs for the older sub-fleets, the Mitsubishi and Isuzu CSA,
were taken as constant. Penalty costs for breakdowns on the road were also
modelledsee Scarf and Hashem (1997) for a full discussion of this. It was known
that the Mitsubishi and Isuzu sub-fleets were in partial retirement and candidates
for immediate replacement, capital expenditure permitting. The usage of sub-fleets
was unknown, although with a daily requirement for 125 vehicles, it was reasonable to suppose that the usage level for the Mitsubishi and Isuzu sub-fleets was
about half that of the other newer sub-fleets. This assumption led to the null
optimal policyreplace the Mitsubishi and Isuzu CSA sub-fleets as soon as
possiblewhich is uninteresting from a model validation point of view. Therefore
in order to illustrate the replacement model, we consider the following subproblem in detail: investigate replacement policy for the fleet comprising of the
Cummins, Isuzu CJR and MAN, assuming a fixed fleet size (93 vehicles) and
uniform usage.
Table 12.2. Fleet composition by vehicle type showing purchase cost, R, maintenance cost
parameters and age distribution at time of replacement study
>12
89
67
0
0
0.72
0.72
0.72
45
55.6
57.8
24.7
11.1
18.4
1011
750
800
500
300
450
23
R, M$000s
IsuzuCSA
Mitsubishi
Cummins
IsuzuCJR
MAN
M ( ) = a b
model
30
28
18
8
45
22
For the three sub-fleets problem, optimal policy is presented for each of the six
replacement schedules in Table 12.3. Horizon lengths, h, of 120, 150 and 180
months (15 years) are considered. For the fleet as a whole it is difficult to determine optimal replacement policy as two sub-fleets are partially retired, and usage
levels are unknown. The problem is made more difficult because it is likely that
the maintenance of these sub-fleets is less thorough than that for the newer subfleets. Under simple usage assumptions, the optimum policy is to replace the
Mitsubishi and Isuzu CSA sub-fleets immediately. For the particular sub-problem
302
relating to the Cummins, Isuzu CJR and MAN, it appears that the optimum
replacement schedule depends on the length of the horizon. Also, the end-ofhorizon effect, as represented by the mean age of the fleet, also varies with the
replacement schedule. The choice of optimal policy is therefore not straightforward. Over a fifteen year planning horizon, there is little to choose between the
three schedules Cummins-IsuzuCJR-MAN, Cummins-MAN-IsuzuCJR and MANCummins-IsuzuCJR, both in terms of cost and age. Sensitivity to model parameters is considered more fully in Scarf and Hashem (1997).
Table 12.3. Optimum policy for each schedule for various horizon lengths, h=120,150,180
months; penalty cost, M$2000, annual discount rate 0.97. Cost of equivalent rent (M$000s
per month for whole fleet), average age of fleet at end of horizon, and optimum cycle
lengths. Replacement schedules: CIM Cummins-IsuzuCJR-MAN, etc.
h
Schedule
120
CIM
CMI
ICM
IMC
MCI
MIC
CIM
CMI
ICM
IMC
MCI
MIC
CIM
CMI
ICM
IMC
MCI
MIC
150
180
Cost/month
(M$000s)
745.8
763.5
816.7
816.7
782.1
838.1
779.4
767.3
844.7
848.1
782.8
850.6
787.9
778.1
841.0
843.9
794.6
859.9
Age/years
7.0
9.9
8.4
8.4
5.1
6.5
8.6
5.6
4.6
4.4
5.7
7.7
6.0
6.0
5.2
5.4
6.7
4.4
L1
L1
6
120
120
120
24
42
6
6
54
60
42
60
18
18
72
72
54
72
114
6
78
144
54
6
6
6
90
72
60
6
6
6
6
L3
L4
90
90
6
6
102
6
36
6
6
120
6
84
78
84
66
96
96
96
303
in year t. For network expansion projects ft 0 = 0 . Let C (>0) be the capital cost of
project P. Assume income cashflows are negative and expenditure cashflows are
positive, and that all cashflows are incurred at the year end and discounted at rate
v. If project P is released in year x from now then the total cashflow over h years
from now will be
x 1
t =1
ft 0 t + x
h x
t =0
ft1 t + C .
(12.8)
If project P is not released then the cashflow over the horizon will be t =1 ft 0 t .
Define the gain from releasing project P in year x to be the difference between
these cashflows:
h
g P ( x; h) = t =1 ft 0 t 1 [ t =1 ft 0 t + x { t = 0 ft1 t + C}]
h
x 1
h x
= t = x ( ft 0 ft1 x ) t x C.
h
304
leased subject to the constraints that the capital investment budget is not exceeded
in each year. That is maximize
i =1
j =1 ij
x gij (h)
subject to
i =1 ij
k
x 1
j =1 ij
(12.9)
(12.10)
xij = 0,1.
Constraint set Equation 12.9 ensures that the budget for year j is not exceeded.
Constraint set at Equation 12.10 ensures that project i is released at most once over
the planning horizon. Note that if an individual project has negative gain whatever
its execution time, then the contribution to the objective function from this project
will be greatest when this project is not released over (0, k). Typically such planning may be informative over the planning horizon, but only decisions relating to
the immediate future (one to two years) would be acted on. Therefore policy would
be continually updated, implying a rolling horizon approach. Where a network
consists of many identical components, the modelling of project planning may be
extended to the case in which a proportion of similar projects are released in a
given year. This could be done by formulating the capital rationing model (CRM)
as a mixed programming problem.
Consider now dependence between projects. For example, a major expansion
project, while not replacing existing assets, may have significant operating cost or
performance implications for particular assets: the building of a large ring-main in
a water supply network is one such example. Essentially, if two projects P1 and P2
interact in this way, then new projects P1' = ( P1, not P2 ) , P2' = (not P1, P2 ) , and
'
P12
= ( P1, P2 ) would have to be introduced, along with the constraint to ensure that
'
at most one of P1' , P2' , and P12
is released over the planning horizon. While this
approach may lead to a significant increase in the number of projects in the
model, in principle the solution procedure would remain unchanged. The existence
of future-cost dependencies between projects would have to be identified by the
network owner. This may be extremely difficult in practice. However such dependency would very much characterize the network replacement problem, and therefore the approach described is an advance over current methods. A similar approach has been taken by Santhanam and Kyparisis (1996) in modelling dependency in the project release of information systems. Capital costs may be considered
simply using the concept of shared set-up. It is possible that it may be optimal to
release both P1 and P2 during the planning horizon, but not simultaneously. This
presents a more difficult modelling task, without introducing many pseudo-projects, that is. For example, we could consider: release P1 at time s and P2 at time
t; however for k=10, say, this would mean the introduction of 25 variables,
x( P1P2 )( s ,t ) , for the P1 , P2 decision alone!
305
h x
h x
where x is the execution time for the project under capital rationing. The marginal
increase in revenue expenditure would be found by summing over all projects. In a
similar manner, the marginal increase in revenue expenditure due to projects delayed
in year j could be found by summing over all projects with x = j , and this measure
indicates how much more capital investment would be required to reduce revenue
expenditure to the optimum level.
Uncertainty in the cashflow/performance model parameter estimates, reflecting
the extent of currently available information about particular components and
potential projects, and the extent of technological developments (new materials and
techniques), may be propagated through into uncertainty in the gain function, g (.) .
This would be most easily done using the delta method; see Baker and Scarf (1995)
for an example of this in maintenance. The variance of the gain, as well as the
expected gain, may then be used to produce the project priority list and those
projects for which the expected gain is high and the uncertainty in the gain
(variance of the gain) is low are candidates for release; these projects would be
viewed as sound investments. Markovitz (1952) is the classic reference here; for a
more recent discussion see Booth and King (1998). Also, a real options approach
might be taken (e.g. Bowe and Lee 2004). Where there are no data regarding a
potential project, there will be no objective basis for determining if and where the
project lies on the project priority list. One possible approach to this problem
would be to use data relating to other projects that are similar in design. Also
subjective data may be collected, and used to update component data for the whole
network in the manner described in OHagan (1994) and Goldstein and OHagan
(1996) in the context of sewer networks. These methods are particularly useful for
multi-component systems in which there are only limited data for a limited number
of individual components. On the other hand, it may be that the income cashflow
may be deterministic in some situations. For example, expansion of the network
may be initiated by legislation, and the compensation for the investment costs are
fixed and predetermined per customer connection.
306
(12.11)
N=5
307
4
R
3
2
1
0
T=4
If the n-period old asset is kept (K), the operating and maitenance (O&M) cost
Ct+1(n+1) is incurred for the asset in the following period. As the asset is age n+1
at the end of the period, ft+1(n+1) defines the costs going forward. (This is why ft
is often referred to as the cost to go function in dynamic programming.) If the
asset is replaced (R), then a salvage value St(n) is received and a purchase price Pt
is paid for a new asset. The new asset is utilized for the period as the state
transitions to an age of 1, defined by costs ft+1(1) going forward. If the asset
reaches the maximum age of N, then only the replace decision is feasible.
When the horizon time T is reached, the asset is salvaged such that
fT (n) = ST (n)
(12.12)
308
t\n
0
1
2
3
4
1
2
3
4
---$47,996
$16,332
---$407
$6,292
--$17,411 $12,455 $7,353
-$35,000 $31,500 $28,350 $25,515
5
-$35,602
----
Table 12.4 shows the results of solving the dynamic programming algorithm.
The values in the final row (f4 (n)) are the negative salvage values received for a
given asset of age n at that time. To illustrate a calculation at t = 3, consider n = 1.
Substituting into Equation 12.11:
K : 0.893 ( $12, 000 $31,500 )
f3 (1) = min
= $17, 411
R : $50, 000 $35, 000 + 0.893 ( $10, 000 $35, 000)
The recursion continues in this fashion until f0 (4) is evaluated, with the decision
to replace the asset immediately with a new asset. This new asset is retained through
the horizon. The net present value cost of this sequence of decisions is $47,996.
The benefit of using this model, in addition to allowing for replacements after
each period, is that periodic costs are explicitly modeled on each arc in the network.
This allows for detailed cost modelling of technological change, as in Regnier et al.
(2004) or those costs associated with after-tax analysis, as in Hartman and Hartman
(2001).
A similar line of models have also been developed such that the condition of
the asset, not its age, is tracked (i.e. Derman 1963). As opposed to moving from
state to state by increasing the age of the asset, there is some probability that the
asset will degrade to a lower condition during a period. The work assuming stochastic deterioration has been extended to include technological change (Hopp and
Nair 1994) or consider probabilistic utilization (Hartman 2001).
12.5.2 Period Based Model
Wagner (1975) offered an alternative dynamic programming formulation for the
equipment replacement problem in which the state of the system is the time period
and the decision at each period is the length of time to retain an asset. This model
is described in the network in Figure 12.4. The nodes represent the state of the
system (time period) and the arcs connecting two nodes represent the decision to
keep an asset in service between those time periods.
309
The objective is to find the sequence of service lives that minimizes costs from
time 0 through time T. (As previously, T = 4 in the figure.) Assuming costs along
an arc connecting node t to node t+n are defined as net present value costs at time
t, the optimal sequence of decisions can be determined by solving the following
recursion:
f (t ) = min n N ,t + n T {ctn + n f (t + n)}, t = 0,1,..., T 1
(12.13)
where ctn represents the cost of retaining the asset for n periods from period t.
Using our previous notation, ctn is defined as
n
ctn = Pt + j Ct + j ( j ) n St + n (n)
(12.14)
j =1
This model can be solved similarly to the age-based model, assuming that
f(T) = 0 is substituted into Equation 12.13. Note that the network in Figure 12.4
assumes that a new asset is purchased at time 0. To include the option to keep or
replace an asset owned at time zero, another set of arcs must be drawn, emanating
from node 0, representing the length of time to retain the owned asset with its
associated costs. As these arcs parallel those illustrated in Figure 12.4, the higher
cost parallel arcs can be deleted, as they will not reside on the optimal path. This
can be completed in a pre-processing step, with the recursion ensuing as defined.
Example 12.4
Utilizing the same data from Example 12.3, the network in Figure 12.4 represents
the options associated with purchasing a new asset in each period. We would add
an arc from node 0 to node 1 to represent the decision to retain the four-period old
asset for one additional period (to its maximum feasible age of 5).
Table 12.5 provides the net present value costs (at time t) on the arcs from node
t to node t+n. The arc from node 0 to 1 represents the cost of retaining the fourperiod old asset for one period, as this is cheaper than salvaging the used asset and
purchasing a new asset for one period of use. The values of c02, c03, and c04 include
the revenue received for salvaging the four-period old asset at time zero. With the
values in Table 9.5, the dynamic programming recursion in Equation 12.13 can be
solved.
310
t \ t+n
0
1
2
3
1
-$1,989
2
$17,868
$27,679
3
$33,051
$43,383
$27,679
4
$47,996
$58,566
$43,383
$27,679
f (2) = min
= $43,383,
$27,
679
+
0.893($27,
679)
defining that it is cheaper to keep the asset for two periods (from the end of period
2 to the end of the horizon) rather than replacing it after one period of use. Continuing in this manner, it is found that f(0) = $47,996, signaling that the four-period
old asset should be sold and the new asset should be retained through the horizon.
This is the same solution found with Bellmans model.
While this model can be shown to be more computationally efficient than the
age-based model, it is the ease with which multiple challengers (as parallel arcs) or
technological change is modelled that has led to numerous extensions in the
literature. See Oakford et al. (1984), Bean et al. (1985, 1994), and Hartman and
Rogers (2006).
12.5.3 Cumulative-usage Based Model
Recently, Hartman and Murphy (2006) offered a third dynamic programming
formulation for the equipment replacement problem following the form of the
classical knapsack model. The model determines the number of times an asset is
used for a given length of time over some horizon.
The dynamic program is described by the network in Figure 12.5. The y-axis
defines the periods, 1 through T, while the x-axis identifies the stage in which an
asset is to be retained for a given length of time, 1 through N, is evaluated. In the
figure, the order is ages 4, 3, 2, and then 1. Thus, in the first stage of the dynamic
program, the number of times to retain an asset for four consecutive periods is
analyzed. (For this small example with T = 4, the asset can only be retained for
four periods once.) In the second stage, it is evaluated whether an asset should be
retained for three periods. In the third stage, it is evaluated whether an asset should
be retained for two periods either once (for two periods of total service) or twice
(for four periods of total service).
311
T=4
3
2
1
0
0
A node in the network represents the cumulative service that has been accrued
through a given stage. For example, after the first stage in Figure 12.5, either 0 or 4
periods of service have been reached. As the horizon is 4, a solution must ultimately
result in 4 periods of service. As with the other dynamic programming, models, the
goal is to find the minimum cost path from the initial node, representing no service at
time zero, to the final node, representing an entire horizons worth of service after the
final stage.
To determine an optimal solution it is assumed that the costs are stationary and
the stages (lengths of service) are ordered according to increasing annualized costs.
Thus, before the recursion can be solved, the annualized costs of keeping an asset
for each possible service life must be computed such that the stages can be ordered
accordingly.
Example 12.5
We revisit the previous examples again. From the given costs, the annual equivalent costs are computed as given in Table 12.6. For example, to retain the asset for
two years costs $25,670 per year, equivalently, assuming a 12 percent interest rate.
The net present value (NPV) costs are also given. We restrict the set of decisions to
those of a new asset namely how many to purchase and how long to retain them
over the finite horizon.
Table 12.6. Annual equivalent costs of keeping the asset for up to five years
Age
0
1
2
3
4
5
O&M
$10,000
$12,000
$14,400
$17,280
$20,736
SV
$50,000
$35,000
$31,500
$28,350
$25,515
$22,964
AEC
NPV
$31,000
$25,670
$24,384
$24,202
$24,540
$27,679
$43,383
$58,566
$73,511
$88,462
Given the information in Table 12.6, the stages are ordered according to ages 4,
3, 5, 2, and 1, as the annual equivalent costs increase accordingly. As an asset is
only required for four periods, the age 5 cost can be ignored.
312
According to Figure 12.5, an asset can be retained a maximum of one time for
four years, at a cost of $73,511. Thus, the states in the first stage and their values
are
f1 (0) = 0,
f1 (4) = $73,511.
Similar reasoning defines f2(0)=0, f2(3)=$58,566, and f2(4)=$73,511. For the third
stage, the decisions are more interesting because an asset can be retained for two
years twice in the sequence. Thus
f3 (0) = 0,
f3 (2) = $43,383,
f3 (3) = $58,566,
The final stage evaluates using assets for a single period with previous combinations (three-period and two-period aged assets). It can be shown that the optimal decision is to retain the asset for all four periods at a net present value cost of
$73,511. Note that this is the same decision found with the two previous formulations, as $73,511 less the salvage value of the four-period old asset ($25,515) is
$47,996.
This recursion was not developed in order to provide another computational
approach to the equipment replacement problem. Rather, it was developed to illustrate the relationship between the infinite and finite horizon solutions under stationary costs. Specifically, as the optimal solution to the infinite horizon problem is to
repeatedly replace an asset at its economic life (age which minimizes equivalent
annualized costs), the question being investigated was whether the solution (replacing at the economic life) translates to the finite horizon case.
It was shown that using the infinite horizon solution provides a good answer
when O&M costs increase over the life of an asset more drastically than salvage
values decline. In the case when the salvage value declines are more drastic than
the O&M cost increases, it is generally better to retain the final asset in the sequence for a period longer than the economic life of the asset. For the cases when
O&M cost increases and salvage value declines are similar, then it is beneficial to
solve a dynamic programming recursion to find the optimal policy.
12.5.4 Infinite Horizon Considerations
The solution of a dynamic programming algorithm assumes that the horizon is
finite. In the case of an infinite horizon in which an asset is expected to remain in
service indefinitely, it may be possible to identify an optimal time zero decision.
Bean et al. (1985) show that if the time zero decision for an equipment replacement problem does not change for N consecutive horizons, where N is the maximum age of an asset, then the decision is optimal for any length of horizon, includ-
313
ing an infinite horizon. Unfortunately, this does not guarantee the existence of an
optimal time zero decision.
For the age or period based dynamic programming recursions, the models must
be solved over T, T+1, T+2, , T+N horizons. If the time zero decision does not
change for these problems, then the optimal time-zero decision is found. If this is
not the case, the progression must continue until N consecutive time zero decisions
are identified. This may be more easily facilitated using a forward recursion. In the
period-based model, this requires defining f(t) as a function of f(t1), f(t2), etc.,
with f(0) = 0. We illustrate by revisiting Example 12.4.
Example 12.6
We illustrate the first few stages of the forward recursion, as its implementation is
better suited for infinite horizon analysis. As noted earlier, the recursion is
initialized with f(0) = 0. Stepping forward in time, it is assumed that T = 1. Using
the values from Table 12.2, it is clear that the only feasible decision is to retain the
four-period old asset for one period such that f(1) = -$1,989. For the second stage,
there are two feasible decisions to evaluate, such that
0.893($27, 679) $1,989
f (2) = min
= $17,868.
$17,868 + $0
The first decision evaluates using the new asset for one period, assuming (from
stage 1) that the four-period old asset is retained for one period. The second
decision assumes the four-period old asset is retired immediately and a new asset is
used for two periods. This process moves forward in time, increasing the value of T
in each step. The process stops when, in this case, five consecutive solutions (with
increasing T) result in the same time zero decision.
12.5.5 Modeling Complex Systems
The presented dynamic programming algorithms are designed for single asset
systems. More complex systems are obviously defined by multiple assets which are
not independent, otherwise the presented models would be sufficient. The most
straightforward case in where all assets operate in parallel, such as a fleet.
Jones et al. (1991) offered the first dynamic programming recursion for the
parallel machine replacement model, which can be used to analyze fleet replacement decisions. Machines are assumed to operate in parallel and thus the capacity
of the system is equal to the sum of the individual asset capacities. In addition to
defining the capacity of the system, the assets are often linked economically. Jones
et al. focused on the assumption that a fixed cost would be charged in any period in
which a replacement occurs (in addition to the typical per unit charges for each
asset replaced). This provides an incentive to replace multiple assets together over
time so as to reduce the number of times the fixed charge in incurred over some
horizon.
To model replacement decisions for this system, the state of the system is
defined as the number of assets aged 1 through N, represented as a vector, [m1, m2,
314
[3,2,1]
[1,3,2]
[3,3,0]
[6,0,0]
[2,1,3]
[5,1,0]
[0,3,3]
[3,0,3]
[0,6,0]
[6,0,0]
[3,2,1]
[4,2,0]
[0,5,1]
[1,5,0]
[3,0,3]
[3,3,0]
[0,0,6]
[0,6,0]
[6,0,0]
T=3
315
Define n as the decision of what minimum aged assets are to be replaced for a
given state at time t. That is, all assets of age n and older are replaced while the
remaining assets are retained. We can model the recursion in general as follows:
N
N
j =n
j =n
(12.15)
n 1
j =1
Examining the recursion, a purchase price is paid and a salvage value is received
for all assets that are replaced. All of the newly purchased assets (the total number
of assets is the sum of mn+mn+1++ mN) incur the O&M cost of a new asset while
the O&M costs of the retained assets are incurred according to their age. A fixed
charge Kt is paid if at least one group of assets is replaced (n>1), captured by the
indicator function. The resulting state is a group of new assets (age 1) with all other
assets incrementing one period in age.
A number of extensions to this model have been published in the literature,
although many utilize integer programming modeling techniques to deal with the
large-state space. Chand et al. (2000) focus on the use of dynamic programming and
include capacity expansion decisions with the replacement decisions. Unfortunately,
capital budgeting constraints greatly complicate the problem as it cannot be assumed
that groups of assets must be kept or replaced together.
While the theorems presented in Jones et al. (1991) greatly reduce the computational difficulties of solving the dynamic program for the parallel replacement
problem, it should be clear that using dynamic programming to address replacement
decisions for more complex systems may be difficult due to computational complexities that arise due to the number of combinations of replacement alternatives.
(See Hartman and Ban (2002) and the references therein for a discussion of these
issues.)
Consider a more complex system in which a number of machines are used in
series (such as a production line) and there are a number of lines in parallel, such
as the one given in Figure 12.7. The lines are labeled 1, 2 and 3 while the machines
are labeled a, b, c, and d.
1
2
3
a
316
The capacity of a line is now defined by the machine in the line with the minimum capacity. However, the capacity of the system is raised due to the parallel
design. The capacity of the system is defined by the sum of the capacity of each
line. Therefore, it is defined by the sum of the minimum capacity asset in each line.
Reliability is measured similarly to a capacity, in that it is reduced by the series
structure but increased with parallel (redundant) structure. For a given series, the
reliability of the line is equal to the product of the reliability of each individual
asset. That is because if one asset is down, the line is down. The reliability of the
system, assuming only one line must be up and running, is increased as the system
is operating even if three lines are down.
If one defines minimum system capacity or reliability constraints, these can be
incorporated into a dynamic programming recursion that evaluates the possibility of
replacing any combination of assets in each period over some horizon. Presumably,
newer assets would have higher capacity or reliability, either due to technological
change or due to the fact that they are new (and have not deteriorated), and thus
would increase the respective capacity or reliability of the system (in order to meet
the defined constraints).
The difficulty with using a dynamic programming recursion to evaluate these
decisions is not in capturing the capacity or reliability constraints. Rather, the
difficulty is in the exponential growth in the number of possible combinations of
replacements in each period. Consider the 12 assets shown in Figure 12.7. In the
most general problem, each asset and each combination of assets can be replaced in
each period, totaling 212 combinations each period for each state of the system.
This system could easily become more complicated, merely by defining a, b, c, and
d as processes, each of which may have a number of assets in parallel (or in series).
In the parallel machine replacement problem, a similar problem was encountered, but the number of possible decisions was reduced to N (the maximum allowable age for an asset) for each possible state in each period with the two theorems
introduced by Jones et al., without sacrificing optimality. Unfortunately, the interaction of the assets may prohibit the application of these theorems to other systems.
In fact, defining the state of the system is not entirely clear.
For the system described in Figure 12.7, we could define the system as a matrix
of asset ages. Each row would be defined by the age of each machine in a given
line, with a row defined for each line. If an asset is replaced, then the age would
translate to 1 in the next stage while it would merely increment 1 period if the
machine is retained. This modeling approach could be expanded to the case of
multiple machines in a given process by expanding the size of the matrix.
Again, the difficulty would be in restricting the number of decisions to evaluate
for each state in a given period. Following the approach of Jones et al. (1991),
older assets would be replaced first (and even further restricted to have to be above
a certain age for consideration) and similarly aged assets of the same type would be
replaced in the same time period. Another approach would be to only consider
replacing assets that increase the system capacity or reliability. Thus, replacements
could be examined in the order of either increasing capacity or increasing reliability. Whether these heuristic approaches provide a good solution for a given
problem instance would require extensive numerical testing.
317
12.7 References
Apeland, S. and Scarf, P.A. (2003) A fully subjective approach to capital equipment
replacement. Journal of the Operational Research Society 54, 371378.
Arnold, G. (2006) Essentials of Corporate Financial Management. Pearson, London.
Baker, R.D. and Scarf, P.A. (1995) Can models to small data samples lead to maintenance
policies with near-optimal cost? IMA Journal of Mathematics Applied in Business and
Industry 6, 312.
Bean, J.C., Lohmann, J.R. and Smith, R.L. (1985) A dynamic infinite horizon replacement
economy decision model. The Engineering Economist 30, 99120.
Bean, J.C., Lohmann, J.R. and Smith, R.L. (1994) Equipment replacement under
technological change, Naval Research Logistics, 41, 117128.
Bellman, R.E. (1955) Equipment replacement policy. Journal of the Society for the
Industrial Applications of Mathematics 3, 133136.
Booth, P. and King, P. (1998) The relationship between finance and actuarial science. In
Hand, D.J., Jacka, S.D. (Eds), Statistics in Finance, Arnold, London, pp.740.
Bowe, M. and Lee, D.L. (2004), Project evaluation in the presence of multiple embedded
real options: evidence from the Taiwan High-Speed Rail Project, Journal of Asian
Economics 15, 7198.
318
Brint, A.T., Hodgkins, W.R., Rigler, D.M and Smith, S.A. (1998) Evaluating strategies for
reliable distribution. IEEE Comput.Applns. in Power 11, 4347.
Chand, S., McClurg, T. and J. Ward (2000) A model for parallel machine replacement with
capacity expansion. European Journal of Operational Research, 121. 519531.
Christer, A.H. (1984) Operational research applied to industrial maintenance and
replacement. In Eglese, R.W. and Rand, G.K. (Eds) Developments in Operational
Research (pp.3158). Pergamon Press, Oxford.
Christer, A.H. (1988) Determining economic replacement ages of equipment incorporating
technological developments. In Rand, G.K. (Eds) Operational Research 87 (pp.343
354). Elsevier, Amsterdam.
Christer, A.H. and Scarf, P.A. (1994) A robust replacement model with applications to
medical equipment. J.Opl.Res.Soc. 45:261275.
Derman, C. (1963) Inspection-maintenance-replacement schedules under markovian
deterioration. In Mathematical Optimization Techniques, University of California Press,
Berkely, CA, pp. 201210.
Dixit, A.K. and Pindyck R.S. (1994) Investment Under Uncertainty Princeton University
Press, New Jersey.
Eilon, S., King, J.R. and Hutchinson, D.E. (1966). A study in equipment replacement.
Opl.Res.Quart. 17:5971.
Elton, D.J. and Gruber, M.J. (1976) On the optimality of an equal life policy for equipment
subject to technological change. Opl.Res.Quart. 22:9399.
Goldstein, M. and OHagan, A. (1996) Bayes linear sufficiency and systems of expert
posterior assessments. Journal of the Royal Statistical Society Series B 58, 301316.
Hartman, J.C. (1999) A General Procedure for Incorporating Asset Utilization Decisions
into Replacement Analysis. Eng. Econ., 44(3):217238.
Hartman, J.C. (2001) An Economic Replacement Model with Probabilistic Asset Utilization.
IIE Transactions, 33, 717729.
Hartman, J.C. (2004) Multiple asset replacement analysis under variable utilization and
stochastic demand. European Journal of Operational Research 59, 145165.
Hartman, J.C. and J. Ban (2002) The series-parallel replacement problem. Robotics and
Computer Integrated Manufacturing, 18, 215221.
Hartman, J.C. and R.V. Hartman (2001) After-Tax Replacement Analysis. The Engineering
Economist, 46, 181204.
Hartman, J.C. and Murphy, A. (2006) Finite Horizon Equipment Replacement Analysis. IIE
Transactions 38, 409419.
Hartman, J.C. and Rogers, J.L. (2006) Dynamic Programming Approaches for Equipment
Replacement Problems with Continuous and Discontinuous Technological Change. IMA
Journal of Management Mathematics, 17, 143158.
Hopp, W.J. and Nair, S.K. (1991) Timing replacement decisions under discontinuous
technological change. Naval Research Logistics 38, 203220.
Hopp, W.J. and Nair, S.K. (1994) Markovian deterioration and technological change. IIE
Transactions, 26, 7482.
Jones, P.C., Zydiak, J.L. and Hopp, W.J. (1991) Parallel machine replacement. Naval
Research Logistics, 38, 351365.
Karabakal, N., Lohmann, J.R. and Bean, J.C. (1994) Parallel replacement under capital
rationing constraints. Management Science 40, 305319.
Kobbacy, K. and Nicol, D. (1994) Sensitivity of rent replacement models. Int.J.Prod.Econ.
36, 267279.
Markovitz, H.M. (1952) Portfolio selection. Journal of Finance 7, 7791.
Northcott, D. (1985) Capital Investment Decision Making. Dryden Press, London.
Oakford, R.V., Lohmann, J.R. and Salazar, A. (1984) A dynamic replacement economy
decision model. IIE Transactions, 16, 6572.
319
13
Maintenance and Production: A Review
of Planning Models
Gabriella Budai, Rommert Dekker and Robin P. Nicolai
13.1 Introduction
Maintenance is the set of activities carried out to keep a system into a condition
where it can perform its function. Quite often these systems are production systems
where the outputs are products and/or services. Some maintenance can be done
during production and some can be done during regular production stops in
evenings, weekends and on holidays. However, in many cases production units
need to be shut down for maintenance. This may lead to tension between the
production and maintenance department of a company. On one hand the production
department needs maintenance for the long-term well-being of its equipment, on
the other hand it leads to shutting down the operations and loss of production. It
will be clear that both can benefit from decision support based on mathematical
models.
In this chapter we give an overview of mathematical models that consider the
relation between maintenance and production. The relation exists in several ways.
First of all, when planning maintenance one needs to take production into account.
Second, maintenance can also be seen as a production process which needs to be
planned and finally one can develop integrated models for maintenance and production. Apart from giving a general overview of models we will also discuss some
sectors in which the interactions between maintenance and production have been
studied.
Many review articles have been written on maintenance, e.g. Cho and Parlar
(1991), but to our knowledge only one on the combination between maintenance
and production, Ben-Daya and Rahim (2001). This review differs from that in
several aspects. First of all, we also consider models which take production restrictions into account, rather than integrated models. Second we discuss some specific
sectors. Finally, we discuss the more recent articles since that review.
Maintenance is related to production in several ways. First of all, maintenance is
intended to allow production, yet to execute maintenance production often has to be
stopped. This negative effect has therefore to be considered in maintenance plan-
322
ning and optimization. It comes specifically forward in the costing of downtime and
in opportunity maintenance. All articles taking the effect of production on maintenance explicitly into account fall into this category.
Second, maintenance can also be seen as a production process which needs to be
planned. Planning in this respect implies determining appropriate levels of capacity
(e.g. manpower) concerning the demand.
Third, we are concerned with production planning in which one needs to take
maintenance jobs into account. The point is that the maintenance jobs take production capacity away and hence they need to be planned together with production.
Maintenance has to be done either because of a failure or because the quality of the
produced items is not high enough. In this third category we also consider the
integrated planning of production and maintenance.
The relation between maintenance and production is also determined by the
business sector. We consider the following sectors: railways, road, airlines and
electrical power system maintenance.
The outline of the rest of this chapter is now as follows. In Section 13.2 we
present an overview of the main elements of maintenance planning as these are
essential to understand the rest of this chapter. Following our classification scheme,
in Section 13.3 we review articles in which maintenance is modelled explicitly and
where the needs of production are taken into account. Since these needs differ
between business sectors, we discuss in Section 13.4 the relation between production and maintenance for some specific business sectors. In Section 13.5 we
consider the second category in our classification scheme: maintenance as a production process which needs to be planned. In Section 13.6 we are concerned with
production planning in which one needs to take maintenance jobs into account
(integrated production and maintenance planning). Trends and open research areas
are discussed in Section 13.7 and, finally, conclusions are drawn in Section 13.8.
323
the subsystems, what information is available and what elements can be easily
replaced. These are typical maintainability aspects, but they have little to do with
production.
In the tactical phase, usually between a month and year, one plans for the major
maintenance/upgrade of major units and this has to be done in cooperation with the
production department. Accordingly, specific decision support is needed in this
respect. Another tactical problem concerns the capacity of the maintenance crew.
Is there enough manpower to carry out the preventive maintenance program?
These questions can be addressed by use of models as will be indicated later on.
In the short term scheduling phase one determines the moment and order of
execution, given an amount of outstanding corrective or preventive work. This is
typically the domain of work scheduling where extensive model-based support can
be given.
We will next consider another important aspect in maintenance, which is the
type of maintenance. A typical distinction is made between corrective and preventive maintenance work. The first is carried out after a failure, which is defined as
the event by which a system stops functioning in a prescribed way. Preventive
work however, is carried out to prevent failures. Although this distinction is often
made, we like to remark that the difference is not that clear as it may seem. This is
due to the definition of failure. An item may be in a bad state, while still functioning and one may or may not consider this as a failure. Anyhow, an important
distinction between the two is that corrective maintenance is usually not plannable,
but preventive maintenance typically is.
The execution of maintenance can also be triggered by condition measurements
and then we speak of condition-based maintenance. This has often been advocated
as more effective and efficient than time-based preventive maintenance. Yet it is
very hard to predict failures well in advance, and hence condition-based maintenance is often unplannable. Instead of time based maintenance one can also base the
preventive maintenance on utilisation (run hours, mileage) as being more appropriate indicators of wear out.
Finally one may also have inspections which can be done by sight or instruments
and often do not affect operation. They do not improve the state of a system however, but only the information about it. This can be important in case machines start
producing items of a bad quality. There are inspection-quality problems where inspection optimization is connected to quality control.
Another distinction is about the amount of work. Often there are small works,
grouped into maintenance packages. They may start with inspection, cleaning and
next some improvement actions like lubricating and or replacing some parts. These
are typically part of the preventive maintenance program attached to a system and
have to be done on a repetitive basis (monthly, quarterly, yearly or two-yearly).
Next, one has replacements of parts or subsystems and overhauls or refurbishments
where a substantial system is improved. The latter are planned well in advance and
carried out as projects with individual (or separate) budgets.
A traditional optimization problem has been the choice and trade-off between
preventive and corrective maintenance. The typical motivation is that preventive
maintenance is cheaper than corrective. Maintenance costs are usually due to manhours, materials and indirect costs. The difference between corrective and preven-
324
tive maintenance costs is especially in the latter category. They represent loss of
production and environmental damage or safety consequences. Costing these
consequences can be a difficult problem and is tackled in Section 13.3.1. It will
also be clear that preventive maintenance should be done when production is least
effected. This can be done using opportunities, which has given rise to a specific
class of models dealt in a separate section (Section 13.3.2).
325
Failure and hence repairs of other units/components. The failure of one component is often an opportunity to preventively maintain other components.
Especially if the failure causes the breakdown of the production system it is
favourable to perform preventive maintenance on other components. After
all, little or no production is lost above that resulting from the original failure.
An example is given in Van der Duyn Schouten et al. (1998) who consider
the replacement of traffic lights at an intersection.
Other interruptions of production. Production processes are not only interrupted by failures or repairs. Several outside events may create an opportunity as well. This can be market interruptions, or other work for which
production needs to be stopped (e.g. replacing catalysts etc.) and this is an
opportunity to combine preventive maintenance.
According to the foregoing discussion there are two approaches to opportunities. The first models a whole multi-component system in which upon a failure
preventive maintenance can be carried out on other components as well. In the
latter stream the opportunities are modelled as an outside event at which one may
do maintenance. In the simplest form one considers one component, with maintenance which may be done at opportunities, or also with a forced shutdown.
Bckert and Rippin (1985) consider the first type of opportunistic maintenance
for plants subject to breakdowns. In this article three methods are proposed to solve
the problem. In the first two cases the problem is formulated as a stochastic
decision tree and solved using a modified branch and bound procedure. In the third
case the problem is formulated as a Markov decision process. The planning period
is discretised, resulting in a finite state space to which a dynamic programming
procedure can be applied.
In Wijnmalen and Hontelez (1997) a multi-component system is considered
where failures of one component may create an opportunity, but the opportunity
process is approximated by an independent process with the same mean rate. In
this way they circumvent the problem of dimensionality which appears in the study
of Bckert and Rippin (1985).
There are several articles considering the other stream. Tan and Kramer (1997)
propose a general framework for preventive maintenance optimization in chemical
process operations. The authors combine Monte Carlo simulation with a genetic
algorithm. Opportunities are the failure of other components.
In Dekker and Dijkstra (1992) and Dekker and Smeitink (1991) it is assumed
that the opportunity-generating process is completely independent of the failure
process and is modelled as a renewal process. Dekker and Smeitink (1994)
consider multi-component maintenance at opportunities of restricted duration and
determine priorities of what preventive maintenance to do at an opportunity.
326
In Dekker and Van Rijn (1996) a decision-support system (PROMPT) for opportunity-based preventive maintenance is discussed. PROMPT was developed to
take care of the random occurrence of opportunities of restricted duration. Here,
opportunities are not only failures of other components, but also preventive maintenance on (essential) components. Many of the techniques developed in the
articles of Dekker and Smeitink (1991), Dekker and Dijkstra (1992) and Dekker
and Smeitink (1994) are implemented in the decision-support system. In PROMPT
preventive maintenance is split up into packages. For each package an optimum
policy is determined, which indicates when it should be carried out at an opportunity. From the separate policies a priority measure is determined with which maintenance package should be executed at a given opportunity.
In Dekker et al. (1998b) the maintenance of light-standards is studied. A lightstandard consists of n independent and identical lamps screwed on a lamp assembly.
To guarantee a minimum luminance, the lamps are replaced if the number of failed
lamps reaches a prespecified number m. In order to replace the lamps the assembly
has to be lowered. As a consequence, each failure is an opportunity to combine
corrective and preventive maintenance. Several opportunistic age-based variants of
the m-failure group replacement policy (in its original form only corrective maintenance is grouped) are considered. Simulation optimization is used to determine the
optimal opportunistic age threshold.
Dagpunar (1996) introduces a maintenance model where replacement of a component within a system is possible when some other part of the system fails, at a
cost of c2. The opportunity process is Poisson. A component is replaced at an
opportunity if its age exceeds a specified control limit t. Upon failure a component
is replaced at cost c4 if its age exceeds a specified control limit x, otherwise it is
minimally repaired at cost c1. In case of a minimal repair the age and failure rate of
the component after the repair is as it was immediately before failure. There is also
a possibility of a preventive or interrupt replacement at cost c3 if the component
is still functioning at a specified age T. A procedure to optimise the control limits t
and T is given in Dekker and Plasmeijer (2001).
13.3.3 Maintenance Scheduling in Line with Production
Here we consider models where the effect of production on maintenance is explicitly taken into account. These models only address maintenance decisions, but
they do not give advice on how to plan production.
The models developed in the articles in this category show that a good
maintenance plan, one that is integrated with the production plan, can result in
considerable cost savings. This integration with production is crucial because
production and maintenance have a direct relation. Any breakdown in machine
operation results in disruption of production and leads to additional costs due to
downtime, loss of production, decrease in productivity and quality, and inefficient
use of personnel, equipment and facilities. Below we review articles following this
stream of research in chronological order.
Dedopoulos and Shah (1995) consider the problem of determining the optimal
preventive maintenance policy parameters for individual items of equipment in
multipurpose plants. In order to formulate maintenance policies, the benefits of
327
maintenance, in the form of reduced failure rates, must be weighed against the
costs. The approach in this study first attempts to estimate the effect of the failure
rate of a piece of equipment on the overall performance/profitability of the plant.
An integrated production and maintenance planning problem is also solved to
determine the effects of PM on production. Finally, the results of these two
procedures are then utilized in a final optimization problem that uses the relationship between profitability and failure rate as well as the costs of different maintenance policies to select the appropriate maintenance policy.
Vatn et al. (1996) present an approach for identifying the optimal maintenance
schedule for the components of a production system. Safety, health and environment objectives, maintenance costs and costs of lost production are all taken into
consideration, and maintenance is thus optimized with respect to multiple objectives. The approach is flexible as it can be carried out at various levels of detail,
e.g. adapted to available resources and to the managements willingness to give
detailed priorities with respect to objectives on safety vs. production loss.
Frost and Dechter (1998) define the scheduling of preventive maintenance of
power generating units within a power plant as constraint satisfaction problems.
The general purpose of determining a maintenance schedule is to determine the duration and sequence of outages of power generating units over a given time period,
while minimizing operating and maintenance costs over the planning period.
Vaurio (1999) develops unavailability and cost rate functions for components
whose failures can occur randomly. Failures can only be detected through periodic
testing or inspections. If a failure occurs between consecutive inspections, the unit
remains failed until the next inspection. Components are renewed by preventive
maintenance periodically, or by repair or replacement after a failure, whichever
occurs first (age-replacement). The model takes into account finite repair and
maintenance durations as well as costs due to testing, repair, maintenance and lost
production or accidents. For normally operating units the time-related penalty is
loss of production. For standby safety equipment it is the expected cost of an
accident that can happen when the component is down due to a dormant failure,
repair or maintenance. The objective is to minimize the total cost rate with respect
to the inspection and the replacement interval. General conditions and techniques
are developed for solving optimal test and maintenance intervals, with and without
constraints on the production loss or accident rate. Insights are gained into how the
optimal intervals depend on various cost parameters and reliability characteristics.
Van Dijkhuizen (2000) studies the problem of clustering preventive maintenance jobs in a multiple set-up multi-component production system. This article
has been reviewed in Chapter 11, which gives an overview of multi-component
maintenance models.
Cassady et al. (2001) introduce the concept of selective maintenance. Often
production systems are required to perform a sequence of operations with finite
breaks between each operation. The authors establish a mathematical programming
framework for assisting decision-makers in determining the optimal subset of maintenance activities to perform prior to beginning the next operation. This decision
making process is referred to as selective maintenance.
The article of Haghani and Shafahi (2002) deals with the problem of scheduling
bus maintenance activities. A mathematical programming approach to the problem
328
is proposed. This approach takes as input a given daily operating schedule for all
buses assigned to a depot along with available maintenance resources. Then a daily
inspection and maintenance schedule is designed for the buses that require
inspection so as to minimize the interruptions in the daily bus-operating schedule,
and maximize the reliability of the system and efficiently utilize the maintenance
facilities.
Charles et al. (2003) examine the interaction effects of maintenance policies on
batch plant scheduling in a semiconductor wafer fabrication facility. The purpose
of the work is the improvement of the quality of maintenance department activities
by the implementation of optimized preventive maintenance (PM) strategies and
comes within the scope of total productivity maintenance (TPM) strategy. The
production of semiconductor devices is carried out in a wafer lab. In this production environment equipment breakdown or procedure drifting usually induces unscheduled production interruptions.
Cheung et al. (2004) consider a plant with several units of different types.
There are several shutdown periods for maintenance. The problem is to allocate
units to these periods in such a way that production is least effected. Maintenance
is not modelled in detail, but incorporated through frequency or period restrictions.
329
downtime required for maintenance. The main question is when to carry out
maintenance such that the inconvenience for the train operators, the disruption to
and from the scheduled trains, the infrastructure possession time for maintenance
are minimized and the maintenance cost is the lowest possible. For a more detailed
overview of techniques used in planning railway infrastructure maintenance we
refer to Dekker and Budai (2002) and Improverail (2002). In some articles (see,
e.g. Higgins 1998, Cheung et al. 1999 and Budai et al. 2006) the track possession
is modelled in between operations. This can be done for occasionally used tracks,
which is the case in Australia and some European countries. If tracks are used
frequently, one has to perform maintenance during nights, when the train traffic is
almost absent or during weekends (with possible interruption of the train services),
when there are less disturbances for the passengers. In the first case one can either
make a cyclic static schedule, which is done by Den Hertog et al. (2005) and Van
Zante-de Fokkert (2001) for the Dutch situation, or a dynamic schedule with a
rolling horizon, which is done in Cheung et al. (1999). The latter schedule has to
be made regularly.
Some other articles deal with grouping railway maintenance activities to reduce
costs, downtime and inconvenience for the travellers and operators. Here we
mention the study of Budai et al. (2006) in which the preventive maintenance
scheduling problem is introduced. This problem arises in other public/private
sectors as well, since preventive maintenance of other technical systems (machine,
road, airplanes, etc.) also contains small routine works and large projects.
13.4.2 Road Maintenance
Road maintenance has many common characteristics with railway maintenance.
Failures are often indirect, in the sense that norms are surpassed, but there may not
be any consequences. The production function is indirect, but that does not mean
that it is not felt by many. Governments may define a cost penalty due to one hour
waiting per vehicle because of congestion caused by road maintenance. Similar to
railway maintenance one sees that work is shifted to nights or a lot of work is combined into a large project on which the public is informed long before it is started.
The night work causes high logistics costs for maintenance, yet it is useful for
small repairs or patches.
Other similarities with railroads are the large number of identical parts (a road
is typically split up in lanes of 100 meters about which information is stored). Vans
with complex road analysing equipment are used to assess the road quality. For
railways special trains with complex measuring equipment are used. Videos are
used in both cases. Next, both roads and rails have multiple failure modes. Furthermore, the assets to be maintained are spread out geographically, which result in
high logistics costs for maintenance. This is also true for airline and truck maintenance. Both road and rail need much maintenance and as a result large budgets
need to be allocated for both.
Although several articles have been written on road maintenance, few take the
production or user consequences into account. We would like to mention Dekker et
al. (1998a) who compare two concepts to do road maintenance one with small
projects carried out during nights and the other where large road segments (some
330
4 km) are overhauled in one stretch. In the latter case the traffic is diverted to other
lanes or the side of the road. It is shown that the latter is both advantageous for the
traffic as well as cheaper, provided the volume of traffic on the road is not too
high. Another interesting contribution is from Rose and Bennett (1992) who provide a model to locate and decide on the size (or capacity) of road maintenance
depots, for corrective maintenance.
13.4.3 Airline Maintenance
Maintenance costs are a substantial factor of an airlines costs. Estimates are that
20% of the cost is due to maintenance. Maintenance is crucial because of safety
reasons and because of high downtime costs. Apart from a crash, the worst event
for an airline is an aircraft on ground (AOG) because of failures. Accordingly a lot
of technology has been developed to facilitate maintenance. We like to mention inflight diagnosis, such that quick actions can be taken on ground and a very high
level of modularity, such that failed components can easily be replaced. Yet in an
aircraft there is still a high level of time-based preventive maintenance rather than
condition-based maintenance. A plane has to undergo several checks, ranging from
an A check taking about an hour after each flight, to a monthly B check, a yearly C
check and a five-yearly D check, where it is completely overhauled and which can
take a month. The presence of the monthly check implies that planes cannot always
fly the same route, but need to be rotated on a regular basis. It also implies that
airlines need multiple units of a type in order to provide a consistent service.
Several studies have addressed the issue of fleet allocation and maintenance
scheduling. In the fleet allocation one decides which planes fly which route and at
which time. One would preferably make an allocation which remains fixed for a
whole year, but due to the regular maintenance checks this is not possible. Gopalan
and Talluri (1998) give an overview of mathematical models on this problem.
Moudani and Mora-Camino (2000) present a method to do both flight assignment
and maintenance scheduling of planes. It uses dynamic programming and heuristics. A case of a charter airline is considered. Sriram and Haghani (2003) also
consider the same problem. They solve it in two phases. Finally, Feo and Bard
(1989) consider the problem of maintenance base planning in relation to an airlines
fleet rotation, while Cohn and Barnhart (2003) consider the relation between crew
scheduling and key maintenance routing decisions.
In another line of research, Dijkstra et al. (1994) develop a model to assess
maintenance manpower scheduling and requirements in order to perform inspection checks (A type) between flight turnarounds. It appears that their workload is
quite peaked because of many flights arriving more or less at the same time (socalled banks) in order to allow fast passenger transfers.
The same problem is also tackled by Yan et al. (2004). The articles in this line
of research consider in effect the production planning of maintenance, a topic also
addressed in Section 13.5.
As the last article in this category we would like to mention Cobb (1995) who
presents a simulation model to evaluate current maintenance system performance
or the positive effect of ad hoc operating decisions on maintenance turn times (i.e.
the time maintenance takes to carry out a check or to do a repair).
331
332
Christer (1997) consider the problem of manpower planning for hospital building
maintenance.
Another typical production planning problem is with respect to layout planning.
A case study for a maintenance tool room is described in Rosa and Feiring (1995).
The study by Rose and Bennett (1992), which was discussed in Section 13.4, also
falls into this category.
333
which determine both the optimal design as well as the production and maintenance plans simultaneously. In this framework, the basic process and system
reliability-maintainability characteristics are determined in the design phase with
the selection of system structure, components, etc. The remaining characteristics
are determined in the operation phase with the selection of appropriate operating
and maintenance policies. Therefore, the optimization of process system effectiveness depends on the simultaneous identification of optimal design, operation and
maintenance policies having properly accounted for their interactions. In Goel et
al. (2003) a reliability allocation model is coupled with the existing design, production, and maintenance optimization framework. The aim is to identify the optimal
size and initial reliability for each unit of equipment at the design stage. They
balance the additional design and maintenance costs with the benefits obtained due
to increased process availability.
13.6.2 EMQ Problems
In the classical economic manufacturing quantity (EMQ) model items are produced
at a constant rate p and the demand rate for the items is equal to d < p. The aim of
the model is to find the production uptime that minimizes the sum of the inventory
holding cost and the average, fixed, ordering cost. This model is an extension of
the well known economic order quantity (EOQ) model, the difference being that in
the EOQ model orders are placed when there is no inventory. Note that the EMQ
model is also referred to as economic production quantity (EPQ) model.
In the extensive literature on production and inventory problems, it is often
assumed that the production process does not fail, that it is not interrupted and that
it only produces items of acceptable quality. Unfortunately, in practice this is not
always the case. A production process can be interrupted due to a machine breakdown or because the quality of the produced items is not acceptable anymore. The
EMQ model has been extended to deal with these aspects and we thus divide the
literature on EMQ models into two categories. First, we consider EMQ problems
that take into account the quality aspects of the items produced. The second
category of EMQ models analyzes the effects of (stochastic machine) breakdowns
on the lot sizing decision.
13.6.2.1 EMQ Problems with Quality Aspects
One of the reasons why a production process is interrupted is the (lack of) quality
of the items produced. Obviously, items of inferior quality can only be sold at a
lower revenue or cannot be sold at all. Thus, the production of these items results
in a loss (or a lower profit) for the firm. This type of interruption is usually
modelled as follows. It is assumed that at the start of the production cycle the
production is in an in-control state, producing items of acceptable quality. After
some time the production process may then shift to an out-of-control state. In
this state a certain percentage of the items produced are defective or of substandard quality. The elapsed time for the process to be in the in-control state,
before the shift occurs, is a random variable. Once a shift to the out-of-control state
has occurred, it is assumed that the production process stays in that state unless it is
334
335
336
mally repaired and put back into commission. Okamura et al. (2001) generalize the
model of Srinivasan and Lee (1996) by assuming that both the demand as well as
the production process is a continuous-time renewal counting process. Furthermore, they suppose that machine breakdown occurs according to a non-homogeneous Poisson process. In Lee and Srinivasan (2001) the demand and production
rates are considered constant and a production run begins as soon as the inventory
drops to zero. If the facility fails during operation, it is assumed to be repaired, but
restoring the facility only to the condition it was in before the failure. Lee and
Srinivasan (2001) consider an (S, N) policy, where the control variable N specifies
the number of production cycles the machine should go through before it is set
aside for preventive maintenance overhaul, which restores the facility to its original
condition.
Recently, Lin and Gong (2006) determined the effect of breakdowns on the
decision of optimal production uptime for items subject to exponential deterioration under a no-resumption policy. Under this policy, a production run is executed
for a predetermined period of time provided that no machine breakdown has
occurred in this period. Otherwise, the production run is immediately aborted. The
inventories are built up gradually during the production uptime and a new production run starts only when all on-hand inventories are depleted. If a breakdown
occurs then corrective maintenance is carried out and this takes a fixed amount of
time. If the inventory build-up during the production uptime is not enough to meet
the demand during the entire period of the corrective maintenance, shortages (lost
sales) will occur. Maintenance restores the production system to the same initial
working conditions.
13.6.3 Deteriorating Production System with Buffer Capacity
In order to reduce the negative effect of a machine breakdown on the production
process, a buffer inventory may be built up during the production uptime (as it is
done in the EMQ model). The role of this buffer inventory is that if an unexpected
failure of the installation occurs then this inventory is used to satisfy the demand
during the period that corrective maintenance is carried out. One of the earliest
works on this subject is Van der Duyn Schouten and Vanneste (1995). In their
model the demand rate is constant and equal to d (units/time) and as long as the
fixed buffer capacity (K) is not reached the installation operates at a constant rate
of p units/time (p>d) and the excess output is stored in the buffer. When the buffer
is full, the installation reduces its speed from p to d. Upon failure corrective
maintenance starts and the installation becomes as good as new. It is possible to
perform preventive maintenance, which takes less time than repair and it also
brings the installation back into the as-good-as-new condition. The decision to start
a preventive maintenance action is not only based on the condition of the installation, but also on the level of the buffer. The criterion is to minimize the average
inventory level and the average number of backorders. Since the optimal policy is
difficult to implement, the authors develop suboptimal (n, N, k) control-limit
policies. Under this policy if the buffer is full, preventive maintenance is undertaken at age n. If the buffer is not full, but it has at least k items, preventive main-
337
338
A recent article of Kenne et al. (2006) considers the effects of both preventive
maintenance policies and machine age on optimal safety stock levels. Significant
stock levels, as the machine age increases, hedge against more frequent random
failures. The objective of the study is to determine when to perform preventive
maintenance on the machine and to find the level of the safety stock to be maintained.
13.6.4 Production and Maintenance Rate Optimization
An integrated production and maintenance planning can also be made by
optimizing the production and maintenance rates of the machines under consideration. In this line of research we mention the work of Gharbi and Kenne (2000,
2005), Kenne and Boukas (2003) and Kenne et al. (2003). In these articles a
multiple-identical-machine manufacturing system with random breakdowns, repairs and preventive maintenance activities is studied. The objective of the control
problem is to find the production and the preventive maintenance rates of the machines so as to minimize the total cost of inventory/backlog, repair and preventive
maintenance.
13.6.5 Miscellaneous
Finally, we list some articles that deal with integrated maintenance and production
planning, but their approaches for modelling or the problem settings are different
from the articles in the previous categories discussed earlier. For instance, the
model presented in Ashayeri et al. (1996) deals with the scheduling of production
and preventive maintenance jobs on multiple production lines, where each line has
one bottleneck machine. The model indicates whether or not to produce a certain
item in a certain period on a certain production line.
In Kianfar (2005) the manufacturing system is composed of one machine that
produces a single product. The failure rate of the machine is a function of its age
and the demand of the manufacturing product is time-dependent. Its rate depends
on the level of advertisement of the product. The objective is to maximize the
expected discounted total profit of the firm over an infinite time horizon.
Sarper (1993) considers the following problem. Given a fixed repair/maintenance capacity, how many of each of the low demand large items (LDLIs) should
be started so that there are no incomplete jobs at the end of the production period?
The goal is to ensure that the portion of the total demand started will be completed
regardless of the amount by which some machines may stay idle due to insufficient
work. A mixed-integer model is presented to determine what portion of the
demand for each LDLI type should be rejected as lost sales so that the remaining
portion can be finished completely.
339
been published with the majority dating from the 1990s and the new millennium.
The most popular area in this review is also the oldest one, i.e. on integrated
models for maintenance and production. However, still many papers appear in that
area and the models become more and more complex, with more decision
parameters and more aspects.
The topics on opportunity maintenance and scheduling maintenance in line
with production have also been popular, but maybe more in the past than today.
We did expect to find more studies on specific business sectors, but could only find
many for the airline sector. That sector seems to be the most popular as it has both
a lot of interaction between maintenance and production as well as high costs
involved. In the other sectors, we do see the interaction, but perhaps more papers
will be published in the future. The other sections are interesting but small in terms
of papers published.
In general, the demands on maintenance become higher as public and companies are less likely to accept failures, bad quality products or non-performance.
Yet at the same time societys inventory of capital goods is increasing as well as
ageing in the western societies. This is very much the case for roads, railways,
electric power generation, transport, and aircrafts. As there are continuous pressures on maintenance budgets we do foresee the need for research supporting
maintenance and production decisions, also because decision support software is
gaining in popularity and more data becomes electronically available. A theory is
therefore needed for such decision support systems. As several case studies have
taught us that practical problems have many complex aspects, there is a high need
for more theory that can help us to understand and improve complex maintenance
decision-making.
13.8 Conclusions
In this chapter we have given an overview of planning models for production and
maintenance. These models are classified on the basis of the interactions between
maintenance and production. First, although maintenance is intended to allow
production, production is often stopped during maintenance. The question arises
when to do maintenance such that production is least effected. In order to answer
this question planning models should take into account the needs of production.
These needs are business sector specific and thus applications of planning models
in different areas have been considered. In comparison with other specific sectors,
much work has been done on modelling maintenance for the airline sector. Second,
maintenance itself can also be seen as a production process which needs to be
planned. Models for maintenance production planning mainly address allocation
and manpower determination problems. Finally, maintenance also affects the production process since it takes capacity away. In production processes maintenance
is mostly initiated by machine failures or low quality items. Maintenance and
production should therefore be planned in an integrated way to deal with these
aspects. Indeed, integrated maintenance and production planning models determine
optimal lot sizes while taking into account failure and quality aspects. We observe
340
a non-stop attention for such models, which take more and more real world aspects into account.
Although many articles have been written on the interaction between production
and maintenance, a careful reader will detect several open issues in this review. The
theory developed thus far, is far from complete and any real application, is likely to
reveal many more open issues.
13.9 Acknowledgements
The authors would like to thank Georgios Nenes, Sophia Panagiotidou, and the
editors for their helpful suggestions and comments.
13.10 References
Al-Zubaidi H, Christer A, (1997) Maintenance manpower modelling for a hospital building
complex. European Journal of Operational Research 99:603618
Ashayeri J, Teelen A, Selen W, (1996) A production and maintenance planning model for
the process industry. International Journal of Production Research 34: 33113326
Bckert W, Rippin D, (1985) The determination of maintenance strategies for plants subject
to breakdown. Computers and Chemical Engineering 9(2):113126
Ben-Daya M, Makhdoum M, (1998) Integrated production and quality model under various
preventive maintenance policies. Journal of the Operational Research Society 49(8):
840853
Ben-Daya M, Rahim M, (2001) Integrated production, quality & maintenance models: an
overview. in M. Rahim and M. Ben-Daya (eds), Integrated models in production
planning, inventory, quality, and maintenance, Kluwer Academic Publishers, 328
Beng G, (1994) Telecommunications systems maintenance. Computers and Operations
Research 21:337351
Budai G, Huisman D, Dekker R, (2006) Scheduling preventive railway maintenance
activities. Journal of the Operational Research Society 57:10351044
Cassady C, Pohl E, Murdock W, (2001) Selective maintenance modeling for industrial
systems. Journal of Quality in Maintenance Engineering 7(2):104117
Charles A, Floru I, Azzaro-Pantel C, Pibouleau L, Domenech S, (2003) Optimization of
preventive maintenance strategies in a multipurpose batch plant: application to
semiconductor manufacturing. Computers and Chemical Engineering 27:449467
Chelbi A, Ait-Kadi D, (2004) Analysis of a production/inventory system with randomly
failing production unit submitted to regular preventive maintenance. European Journal
of Operational Research 156:712718
Cheung B, Chow K, Hui L, Yong A, (1999) Railway track possession assignment using
constraint satisfaction. Engineering Applications of AI 12(5):599611
Cheung K, Hui C, Sakamoto H, Hirata K, O'Young L, (2004) Short-term site-wide
maintenance scheduling. Computers and Chemical Engineering 28:91102
Cho D, Parlar M, (1991) A survey of maintenance models for multi-unit systems. European
Journal of Operational Research 51:123
Chung K, (2003) Approximations to production lot sizing with machine breakdowns.
Computers & Operations Research 30:14991507
Cobb R, (1995) Modeling aircraft repair turntime: simulation supports maintenance
marketing. Journal of Air Transport Management 2:2532
341
342
Goel H, Grievink J, Weijnen M, (2003) Integrated optimal reliable design, production, and
maintenance planning for multipurpose process plant. Computers and Chemical
Engineering 27:15431555
Gopalan R, Talluri K, (1998) Mathematical models in airline schedule planning: a survey.
Annals of Operations Research 76(1): 155185
Groenevelt H, Pintelon L, Seidmann A, (1992a) Production batching with machine
breakdowns and safety stocks. Operations Research 40(5):959971
Groenevelt H, Pintelon L, Seidmann A, (1992b) Production lot sizing with machine
breakdowns. Management Science 48(1):104123
Haghani A, Shafahi Y, (2002) Bus maintenance systems and maintenance scheduling:
model formulations and solutions. Transportation Research Part A 36:453482
Higgins A, (1998) Scheduling of railway maintenance activities and crews. Journal of the
Operational Research Society 49:10261033
Improverail (2002) http://www.tis.pt/proj/improverail/downloads/d6final.pdf (accessed
September 26, 2006)
Iravani S, Duenyas I, (2002) Integrated maintenance and production control of a
deteriorating production system. IIE Transactions 34:423435
Kenne J, Boukas E, (2003) Hierarchical control of production and maintenance rates in
manufacturing systems. Journal of Quality in Maintenance Engineering 9:6682
Kenne J, Boukas E, Gharbi A, (2003) Control of production and corrective maintenance
rates in a multiple-machine, multiple-product manufacturing system. Mathematical and
Computer Modelling 38:351365
Kenne J, Gharbi A, Beit M, (2006) Age-dependent production planning and maintenance
strategies in unreliable manufacturing systems with lost sale. Accepted for publication in
European Journal of Operational Research 178(2):408420
Kianfar F, (2005) A numerical method to approximate optimal production and maintenance
plan in a flexible manufacturing system. Applied Mathematics and Computation
170:924940
Knight P, Jullian F, Jofre L, (2005) Assessing the size of the prize: developing business
cases for maintenance improvement projects. Proceedings of the International Physical
Asset Management Conference, 284302
Kralj B, Petrovic R, (1988) Optimal preventive maintenance scheduling of thermal
generating units in power systems a survey of problem formulations and solution
methods. European Journal of Operational Research 35:115
Kyriakidis E, Dimitrakos T, (2006) Optimal preventive maintenance of a production system
with an intermediate buffer. European Journal of Operational Research 168:8699
Lam K, Rahim M, (2002) A sensitivity analysis of an integrated model for joint
determination of economic design of x -control charts, economic production quantity
and production run length for a deteriorating production system. Quality and Reliability
Engineering International 18:305320
Langdon W, Treleaven P, (1997) Scheduling maintenance of electrical power transmission
networks using genetic programming. In Warwick K, Ekwue A, Aggarwal A, (eds),
Artificial intelligence techniques in power systems, Institution of Electrical Engineers,
Stevenage, UK, 220237
Lee H, (2005) A cost/benefit model for investments in inventory and preventive
maintenance in an imperfect production system. Computers and Industrial Engineering
48:5568
Lee H, Rosenblatt M, (1987) Simultaneous determination of production cycle and inspection
schedules in a production system. Management Science 33:11251137
Lee H, Rosenblatt M, (1989) A production and maintenance planning model with restoration
cost dependent on detection delay. IIE Transactions 21(4):368375
343
344
Van Zante-de Fokkert J, den Hertog D, van den Berg F, Verhoeven J, (2001) Safe track
maintenance for the Dutch Railways, Part II: Maintenance schedule. Technical report,
Tilburg University, the Netherlands
Vatn J, Hokstad P, Bodsberg L, (1996) An overall model for maintenance optimization.
Reliability Engineering and System Safety 51:241257
Vaurio J, (1999) Availability and cost functions for periodically inspected preventively
maintained units. Reliability Engineering and System Safety 63:133140
Wang C, (2006) Optimal production and maintenance policy for imperfect production
systems. Naval Research Logistics 53:151156
Wang C, Sheu S, (2003) Determining the optimal production-maintenance policy with
inspection errors: using a Markov chain. Computers & Operations Research 30:117
Weinstein L, Chung C, (1999) Integrating maintenance and production decisions in a
hierarchical production planning environment. Computers & Operations Research
26:10591074
Wijnmalen D, Hontelez A, (1997) Coordinated condition-based repair strategies for
components of a multi-component maintenance system with discounts. European
Journal of Operational Research 98:5263
Yan S, Yang T, Chen H, (2004) Airline short-term maintenance manpower supply planning.
Transportation Research Part A 38:615642
Yao X, Xie X, Fu M, Marcus S, (2005) Optimal joint preventive maintenance and
production policies. Naval Research Logistics 52:668681
14
Delay Time Modelling
Wenbin Wang
14.1 Introduction
In this chapter we present a modelling tool that was created to model the problems
of inspection maintenance and planned maintenance interventions, namely delay
time modelling (DTM). This concept provides a modelling framework readily
applicable to a wide class of actual industrial maintenance problems of assets in
general, and inspection problems in particular.
The concept of the delay time was first mentioned by Christer (1976) in a context of building maintenance. It was not until 1984, the concept was first applied to
an industrial maintenance problem (Christer and Waller 1984). Since then, a series
of research papers appeared with regard to the theory and applications of delay
time modelling of industrial asset inspection problems; see Christer (1999) for a
detailed review. The delay time concept itself is simple which defines the failure
process of an asset as a two-stage process. The first stage is the normal operating
stage from new to the point that a hidden defect has been identified. The second
stage is defined as the failure delay time from the point of defect identification to
failure. It is the existence of such a failure delay time which provides the opportunity for preventive maintenance to be carried out to remove or rectify the identified
defects before failures. With appropriate modelling of the durations of these two
stages, optimal inspection intervals can be identified to optimise a criterion function of interest.
The delay time concept is similar in definition to the well known potential
failure (PF) interval in reliability centred maintenance (Moubray 1997). It is noted,
however, that two differences between these two definitions mark a fundamental
difference in modelling maintenance inspection of assets. First, the delay time is
random in Christers definition while the PF interval is assumed to be constant.
Second, the initial point of a defect identification is very important to the set up of
an appropriate inspection interval, but ignored by Moubray. Nevertheless, Moubray
did not provide any means of modelling the inspection practice, while DTM
346
W. Wang
347
example, the case for a single component will be considered in Section 14.3. The
interaction between inspection and equipment performance may be captured using
the delay time concept presented below.
Let the item of an asset be maintained on a breakdown basis. The time history
of breakdown or failure events is a random series of points; see Figure 14.1. For
any one of these failures, the likelihood is that, had the item been inspected at some
point just prior to failure, it could have revealed a defect which, though the item
was still working, would ultimately lead to a failure. Such signals include excessive vibration, unusual noise, excessive heat, surface staining, smell, reduced output, increased quality variability, etc. The first instance where the presence of a
defect might reasonably be expected to be recognised by an inspection, had it taken
place, is called the initial point u of the defect, and the time h to failure from u is
called the delay time of the defect; see Figure 14.2. Had an inspection taken place
in (h, u + h) , the presence of a defect could have been noted and corrective actions
taken prior to failure. Given that a defect arises, its delay time represents a window
of opportunity for preventing a failure. Clearly, the delay time h is a characteristic
of the item concerned, the type of defect, the nature of any inspection, and perhaps
the person inspecting. For example, if the item was a vehicle, and the maintenance
practice was to respond when the driver reported a problem, then there is in effect a
form of continuous monitoring inspection of cab related aspects of the vehicle,
with a reasonably long delay time consistent with the rate of deterioration of the
defect. However, should the exhaust collapse because a support bracket was
corroded through, the likely warning period for the driver, the delay time, would be
virtually zero, since he would not normally be expected to look under the vehicle.
At the same time, had an inspection been undertaken by a service mechanic, the
delay time may have been measured in weeks or months. Had the exhaust
collapsed because securing bolts became loose before falling out, then the driver
could have had a warning period of excessive vibration, and perhaps noise, and the
defects would have had a drive related delay time measured in days or weeks.
failure
Time
348
W. Wang
To see why the delay time concept is of use, consider Figure 14.3 incorporating
the same failure point pattern as Figure 14.1 along with the initial points associated
with each failure arising under a breakdown system. Had an inspection taken place
at point (A), one defect could have been identified and the seven failures could
have been reduced to six. Likewise, had inspection taken place at points (B) and
point (A), four defects could have been identified and the seven failures could have
been reduced to three. Figure 14.3 demonstrates that provided it is possible to
model the way defects arise, that is the rate of arrival of defects (u ) , and their
associated delay time h , then the delay time concept can capture the relationship
between the inspection frequency and the number of plant failures.
We are assuming for now that inspections are perfect, that is, a defect is recognised if, and only if, it is there and is removed by corrective action. Delay time
modelling is still possible if these assumptions are not valid, but this more complex
case is discussed in Section 14.3.1.
Time
349
(14.1)
E[ N f (T )] = F (t )dt
0
(14.2)
(14.3)
The original model developed in Christer and Waller (1984) for Equation 14.2
uses a different approach, but leads to the same result.
14.3.2 Imperfect Inspections
Section 14.3.1 outlined a basic delay time model under perfect inspections. It is
established under a set of assumptions, and some of them may not be valid in
practical situations. These assumptions greatly simplify the mathematics involved
but also restrict a wider use of the models developed. Perhaps the most restrictive
assumption is that of perfect inspections. In almost all the case studies conducted
using the delay time concept, we found none of them supported the perfect inspec-
350
W. Wang
tion assumption. The other concerning assumption is the HPP for defect arrival in
the case of a complex system. One would naturally think as the system ages there
could be more defect arrivals than that of a younger system. In this section, we
introduce one delay time model that relaxes the perfect inspection assumption. The
delay time model using a NHPP is presented in Christer and Wang (1995) and
Wang and Christer (2003). These models are mainly developed for complex systems, but a non-perfect inspection single component delay time model can also be
developed along a similar line (Baker and Wang 1991).
All the assumptions proposed in Section 14.3.1 will hold except the perfect
inspection one. Assume for now that if a defect is present at an inspection; then
there is a probability r that the defect can be identified. This implies that there is a
probability 1 r that the defect will be unnoticed. Figure 14.4 depicts such a
process.
Two defects were not identified
time
Figure 14.4. Failure process of a multi-component system subject to three non-perfect inspections at points A, B, and C; two potential failures were removed and two missed
It has been proved that the failure process over each inspection interval is still
an NHPP (Christer and Wang 1995), but not identical over the earlier inspection
intervals of the system. It can be shown that as the number of inspections increases,
the number of failures over each inspection interval becomes stable and identical,
so we need to study the asymptotic behaviour of the failure process assuming the
number of previous inspections is very large.
Let
i --- i-th inspection
U --- random variable of the initial time u
r --- probability of perfect inspection
i (t ) --- ROCOF at time t , t [(i 1)T , iT )
E[ N f ((i 1)T , iT )] --- expected number of failures over [(i 1)T , iT )
E[ N s (iT )] --- expected number of defects identified at iT
It can be shown (Christer et al. 1995; Christer and Wang 1995) that vi (t ) is given
by
vi (t ) = n=1 (1 r)i n+1[F (t (n 1)T ) F (t nT )] + F (t (i 1)T )
i
(14.4)
351
iT
( i 1)T
iT
( i 1)T
vi (t )dt
(14.5)
i
n =1
(1 r )i n +1r
nT
( n 1)T
[1 F (iT u )]du + r
iT
( i 1)T
[1 F (iT u )] du
(14.6)
The expected downtime is given by Equation 14.1 with the expected number of
failures given by by Equation 14.5, so that
D(T ) =
d f E[ N f ((i 1)T , T )] + d s
(14.7)
T + ds
The use of Equation 14.7 assumes that the system is already in a steady state
with i . For computation purpose we can select a large i , and then n starts
from the first k where (1 r )i k +1 and is a very small number.
Equation 14.7 is established assuming that the defects identified at an inspection
will always be removed without costing any extra downtime or cost. This assumption can be relaxed. Let d r be the mean downtime per defect being repaired. Then
using the same approach as before, the expected downtime is given by
D(T ) =
(14.8)
If the objective function is the expected cost per unit time, we obtain this by
simply substituting the downtime parameters in Equations 14.7 or 14.8 by the
corresponding cost parameters.
Example 14.1 Assume that the rate of occurrence of defects is two per day, and the
delay time distribution is exponential with scale parameter 0.03 measured in days.
The downtime measures are d f = 30 and d s = 30 min respectively. The probability
of a perfect inspection is assumed to be 0.7. Using Equations 14.5 and 14.7, we have
the expected downtime against inspection intervals as shown in Figure 14.5. It can be
seen from Figure 14.5 that a weekly inspection interval is the best.
W. Wang
35
30
25
20
15
22
19
16
13
10
10
1
352
Inspection interval
Figure 14.5. Expected downtime per unit time vs. inspection interval (in days)
Time
Figure 14.6. Failure process of a multi-component system, where denotes initial points;
failure points
353
Time
For the system in Figure 14.6, the system may be renewed at inspection points if
these inspections are perfect, and the rate of arrival of defects is constant. However
for the system in Figure14.7, the system can be renewed either at a failure or at an
inspection. We present the case with a perfect inspection assumption. The case of an
imperfect inspection delay time model for a single component can be found in
Baker and Wang (1991, 1993).
We need the following additional assumptions and notation;
1. The system is renewed at either a failure repair or at a repair done at an
inspection if a defect is identified.
2. After either a failure renewal or inspection renewal the inspection process
re-starts.
3. The initial time, U , to the appearance of a random defect has a probability
density function g (u ) .
4. The defective compoment identified at an inspection will be renewed either
by a repair or a replacement at an average cost of cr and downtime d r .
14.4.1 Inspection Model Based on an Exponentially Distributed Initial Time
We first consider a simple case that an inspection renews the system regardless of
whether a defect was identified or not. This effectively assumes an exponential
distribution for the initial time U .
Since each failure or inspection renewed the system with associated downtimes
or costs, the process is a renewal reward process, and the long term expected cost
per unit time, C (T ) , is given by Ross (1983):
C(T) =
E(CC)
E(CL)
where CC is the renewal cycle cost and CL is the renewal cycle length which is
the interval between two consecutive renewals. There could be two different
renewal cycles, one is the failure renewal and the other is the inspection renewal.
Taking the expected cost per renewal cycle as an example, since a failure will
cost c f with probability of it happening as P( X < T ) , then the expected cost due
to a failure renewal within T is
c f P( X < T ) = c f
T
0
g (u ) F (T u ) du ,
(14.9)
354
W. Wang
T is
(14.10)
and finally the expected cost due to an inspection renewal without a defect being
identified at T is given by
cs P (U T ) = cs
g (u ) du
(14.11)
T
0
(14.12)
As to the expected cycle length, we model two possibilities. The first is that the
cycle ends at a failure before T . Define p (t ) the density function for the time to
failure which is given readily by
p (t ) =
d
P( X t ) =
dt
t
0
g (u ) f (t u ) du
E (CL) =
g (u ) f (t u )dudt + T (1
g (u ) F (T u ) du )
(14.13)
For the detailed derivation of Equations 14.914.13 see Baker and Wang (1991,
1993).
Finally the expected cost per unit time is given by
C(T) =
cf
T
0
g (u ) F (T u )du + (cr + cs )
T
T
0
g (u ){1 F (T u )}du + cs
g (u ) f (t u )dudt + T (1
T
0
g (u ) du
g (u ) F (T u ) du )
(14.14)
355
290
270
250
230
210
190
2.3
2.1
1.9
1.7
1.5
1.3
1.1
0.9
0.7
0.5
0.3
170
150
0.1
Example14.2 Assume both the initial time and delay time distributions are exponential with scale parameters 0.6 and 0.75 respectively. The time unit is 100 days
and the cost parameter values are c f = 1000, cr = 150 and cs = 15 respectively.
Using Equation 14.14, the calculated expected cost per unit time as a function of T
is shown in Figure 14.8.
Inspection interval
Figure 14.8. Expected cost per unit time vs. inspection interval
iT
( i 1)T
g (u ) F (iT u )du
(14.15)
This is because inspections are perfect so that if a failure at time X, then the
initial time U must be bounded within [(i 1)T , X ), X < iT . There are (i 1)
inspections with no defect identified before the failure so (i 1) times of the
inspection downtime are added.
Equation 14.15 models only one of the possibilities and a failure can be in any
of the inspection intervals so summing over all possible intervals i from 1 to
infinity gives the expected downtime due to a failure:
356
W. Wang
i =1
(14.16)
iT
i =1
( i 1)T
Equation 14.16 is always finite since all the probability terms for large i tend
to zero because g (u ) tends to zero for u > (i 1)T when i is large.
Similarly the expected downtime due to an inspection renewal with a defect
identified is
i =1
((i 1)d s + d r )
iT
( i 1)T
(14.17)
Summing Equations 14.16 and 14.17 gives the complete expected downtime
per renewal cycle:
E(CD)
=
i =1
{[(i 1)d + d ]
s
iT
( i 1)T
g (u ) du +( d f d r )
iT
( i 1)T
g (u ) F (iT u ) du
(14.18)
iT
i =1
( i 1)T
t
( i 1)T
g (u ) f (t u ) dudt + iT
iT
( i 1)T
(14.19)
i =1
i =1
( i 1)T
iT
iT
iT
( i 1)T
( i 1)T
( i 1)T
( i 1)T
(14.20)
357
Following a discussion with the chief technician, it seemed best to focus on the
following items, to ensure a sample of similar machine types, under heavy and
constant use, with a usefully long history of failures, and with reasonably welldefined modes of failures. Two pumps were chosen, namely volumetric infusion
pumps and peristaltic pumps all from the intensive-care, neurosurgery and heartcare units. There were 105 volumetric pumps and the most frequent failure mode
was the failure of the pressure transducer. There were 35 peristaltic pumps and the
most frequent failure mode was battery failure. For a detailed description of the
case, data and model fitting see Baker and Wang (1991). Several distributions were
chosen for the initial and delay time distributions for both pumps, and it turned out
that in both cases a Weibull distribution was the best for the initial time distribution
and an exponential distribution for the delay time distribution. The estimated
parameter values based on history data using the maximum likelihood method for
both pumps are shown in Table 14.1.
Table 14.1. Estimated parameter values for the pumps
Pump
g (u ) = ( u ) 1 e ( u )
f ( h) = e h
Volumetric infusion
=0.0017, =1.42
=0.0174
Peristaltic
=0.0007, =2.41
=0.0093
Although the cost data were not recorded, it was relatively easy to estimate the
cost of an inspection (called preventive maintenance in the hospital) and the cost of
an inspection repair if a defect was identified. However, it was extremely difficult
to have an estimate for the failure cost since if the pump failed to work while
needed the penalty cost could be very high compared with the cost of the pump
itself. Nevertheless, some estimates were provided, which are shown in Table 14.2
Table 14.2. Cost estimates
Pump
Inspection cost
Inspection repair
cost
Failure cost
Volumetric infusion
15
50
2000
Peristaltic
15
70
1000
This time we cannot derive an analytical formula for the expected cost because
of the use of the Weibull distribution. Numerical integrations have to be used to
calculate Equation 14.20. We did this using the maths software package MathCad
and the results are shown in Figures 14.9 and 14.10.
358
W. Wang
2.4
2.2
2
Expected_Cost( T )
C(T)
1.8
1.6
1.4
1.2
20
40
60
80
100
120
Figure 14.9. Expected cost per unit time vs. inspection interval for the volumetric infusion
pump
2.5
Expected_Cost( T ) 1.5
C(T)
0.5
20
40
60
80
100
120
Figure 14.10. Expected cost per unit time vs. inspection interval for the peristaltic pump
Time is given in days in Figures 14.9 and 14.10, so the optimal inspection
interval for the volumetric infusion pump is about 30 days and for the peristaltic
pump is around 70 days. The hospital at the time checked the pumps at an interval
of six months, so clearly for both pumps the inspection intervals should be
shortened. However, it has to be pointed out that the model is sensitive to the
failure cost, and had a different estimate been provided, the recommendation
would have been different.
359
360
W. Wang
HML: how much longer could the defect be left unattended before a repair was
essential.
The estimates are given by h = HLA for a failure, and h = HLA + HML for an
inspection repair; see Figure 14.11a,b. f (h) is then estimated from the data of { h }.
HLA
HLA
HML
(a) Failure
(b) Inspection
361
The following phases for the estimating of the delay time were suggested; Wang
(1997).
The problem identification phase This is for the identification of all major failure
types and possible causes of the failures. This was normally done via a failure
mode and criticality analysis so that a list of dominant failures can be obtained.
This process will entail a series of discussions with the maintenance engineers to
clarify any hidden issues. If some failure data exists it should be used to validate
the list, or otherwise a questionnaire should be designed and forwarded to the
person concerned for a list of dominant failure types.
Expert identification and choice phase The term expert is not defined by any
quantitative measure of resident knowledge. However, it is clear in the case here
that a person who is regarded by others as being one of the most knowledgeable
about the machine should be chosen as the expert. The shop floor fitters or any
maintenance technicians or engineers who maintain the machine would be the
desired experts; Christer and Waller (1984). After the set of experts is identified, a
choice is made of which experts to use in the study. Full discussion with management is necessary in order to select the persons who know the machine best.
Psychologically, five or fewer experts are expected to take part of the exercise, but
not less than three.
The question formulation phase The questions we want to ask in this case are the
rate of occurrence of defects, (assuming we are modelling a complex plant) and the
delay time distribution. In the case addressing the rate of arrival of a defect type,
we can simply ask for a point estimate since it is not random variable. Without
maintenance interventions, this would, in the long term, be equal to the average
number of the same failure type per unit time. For example we may ask how many
failures of this type will occur per year, month, week or day?. It is noted that this
quantity is usually observable. In fact, our focus is mainly on the delay time
estimates.
Given the amount of uncertainty inherent in making a prediction of the delay
time, the experts may feel uncomfortable about giving a point estimate, and may
prefer to communicate something about the range of their uncertainty. Accepting
these points, perhaps the best that experts could do in this case would be to give
their subjective probability mass function for the quantity in question. In other
words, they could provide an estimate over the interval such that the mass above
the interval is proportional to their subjective probability measures. Alternatively,
three point estimates can be asked, such as the most likely, the minimum and the
maximum durations of the delay times for a particular type of failure.
The word delay time was not entered in the question since it will take some
effort to explain what is the delay time. Instead, we just asked a similar question
like HLA. But this question was still difficult for the experts to understand based
upon our case experience. The lesson learned is to demonstrate one example for
them before starting the session.
362
W. Wang
The elicitation phase Elicitation should be performed with each expert individually. If possible, the analyst should be present, which proved to be vital in our case
studies. The above-mentioned histogram was used to draw the answer from the
experts so that the experts can have a visual overview of their estimates and a
smooth histogram could be achieved if the experts are advised to do so. The maximum number of the histogram intervals is set to be five, which is advised by
psychological experiments.
The calibration phase Roughly speaking, calibration is intended to measure the
extent to which a set of probability mass functions correspond to reality. Reviewing the problem we have concluded that subjective calibration is not recommended
due to its time consuming nature. If any objective data is available, we may calibrate the experts opinion by a Bayesian approach as discussed by many others.
Another approach is to calibrate the estimate by matching a statistics observed. If
significant difference is found, the estimates must be revised.
The combination phase Experts resolution, or combining probabilities from experts, has received some attention. Here we use one of the simplest approaches,
namely the weighting method. It is simply a weighted average of the estimates of
all experts. The weights need to be selected carefully according to each experts
level of expertise, and their sum should be equal to one. Other more complicated
methods are available; see Wang (1997)
It is noted that the combined delay time distribution obtained from this phase is
in a form of discrete probability distribution. In fact a continuous delay time distribution is needed in delay time inspection modelling. To achieve this, based upon
the number of delay times in each interval, an estimated continuous delay time
distribution F (h) of F (h) can be obtained by fitting a distribution from a known
family failure distributions, such as exponential or Weibull using the least square
method or maximum likelihood method.
The updating phase This phase is mainly for after some failure and recorded
findings become available. In a sense it is a way of calibrating.
A case study using the above method is detailed in Akbarov et al. (2006).
14.5.2.3 An empirical Bayesian Approach for Estimating the DTM Parameters
Based on Subjective Data
In previous subjective data based delay time estimating approaches (Christer and
Waller 1984; Wang 1997; Akbarov et al. 2006), some direct subjective estimates
of the delay time is required, which has been found to be extremely difficult for the
experts to estimate since the delay time is not usually observable and difficult to
explain Akbarov et al. (2006).
We now introduce a recently developed new approach which starts with subjective data first and then updates the estimates when objective data becomes
available. The initial estimates are made using the empirical Bayesian method
matching with a few subjective summary statistics provided by the experts. These
statistics should be designed easy to get based on the experience of the experts and
on observed practice rather than unobservable delay times. Then the updating
363
mechanism enters the process when objective data become available, which
requires a repeated evaluation of the likelihood function which will be introduced
later. In the framework of Bayesian statistics and assuming no objective data is
available at the beginning, we basically first assume a prior on the parameters
which characterize the underlying defect and failure arrival processes. When
objective data becomes available, we calculate the joint posterior distribution of the
parameters, and then we may use this posterior distribution to evaluate the expected cost or downtime per unit time conditional on observed data.
Assuming for now that we are interested in the rate of arrival of defects, , and
the delay time pdf., f (h) , which is characterised by a two parameter distribution
f (h | , ) . Unlike the methods proposed in Christer and Waller (1984) and Wang
(1997), here we treat parameters and the and in f (h | , ) as random
variables. The classical Bayesian approach is used here to define the prior distributions for model parameters , and as f ( | ) , f ( | ) and
f ( | ) , where is the set of hyper-parameters within f ( | ) .
Once those are available, the point estimates of , and are the expected values of them and are given by
f ( | ) d ,
f ( | )d and =
f ( | )d
g ( , , ) f ( | ) f ( | ) f ( | )d d d .
(14.21)
If we can obtain a subjective estimate of E[ g ( , , )] provided by the experts, denoted by g s , then letting E[ g ( , , )] = g s , we have
gs =
g ( , , ) f ( | ) f ( | ) f ( | )d d d .
(14.22)
Equation 14.22 is only one of such equations and if several such subjective
estimates (different) were provided, we could have a set of equations like Equation
14.22. The hyper-parameters may be estimated by solving the equations like
Equation 14.22 in the case that the number of equations like Equation 14.22 is at
least the same as the number of hyper-parameters in . We now demonstrate this
in our case.
Suppose that the experts can provide us the following subjective statistics in
estimating :
364
W. Wang
Tf ( | ) f ( | ) f ( | ) d d d =
Tf ( | ) d
n f + nd = Tf ( | ) d .
(14.23)
-T
n
Similarly, from the property of the HPP, that is, P( N d (0,T) = n| ) = e (T ) , we
n!
have
pnd =
Pr ( N d (0,T) = 0| )f (| )d =
e T f ( | ) d .
(14.24)
where N d (0, T ) is the number of defects in [0, T ) . If we have only two hyper-parameters in , then solving Equations 14.23 and 14.24 simultaneously in terms of
will give the estimated values of the hyper-parameters in . Note that is
independent with and so that the integrals of f ( | ) and f ( | ) are
dropped from Equation 14.21. Similarly if more subjective estimates were provided,
the hyper-parameters in and can be obtained. For a detailed description of
such an approach to estimate delay time model parameters see Wang and Jia (2007).
Obviously this approach is better than the previously developed subjective
methods in terms of the way to get the data and the accuracy of the estimated
parameters. It is also naturally linked to the objective method in estimation DTM
parameters to be presented in the next section via Bayesian theorem if such
objective data becomes available, Wang and Jia (2007).
14.5.3 Objective Data Method
Objective data for complex systems under regular inspections should consist of the
failures (and associated times) in each interval of operation between inspections
and the number of defects found in the system at each inspection. From this data
information, we estimate the parameters for the chosen form of the delay time
model.
365
Initially, we consider a simple case of the estimation problem for the basic
delay time model where only the number of failures, mi , occurring in each cycle
[(i 1), iT ) and the number of defects found and repaired, ji , at each inspection (at
time iT ) are required. We do not know the actual failure times within the cycles
The probability of observing mi failures in [(i 1), iT ) is
P ( N f ((i 1)T , iT ) = mi ) =
E [ N f (( i 1)T ,iT )]
(14.25)
mi !
P ( N s (iT ) = ji ) =
( iT )]
E[ N s (iT )] j
ji !
(14.26)
As the observations are independent, the likelihood of observing the given data
set is just the product of the Poisson probabilities of observing each cycle of data,
mi and ji . As such, the likelihood function for K intervals of data is
L () =
i =1
e E[ N
(( i 1)T , iT )]
mi !
( iT )]
E[ N s (iT )] j
ji !
(14.27)
where is the set of parameters within the delay time model. The likelihood
function is optimised with respect to the parameters to obtain the estimated values.
This process can be simplified by taking natural logarithms. The log-likelihood
function is
( )
i =1
(14.28)
i =1
where the final summation term is irrelevant when maximising the log-likelihood as it is a constant term and therefore not a function of any of the parameters
under investigation.
When the times of failures are available, it is often necessary to refine the
likelihood function at Equation 14.27 by considering the detailed pattern of behaviour within each interval in terms of the number of failures and their associated
times. Define t ij the time of the j-th failure in the i-th inspection interval; the
likelihood is given by (Christer et al. 1998a)
L () =
i =1
mi
v (t )e
j =1 i ij
E [ N f (( i 1)T , iT )]
e E[ N s (iT )] E[ N s (iT )] ji
ji !
(14.29)
366
W. Wang
In the case study of Christer et al. (1995), only the daily numbers of failures are
available. They formulated a different likelihood taking account of this pattern of
data. It was done essentially by formulating the probability of a particular number
of failures for each day over each inspection interval, and then the likelihood for a
particular inspection interval is just the product of these probabilities and the
probabilty of observing some number of defects at the inspection; see Christer et
al. (1995) for details.
14.5.4 A Case Example
A copper works in the north-west of England has used the same extrusion press for
over 30 years, and the plant is a key item in the works since 70% of its products go
through this press at some stage of their production. The machine comprises a
1700-ton oil-hydraulic extrusion press with one 1700 kW induction heater and
completely mechanized gear for the supply of billets to the press and for the
removal of the extruded products. The machine was operated 1518 h a day (two
shifts), five days a week, excluding holidays and maintenance down-time. Preventive maintenance (PM) has been carried out on this machine since 1993, which
consisted of a thorough inspection of the machinery, along with any subsequent
adjustments or repairs if the defects found can be rectified within the PM period.
Any major defects which cannot be rectified during the PM time were supposed to
be dealt with during non-production hours. PM lasted about 2 h and is performed
once a week at the beginning of each week.
Questions of concern are (i) whether PM is or could be effective for this
machine; (ii) whether the current PM period is the right choice, particularly the one
week PM interval which was based upon maintenance engineers subjective judgement; (iii) whether PM is efficient, i.e. whether it can identify most defects present
and reduce the number of failures caused by those defects.
In this case study, the delay time model introduced earlier was used to address
the above questions. The first question can also be answered in part by comparing
the total downtime per week under PM with the total downtime per week per week
of the previous years without PM. A parallel study carried out by the company
revealed that PM has lowered the total downtime. The proportion of downtime was
reduced from 7.8% to 5.8%.
To establish the relationship between the downtime measure and the PM
activities using the delay time concept, the first task is to estimate the parameters
of the underlying delay time distribution from available data, and hence build a
model to describe the failure and PM processes. The type of delay time model used
in the study is the non-perfect inspection model.
In the original study, Christer et al. (1995), a number of different candidate
delay time distributions were considered including exponential and Weibull distributions. The chosen form for the delay time distribution is a mixed distribution
consisting of an exponential distribution (scale parameter ) with a proportion P of
defects having a delay time of 0. The cdf. is given by
F(h) = 1 ( 1 P)e h
367
Scale parameter
= 1.3561
r = 0.902
P = 0.5546
= 0.0178
CV = 0.0832
CV = 3.4956
CV = 0.4266
CV = 1.1572
Inserting the optimal parameter estimates into the log-likelihood function gives
an ML value of 101.86. See Christer et al. (1995) on the analysis and the fit of the
model to the data.
368
W. Wang
14.7 Conclusion
There is considerable scope for advances in maintenance modelling that impact
productivity upon current maintenance practice. This chapter reports upon one
methodology for modelling inspection practice. The power of mathematics and
statistics is used to exploit an elementary mathematical construct of failure process
to build operational models of maintenance interactions. The delay time concept is
a natural one within the maintenance engineering context. More importantly, it can
be used to build quantitative models of the inspection practice of asset items, which
have proved to be valid in practice. The theory is still developing, but so far there
has been no technical barrier to developing DTM for any plant items studied.
This chapter has introduced the delay time concept and has shown how it can
be applied to various production equipment to optimise inspection intervals. To
provide substance to this statement, the processes of model parameter estimation
and case examples outlining the use of delay time modelling in practice are
introduced. We only presented some fundamental DTMs and associated parameters
estimation procedures, but interested readers can refer to the references listed at the
end of the chapter for further consultation.
369
14.8 Dedications
This chapter is dedicated to Professor Tony Christer who recently passed away.
Tony was a world class researcher with an international reputation. He was the
originator of the delay time concept and had produced in conjunction with others a
considerable number of papers in delay time modelling theory and applications. He
was a great man who enthused, mentored and guided many of us to strive for
higher quality research. He will be sadly missed by all who knew him.
14.9 References
Abdel-Hameed, M., (1995), Inspection, maintenance and replacement models, Computers
and Operations Research, V22, 4, 435441
Akbarov, A., Wang W. and Christer A.H., (2006), Problem identification in the frame of
maintenance modelling: a case study, to appear in I. J. Prod. Res.
Baker, R.D. and Wang, W., (1991), Estimating the delay time distribution of faults in
repairable machinery from failure data, IMA J. Maths. Applied in Business and Industry,
4, 259282.
Baker, R. and Wang, W., (1993), Developing and testing the delay time model, Journal of
Operational Research Society, Vol. 44, No. 4, 361374.
Barlow, R.E and Proschan, F., (1965), Mathematical theory of reliability, Wiley, New York.
Carr, M.J., and Christer, A.H, (2003) Incorporating the potential for human error in
maintenance models, J. Opl. Res. Soc., 54 (12), 12491253
Christer, A.H., (1976), Innovative decision making, proceedings of NATO conference on
the role of effectiveness of theory of decision in practice, eds. Bowen K.C and White
D.J., Hodder and Stoughton, 368377.
Christer, A.H., (1999), Developments in delay time analysis for modeling plant maintenance, J. Opl. Res. Soc., 50, 11201137.
Christer, A.H. and Redmond, D.F., (1990), A recent mathematical development in maintenance theory, Int. J. Prod. Econ, 24, 227234.
Christer, A.H. and Waller, W.M., (1984), Delay time Models of Industrial Inspection Maintenance Problems, J. Opl. Res. Soc., 35, 401406.
Christer, A.H and Wang, W., (1995), A delay time based maintenance model of a multicomponent system, IMA Journal of Maths. Applied in Business and Industry, Vol. 6,
205222.
Christer, A.H and Whitelaw, J. (1983), An Operational Research approach to breakdown
maintenance: problem recognition, J Opl Res Soc, 34, 10411052.
Christer, A.H., Wang, W., Baker, R.D. and Sharp, J.M., (1995), Modelling maintenance
practice of production plant using the delay time concept, IMA J. Maths. Applied in
Business and Industry, Vol. 6, 6783.
Christer, A.H., Wang, W., Sharp, J.M. and Baker, R.D., (1997), A stochastic modelling
problem of high-tech steel production plant, in Stochastic Modelling in Innovative
Manufacturing, Lecture Notes in Economics and mathematical Systems, (Eds. by A.H
Christer, Shunji Osaki and L. C. Thomas), Springer, Berlin, 196214.
Christer, A.H., Wang, W., Choi, K. and Sharp, J.M., (1998a), The delay-time modelling of
preventive maintenance of plant given limited PM data and selective repair at PM, IMA
J. Maths. Applied in Business and Industry, Vol. 9, 355379.
370
W. Wang
Christer, A.H., Wang, W., Sharp, J.M. and Baker, R.D., (1998b), A case study of modelling
preventive maintenance of production plant using subjective data, J. Opl. Res. Soc., 49,
210219.
Christer, A.H., Wang, W. and Lee, C., (2000), A data deficiency based parameter estimating
problem and case study in delay time PM modelling, Int. J. Prod. Eco. Vol. 67, No. 1,
6376
Christer, A.H. Wang, W., Choi, K. and Schouten, F.A., (2001), The robustness of the semiMarkov and delay time maintenance models to the Markov assumption, IMA. J.
Management Mathematics, 12, 7588.
Kaio, N. and Osaki, S., (1989), Comparison of Inspection Policies Journal of the Operational Research Society, Vol. 40, No. 5, 499503
Luss, H., (1983), An Inspection Policy Model for Production Facilities, Management
Science, Vol. 29, No. 9, 11021109
McCall, J., (1965), Maintenance Policies for Stochastically Failing Equipment: A Survey,
Management Science, Vol. 11, No. 5, 493524
Moubray, J., (1997), Reliability Centred Maintenance, Butterworth-Heineman, Oxford.
Ross, (1983), Stochastic processes, Wiley, New York
Taylor, H.M., and Karlin, S., (1998), An introduction to stochastic modeling, 3rd Ed.,
Academic press, San Diego.
Thomas, L.C., Gaver, D.P. and Jacobs, P.A. (1991), Inspection Models and their application,
IMA Journal of Management Mathematics, 3(4):283303
Wang, W., (1997), Subjective estimation of the delay time distribution in maintenance modelling, European Journal of Operational Research, 99, 516529.
Wang W., (2000), A model of multiple nested inspections at different intervals, Computers
and Operations Research, 27, 539558.
Wang W., (2006), Modelling the probability assessment of the system state using available
condition information, to appear in IMA. J. Management Mathematics
Wang W. and Christer A.H., (1997), A modelling procedure to optimise component safety
inspection over a finite time horizon, Quality and Reliability Engineering International,
13, No. 4, 217224.
Wang W. and Christer A.H., (2003), Solution algorithms for a multi-component system
inspection model, Computers and OR, 30, 190134.
Wang W. and Jia, X., (2007), A Bayesian approach in delay time maintenance model
parameters estimation using both subjective and objective data, Quality Maintenance
and reliability Int. , 23, 95105
Part E
Management
15
Maintenance Outsourcing
D.N.P. Murthy and N. Jack
15.1 Introduction
Every business (mining, processing, manufacturing and service-oriented businesses
such as transport, health, utilities, communication) needs a variety of equipment to
deliver its outputs. Equipment is an asset that is critical for business success in the
fiercely competitive global economy. However, equipment degrades with age and
usage and ultimately become non-operational and businesses incur heavy losses
when their equipment is not in full operational mode. For example, in open cut
mining, the loss in revenue resulting from a typical dragline being out of action is
around one million dollars per day and the loss in revenue from a 747 plane being
out of action is roughly half a million dollars per day. Non-operational equipment
leads to delays in delivery of goods and services and this in turn causes customer
dissatisfaction and loss of goodwill.
Rapid changes in technology have resulted in equipment becoming more complex and expensive. Maintenance action can reduce the likelihood of such equipment becoming non-operational (referred to as preventive maintenance) and also
restore a non-operational unit to an operational state (referred to as corrective maintenance). For most businesses it is no longer economical to carry out maintenance in
house. There are a variety of reasons for this including the need for a specialist work
force and diagnostic tools that often require constant upgrading. In these situations it
is more economical to outsource the maintenance (in part or total) to an external
agent through a service contract. Campbell (1995) gives details of a survey where it
was reported that 35% of North American companies had considered outsourcing
some of their maintenance.
Consumer durables (products such as kitchen appliances, televisions, automobiles, computers, etc.) that are bought by individuals are certainly getting more
complex. A 1990 automobile is immensely more complex than its 1950 counterpart. Customers need assurance that a new product will perform satisfactorily over
its lifetime. In the case of consumer durables, manufacturers have used warranties
to provide this assurance during the early part of a products useful life. Under
374
warranty the manufacturer repairs all failures that occur within the warranty period
and this is often done at no cost to the customer. The warranty period for most
consumer durables has been increasing and the warranty terms have been
becoming more favourable to the customer. For example, the typical warranty period for an automobile in 1930 was 90 days, in 1970 it was 1 year, and in 1990 it
was 3 years. A warranty is tied to the sale of a product and the cost of servicing the
warranty is factored into the sale price. For customers who need assurance beyond
the warranty period, manufacturers and/or third parties (such as financial institutions, insurance companies and independent operators) offer extended warranties
(or service contracts) at an additional cost to the customer. Extended warranties for
automobiles of 57 years are now fairly common.
Governments (local, state or national) own infrastructure (roads, rail and communication networks, public buildings, dams, etc.) that were traditionally maintained by in-house maintenance departments. Here there is a growing trend towards
outsourcing these maintenance activities to external agents so that the governments
can focus on their core activities.
In all the above cases, we have an asset (complex equipment, consumer durable
or an element of public infrastructure) that is owned by the first party (the owner)
and the asset maintenance is outsourced to the second party (the service agent who
is also referred to as the contractor in many technical papers) under a service
contract. This chapter deals with maintenance outsourcing from the perspectives of
both the owner (the customer for the maintenance service) and the service agent
(the service provider). We focus on the first case (where the customer is a
business) and we develop a framework to indicate the different issues involved,
carry out a review of the literature, and indicate topics that need further investigation and research.
The outline of the chapter is as follows. Section 15.2 deals with the customer
and the agent perspectives. In Section 15.3, we propose a framework to study maintenance outsourcing. Section 15.4 reviews the relevant literature on maintenance
outsourcing and on extended warranties. Section 15.5 deals with a game theoretic
approach to maintenance outsourcing and extended warranties. In Section 15.6 we
briefly discuss agency theory and its relevance to maintenance outsourcing and, in
Section 15.7 we conclude with a brief discussion of future research in maintenance
outsourcing.
Maintenance Outsourcing
375
15.2.1.1 Businesses
Businesses (producing products and/or services) need to come up with new
solutions and strategies to develop and increase their competitive advantage.
Outsourcing is one of these strategies that can lead to greater competitiveness
(Embleton and Wright 1998). It can be defined as a managed process of transferring activities performed in-house to some external agent. The conceptual basis
for outsourcing (see Campbell 1995) is as follows:
1. Domestic (in-house) resources should be used mainly for the core competencies of the company.
2. All other (support) activities that are not considered strategic necessities
and/or whenever the company does not possesses the adequate competences and skills should be outsourced (provided there is an external agent
who can carry out these activities in a more efficient manner).
Most businesses tend not to view maintenance as a core activity and have
moved towards outsourcing it. The advantages of outsourcing maintenance are as
follows:
1.
3.
4.
5.
6.
7.
8.
For very specialised (and custom built) products, the knowledge to carry out the
maintenance and the spares needed for replacement need to be obtained from the
original equipment manufacturer (OEM). In this case, the customer is forced into
having a maintenance service contract with the OEM and this can result in a noncompetitive market. In the USA, Section II of the Sherman Act (Khosrowpour
1995) deals with this problem by making it illegal for OEMs to act in this manner.
When the maintenance service is provided by an agent other than the original
equipment manufacturer (OEM) often the cost of switching prevents customers
from changing their service agent. In other words, customers get locked in and
are unable to do anything about it without a major financial consequence.
376
REGULATOR
OWNER
SERVICE AGENT
[MAINTENANCE]
ASSET
[INFRASTRUCTURE]
GOVERNMENT
OPERATOR
PUBLIC
Figure 15.1. Different parties that need to be considered in the maintenance of infrastructures
Maintenance Outsourcing
377
378
PAST USAGE
PAST
MAINTENANCE
OWNER
(CUSTOMER)
CONTRACT
SERVICE
AGENT
NOMINATED
USAGE RATE
ACTUAL
USAGE RATE
PENALTIES /
INCENTIVES
ASSET DEGRADATION
RATE
NOMINATED
MAINTENANCE
ACTUAL
MAINTENANCE
Maintenance Outsourcing
379
ment. The stress can be thermal, mechanical, electrical, etc., and the reliability decreases as the stress increases and/or the environment gets harsher.
When a failure occurs, the asset can be restored to an operational state through
corrective maintenance (CM). In the case of equipment, this involves repairing or
replacing the failed components. In the case of the road example, the CM involves
filling the potholes and resealing a section of the road. The degradation in the asset
state can be controlled through use of preventive maintenance (PM) and, in the
case of equipment, this involves regular monitoring and replacing of components
before failure.
The asset state at any given time (subsequent to it being put into operation) is a
function of its inherent reliability and past history of usage and maintenance. This
information is important in the context of maintenance service contracts for used
assets. The information that the service agent (and the customer) has can vary from
very little to lot (if detailed records of past usage and maintenance have been kept).
Finally, for some assets, the delivery of maintenance requires the service agent
to visit the site where the asset is located (for example, lifts in buildings and roads)
and for others (most consumer durables and some industrial equipment) the failed
asset can be brought to a service centre to carry out the maintenance actions.
15.3.2 Maintenance
15.3.2.1 Corrective Maintenance (CM)
These are corrective actions performed when the asset has a failure. The most
common form of CM is minimal repair where the state of the asset after repair is
nearly the same as that just before failure. The other extreme is as good as new
repair and this is seldom possible unless one replaces the failed asset by a new one.
Any repair action that restores the asset state to better than that before failure and
not as good as that of new asset is referred to as imperfect repair.
15.3.2.2 Preventive Maintenance (PM)
In the case of equipment or consumer durables, PM actions are carried out at component level where components are replaced based on age, usage and/or condition.
As a result, there are several different kinds of PM policies (Blischke and Murthy
2000). Some of the more commonly used ones are the following:
380
INTENSITY FUNCTION
System level modeling If only CM and no PM is used and the time to repair is very
much smaller than time between failures, then one can model failures over time as
a stochastic point process with an intensity function (t ) that is increasing with t
(time or age) to capture the degradation with time (see Rigdon and Basu 2000).
The effect of operating stress and operating environment can be modeled through a
Cox-regression model where the intensity function is modified to g ( z ) (t ) where
variables
z is the vector of covariates representing the stress and environmental
(see Cox and Oakes 1984).
The effect of PM actions can be modeled through a reduction in the intensity
function as shown in Figure 15.3. The level of PM (indicated by in the figure)
determines the reduction in the intensity function and the cost of a PM action
increases with the level of PM.
PM ACTIONS
1
T1
TIME
T2
Maintenance Outsourcing
381
15.3.3 Contract
The contract is a legal document that is binding on both parties (customer and
service agent) and it needs to deal with technical, management and economic
issues.
15.3.3.1 Technical and Management Issues
Maintenance of an asset involves carrying out several activities as indicated in
Figure 15.4 (adapted from Dunn 1999). There are many different contract scenarios depending on how these activities are outsourced. Table 15.1 indicates three
different scenarios (S-1 to S-3) where:
WORK
PLANNING
WORK
SCHEDULING
DATA
ANALYSIS
DATA
RECORDING
WORK
EXECUTION
DECISIONS
CUSTOMER
SERVICE AGENT
S-1
S-2
D-1, D-2
D-1
D-3
D-2, D-3
S-3
In scenario S-1, the service agent is only providing the resources (workforce
and material) to execute the work. This corresponds to the minimalist approach to
outsourcing. In scenario S-2, the service agent decides on how and when and what
is to be done is decided by the customer. Finally, in scenario S-3 the service agent
makes all three decisions.
There is growing trend towards functional guarantee contracts. Here the contract
specifies a level for the output generated from equipment, for example, the amount
of electricity produced by a power plant, or the total length of flights and number of
landings and takeoffs per year. The service agent has the freedom to decide on the
maintenance needed (subject to operational constraints) with incentives and/or
382
penalties if the target levels are exceeded or not. For more on this, see Kumar and
Kumar (2004).
In the context of infrastructures, there is a trend towards giving the service
agent the responsibility for ongoing upgrades or the responsibility for the initial
design resulting in a BOOM (build, own, operate and maintain) contract.
The levels of risk to both parties vary with the contract scenario.
15.3.3.2 Economic Issues
There are a number of alternative contract payment structures. The following list is
from Dunn (1999):
Each of these price structures represents a different level of risk sharing between
the customer and the service agent. According to Vickerman (2004), an increasing
issue in privatized infrastructure is the appropriate incentives needed to ensure
adequate maintenance of the infrastructure as a public resource.
15.3.3.3 Other Issues
Some other issues are as follows:
Requirements. Both parties might need to meet some stated requirement. For
example, the customer needs to ensure that the stresses on the asset do not exceed
the levels specified in the contract as this can lead to greater degradation and
higher servicing costs to the service agent. Similarly, the service agent needs to ensure proper data recording.
Contract duration. This is usually fixed with options for renewal at the end of the
contract.
Dispute resolution. This specifies the avenues to follow when there is a dispute.
The dispute can involve going to a third party (legal courts).
Unless the contract is written properly and relevant data (relating to equipment
and collected by the service agent) are analysed properly by the customer, the longterm costs and risks will escalate.
15.3.4 Maintenance Outsourcing Market
Whether the maintenance outsourcing market is competitive or not depends on the
number of customers and service agents. Table 15.2 indicates the different market
scenarios. These have an impact on issues such as the types of service contracts
available to customers and the pricing of the contracts.
Maintenance Outsourcing
383
NUMBER OF
CUSTOMERS
ONE
FEW
ONE
A-1
B-1
FEW
A-2
B-2
MANY
A-3
B-3
Some of the relevant papers are Campbell (1995), Judenberg (1994), Martin
(1997), Levery (1998) and Sunny (1995).
Unfortunately, cost has been the sole basis used by businesses for making
maintenance out-sourcing decisions. Sunny (1995) looks at what activities are to be
outsourced by looking at the long strategic dimension (core competencies) as well
as the short-term cost issues.
Bertolini et al. (2004) take a quantitative approach and use the analytic hierarchy
process (AHP) to make decisions regarding the outsourcing of maintenance.
Ashgarizadeh and Murthy (2000) and Murthy and Ashgarizadeh (1998, 1999)
look at maintenance outsourcing from both customer and service agent perspec-
384
tives and propose game-theoretic models to determine the optimal strategies for
both parties. This approach is discussed further in Section 15.5.
On the application side, Armstrong and Cook (1981) look at clustering of
highway sections for awarding maintenance contracts to minimise the cost and use
a fixed-charge goal programming model to determine the optimal strategy.
Bevilacqua and Braglia (2000) illustrate their AHP model in the context of an
Italian brick manufacturing business having to make decisions regarding maintenance outsourcing.
Stremersch et al. (2001) look at the industrial maintenance market.
15.4.2 Extended Warranties
The literature can be broadly divided into three groups.
15.4.2.1 Group 1: Warranty cost analysis
The cost analysis of many different types of basic warranties can be found in
Blischke and Murthy (1994, 1996). For a review of more recent literature, see
Murthy and Djamaludin (2002). These techniques can be easily extended to obtain
the costs for extended warranties and this has been done by Sahin and Polatoglu
(1998).
15.4.2.2 Group 2: Warranty Servicing Strategy
When a repairable asset fails under warranty, the manufacturer has the choice of
either repairing or replacing it with a new one. The first option costs less then the
second but a repaired asset has a greater probability of failing during the remainder
of the warranty period. It is therefore important for the manufacturer to choose an
appropriate servicing strategy in order to minimise the expected cost of servicing
the warranty per asset sold.
Servicing strategies for products sold with one-dimensional warranties have
received considerable attention. Biedenweg (1981) and Nguyen and Murthy (1986,
1989) assume that repaired items have independent and identically distributed
lifetimes different from that of a new item and considered strategies where the
warranty period is divided into distinct intervals for repair and replacement.
Nguyen (1984) introduces the first servicing model with minimal repair (see
Barlow and Hunter 1960), with the warranty period split into a replacement
interval followed by a repair interval. The length of the first interval is selected
optimally to minimize the expected warranty cost.
Jack and Van der Duyn Schouten (2000) show that this strategy is sub-optimal
and that the optimal servicing strategy is in fact characterized by three distinct
intervals [0, x), [x, y] and (y, W] where W is the warranty period. The optimal
strategy is to carry out minimal repairs in the first and last intervals and to use
either minimal repair or replacement by new in the middle interval depending on
the age of the item at failure. This strategy is difficult to implement, so Jack and
Murthy (2001) propose a near optimal strategy involving the same three intervals
but with only the first failure in the middle interval resulting in a replacement and
all other failures being minimally repaired.
Maintenance Outsourcing
385
386
Ai (i ), 1 i n
SERVICE AGENT
CUSTOMER
A* (1 , 2 , , n )
Figure 15.5. Stackelberg game formulation
Murthy and Asgharizadeh (1998, 1999) and Asgharizadeh and Murthy (2000)
use a Stackelberg game formulation for a special case where the time between
equipment failures is given by an exponential distribution so that the failures over
time occur according to a Poisson process. They consider the two options discussed earlier and consider the following three cases:
1.
2.
3.
Maintenance Outsourcing
387
In case 1 the service agent has to decide the optimal number of customers to
service and in case 3 he has to decide the optimal number of repair facilities.
15.5.2 Extended Warranties
Jack and Murthy (2006) consider the case where the product is complex and so the
specialist knowledge of the manufacturer is required to carry out any repairs after
the base warranty expires. The consumer must decide how long to keep the item
and how to maintain it until replacement. Two maintenance options are available:
the consumer can (i) pay the manufacturer to repair the item each time it fails, or
(ii) purchase an extended warranty (EW) from the manufacturer. These are similar
to Options 2 and 1 respectively, discussed earlier. The EW contract specifies that
the manufacturer will again rectify all failures free of charge to the consumer. The
consumer has flexibility in choosing when the EW will begin and the length of
cover. The price of the EW depends on these two variables and is set by the
manufacturer. The manufacturer also has to decide the price of each repair if the
item fails and the consumer does not have an EW. A Stackelberg game formulation
is used to determine the optimal strategies for both the consumer and the manufacturer.
388
COSTS
MONITORING
INCENTIVES
PRINCIPAL
CONTRACT
INFORMATIONAL
ASYMMETRY
RISK PREFERENCES
AGENT
MORAL HAZARD
ADVERSE SELECTION
Maintenance Outsourcing
389
agent and (ii) the cost of measuring the outcomes of the relationship and the transferring of risk to the agent.
Contract. The design of the contract that takes into account the issues discussed
above is the challenge that lies at the heart of the principal-agent relationship.
15.6.2 Relevance to Maintenance Outsourcing and Extended Warranties
15.6.2.1 Maintenance Outsourcing
Outsourcing of maintenance involves all the Agency Theory issues discussed in
Section 15.6.1 with the customer as the principal and the maintenance service
provider as the agent. The key factor is the contract that specifies what, when, and
how maintenance is to be carried out. This needs to be designed taking into
account all the various issues. Kraus (1996) reviews the literature on incentive
contracting.
The customer and service agent both potentially face moral hazard. This can
occur for the customer when the service agent shirks to reduce costs and doesnt do
proper maintenance and it can occur for the agent when the customer uses the asset
in a manner different to that stated in the contract. Adverse selection can also take
place when the customer chooses from a pool of potential maintenance service
providers (the B scenarios in Table 15.2). The two parties have different information about asset state, usage level, care and attention of the asset, and quality of
maintenance used and this asymmetry will affect the outcome of their relationship.
The different market scenarios for maintenance outsourcing are as indicated in
Table 15.2. In scenario A-1, the classical principal-agent model discussed in
Section 15.6.1 is appropriate with a single principal (customer) and a single agent
(maintenance provider). This could be a large business unit, for example.
In the remaining five scenarios, there are multiple principals and/or multiple
agents. In scenarios A-2 and A-3, the equipment under consideration could be a
particular brand of lift installed in different buildings within a city. In this case, all
the equipment is maintained either by the OEM or an agent of the OEM. There is
an extensive literature dealing with the design of contracts for multiple principal/
multiple agent problems (Macho-Stadler and Perez-Castrillo 1997 and Laffont and
Martimort 2002 are a couple of samples of the papers from this literature) and all
the issues from Section 15.6.1 are still relevant. The principal-agent models that
have been studied in the literature are static in nature and new, dynamic models
need to be formulated so that they can be applied meaningfully in the context of
maintenance outsourcing.
15.6.2.2 Extended Warranties
This case is similar to A-3. In the case of standard commercial and industrial
products and consumer durables, the EW policy is decided by the EW provider and
the customer does not have any direct input. The issues (such as moral hazard,
adverse selection, risk, monitoring, etc) from agency theory are all relevant for EW
policies. Current EW offerings lack flexibility from the customer point of view and
there is a perception (amongst customers and EW regulators) that the pricing of
EWs is not fair. This provides an opportunity for EW providers to offer flexible
390
warranties to meet the different needs across the customer population. Agency
theory offers a framework to evaluate the costs of different policies taking into
account all the relevant issues.
15.8 References
Armstrong, R.D. and Cook, W.D. (1981), The contract formation problem in preventive
pavement maintenance: A fixed-charge goal-programming model, Comp. Environ.
Urban Systems, 6, 147155
Ashgarizadeh, E. and Murthy, D.N.P. (2000), Service contracts a stochastic model,
Mathematical and Computer Modelling, 31, 1120
Barlow, R.E. and Hunter, L.C. (1960), Optimum preventive maintenance policies,
Operations Research, 8, 90100
Bertolini, M., Bevilacqua, M. Braglia, M. and Frosolini, M. (2004), An analytical method
for maintenance outsourcing service selection, International Journal on Quality &
Reliability Management, 21, 772788
Bevilacqua, M. and Braglia, M. (2000), The analytic hierarchy process applied to
maintenance strategy selection, Reliability Engineering & System Safety, 70, 7183.
Biedenweg, F. M. (1981), Warranty Analysis: Consumer Value vs. Manufacturers Cost,
Unpublished Ph.D. Thesis, Stanford University, U.S.A.
Blischke, W.R. and Murthy, D.N.P. (1994), Warranty Cost Analysis. Marcel Dekker, New
York
Blischke, W.R. and Murthy, D.N.P. (1996), Product Warranty Handbook, Marcel Dekker,
New York
Blischke, W.R. and Murthy D.N.P. (2000), Reliability, Wiley, New York
Campbell, J.D. (1995), Outsourcing in maintenance management: a valid alternative to selfprovision, Journal of Quality in Maintenance Engineering, 1, 1824.
Maintenance Outsourcing
391
Cho, D. and Parlar, M. (1991), A survey of maintenance models for multi-unit systems,
European Journal of Operational Research, 51, 123.
Cox, D.R. and Oakes, D. (1984), Analysis of Survival Data, Chapman and Hall, New York
Day, E. and Fox, R.J. (1985), Extended warranties, service contracts and maintenance
agreements A marketing opportunity? Journal of Consumer Marketing, 2, 7786
Dekker, R., Wildeman, R.E. and van der Duyn Schouten, F.A. (1997), Review of multicomponent models with economic dependence, Zor/Mathematical Methods of
Operations Research, 45, 411435.
Desai, P.S. and Padmanabhan, V. (2004), Durable good, extended warranty and channel
coordination. Review of Marketing Science, 2, Article 2, available at
www.bepress.com/romsjournal/vol2/iss1/art2
Dunn, S. (1999), Maintenance outsourcing Critical issues, available at: www.plantmaintenance.com/maintenance_articles_outsources.html
Eisenhardt, K.M. (1989), Agency theory: An assessment and review, The Academy of
Management Review, 14, 5774
Embleton, P.R. and Wright, P.C. (1998), A practical guide to successful outsourcing,
Empowerment in Organizations, Vol. 6 No. 3, pp. 94106
Eppen, G.D., Hanson, W.A. and Martin, R.K. (1991), Bundling new products, new
markets, low risks, Sloan Management Review, Summer, 714
Hollis, A. (1999), Extended warranties, adverse selection and aftermarkets. The Journal of
Risk and Insurance, 66, 321343
Iskandar, B.P., and Murthy, D.N.P. (2003), Repair-replace strategies for two-dimensional
warranty policies, Mathematical and Computer Modelling, 38, 12331241
Iskandar, B.P., Murthy, D.N.P. and Jack, N. (2005), A new repair-replace strategy for items
sold with a two-dimensional warranty, Computers and Operations Research, 32,
669682
Jack, N. and Murthy, D.N.P. (2001), A servicing strategy for items sold under warranty, Jr.
Oper. Res. Soc., 52, 12841288
Jack, N. and Murthy, D.N.P. (2006), A Flexible Extended Warranty and Related Optimal
Strategies, Jr. Oper. Res. Soc. (accepted for publication)
Jack, N. and Van der Duyn Schouten, F. (2000), Optimal repair-replace strategies for a
warranted product, Int. J. Production Economics, 67, 95100
Jardine, A.K.S. and Buzacott, J.A. (1985), Equipment reliability and maintenance, European
Journal of Operational Research, 19, 285296.
Judenberg, J. (1994), Applications maintenance outsourcing, Information Systems
Management, 11, 3438
Khosrowpour, M. (ed) (1995), Managing Information Technology Investments with
Outsourcing, Idea Group Publishing, Harrisburg
Kraus, S. (1996), An overview of incentive contracting, Artificial Intelligence, 83, 297346
Kumar, R. and Kumar, U. (2004), Service delivery strategy: Trends in mining industries, Int.
J. Surface Mining, Reclamation and Environment, 18, 299307
Laffont, J. and Martimort, D, (2002) The Theory of Incentives: the Principal-Agent Model,
Princeton University Press
Levery, M. (1998), Outsourcing maintenance: a question of strategy, Engineering
Management Journal, February, 3440.
Lutz, N.A. and Padmanabhan, V. (1994), Income variation and warranty policy. Working
Paper, Graduate School of Business, Stanford University.
Lutz, N.A. and Padmanabhan, V. (1998), Warranties, extended warranties and product
quality. International Journal of Industrial Organization, 16, 463493.
Macho-Stadler, I. and Perez-Castrillo, D. (1997), An Introduction to the Economics of
Information, Oxford University Press
392
Martin, H.H. (1997), Contracting out maintenance and a plan for future research, Journal of
Quality in Maintenance Engineering, 3, 8190
McCall, J.J. (1965), Maintenance policies for stochastically failing equipment: A survey,
Management Science, 11, 493524.
Murthy D.N.P. and Ashgarizadeh, E. (1998), A stochastic model for service contract; Int. Jr.
of Reliability Quality and Safety Engineering; 5, 2945
Murthy D.N.P. and Ashgarizadeh, E. (1999), Optimal decision making in a maintenance
service operation, European Journal of Operational Research, 116, 259273
Murthy, D.N.P. and Djamaludin, I. (2002), Product warranty A review, International
Journal of Production Economics, 79, 231260
Nguyen, D.G. (1984), Studies in Warranty Policies and Product Reliability. Unpublished
Ph.D. Thesis, The University of Queensland, Australia.
Nguyen, D.G. and Murthy, D.N.P. (1986), An optimal policy for servicing warranty, Jr.
Oper. Res. Soc., 37, 10811088
Nguyen, D.G. and Murthy, D.N.P. (1989), Optimal replace-repair strategy for servicing
items sold with warranty, Euro. Jr. of Oper. Res., 39, 206212
Padmanabhan, V. (1995), Usage heterogeneity and extended warranties. Journal of
Economics and Management Strategy, 4, 3353
Padmanabhan, V. (1996), Extended warranties, in Product Warranty Handbook, W.R.
Blischke and D.N.P. Murthy (eds), Marcel Dekker, New York
Padmanabhan, V. and Rao, R.C. (1993), Warranty policy and extended warranties: theory
and an application to automobiles. Marketing Science, 12, 230247
Pierskalla, W.P. and Voelker, J.A. (1976), A survey of maintenance models: The control and
surveillance of deteriorating systems, Naval Research Logistics Quarterly, 23, 353388.
Pintelton, L.M. and Gelders, L. (1992), Maintenance management decision making,
European Journal of Operational Research, 58, 301317.
Rigdon, S.E. and Basu, A.P. (2000), Statistical Methods for the Reliability of Repairable
Systems, Wiley, New York
Ross, S.M. (1980), Stochastic Processes, Wiley, New York
Sahin, I. and Polatoglu, H. (1998), Quality, warranty and preventive maintenance. Kluwer:
Amsterdam
Scarf, P.S. (1997), On the application of mathematical models to maintenance, European
Journal of Operational Research, 63, 493506.
Sherif, Y.S. and Smith, M.L. (1986), Optimal maintenance models for systems subject to
failure - A review, Naval Logistics Research Quarterly, 23, 4774.
Stremersch, S., Wuyts, S. and Frambach, R.T. (2001), The purchasing of full-service
contracts: An exploratory study within the industrial maintenance market, Industrial
Marketing Management, 30, 112
Sunny, I. (1995), Outsourcing maintenance: making the right decisions for the right reasons,
Plant Engineering, 49, 156157.
Thomas, L.C. (1986), A survey of maintenance and replacement models for maintainability
and reliability of multi-item systems, Reliability Engineering, 16, 297309
UK Competition Commission (2003): A report into the supply of extended warranties on
domestic electrical goods within the UK, available at:
www.competition-commission.org.uk/inquiries/completed/2003/warranty/index.htm
Valdez-Flores, C. and Feldman, R.M. (1989), A survey of preventive maintenance models
for stochastically deteriorating single-unit systems, Naval Research Logistics Quarterly,
36, 419446.
Van Ackere, A. (1993), The principal-agent paradigm: Its relevance to various functional
fields, European Journal of Operational Research, 70, 83103
Maintenance Outsourcing
393
16
Maintenance of Leased Equipment
D.N.P. Murthy and J. Pongpech
16.1 Introduction
Businesses need equipment to produce their outputs (goods/services). Equipment
degrades with age and usage, and eventually fails (Blischke and Murthy 2000).
This impacts business performance in several ways reduced equipment availability, lower output quality, higher operating costs, increased customer dissatisfaction, etc. The degradation can be controlled through preventive maintenance (PM)
actions whilst corrective maintenance (CM) actions restore failed equipment to its
working state.
Prior to 1970, businesses owned the equipment, and maintenance was done in
house. Since 1970, there has been a shift towards outsourcing of maintenance. This
was primarily due to a change in the management paradigm where activities in a
business were classified as either core or non-core, with the non-core activities to
be outsourced to external agents if this was deemed to be cost effective. Also, as
technology became more complex it was no longer economical to carry out inhouse maintenance due to the need for expensive maintenance equipment and
highly trained maintenance staff.
Since 1990, there has been an increasing trend towards leasing rather than
owning equipment. According to Fishbein et al. (2000) there are several reasons
for this. Some of these are as follows:
Rapid technological advances have resulted in improved equipment appearing on the market, making the earlier generation equipment obsolete at
an ever-increasing pace.
The cost of owning equipment has been increasing very rapidly.
Businesses viewing maintenance as a non-core activity.
It is often economical to lease equipment, rather than buy, as this involves
less initial capital investment and often there are tax benefits that make it
attractive.
396
The leasing industry grew from 1990 till the last quarter of year 2001 when it
experienced an economic downturn due to the impact from 9/11. In 2002, the
predictions made by the Department of Commerce for equipment leasing volume
for 2003 and 2004 were $208 and $218 billion respectively.
The ELA Online Focus Groups Report (ELA 2002b) states that 60% of leasing
benefits come from maintenance options. This is because some equipment leases
come with maintenance as an integral part of the lease so that the physical equipment is bundled with maintenance service and offered as a package under a lease
contract. This implies that the lessee can focus on the core activities of the business
and not be distracted with equipment maintenance.
Maintenance of leased equipment raises several new issues for both the lessor
and the lessee (Desai and Purohit 1998; Kleiman 2001). The strategic issues deal
with the size and composition of the equipment fleet, the number and the location
of lease centers, workshop facilities, warehouse for spares, etc. The operational
issues include logistics, pricing, marketing, and maintenance strategies. In this
chapter we touch on these issues and then focus our attention on maintenance
strategies for leased equipment.
The outline of the chapter is as follows. Section 16.2 starts with a general introduction to equipment leasing and then the different types of leases are discussed.
Section 16.3 deals with a framework to study equipment leasing and reviews the
relevant literature. In Section 16.4, we look at the maintenance of equipment under
operational lease. We discuss the modeling issues and propose various maintenance policies. Section 16.5 looks at the analysis of two of these policies and the
optimal selection of the policy parameters. We conclude with a brief discussion of
topics for future research in Section 16.6. We use the following abbreviations and
notation.
Abbreviations
AFT:
Accelerated failure time
PH:
Proportional hazard
NHPP:
Non-homogeneous Poisson process
ROCOF: Rate of occurrence of failure
CM:
Corrective maintenance
PM:
Preventive maintenance
Notation
F (t ) :
Failure distribution for the time to first failure of new equipment
f (t ), r (t ) : Failure density and hazard functions associated with F (t )
Intensity function with only CM actions
0 (t ) :
(t ) :
A:
x:
L:
j:
tj :
N ( L) :
Y:
G( y) :
, :
C p ( ) :
Cu ( x) :
Cf :
Cn :
Ct :
397
398
USA, the Internal Revenue Code defines a true lease as a transaction that allows
the lessor to claim ownership and the lessee to claim rental payments as tax deductions.
The advantages and disadvantages of an operating lease from the lessees perspective are as follows:
Advantages
The lessee can obtain new equipment (based on the latest technologies) and
thus avoid the risks associated with equipment obsolescence.
The lessee usually gets maintenance and other supports from the lessor so
that the business can focus on core activities.
Equipment disposal is the lessors responsibility.
Disadvantages
If the lessees needs change over the lease period, then premature termination of the lease agreeement can incur penalties.
The risks associated with the lessor do not provide the level of maintenance
needed.
The lessee is able to spread the payments over the lease period (no need for
initial cash at purchase).
It offers greater flexibility as the lessee can choose from a range of lease
options especially, in the consumer product market when there are
several institutions offering different types of leases.
Disadvantages
If the lessee fails to make lease payments as per schedule, the leased equipment can be repossessed and sold by the lessor to recover the payments
due.
399
Maintenance is often not a part of the lease agreement so that the lessee has
to provide for this separately.
The overall cost to the lessee is significantly higher than purchase price of
the equipment because the payments include not only the financing costs,
but also other costs associated with insurance, taxes, etc.
REGULATOR
OWNER
CUSTOMER
(USER)
SERVICE
PROVIDER
EQUIPMENT
(ASSET)
OUTPUTS
(PRODUCTS /
SERVICES)
GOVERNMENT
OPERATOR
Customer: The customer is the lessee. The lessee can be an individual (purchasing
a car under finance lease), a business (operating industrial or commercial equipment under operational lease) or a government agency (responsible for operating
an infrastructure, such as train network, under a buyback lease).
Equipment: Equipment can be an infrastructure (for example, parts of road network, railway network, sewerage and water network, electricity network, etc.);
400
industrial equipment (for example, trucks, cranes, plant machinery, etc.); commercial equipment (for example, office furniture, vending machines, photocopiers,
etc.) and, consumer products (for example, refrigerators, computers, etc.). The cost
of the equipment (or asset) can vary significantly. Ezzel and Vora (2001) give
some interesting statistics relating to sale and leaseback, and operating leases in the
USA over the period 19841991.
Owner: The owner is a person or agency that owns the equipment from a legal
point of view. In the case of a finance lease, the financial institution is the owner as
the equipment is mortgaged to the institution.
Service provider: In the case of an operating lease, the lessor is the service provider. However, if the lessor decides to outsource the maintenance to some external
service agent, then the agent is the service provider. In the case of a finance lease,
the lessee is responsible for the maintenance and might decide to outsource it to an
external agent.
Outputs (products/services): If the lessee is a business, then the leased equipment
is used to produce its outputs goods and/or services as discussed in Section 16.1.
For consumer goods, the output is the utility (in the case of a kitchen appliance) or
the satisfaction (in the case of a television) derived by the lessee.
Operator: In general, the lessee is the operator of the equipment. However, the
lessee, in turn, might hire some other business to operate the equipment and
produce the desired outputs. An example of this is a business that leases a fleet of
aircraft, then outsources the flying to another business that employs the crew and
operates the planes.
Government: Government plays an important role in the context of sale and buyback leases of infrastructure. The lessee can be a department of the government or
an independent unit acting as a proxy for the government. Decisions relating to
subsidy, tax incentives, etc., are decided by the government and have a significant
impact on the lease structure.
Regulator: This applies mainly for equipment used in certain industry sectors
(such as health, transport, energy) where public safety is of great concern. The regulator is often an independent body that monitors and makes recommendation that
can be binding on the owners and operators of equipment.
Vickerman (2004) deals with the infrastructure maintenance issues in the
context of rail and road transport in the UK and discusses the role of government
and regulators. Interested readers should consult the references cited in the paper
for more details.
16.3.1 Different Scenarios of Leasing
There are many different scenarios depending on the number of parties involved.
Table 16.1 gives three different scenarios involving four parties. Other scenarios
can include additional parties such as the government and/or the regulator.
In the remainder of the chapter we focus our attention on industrial and
commercial equipment leased under an operating lease and this corresponds to
Scenario 1.
401
Scenario 1
Scenario 2
Scenario 3
Lessor: Owner
Lessor: Owner
Lessee: User
Service Provider
Operator
LESSOR
MAINTENANCE
LESSEE
EQUIPMENT
Lessor: The lessor is not only the owner of the leased equipment, but also the
maintenance service provider. The lessor is a business (either manufacturer or
some other entity) and as such has certain business objectives. At the strategic level
these can include issues such ROI, market share, profits, etc. In order to achieve
these objectives, the lessor needs to have proper strategies at the strategic level (to
deal with issues such as type and number of equipment to lease, upgrade options to
1
In the case of a finance lease, the lessee has the option of either doing the maintenance in
house or outsourcing it to some third party. For more on maintenance outsourcing, see
Deelen et al. (2003).
402
Buy vs. lease options through proper cost and benefit analysis
Selection of the most appropriate lease option
Negotiating the terms of the lease option
Administration of lease contracts
403
See Deelen et al. (2003) and ELA (2005) for more details.
The economics and finance oriented literature looks at both the lessor and
lessee perspectives and the leased equipment market resulting from the interaction
between these two parties. Ezzel and Vora (2001), Sharpe and Nguyen (1995),
Desai and Purohit (1998), Stremersch et al. (2001), Handa (1991) and Kim et al.
(1978) are an illustrative sample where readers can find more details.
The literature on maintenance is vast and there are many survey papers and
books on the topic. They deal with a range of issues determining optimal maintenance strategies, planning and implementation of maintenance actions, logistics
of maintenance, etc. References to these can be found in review/survey papers
(McCall 1965; Pierskalla and Voelker 1976; Sherif and Smith 1976; Jardine and
Buzacott 1985; Gits 1986; Thomas 1986; Valdez-Flores and Feldman 1989; Cho
and Parlar 1991; Pintelton and Gelders 1992; Dekker et al. 1997; Scarf 1997).
There are very few papers dealing with the maintenance of leased equipment and
these will be discussed later in the chapter.
404
f ( t ) = dF ( t ) / dt and r ( t ) = f ( t ) 1 F ( t )
(16.1)
respectively. In the case of used equipment, let A denote the age at the start of the
lease. Then, the time to first failure is given by the conditional failure distribution
function
F (t A) =
F (t ) F ( A)
, t A.
1 F ( A)
(16.2)
405
failed then it is called minimal repair (see Barlow and Hunter 1960). This is
appropriate for complex equipment where the equipment failure is due to failure of
one or a few components. The equipment becomes operational by replacing (or repairing) the failed components. This action has very little impact on the reliability
characteristics of the equipment. If the failure rate changes (in either direction)
after repair, it is called imperfect repair. Many different types of imperfect repair
models have been proposed and for a review of such models see Pham and Wang
(1996).
The time to repair is in general a random variable and needs to be modeled by a
distribution function. Typically, the time to repair is often very much smaller than
the time between random variables (in a statistical sense) so that one can ignore
this and treat repair as being instantaneous for determining failures over time. With
this assumption, the failures over time (with only CM actions) occur according to a
non-homogeneous Poisson process (NHPP) with intensity function 0 (t ) = r (t ) ,
the hazard function defined earlier. The intensity function (characterizing the
failures over time) is also referred to as rate of occurrence of failure (ROCOF).
The cost of repair is also a random variable and needs to be modeled by a
distribution function. Let C f denote the average cost of each minimal repair.
16.4.1.3 Preventive Maintenance (PM) Actions
PM actions are used to control the degradation process and to reduce the likelihood
of failure occurrences. Inspection, cleaning, lubrication, adjustment and calibration, replacement of degraded components, and major overhaul are some common
tasks that are carried out under PM. The effect of PM action is to improve the
reliability of the equipment. There are several ways of modeling this improvement
and we discuss two of them (Reduction in Failure Intensity and Reduction in Age)
later in the section.
The time needed to carry out PM actions can vary and needs to be modeled
properly. For minor PM actions, the time needed is small relative to the time between failures and can be ignored. For a major overhaul, the time can be significant
and cannot be ignored. The cost of PM action comprises the administration cost,
labor cost, material cost, and spare parts inventory cost, and some of these costs are
uncertain.
Reduction in intensity function: Here a PM action results in a reduction in the
intensity function (ROCOF). 0 ( t ) is the intensity function without any PM
actions. Let ( t ) denote the intensity function with PM actions. We assume that
the time for PM action is small relative to the mean time between failures so that it
can be ignored. The effect of PM on the intensity function is given by
( )
( )
t +j = t j j
(16.3)
406
( )
0 j t j ( 0 )
(16.4)
This implies that PM action cannot make the equipment better than new.
As a result, if PM actions are carried out at time instants t j , j 1, and the
reduction in the intensity function given by j , j 1, then the intensity function is
given by
( t ) = 0 ( t )
, t
i =0
< t < t j +1 ,
(16.5)
for j 0 , with t0 = 0 and 0 = 0 . This implies that the reduction resulting from
action at t j lasts for all t t j as shown in Figure 16.4.
0 (t )
(t )
1
Time
t1
t2
Figure 16.4. Effect of PM action on the intensity function for new equipment
The cost of each PM action depends on the reduction in the intensity function.
Let C p ( ) denote the cost of PM action and this is an increasing function of .
Reduction in age: Used equipment can be subjected to an upgrade (or overhaul)
where components that have degraded significantly are replaced with new ones so
that the equipment is a sense younger (from a reliability point of view). If the age
of the equipment is A before it is subjected to PM action, then it can be viewed as
an equipment of virtual age A x after the PM action. The reduction in the age is
x, 0 < x < A . As a result, the intensity function decreases after PM action as shown
in Figure 16.5.
407
0 (t )
(t )
x
A-x
Time
Figure 16.5. Effect of upgrade action on the intensity function for used equipment
The cost of this type of PM action depends on the reduction in the virtual age
and is modeled by a function Cu ( x) which is an increasing function of x .
16.4.1.4 Usage Intensity and Operating Environment
Equipment is usually designed for some nominal usage intensity and operating
environment. When it is operated under these conditions, the ROCOF (with no PM
actions) is given by 0 (t ) . If the equipment is used in a more intense mode and/or
the operating environment becomes harsher, then the ROCOF can increase significantly. As a result, failures occur more frequently. Many different models have
been proposed to model this change. Two of the well known ones are (i) accelerated failure time (AFT) model and (ii) proportional hazard (PH) model. For more
on this see, Blischke and Murthy (2000).
16.4.2 Penalties
Both the lessor and the lessee can incur penalties if they violate the terms of the
contract. In the case of the lessee, it could be the usage intensity exceeding that
specified in the contract (provided the lessor can monitor this). In the case of the
lessor, the penalties are linked to equipment failures and the time to repair failed
equipment.
Two simple forms of penalty are as follows.
Penalty 1: Let N ( L) denote the number of equipment failures over the lease
period L . If N ( L) exceeds (a pre-specified value) the lessor incurs a penalty.
The amount that the lessor pays to the lessee at the end of the contract is
Cn [max{N ( L) , 0}] .
Penalty 2: Let the random variable Y denote the time that the lessor takes to restore
failed equipment to its working state. If Y exceeds (a pre-specified value) then
the lessor incurs a penalty given by Ct [max{(Y ), 0}] .
408
COSTS
TOTAL COST
PM COST
CM COST
OPTIMAL PM EFFORT
LOW
PM EFFORT
HIGH
409
Policy 2: The equipment is subjected to preventive maintenance actions periodically so that the j th PM action is carried at time t j = jT , j = 1, 2,..., k . After each
PM action the intensity function is reduced by j . All failures over the lease
period are rectified through minimal repair. The policy is characterized by the parameter set {T , j } .
Policy 3: The equipment is subjected to preventive maintenance action whenever
the intensity function reaches a specified level . Each PM action reduces the
intensity function by a fixed amount . All failures over the lease period are
rectified through minimal repair. The policy is characterized by the parameter set
{ , } .
Policy 4: Let 0 < 1 < 2 < L . The equipment is subjected no PM actions in the
interval [0, 1 ) , periodic PM actions with period 2 in the interval [ 1 , 2 ) and
period in the interval [ 2 , L) . Each PM reduces the intensity function to a
specified level . All failures over the lease period are rectified through minimal
repair. The policy is characterized by the parameter set { 1 , 2 , } .
16.4.4.2 Used Equipment Lease
In this case, the lessor has the additional option of subjecting the equipment to an
overhaul. This can be modeled as a reduction in the virtual age so that we now
have an additional parameter x (the reduction in age). During the lease period the
lessor can use PM policies defined in Section 4.4.1.
(16.6)
410
y m
G ( y ) = 1 exp , 0 y <
n
(16.7)
with shape parameter m < 1 (implying decreasing repair rate) and scale parameter
n > 0 . We assume the following parameter values:
Intensity function: = 1 (year) and > 1 (implying increasing failure rate)
Repair time: m = 0.5 and n = 0.5 (mean time to repair is one day)
Reduction in intensity function: C p ( ) = 100 + 50 ($)
wx
($) with w = 10 and = 0.1
Reduction in age: Cu ( x) =
A x
1 e ( )
Cost parameters: C f = 100 ($), Cn = 200 ($), Ct = 300 ($)
16.5.1 Policy 1 (New Equipment Lease)
From Jaturonnatee et al. (2005), the expected total cost given by
J ( ) = C f E N ( L ) +
C ( ) +
j =1
Ct E N ( L ) ( y ) g ( y ) dy + Cn E N ( L )
(16.8)
The first term on the LHS is the cost of rectifying failures, the second term is the
PM costs, and the third and fourth terms represent the penalty costs associated with
repair times and number of failures over the lease period. The parameters, given by
the set {k , t j , j ,1 j k } need to be selected optimally to minimize J ( ) .
Example 16.1 Table 16.2 (extracted from Table 3 of Jaturonnatee et el. 2005)
shows k * , the optimal number of PM actions (the optimal values for the remaining
parameters are omitted) and J * ( * ) , the corresponding expected costs for a range
of and Cn .
The optimisation needs to take into account the following constraint:
0<
i =0
< 0 (t j ) 0 (0), j 1
with t0 = 0 and 0 = 0 .
(16.9)
411
Cn = 0 ($)
(days)
1.5
J ( * )
$1298.34
$1223.91
$1179.39
$1042.31
$2531.77
$2399.16
$2317.58
$2067.71
$8962.08
$8610.66
$8388.23
$7712.87
J ( )
$1002.27
$907.63
$838.08
$615.31
$1992.32
$1811.05
$1693.48
$1280.00
$7511.43
$7009.92
$6677.07
$5437.03
k
4
3
3
2
7
6
6
4
19
16
16
10
1
2
3
1
2
3
1
2
3
Cn = 200 ($)
*
k
5
5
5
4
10
9
9
7
26
24
23
20
( )
Example 16.2 Table 16.3 (extracted from Pongpech and Murthy 2006) shows T *
(the optimal values for the other parameters are omitted) and the corresponding
expected total cost for = 3 and L = 5 .
Table 16.3. Optimal maintenance under Policy 2
(days)
1
2
3
Cn = 200 ($)
C n = 0 ($)
*
T
0.2381
0.2778
0.3125
0.5000
J ( * )
$7827.21
$7312.50
$6968.53
$5750.00
T
0.1786
0.1923
0.2000
0.2273
J ( * )
$9336.99
$8969.90
$8737.14
$8034.90
412
( )
C p ( j ) + Ct E N ( L ) ( y ) g ( y ) dy
j =1
+ Cn E N ( L ) + Cu ( x )
J ( ) = C f E N ( L ) +
(16.10)
(days)
1
2
3
Cn = 0 ($)
*
x
3.5
3.5
3.5
2.5
k
7
6
6
4
Cn = 200 ($)
*
J ( )
$8484.36
$7488.90
$6884.50
$4792.55
x
4.0
4.0
4.0
3.5
k*
10
9
9
7
J ( * )
$11312.57
$10637.18
$10231.04
$8918.59
( )
413
A
1
2
3
4
5
6
7
x*
0.0
0.6
1.2
2.0
2.5
3.6
4.2
k*
4
4
4
4
4
4
4
J ( * )
$2280.00
$3111.58
$3752.68
$4290.03
$4792.55
$5198.07
$5601.10
As can be seen, x* (the reduction in age due to PM actions before the equipment is
leased out) increases with A as is to be expected since the ROCOF increases with
age. Note that no upgrade is needed when the equipment is fairly young ( A = 1 ).
Also, k * does not change when = 2 . However, when > 2 , then we find that
k * increases as A increases.
414
5. From the lessors point of view, the size and variety of equipment to stock
for leasing are both important issues. The optimal choice of these and the
replacement decisions must take into account the needs of different lessees
and the investment needed for the purchase of new stock.
16.7 References
Baker CR, Hayes RS (1981) Lease Financing A Practical Guide, John Wiley, New York,
USA
Barlow RE, Hunter LC (1960) Optimum preventive maintenance policies, Operation
Research, 8:90100
Blischke WR, Murthy DNP (2000) Reliability Modeling, Prediction, and Optimization, John
Wiley, New York, USA
Cho D, Parlar M (1991) A survey of maintenance models for multi-unit systems, European
Journal of Operational Research, 51:123
Coyle B (2000) Leasing, Glenlake, Chicago, USA
Deelen L, Dupleich M, Othieno L, Wakelin O (2003) Leasing for small and micro
enterprises a guide for designing and managing leasing schemes in developing
countries, Berold, R. (ed), Cristina Pierini, Turin, Italy.
Dekker R, Wildeman RE, Van Der Duyn Schouten FA (1997) Review of multi-component
models with economic dependence, Mathematical Methods of Operations Research,
45:411435
Desai P, Purohit D (1998) Leasing and selling: optimal marketing strategies for a durable
goods firm, Management Science, 44 (11):1934
http://www.leasefoundation.org/pdfs/2001StateofIndustryRpt.pdf
ELA (2002a) Equipment Leasing and Financial Foundation 2002 State of the Industry
Report, Price Water House Coopers, Available on
http://www.leasefoundation.org/pdfs/2002SOIRpt.pdf
ELA (2002b) Equipment Leasing Association Online Focus Groups Report, Available on
http://www.chooseleasing.org/Market/2002FocusGroupsRpt.pdf
ELA (2005) The economic contribution of equipment leasing to the U.S. economy: growth,
investment & jobsupdate, Equipment Leasing Association, Global Insight, Advisory
Services Group, Available on http://www.elaonline.org/press/
Ezzel JR, Vora PP (2001) Leasing versus purchasing: Direct evidence on a corporations
motivation for leasing and consequences of leasing, The Quarterly Review of Economics
and Finance, 41:3347
Fishbein BK, McCarry LS, Dillon PS (2000) Leasing: A step toward producer responsibility, Available on http://www.informinc.org.
Gits CW (1986) On the maintenance concept for a technical system: II. Literature review,
Maintenance Management International, 6:181196
Handa P (1991) An economic analysis of leasebacks, Review of Quantitative Financing and
Accounting, 1:177189
Jardine AKS, Buzacott JA (1985) Equipment reliability and maintenance, European Journal
of Operational Research, 116:259273
Jaturonnatee J, Murthy DNP, Boondiskulchok R (2005) Optimal preventive maintenance of
leased equipment with corrective minimal repair, European Journal of Operational
Research, Available online 30 March 2005
Kim EH, Lweellen WG, McConnell JJ (1978) Sale-and-leaseback agreements and enterprise
valuation, Journal of Financial and Quantitative Analysis, 13:871881
415
17
Computerised Maintenance Management Systems
Ashraf Labib
17.1 Introduction
Computerised maintenance management systems (CMMSs) are vital for the coordination of all activities related to the availability, productivity and maintainability of complex systems. Modern computational facilities have offered a dramatic scope for improved effectiveness and efficiency in, for example, maintenance. Computerised maintenance management systems (CMMSs) have existed,
in one form or another, for several decades.
The software has evolved from relatively simple mainframe planning of maintenance activity to Windows-based, multi-user systems that cover a multitude of
maintenance functions. The capacity of CMMSs to handle vast quantities of data
purposefully and rapidly has opened new opportunities for maintenance, facilitating a more deliberate and considered approach to managing assets.
Some of the benefits that can result from the application of a CMMS are:
The most important factor may be reduction of breakdowns. This is the aim of
the maintenance function and the rest are nice objectives (or by-products).
This is a fundamental issue as some system developers and vendors as well as
some users lose focus and compromise reduction of breakdowns in order to maintain standardisation and integration objectives, thus confusing aim with objectives.
This has led to the fact that the majority of CMMSs in the market suffer from
serious drawbacks, as will be shown in the following section.
418
A. Labib
419
form of the data captured and the historical nature of certain elements of it. In
short, companies tend to spend a vast amount of capital in acquisition of off-theshelf systems for data collection, but their added value to the business is questionable.
Few books have been published about the subject of CMMSs (Bagadia 2006;
Mather 2002; Cato and Mobley 2001; Wireman 1994). However, they tend to
highlight its advantages rather than its drawbacks.
All CMMSs offer data collection facilities; more expensive systems offer
formalised modules for the analysis of maintenance data, and the market leaders
allow real time data logging and networked data sharing (see Table 17.1). Yet,
despite the observations made above regarding the need for information to aid
maintenance management, a black hole exists in the row titled Decision analysis in Table 17.1, because virtually no CMMS offers decision support.1 This is a
definite problem, because the key to systematic and effective maintenance is
managerial decision-making that is appropriate to the particular circumstances of
the machine, plant or organisation. This decision-making process is made all the
more difficult if the CMMS package can only offer an analysis of recorded data.
As an example, when a certain preventive maintenance (PM) schedule is input into
a CMMS, for example to change the oil filter every month, the system will simply
produce a monthly instruction to change the oil filter and is thus no more than a
diary.
Table 17.1. Facilities offered by commercially available CMMS packages
Price range
Data collection
Data analysis
1,000 +
10,000 +
30,000 +
40,000 +
Realtime
Network
Decision analysis
A black hole
All machine work in different environments and would therefore need different PMs
Machine designers often have a different experience of machine failures
and means of prevention from those who operate and maintain them
Machine vendors may have a hidden agenda of maximising spare parts replacements through frequent PMs
420
A. Labib
The use of CMMSs for decision support lags significantly behind the more
traditional applications of data acquisition, scheduling and work order issuing.
While many packages offer inventory tracking and some form of stock level
monitoring, the reordering and inventory holding policies remain relatively
simplistic and inefficient. See the work of Exton and Labib (2002) and Labib and
Exton (2001). Also, there is no mechanism to support managerial decision-making
with regard to inventory policy, diagnostics or setting of adaptive and appropriate
preventive maintenance schedules.
A noticeable problem with current CMMS packages regards provision of
decision support. Figure 17.1 illustrates how the use of CMMS for decision support
lags significantly behind the more traditional applications of data acquisition,
scheduling and work-order issuing.
Applications of CMMS Modules
A Black Hole
75
80
85
90
95
100
421
It is worrying the fact that almost half of the companies are either in some
degree dissatisfied or neutral with their CMMS and that the responses indicated
that manufacturing plants demand more user-friendly systems.
This is a further proof of the existence of a black hole. To make matters
worse, it appears that there is a new breed of CMMSs that are complicated and lack
basic aspects of user-friendliness. Although they emphasise integration and logistics capabilities, they tend to ignore the fact that the fundamental reason for implementing CMMSs is to reduce breakdowns. These systems are difficult to handle
for both production operators and maintenance engineers; they are accountingand/or IT-orientated rather than engineering-orientated. Results of an investigation
(EPSRC GM/M35291) show that managers lack of commitment to maintenance
models has been attributed to a number of reasons:
Unavailability of data
Lack of awareness about these models
Restrictive assumptions of some of these models
Finally, here is an extract from the Professor Nigel Slack (Warwick University)
textbook on operations management regarding critical commentary of ERP implementations (which may as well apply to CMMSs as many of them tend to be nowadays classified as specialised ERP systems):
Far from being the magic ingredient which allows operations to fully integrate
all their information, ERP is regarded by some as one of the most expensive
ways of getting zero or even negative return on investment. For example, the
American chemicals giants, Dow Chemical, spent almost half-a-billion dollars
and seven years implementing an ERP system which became outdated almost
as it was implemented. One company, FoxMeyer Drug, claimed that the expense and problems which it encountered in implementing ERP eventually
drove it to bankruptcy. One problem is that ERP implementation is expensive.
This is partly because of the need to customise the system, understand its
implications for the organisation, and train staff to use it. Spending on what
some call the ERP ecosystem (consulting, hardware, networking and complimentary applications) has been estimated as being twice the spending on the
software itself. But it is not only the expense which has disillusioned many
companies, it is also the returns they have had for their investment. Some
studies show that the vast majority of companies implementing ERP are
disappointed with the effect it has had on their businesses. Certainly many
422
A. Labib
companies find that they have to (sometimes fundamentally) change the way
they organise their operations in order to fit in with ERP systems. This organisational impact of ERP (which has been described as the corporate
equivalent of dental root canal work) can have a significantly disruptive effect
on the organisations operations.
Hence, theory and implementation of existing maintenance models are, to a
large extent, disconnected. It is concluded that there is a need to bridge the gap
between theory and practice through intelligent optimisation systems (e.g. rulebased systems). It is also argued that the success of this type of research should be
measured by its relevance to practical situations and its impact on the solution of
real maintenance problems. The developed theory must be made accessible to
practitioners through IT tools. Efforts need to be made in the data capturing area to
provide necessary data for such models. Obtaining useful reliability information
from collected maintenance data requires effort. In the past, this has been referred
to as data mining as if data can be extracted in its desired form if only it can be
found.
In the next section we introduce a decision analysis model. We then show how
such a model has been implemented for decision support in maintenance systems.
423
424
A. Labib
425
B re a k d o wn tre n d s (h )
1200
1000
800
600
400
200
0
Nov
D ec
Ja n
Feb
M ar
A pr
M ay
Ju n
Ju l
A ug
Sep
Oct
Nov
426
A. Labib
427
Low
Med.
High
FREQUENCY
10
Low
Med.
High
O.T.F.
5
10
F.T.M.
[H]
(When ?)
F.T.M. [I]
F.T.M.
(Who ?)
C.B.M.
[F]
F.T.M.
[G]
(How ?)
[B]
F.T.M.
[E]
[J]
S.L.U.
20
(What ?)
[A]
D.O.M.
[D]
[C]
428
A. Labib
Downtime
Frequency
Spare Parts
Level 2: Critical
Machines
System A
System B
Bottlenecks
System C
Level 3: Critical
Faults
Electrical
Level 4: Fault
Details
6/30/02
Mechanical
Motor Faults
Limit Faults
Hydraulic Pneumatic
No Power Faults
Panel Faults
Proximity
Faults
Pressure Faults
Dr. A.W.
Labib (UMIST)
Software
Switch Faults
30
429
Medium
Low
High
1
0.75
0.4
10
20
30
40
50
Frequency
(No. of times)
12
Medium
Low
High
1
0.7
0.2
0
100
200
300
400
380
500
Downtime
(hrs)
430
A. Labib
The output strategies have a membership function and we have assumed a cost
(or benefit) function that is linear and follows the following relationship (DOM >
CBM >SLU > FTM > OTF) as shown in Figure 17.9a.
The rules are then constructed based on the DMG grid where there will be 9
rules. An example of the rules is as follows:
0
OTF
20
30
40
FTM
SLU
CBM
50
DOM Units of Cost
(x 1,000/unit)
Figure 17.9. a Output (strategies) membership function. b The nine rules of the DMG
431
The fuzzy decision surface is shown in Figure 17.10. In this figure, given any
combination of frequency (x-axis) and downtime (y-axis) one can determine the
most appropriate strategy to follow (z axis).
DOM
CBM
SLU
FTM
OTF
It can be noticed from Figure 17.11 that the relationship of (DOM > CBM
>SLU > FTM > OTF) is maintained. As illustrated in Figure 17.11, given an 380-h
downtime and a 12 x frequency, the suggested strategy to follow is CBM.
Figure 17.11. The fuzzy decision surface showing the regions of different strategies
432
A. Labib
17.5.5 Discussion
The concept of the DMG was originally proposed by Labib (1996). It was then
implemented in a company that has achieved a world-class status in maintenance
(Labib 1998a). The DMG model has also been extended to be used as a technique
to deal with crisis management in an award winning paper (Labib 1998b).
The DMG could be used for practical continuous improvement process because,
when machines in the top ten have been addressed, they will then, if and only if
appropriate action has been taken, move down the list of top ten worst machines.
When they move down the list, other machines show that they need improvement
and then resources can be directed towards the new offenders. If this practice is
continuously used then eventually all machines will be running optimally.
If problems are chronic, i.e. regular, minor and usually neglected, some of these
could be due to the incompetence of the user and thus skill level upgrading would
be an appropriate solution. However, if machines tend towards RCM then the
problems are more sporadic and when they occur could be catastrophic. Uses of
maintenance schemes such as FMEA and FTA can help determine the cause and
may help predict failures thus allowing a prevention scheme to be devised.
Figure 17.12 shows when to apply TPM and RCM. TPM is appropriate at the
SLU range since skill level upgrade of machine tool operators is a fundamental
concept of TPM, whereas RCM is applicable for machines exhibiting severe
failures (high downtime and low frequency). Also, CBM and FMEA will be ideal
for this kind of machine and hence an RCM policy will be most applicable. The
significance of this approach is that in one model we have RCM and TPM in a
unified model rather than two competing concepts.
433
It can be concluded that the challenges, and research questions facing research
and development (R&D) concerning next generation maintenance systems are:
434
A. Labib
Emphasis on CMMS and ERP systems in the market, as well as their use
and limitations
Design awareness in maintenance and design for maintainability
Learning from failures across different industries and disciplines
Emphasis on prognostics rather than diagnostics
e-Maintenance and remote maintenance, including self-powered sensors
Modelling and simulation using OR tools and techniques
AI applications in maintenance
As the success of systems implementation are based on two factors, human and
systems, it is important to develop and nurture skills as well as to use advanced
technologies.
In this chapter we have investigated the characteristics of computerised maintenance management systems (CMMSs) and have highlighted the need for them in
industry and identified their current deficiencies.
A proposed model was then presented to provide a decision analysis capability
that is often missing in existing CMMSs. The effect of such model was to contribute
towards the optimisation of the functionality and scope of CMMSs for enhanced
decision analysis support.
We have also demonstrated the use of AI techniques in CMMSs. We also
showed how it integrates with the work of Kobbacy in Chapter 9. Finally, we have
outlined features of next generation maintenance systems.
17.8 References
Bagadia, K. (2006), Computerized Maintenance Management Systems Made Easy,
McGraw-Hill.
Brashaw, C. (1998), Characteristics of acoustic emission (AE) signals from ill fitetd copper
split bearings, Proc 2nd Int. Conf on Planned Maintenance, Reliability and Quality.
Ben-Daya, M., Duffuaa, S.O. and Raouf, A. (eds) (2001), Maintenance Modelling and
Optimisation, Kluwer Academic Publishers, London.
Boznos, D. (1998), The Use of CMMSs to Support Team-Based Maintenance, MPhil thesis,
Cranfield University.
435
18
Risk Analysis in Maintenance
Terje Aven
18.1 Introduction
This chapter discusses the use of risk analysis to support decision making on
maintenance activities. In recent years there has been a growing interest in the use
of risk analysis and risk based (informed) approaches for guiding decisions on
maintenance, see, e.g., Vatn et al. (1996), Clarotti et al. (1997), Dekker (1996) and
Cepin (2002), and this topic has also been given much attention in industry see for
example van Manen et. al. (1997), Knoll et al. (1996), Perryman et al. (1995) and
Podofillini et al. (2006). This chapter provides a critical review of some of the key
building blocks of the theories and methods developed. We also discuss some
critical factors for ensuring a successful use of risk analysis for maintenance
applications. The issues discussed include:
An example is presented of a detailed risk analysis, showing the effect of maintenance efforts on risk.
The chapter is organised as follows. First in Section 18.2 we review the basic
elements of risk management and risk management processes, and clarify the risk
perspective adopted in this chapter. Then in Section 18.3 we address the use of risk
analysis to support decisions on maintenance. Various types of decision situations
and analyses are covered. Section 18.4 presents the case mentioned above. In
Section 18.5 we discuss key building blocks of the theories and methods developed, as well as the critical factors for ensuring a successful use of risk analysis for
maintenance applications. Section 18.6 concludes. When not otherwise stated, we
use terminology from ISO (2002).
438
T. Aven
List of abbreviations:
PLL
Potential loss of life (expected number of fatalities per year)
FAR
Fatal accident rate (expected number of fatalities per 100 million
exposed hours)
ETA
Event tree analysis
FTA
Fault tree analysis
CCA
Cause consequence analysis
FMECA Failure mode and effect and criticality analysis
HAZOP Hazard and operability studies
RIF
Risk influencing factor
BORA
Barrier operational risk analysis
RCM
Reliability centred maintenance
HMI
Human machine interface
TTS
Technical condition safety
439
ANALYSE RISKS
IDENTIFY RISKS
RISK ASSESSMENT
EVALUATE RISKS
TREAT RISKS
440
T. Aven
441
442
T. Aven
of risk analysis we obtain when models are developed to represent cause and/or
consequence scenarios. The standard tools used are FTA (fault tree analysis) and
ETA (event tree analysis) and the combination of the two, CCA (cause consequence analysis). These models are important elements in a qualitative risk analysis, and provide the basis for a quantitative risk analysis. These are all standard risk
analysis methods and we refer to texts books for description of discussion of these
methods; see, e.g. Aven (1992) and Modarres (1993).
The models are used to identify critical systems, and thus provide a basis for
selecting appropriate maintenance activities. To illustrate this, let R be a risk index,
for example expressing the expected number of fatalities (PLL) or the probability
of a system failure, and let Ri be the risk index when subsystem i is in the
functioning state. Then a common way of ranking the different subsystems is to
compute the risk improvement potential (also referred to as the risk achievement
worth) Ii = Ri R, i.e. the maximum potential risk improvement that can be
obtained by improving system i (Aven 1992; Haimes 1998). The potential Ii is
referred to as a risk importance measure. An application of this approach is
presented in Brewer and Canady (1999). Criteria are established based on such a
ranking to identify when maintenance improvements are needed to reduce risks.
Identifying critical items is an important basis for maintenance management, and is
one of the key steps in various maintenances frameworks, e.g. the RCM (reliability
centred maintenance) approach (Andersen and Neri 1990).
In risk analysis, the maintenance efforts are incorporated by:
1. Showing the relation between maintenance effort and component performance
2. Showing the relation between component performance and overall risk
indices.
An example demonstrating the component level 1 is the periodical testing of a
component, where the component has a failure rate and the testing interval is .
Then the unavailability of the component is approximated by /2, expressing the
mean fractional down time of the component. We refer to the literature for further
details on this example and related models and methods, including Markov
methods; see, e.g. Aven (1992), Rausand and Hyland (2003) and Modarres
(1993). The component measures are often expressing features of the performance
of safety barriers, reflected in the event trees. In this way a link is established
between the component performance level and risk (level 2). For the periodical
testing example, suppose that the component is a safety barrier modelled as a
branching event of the event tree. Then the unavailability /2 expresses the probability that this barrier is not functioning at a demand.
In Figure 18.2 we present a model for integrating maintenance activities and
risk analysis, taken from Apeland and Aven (2000), which also shows the two
levels 1 and 2 mentioned above. On the low system level we have maintenance,
component and operating characteristics, describing alternative maintenance
actions and strategies, alternative components available and relevant operating
patterns.
443
444
T. Aven
High
system
level
Main
objectives
High analysis
level
A
n
R a
i l
s y
Intermediate
k s
i analysis level
s
Low
analysis
level
Low
system
level
Comparison
of
alternatives
System
attributes
Expert
opinions
System
performance
Maintenance
performance
Maintenance
characteristics
Historical
data
Suitable
models
Component
performance
Operating
characteristics
Component
characteristics
Figure 18.2. Model showing the relationship between maintenance efforts and risk
(Apeland and Aven 2000)
Traditionally, risk analysis using FTA and ETA have not had the level of detail
that is necessary to support many decision related to maintenance. However, recent
developments within risk analysis allow for more detailed analysis taking into
account risk influencing factors, for example maintenance activities. In Section 18.4
we will look closer into this type of risk analysis and show how maintenance
activities can be incorporated. Here we summarise the basic features of the method,
using a cause analysis based example as an illustration:
1. Identify top events A that summarise essential barrier performance. An
example is ignition or avoid ignition given a specific leakage scenario.
The event A must be precisely defined no ambiguity can exist.
2. Establish a deterministic model that links A and events Bi and quantities Xi
on a more detailed level. A fault tree is an example of such a model.
3. Specify a set of operational and management factors Fi that could influence
the performance of the barriers, and which have not been included in the
fault tree model. Examples of such factors are the quality of the maintenance work, the level of competence and the adequacy of organisation.
445
18.4 A Case
In this section we present a risk analysis incorporating operational and maintenance factors. The presentation is based on Sklet et al. (2005), and is referred to
as the BORA (barrier and operational risk analysis) approach. The approach is
inspired by the I-Risk method (Papazoglou et al. 2003). The case relates to an
offshore installation, and releases of hydrocarbons.
The BORA approach consists of the following steps:
1. Development of a basic risk model.
2. Assignment of industry average frequencies/probabilities of initiating events
and basic events.
3. Identification of risk influencing factors (RIFs) and development of risk influence diagrams.
4. Assessment of the status of RIFs.
5. Calculation of installation specific frequencies/probabilities.
6. Calculation of installation specific risk, incorporating the effect of technical
systems, technical conditions, human factors, operational conditions, and
organizational factors.
446
T. Aven
Barrier functions
Detection of valve(s) in
wrong position
Valve(s) in wrong
position after
maintenance
End event
Self control /
checklists
(isolation plan)
Safe state
Failure revealed
Leak test
Release of
hydrocarbons
447
As seen in Figure 18.3, several of the barriers are non-physical by nature, thus
requiring human and operational factors to be included in the risk model.
In order to perform a quantitative risk analysis, frequencies/probabilities of
three main types of events need to be quantified:
1. The frequency of the initiating event, i.e. in the example case: The frequency of valve in wrong position after maintenance.
2. The probability of failure of the barrier systems, which for the example case
includes: i) failure to reveal valve(s) in wrong position after maintenance by
self control/use of checklists, ii) failure to reveal valve(s) in wrong position
after maintenance by third party control of work, and iii) failure to detect
potential release during leak test prior to start-up.
3. The (end event) frequency of release of hydrocarbons due to valve in wrong
position (needed for further analysis of the effect of the consequence
barriers).
The frequency of the initiating event is in our example a function of the annual
number of maintenance operations where valve(s) may be set in wrong position in
hydrocarbon systems, and the probability of setting a valve in wrong position per
maintenance operation.
In order to determine the probability of failure of barrier systems, the barrier
systems may be further analyzed by use of fault trees as shown in Figure 18.4.
Failure to reveal valve(s) in
wrong position after
maintenance by self control/
use of checklists
A13
A11
A12
Corresponding analysis may be performed for all barriers for all the identified
release scenarios. For further illustration of the quantification methodology in the
BORA project, we consider the initiating event and the basic events shown in
Figures 18.3 and 18.4:
448
T. Aven
Valve(s) in wrong position after maintenance that may cause release (the
initiating event).
Use of self control/checklists not specified in program (basic event A11).
Use of self control/checklists specified, but not performed (basic event A12).
The operator fails to detect valve(s) in wrong position by self control/use of
checklists (basic event A13).
18.4.2 Assignment of Average Frequencies/Probabilities
The first step in the quantification process is to assign industry average frequencies
and probabilities for all the initiating events in the event trees and basic events in
the fault trees.
Generic data may be found in generic databases or company internal databases.
Alternatively, industry average values can be established by use of expert judgment. For our example case, Table 18.1 shows the assigned industry average frequencies and probabilities for the initiating events and basic events in Figure 18.4.
Table 18.1. Assigned average frequencies (F) and probabilities (P)
Event description
Assigned values
F=6
P = 0.1
P = 0.05
P = 0.06
HMI
Maintainability/
accessibility
Time
pressure
Competence
of area
technician
Procedures
for self
control
Work permit
Figure 18.5. Influence diagram for the basic event Operator fails to detect a valve in wrong
position by self check/checklist
449
Table 18.2 shows the RIFs for the all the relevant events in our example case.
Table 18.2. Proposed RIFs for basic events in the example case
Event description
Valve in wrong position after
aintenance
Self control/use of
checklists not specified
Self control/use of
checklists not performed
Area technician fails to detect
valves(s) in wrong position by
self control/ use of checklists
RIFs
Process complexity
Maintainability/accessibility
HMI (valve labeling and position feedback features)
Time pressure
Competence (of area technician)
Work permit
Program for self control
Work practice (regarding use of self control/checklists)
Time pressure
Work permit
HMI (valve labeling and position feedback features)
Maintainability/accessibility
Time pressure
Competence (of area technician)
Procedures for self control
Work permit
450
T. Aven
B
C
D
E
F
w Q
i =1
(18.1)
w
i =1
=1
(18.2)
451
Qi = 1
if si = C
P / P if s = F
i
high ave
(18.3)
where si denotes the score or status of RIF no i. Hence if the score si is A, and Plow
is 10% of Pave, then Qi is equal to 0.1. And if the score si is F, and Phigh is ten times
higher than Pave, then Qi is equal to 10. If the score si is C, then Qi is equal to 1.
Furthermore, if all scores are C, then Prev = Pave, if all scores are A, then Prev = Plow,
and if all scores are F, then Prev = Phigh.
Note that in this study we use a fixed factor of ten to describe the variations
caused by different scores, from A to F. That is, if all scores are A, Plow is 10% of
Pave, and if all the scores si are F, then Phigh is ten times higher than Pave.
Furthermore; we have adopted the grade score from the TTS project; A=3,
B=2, C=1, D=0, E= 2 and F= 5. Thus we have, letting Qi(j) denote the value of
Qi if the score si takes the value j, the results shown in Table 18.5.
Table 18.5. Adaptation of scores from the TTS-project
Score si =j
Qi (j)
3 (A)
0.10
2 (B)
1 (C)
1
0 (D)
2 (E)
5 (F)
10
3 (A)
0.10
2 (B)
0.55
1 (C)
1
0 (D)
2.5
2 (E)
5.5
5 (F)
10
452
T. Aven
Weight
of RIF i (wi)
4
6
4
6
10
4
34
Normalized
weight
0.12
0.18
0.12
0.18
0.29
0.12
1.0
Status
of RIF i (si)
B
C
E
D
C
D
Qi
wi * Q i
0.55
1
5.5
2.5
1
2.5
0.065
0.176
0.647
0.441
0.294
0.294
1.918
By use of (18.1), Prev is equal to (Pave x 1.918). In our example case, the RIF
analysis gave an increase of the probability of occurrence of the basic event by a
factor of 1.9 (from Pave = 0.01 to Prev = 0.019).
18.4.6 Recalculation of the Installation Specific Risk
A revised value for the installation specific risk may be calculated by use of the
platform specific data (Prev) as input data in the risk model (event trees/fault trees)
described above.
18.4.7 Remarks
We refer to Sklet et al. (2005) for a detailed discussion of this approach, and
relevant references for similar methods.
Compared to a traditional QRA model, the BORA approach is a more detailed
method, and includes considerably more risk influencing factors that gives more
detailed information of factors contributing to the total risk, i.e. a more detailed
risk picture. The analysis allows one to study the effect of maintenance efforts on
risk, and thus provide support for maintenance decisions. The risk analysis can be
used to identify the critical factors, as well as expressing the effect of risk reducing
measures.
453
that the use of expected values is the appropriate criterion for determining the best
policies. The justification is the statistical property of a mean. If we consider a
large set of similar activities and Xi is the consequences of the i-th activity, then the
law of large number says that under certain conditions the mean of the Xis is
approximately equal to EXi. Also the portfolio theory supports the use of the expected values; see e.g. Abrahamsen et al. (2005).
The use of traditional cost-benefit analyses to support decision making is based
on the same type of logic. Cost-benefit analyses means that we assign monetary
values to all relevant attributes, including costs and safety and summarise the
performance of an alternative by the expected net present value, E[NPV]. The main
principle in transformation of goods into monetary values is to find out what the
maximum amount society is willing to pay to obtain an improved performance.
Use of cost-benefit analysis is seen as a tool for obtaining efficient allocation of the
resources, by identifying which potential actions are worth undertaking and in what
fashion. By adopting the cost-benefit method the total welfare is optimised. This is
the rationale for the approach. Although cost-benefit analysis was originally
developed for the evaluation of public policy issues, the analysis is also used in
other contexts, in particular for evaluating projects and activities in firms. The
same principles apply, but using values reflecting the decision makers benefits and
costs, and the decision makers willingness to pay.
However, risk is more than expected values. The most common definition of
risk in the engineering community is that risk is the combination of consequences
and probability, i.e. the combination (X, P), where P refers to probability; see e.g.
ISO (2002). We extend this definition by using the pair (X, U), where U refers to
uncertainty. Probability is a way of expressing the uncertainties. Following these
perspectives on risk, there is a need to see beyond the expected values. The arguments can be summarised as follows.
What we search for is desirable outcomes X, for example no accidents and high
profit. In practice we have a finite number of projects, and the mean numbers based
on these projects are not the same as the expected value. An accident could result
in losses that are significant also in a corporate perspective the standard deviation
of the project loss could be significant relative to the total cash flow of the firm.
And since the uncertainties in the consequences are large, the assumptions and
suppositions made in the calculation of the expected value may influence the
results to large extent. The assessments made should be seen as considerations
based on relevant information, but there could be different assessments, different
views and different perspectives on the uncertainties. This applies in particular to
assigned, small probabilities of rare events.
A complicating factor is that safety and risk involve the balance between
different attributes, including lives and money. The above expected value approach,
for example based on cost-benefit analyses, is based on one being able to transform
all values to one unit, the economic value. And from a business perspective, firms
may argue that this is the only relevant value. All relevant values should be transformed to this unit. This means that the expected costs of accidents and lives should
be incorporated in the evaluations.
But what is the economic value of a life? For most human beings it is infinite;
most people would not be willing to give his or her life for a certain amount of
454
T. Aven
money. We say that a life has a value in itself. But of course, an individual may
accept a risk for certain money or other benefits. And for the firm, this is the way
of thinking the balance of costs and risk. The challenge is however to perform
this balance. What are reasonable numbers for the firm to use for valuing that a life
has a value in itself? Obviously there are no correct answers, as it is a managerial
and strategic issue. High values may be used if it can be justified that this would
produce high performance levels, on both safety and production.
Consequently, uncertainty needs to be considered, beyond the expected values,
which means that the principles of robustness and caution (precaution) have a role
to play. A risk-aversion behaviour is often the result. The point is that we put more
weight on possible negative outcomes than the expected values support. Many
firms seem in principle to be in favour of a risk neutral strategy for guiding their
decisions, but in practice it turns out that they are often risk averse. The justification is partly based on the above arguments. In the case with a large accident, the
possible total consequences could be rather extreme the total loss for the firm in a
short and long term perspective is likely to be high due to loss of production,
penalties, loss of reputation, changes in the regulation regimes, etc. The overall
loss is difficult to quantify the uncertainties are large and it is seldom done in
practice, but the overall conclusion is that investments in safety are required. The
expected value is not the only basis for making this conclusion. We apply a
cautionary principle, expressing that in the face of uncertainty, caution should be a
ruling principle. For example, in a process plant, major hydrocarbon leaks might
occur, requiring investments in various safety systems and barriers to reduce the
possible consequences we are cautious. Uncertainties in phenomena and processes justify investments in safety.
Thus to conclude on maintenance alternatives, we need an approach which
provide decision support beyond expected values. We recommend an assessment
process following a structure as summarized in the following (Aven and Vinnem
2007).
For a specified alternative, say A, we assess the consequences or effects of this
alternative seen in relation to the defined attributes (safety, costs, reputation, etc.).
Hence we first need to identify the relevant attributes (X1, X2, ) and then assess
the consequences of the alternative for these attributes. These assessments could
involve qualitative or quantitative analysis. Regardless of the level of quantification, the assessments need to consider both what the expected consequences are, as
well as uncertainties related to the possible consequences. Often the uncertainties
could be large. In line with the adopted perspective on risk, we recommend a structure for the assessment according to the following scheme:
1. Identify the relevant attributes (safety, costs, reputation, alignment with
main concerns, ...).
2. What are the assigned expected consequences, i.e. E[Xi] given the available
knowledge and assumptions?
3. Are there special features of the possible consequences? In addition to
assessing the consequences on the quantities Xi, some aspects of the possible
consequences might need special attention. Examples may for example be
the temporal extension, aspects of the consequences that could cause social
455
456
T. Aven
is to what extent it is appropriate to adjust the value of a (statistical) life and adjust
the discount rate to take into account the uncertainties.
In maintenance application there is often reference to the use of risk acceptance
criteria, as upper limits of risk acceptance expressed for example by the PLL or
FAR values; see e.g. Khan and Haddara (2003). We are sceptical to the prevailing
thinking concerning risk acceptance criteria; see Aven and Vinnem (2005, 2007).
We all agree on the need for considering risk as a basis for making decisions under
uncertainty. Such considerations must however be seen in relation to other concerns, costs and benefits. Care should be shown when using pre-determined risk
acceptance criteria in order to obtain good arrangements, plans and measures, as
they easily lead to the wrong focus using risk analysis to verify that these limits
are met and there is no drive for risk reduction and safety improvements.
The use of risk acceptance criteria cannot replace managerial review and
judgement. The decision support analyses need to be evaluated in the light of the
premises, assumptions and limitations of these analyses. The analyses are based on
a background information that must be reviewed together with the results of the
analyses. Risk analysis provides decision support, not hard decisions. We refer to
Aven and Vinnem (2007).
18.6 Conclusions
This chapter has presented and discussed the use of risk analysis for the selection
and prioritisation of maintenance activities. The chapter has reviewed some critical
aspects of risk analysis important for the successful implementation of such analyses in maintenance. This relates to risk descriptions and categorisations, uncertainty assessments, risk acceptance and risk informed decision making, as well as
selection of appropriate methods and tools.
In the risk analysis, the maintenance efforts are incorporated by:
Showing the relation between maintenance effort and component performance
Showing the relation between component performance and overall risk in
dices
An example is shown in Section 18.4. This example demonstrates some of the
problems related to incorporating the maintenance efforts into the risk analysis.
The analysis needs to be rather detailed to support the decision making. Developing suitable methodology is not straightforward, for example on how to assign
installation specific probabilities, based on the information available (including
reliability and maintenance data). Further research is undoubtedly required to give
confidence in the methods to be used. A detailed analysis requires substantial input
data, and the data must be relevant. Such analyses cannot be performed without
extensive use of expert judgment. However, expert judgment is not to be seen as
something negative. The risk analysis is a tool for summarising the information
available (including uncertainties), and expert judgment constitutes an important
part of this information.
457
18.7 References
Abrahamsen, E.B., Aven, T., Vinnem, J.E. and Wiencke, H.S. (2005) Safety Management
and the use of expected values. Risk, Decision and Policy, 9, 347358.
Andersen, R.T. and Neri, L. (1990) Reliability-Centred Maintenance. Management and
Engineering Methods, Elsevier Applied Sciences, London.
Apeland, S. and Aven, T. (2000) Risk based maintenance optimization: foundational issues.
Reliability Engineering and System Safety, 67, 285292.
Aven, T. (1992), Reliability and Risk Analysis, Elsevier Applied Science, London.
Aven, T. and Jensen, U. (1999) Stochastic Models in Reliability, Springer-Verlag, New
York.
Aven, T. and Kristensen, V. (2005) Perspectives on risk Review and discussion of the
basis for establishing a unified and holistic approach. Reliability Engineering and
System Safety, 90, 114.
Aven, T. and Vinnem, J.E. (2005) On the use of risk acceptance criteria in the offshore oil
and gas industry. Reliability Engineering and System Safety, 90, 1524.
Aven, T., Vinnem, J.E. and Wiencke, H.S. (2007) A decision framework for risk
management. Reliability Engineering and System Safety, 92, 433448.
Aven, T. and Vinnem, J.E. (2007) Risk Management, with Applications from the Offshore
Oil and Gas Industry, Springer Verlag, New York.
Brewer, H.D. and Canady, K.S. (1999) Probabilistic safety assessment support for the
maintenance rule at Duke Power Company. Reliability Engineering and System Safety,
63, 243249.
Cepin, M. (2002) Optimization of safety equipment outages improves safety. Reliability
Engineering and System Safety, 77, 7180.
Clarotti, C.A., Lannoy, A. and Procaccia, H. (1997) Probabilistic risk analysis of ageing
components which fail on demand; A Bayesian model: Application to maintenance
optimization of diesel engine linings. In Proceedings of Ageing of materials and
methods for the assessment of lifetimes of engineering plant, Cape Town, pp. 8594.
Dekker, R. (1996) Applications of maintenance optimization models: A review and analysis.
Reliability Engineering and System Safety, 51, 229240.
Faber, M.H. (2002) Risk-Based Inspection: An Introduction, Structural Engineering
International, 12, 186194.
Haimes, Y.Y. (1998) Risk modeling, Assessment, and Management, Wiley, New York.
ISO (2002) Risk management vocabulary. ISO/IEC Guide 73.
Khan, F.I. and Haddara, M.M. (2003) Risk-based maintenance (RBM): a quantitative
approach for maintenance/inspection scheduling and planning, Journal of Loss
Prevention, 16, 561573.
Knoll, A., Samanta, P.K. and Vesely, W.E. (1996) Risk based optimization of the Frequency
of EDG on-line maintenance at Hope Creek. In Proceedings of Probabilistic Safety
Assessment, Park City, pp. 378384.
Modarres, M. (1993) What Every Engineer should Know about Reliability and Risk
Analysis, Marcel Dekker, New York.
van Manen, S.E., Janssen, M.P. and van den Bunt, B. (1997) Probability-based optimization
of maintenance of the River Maas Weir at Lith. In Proceedings of European Safety and
Reliability conference (ESREL), Lisbon, pp. 17411748.
Papazoglou, I.A., Bellamy, L.J., Hale, A.R., Aneziris ON, Post JG, Oh JIH. (2003) I-Risk:
development of an integrated technical and Management risk methodology for chemical
installations. Journal of Loss Prevention in the Process Industries, 16, 575 591.
Pat-Cornell, E.M. and Murphy, D.M. (1996) Human and management factors in
probabilistic risk analysis: the SAM approach and observations from recent applications.
Reliability Engineering and System Safety, 53, 115126.
458
T. Aven
Perryman, L.J., Foster, N.A. and Nicholls, D.R. (1995) Using PRA in support of
maintenance optimization, International Journal of Pressure Vessels & Piping, 61, 593
608.
Podofillini, L., Zio, E. and Vatn, J. (2006) Risk-informed optimisation of railway tracks
inspection and maintenance procedures, Reliability Engineering and System Safety, 91,
2035.
PSA, 2004. Trends in Risk Levels on the Norwegian Continental Shelf Main report Phase 4
2003 (in Norwegian). The Petroleum Safety Authority Norway, Stavanger, Norway.
Rausand, M. and Hyland, A. (2003) System Reliability Theory, Wiley, New York.
Renn, O. and Klinke, A. (2002) A New approach to risk evaluation and management: Riskbased precaution-based and discourse-based strategies, Risk Analysis, 22, 10711094.
Sandy, M., Aven, T. and Ford, D. (2005) On integrating risk perspectives in project
management. Risk Management: an International Journal, 7, 721.
Sklet, S., Hauge, S., Aven, T. and Vinnem, J.E. (2005) Incorporating human and
organizational factors in risk analysis for offshore installations. Proceedings ESREL
2005, pp. 18391847.
Thomassen, O., Srum, M. 2002. Mapping and monitoring the safety level. SPE 73923,
Society of Petroleum Engineers.
Vatn, J., Hokstad, P. and Bodsberg, L. (1996) An overall model for maintenance
optimization. Reliability Engineering and System Safety, 51, 241257.
19
Maintenance Performance Measurement (MPM)
System
Uday Kumar and Aditya Parida
19.1 Introduction
Maintenance is an important support function for the business processes with
significant investment in physical assets which plays an important role in achieving
organizational goals. However, the cost of maintenance and downtime is too high
for many industries. For example, the cost of maintenance in a highly mechanized
mine can be 4060% of the operating cost (Campbell 1995), the maintenance
spending in the UKs manufacturing industry ranges from 12 to 23% of the total
factory operating costs (Cross 1988) and as per a study in Germany, the annual
spending on maintenance in Europe is around 1500 billion euros (Altmannshopfer
2006). All these have motivated the senior managers and maintenance engineers to
measure the contribution of maintenance towards total business goals or in terms of
return on investment, etc.
Prior to the 1940s, maintenance was considered as a necessary evil and the
general attitude to maintenance was It costs what it costs. During 195080, with
the advent of techniques like preventive maintenance and condition monitoring, the
perception changed to maintenance is an important support function and it can be
planned and controlled. Today maintenance is considered as an integral part of the
business process and it is perceived as: It creates additional value (Liyanage and
Kumar 2003). The creation of additional value by maintenance is expressed in
terms of increased productivity, better utilisation of plant and system, lower
accident rates and better working environment. With increasing awareness that
maintenance creates additional value in the business process; more and more
companies are treating maintenance as an integral part of the business process, and
maintenance function has become an essential element of strategic thinking of
many companies involved in service and manufacturing industry. With this change
in the mindset of senior asset managers and owners, it has become essential to
measure the performance of manufacturing process to understand the tangible and,
if possible, intangible contribution of maintenance towards business goals. However, without any formal measures of performance, it is difficult to plan, control
460
and improve the maintenance process. With this, the focus has shifted to measure
the performance of maintenance. Maintenance performance needs to be measured
to evaluate, control and improve the maintenance activities for ensuring achievement of organizational goals and objectives.
In recent years, maintenance performance measurement (MPM) has received a
great amount of attention from researchers and practitioners due to a paradigm shift
in maintenance. This chapter deals with the broad topic of performance measurement (PM), metrics and measures for MPM, reviews the existing MPM frameworks, discusses various issues and challenges associated with the development
and implementation of an MPM system. The outline of the chapter is as follows: an
overview of various PM frameworks and their development are presented in
Section 19.2. Definitions of maintenance performance indicator (MPI), and MPM
system, and their salient features are discussed in Section 19.3. The important
issues associated with the development of MPM system are discussed in Section
19.4, while the MPIs under different criteria are explained in Section 19.5. The
MPM system and the framework are explained in Section 19.6. Some of the MPIs
and MPM system in different industries are discussed in Section 19.7. The final
section concludes the chapter with limitations of the current literature and practice.
461
462
463
464
465
466
467
Figure 19.1. Linkages between objective outcomes at operational level to strategic level and
breaking down of goals into objective targets
As shown in the figure, while cascading down the corporate goals of a mining
company with an installed capacity of 0.6 million ton per month, the monthly
production target of 0.51 million ton per month of iron ore pellet will cascade down
to a system availability of 96% at the tactical level, which must be translated into
maximum allowed planned stop of 20 h per month and unplanned plant stop of 8.8 h
per month. Similarly, while aggregating the MPIs such as planned and unplanned
stops needs to be aggregated to higher level in terms of availability and capacity
utilization. The calculations are as under:
468
469
Since the actual production and the OEE level has gone down, now the
management has to take remedial measures and appropriate decision making to
achieve the desired level of OEE and production.
19.5.2 Multiple Criteria of MPM System
The objectives of the organizational decision makers are expressed in terms of
different criteria. For example, at the beginning of twentieth century, financial cost
was the single criteria used by the managers. After the 1980s, it was felt by the
management that a single criterion is unable to meet their entire objectives and the
concept of multiple criteria evolved. When there are a number of criteria, the
multi-criteria choice problem arises, which is solved by obtaining information
about all the criteria and their relative priorities. For the MPM system, different
MPIs are being grouped under different criteria as per organizations requirements,
based on the stakeholders need. The multiple criteria of the MPIs can be considered from a balanced and integrated point of view. Besides the four perspectives
(customer, financial, internal processes and innovation and learning) of Kaplan and
Norton (1992), three more criteria like the HSE, employees satisfaction and maintenance task related, are considered and included in the MPM framework. Some of
the MPIs thus grouped under seven criteria associated with the development of the
MPM framework are selected to improve productivity, quality and safety of the
organization (Parida et al. 2005). The seven criteria considered are discussed
below.
19.5.2.1 Plant/Equipment Related Indicators
The indicators under this criterion measure the performance pertaining to the plant
and equipment of the organizations. These MPIs provide relevant information to
the management at different hierarchical level for appropriate decision making.
Some of the MPIs under this criterion are:
470
Down-time for the number of minor and major stops. This is expressed in
hours and minutes for the total number of stops or for each minor and
major stop.
Rework. Rework due to maintenance lapses (for example; not sharpening
the tools) expressed in time (hours and minutes), the number of pieces on
which rework has been carried out and the cost of the rework undertaken.
Maintenance cost/unit
Production cost per unit
Total maintenance cost
471
for checking the frequency of failure and time taken to fix the failure. Some of the
MPIs considered under this criterion are:
Number of incidents/accidents
Lost time due to HSE issues
Number of legal cases
Number of compensation cases/amount of compensation paid
Number of HSE complaints
Employee absentees
Employee complaints
Employee retention
472
framework
Front-end
process
Hierarchical Level 1
level
MultiStrategic/top
criteria
management
- Timely delivery
- Quality
- HSE issues
External
effectiveness
- Customers/
stakeholders
- Compliance
with regulations
Level 2
Level 3
Tactical/middle
management
Functional/
operational
- Availability
- OEE
- Production rate
- Quality
- Number of stops
- Production rate
- Number of defects/rework
- Number of
stops/downtime
- Vibration & thermography
- Maintenance production
cost per ton
- Maintenance/production
cost
- Quality of maintenance
- Change over time
task
- Planned maintenance
- Change over time
task
- Planned maintenance task - Unplanned maintenance
task
- Unplanned maintenance
task
Equipment/
process
related
- Capacity utilization
Internal
effectiveness
- Reliability
- Productivity
- Efficiency
- Growth &
innovation
Back-end
process
- Process
stability
- Supply chain
- HSE
Learning
growth &
innovations
- Generation of a number
- Generation of number
of new ideas
of new ideas
- Skill improvement training - Skill improvement training
Customer
satisfaction
related
- Quality complaint
numbers
- Quality return
- Customer satisfaction
- Customer retention
Health,
safety &
security
environment
- Number of accidents
- Number of legal cases
- HSSE losses
- HSSE complaints
- Number of
accidents/incidents
- Number of legal cases
- Compensation paid
- HSSE complaints
- Number of accidents/
incidents
- HSSE complaints
Employee
satisfaction
- Employee satisfaction
- Employee complaints
- Employee absentees
- Employee complaints
- Generation of number
of new ideas
- Skill improvement training
The MPIs at functional and tactical levels gets aggregated as KPI at the
strategic level. For example, MPIs like the availability, performance (production
rate) and quality at operational level aggregates to OEE at the tactical level, and to
capacity utilization at the strategic level under plant/equipment criteria.
473
474
Attributes
Overall
indicators
1. Operates
smoothly
1. Operating
1. Forced power 1. No of forced power reductions and
performance
reductions &
outages due to internal causes
outages
2. No of forced power reductions &
outages due to external causes
2. State of
structures,
systems and
components
Strategic
indicators
Specific
indicators
1. Corrective
work orders
issued
2. Material
condition
3. State of the
barriers
Production
Produced volume (Sm3)
Planned production (Sm3)
Technical integrity
Backlog preventive maintenance (man-hours)
Backlog corrective maintenance (man-hours)
Maintenance
Maintenance man-hours total
Maintenance man-hours safety systems
475
Deferred production
Due to maintenance (Sm3)
Due to operation (Sm3)
Due to drilling/well operations (Sm3)
Weather and other causes (Sm3)
476
477
ance measurement frameworks. There is a further scope to study the impact of different culture and human behavioral aspects associated with MPM.
19.8 References
Abran, A. and Buglione, L. (2003), A multidimensional performance model for consolidating
Balanced Scorecards, Advances in Engineering Software, 34, pp. 339349
hren, T and Kumar, U. (2004), Use of maintenance performance indicators: a case study at
Banverket. Conference proceedings of the 5th Asia-Pacific Industrial Engineering and
Management Systems Conference (APIEMS2004). Gold Coast, Australia
Altmannshoffer, R. (2006). Industrielles FM, Der Facility Manager (In German), April
Issue, pp. 1213.
Al-Turki, U. and Duffuaa, S. (2003), Performance measures for academic departments,
International Journal of Educational Management, Vol. 17, No. 7, pp. 330338
Andersen, B. and Fagerhaug, T. (2002), Eight steps to a new performance measurement
system, Quality Progress, 35, 2, pp. 1125.
Campbell, J.D. (1995), Uptime: Strategies for Excellence in Maintenance Management.
Portland, OR: Productivity Press
Chandler, A.D. (1977), The Visible Hand: the Managerial Revolution in American Business,
Boston, MA, Harvard University Press, pp. 417
Cross, M. (1988), Raising the value of maintenance in the corporate environment,
Management Research News, Vol. 11, No. 3, pp. 811
DOE-HDBK-1148-2002 (2002) Work Smart Standard (WSS) Users Handbook, Department
of Energy, USA, www.eh.doe.govt/tecgstds/standard/hdbk1148/hdbk11482002.pdf
Fitzgerald, L., Johnson, R., Brignall, S., Silvestro, R. and Voss, C. (1991), Performance
Measurement in Service Businesses, London, CIMA
IAEA, International Atomic Energy Agency, (2000), A Framework for the Establishment of
Plant specific Operational Safety Performance Indicators, Report, Austria
Kaplan, R.S. and Norton, D.P. (1992), The balanced scorecard: measures that drive
performance, Harvard Business Review, JanuaryFebruary, pp. 7179
Keegan, D., Eiler, R. and Jones, C. (1989), Are your performance measures obsolete?
Management Accounting, June, pp. 4550
Kennerly, M. and Neely, A. (2003), Measuring performance in a changing business
environment, International Journal of Operation and Production Management, Vol. 23,
No. 2, pp. 213229
Kumar, U. and Ellingsen, H. P. (2000), Development and implementation of maintenance
performance indicators for the Norwegian oil and gas industry, Conference proceedings
of 15th European Maintenance Conference (Euro Maintenance 2000), Gothenburg,
Sweden
Lingle, J.H. and Schiemann, W.A. (1996), From balanced scorecard to strategy gauge: is
measurement worth it? Management Review, March, pp. 5662
Liyanage, J.P. and Kumar, U. (2003), Towards a value-based view on operations and
maintenance performance management, Journal of Quality in Maintenance Engineering,
Vol. 9, pp. 333350
Lynch, R.L. and Cross, K.F. (1991), Measure up!: the Essential Guide to Measuring
Business Performance, London, Mandarin
Medori, D. and Steeple, D. (2000), A framework for auditing and enhancing performance
measurement systems, International Journal of Operation & Production Management,
Vol. 20, No. 5, pp. 520533
478
Meyer, M.W. and Gupta, V. (1994), The performance paradox, in Straw, B. M. and
Cummings, L.L. (Eds), Research in Organizational Behavior, Vol. 16, Greenwich, CT,
JAI Press, pp. 309369
Miles, M.B. and Huberman, A.M. (1994). Qualitative Data Analysis, Sage Publication,
California, USA.
Murthy, D.N.P, Atrens, A. and Eccleston, J.A. (2002), Strategic maintenance management,
Journal of Quality in Maintenance Engineering, Vol. 8, No. 4, pp. 287305
Neely, A.D. (1999), The performance measurement revolution: why now and where next,
International Journal of Operation and Production Management, Vol. 19, No. 2, pp.
205228
Neely, A., Adams, C. and Keenerly, M. (2002), The Performance Prism, Prentice Hall,
Financial Times, Harlow, UK
Parida, A., Chattopadhyay, G. and Kumar, U. (2005), Multi criteria maintenance performance
measurement: a conceptual model, in Proceedings of the 18th International Congress of
COMADEM, 31st Aug2nd Sep 2005, Cranfield, UK, pp. 349356
Parida, A. and Kumar, U. (2006), Maintenance performance measurement (MPM): issues
and challenges, Journal of Quality in Maintenance Engineering, Vol. 12, No. 3, pp.
13552511
Tsang, A.H.C. (1998), A strategic approach to managing maintenance performance, Journal
of Quality in Maintenance Engineering, Vol. 4, No. 2, pp. 8794
Wealleans, D. (2000), Organizational Measurement Manual, Abingdon, Oxon, GBR,
Ashgate Publishing Limited
Wireman, T. (1998), Developing Performance Indicators for Managing Maintenance, New
York, Industrial Press, Inc.
Wongrassamee, S., Gardiner, P.D. and Simmons, J.E.L. (2003), Performance measurement
tools: the balanced scorecard and the EFQM Excellence Model, Measuring Business
Performance, Vol. 7, pp. 1429
20
Forecasting for Inventory Management
of Service Parts
John E. Boylan and Aris A. Syntetos
20.1 Introduction
Service parts are ubiquitous in modern societies. Their need arises whenever a
component fails or requires replacement. In some sectors, such as the aerospace
and automotive industries, a very wide range of service parts are held in stock, with
significant implications for availability and inventory holding. Their management
is therefore an important task.
A distinction should be drawn between preventive maintenance and corrective
maintenance. Demand arising from preventive maintenance is scheduled and is
deterministic, at least in principle. Demand arising from corrective maintenance,
after a failure has occurred, is stochastic and requires forecasting.
Fortuin and Martin (1999) categorise the contexts for service logistics as
follows:
Technical systems under client control (e.g. machines in production departments, transport vehicles in a warehouse);
Technical systems sold to customers (e.g. telephone exchange systems,
medical systems in hospitals)
End products used by customers (e.g. TV sets, personal computers, motor
cars)
In the first context, there is usually a specialist department within the client
organization performing maintenance activities and managing service parts inventories. In the second context, a specialist department within the vendor organization will generally undertake these tasks. In both cases, a large amount of information is known by the vendor, or can be shared with the vendor. This information
may include scheduled (preventive) maintenance activities, times between failures,
usage rates and condition of equipment.
When a wealth of data is available, it is possible to identify explanatory
variables which may be used to predict the demand of service parts. For example,
480
Ghobbar and Friend (2002) showed that the average demand interval for aircraft
spare parts depends on the aircraft utilization rate, the component overhaul life and
the type of primary maintenance process. In a further study, Ghobbar and Friend
(2003) showed how forecast accuracy depends on various characteristics of the
demand process, including the seasonal period length, as well as the primary
maintenance process. Hua et al. (2006) used two zero-one explanatory variables,
plant overhaul and equipment overhaul, to help predict demand of spare parts
in the petrochemical industry. In other cases, explanatory variables have been used
to predict part of the demand for a stock keeping unit (SKU). For example,
Kalchschmidt et al. (2006) identified clusters of customers whose sales were
correlated with promotional activities and clusters of customers that were unaffected, using appropriate forecasting methods for each group.
In the third context, parts are used by consumers and much less information is
available. Fortuin and Martin (1999, p 957) commented, Clients are anonymous,
their usage of consumer products and their maintenance concept are not known.
Most demand arises from purely corrective maintenance (e.g. on TV sets, personal
computers) required in the case of a defect. Even when preventive maintenance
occurs (e.g. on motor cars), prediction is complicated by the maintenance concept
of consumers being unknown. For example, customers may not bring in their cars
at the correct time for a service, or may not bring them in at all. In many practical
situations where end products are used by consumers, the vendor must gauge
demand for service parts from the demand history alone. Such demand patterns are
often sporadic, with occasional spikes of demand. Alternatively, demand for an
SKU may be decomposed into regular and irregular components (Kalchschmidt et
al. 2006). In both cases, sporadic demand for service parts poses a considerable
challenge to those responsible for managing inventories. It is this challenge that
will be addressed in this chapter.
The remainder of the chapter is structured as follows. In the next section we
address issues pertinent to the classification of service parts for forecasting and
inventory management related purposes. Parametric and non-parametric approaches
to forecasting service parts requirements are then discussed in Sections 20.3 and
20.4 respectively. In Section 20.5, we present various metrics appropriate for
measuring the performance of the inventory management system whereas in
Section 20.6 we review the limited number of studies that provide empirical
evidence on: i) the performance of forecasting methods for service parts and ii) the
empirical fit of statistical distributions to the corresponding underlying demand
patterns. Finally, the conclusions of our work are summarized in Section 20.7.
481
482
A classification of service parts according to the product life cycle can assist in
choosing the better approach. As discussed in the first section of this chapter, the
choice of forecasting approach is mainly determined by the availability of data on
explanatory variables, such as the timing of preventive maintenance activities.
However, the forecasting approach is also driven by the availability of demand history data which, in turn, is determined by the stage of the service parts life cycle.
Causal methods are particularly useful in the initial phase, when the part is
introduced, since the lack of an adequate length of demand history precludes the
use of extrapolative time-series methods. Models linking sales to promotional
expenditure, for example, can be applied. In the normal phase, which is the focus
of this chapter, causal methods are used when maintenance activities are under the
control of the vendor or the client (if the client is not an end-consumer). For consumer clients, historical data for the explanatory variables are usually not available,
and time-series methods are used to forecast service parts requirements. In the
final phase, when an all time buy from a supplier is required, extrapolative
methods can be applied. For example, a regression model on the logarithm of sales
against time may be used, assuming an exponential decline in demand over time.
20.2.4 Forecasting Method
Faster moving service parts are commonly forecast using time-series methods. The
specific method that should be employed depends on the characteristics of the
483
70
Demand (Units)
60
50
40
30
20
10
0
Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec
Time period
Slow demand
Lumpy demand
484
The first and second factors determine the intermittence of demand. In response
to this intermittence, for those SKUs with very few customers, it may become
feasible to liaise directly with them, and to enhance forecasts accordingly.
The third and fourth factors determine the erraticness of demand. As orders
become more irregular, exploiting early information at the customer level becomes
more attractive. Of course, such early indications are not always available. This
will often be the case when addressing consumer demand. It is also possible that
early confirmed orders may give a good indication of final orders. This is particularly useful when there is a strong correlation between customers demands.
The five factors, and their effect on intermittence, erraticness and lumpiness,
are summarised in Figure 20.2.
Numerousness of customers
Intermittence
Frequency of individual orders
Lumpiness
Heterogeneity of customers
Erraticness
Variety of customers requests
Fig. 20.2. Categorization based on the sources of demand characteristics
For those items without early indicators, forecasting must be undertaken using
a purely time-series approach. This is usually linked to a demand distribution, so
that inventory levels may be set to achieve high percentage service level targets.
Many inventory management systems make distributional assumptions of demand
according to the ABC classification. For example, A and B items may be taken to
be normally distributed, whilst C items are assumed to be Poisson. In practice,
however, many service parts have demand that is more erratic than Poisson
(sometimes known as over-dispersed). The Poisson dispersion index (ratio of the
variance to the mean of demand, including zero demands) can be used to classify
SKUs as Poisson or non-Poisson. If the index is close to unity, then a Poisson
distribution is indicated; if the index is greater than unity, then other distributions,
such as the negative binomial, may be more appropriate, or a non-parametric
approach may be required, as discussed in Section 20.4.
An obvious way to classify service parts is by frequency of demand. As demand
occurrence becomes more infrequent, with some periods having no demand at all, a
number of difficulties emerge. From a forecasting perspective, methods such as
485
High
p=1.34 (break-point)
High
Erratic
Lumpy
(Croston)
(Croston)
Smooth
Intermittent
(SES)
(Croston)
CV =0.28
Low
Fig. 20.3. Categorisation of SKUs by forecast accuracy
In summary, service parts may be in the initial, normal or final phases of their
life cycle. In this chapter, we focus attention on the normal phase. Although service
parts may be classified as A, B or C in a Pareto analysis, it is likely that most parts
486
will be categorized as C. The service requirements for the part may be guided by
criticality and cost considerations, as well as the ABC classification. Further
refinements are necessary to the Pareto classification in order to allocate the most
appropriate forecasting methods to each SKU.
Enhancement of the ABC classification in the manner described above gives a
coherent approach to classification according to forecasting performance and a
foundation for theoretically-informed usage of terms such as erratic, as shown in
Figure 20.4 (after Syntetos 2001, adapted by Boylan et al. 2006).
High
Intermittent
Mean interdemand interval
Non-intermittent
Low
Mean demand size
Slow
Low
AND/OR
High
Erratic
Lumpy
AND
Coefficient of variation
of demand sizes
Non-erratic
Low
AND
Clumped
487
488
489
the variance and the average of the demand history data of each item. The resulting
distribution of demand per period was called a package Poisson distribution. The
same distribution has appeared in the literature under the name hypothetical SKU
(h-SKU) Poisson distribution (Williams 1984), where demand is treated as if it
occurs as a multiple of some constant, or clumped Poisson distribution, for multiple item orders for the same SKU of a fixed clump size (Ritchie and Kingsman
1985) (please also refer to Figure 20.4 where a definition of clumped demand is
offered). In an earlier work, Friend (1960) also discussed the use of a Poisson
distribution for demand occurrence, combined with demands of constant size. The
package Poisson distribution requires, as the Poisson distribution itself, an estimate of the mean demand only.
If demand occurs as a Bernoulli process and orders follow the logarithmicPoisson distribution (which is not the same as the Poisson-logarithmic process that
yields NBD demand) then the resulting distribution of total demand per period is
the log-zero-Poisson (Kwan 1991). The log-zero-Poisson is a three parameter distribution and requires a rather complicated estimation method. Moreover, it was
found by Kwan (1991) to be empirically outperformed by the NBD. Hence, the
log-zero Poisson cannot be recommended for practical applications. One other
compound binomial distribution appeared in the literature is that involving normally distributed demand sizes (Croston 1972, 1974). However, and as discussed
above, a normality assumption is unrealistic and therefore the distribution is not recommended for practical applications.
20.3.2 Estimation of Mean Demand: Size Interval Methodology
Single exponential smoothing (SES) and simple moving averages (SMA) are often
used in practice to forecast intermittent demand. Both methods have been shown to
perform satisfactorily on real service parts data. However, the standard forecasting method for such items is considered to be Crostons method (Croston 1972,
as corrected by Rao 1973). Croston suggested treating the size of orders ( z t ) and
the intervals between them ( p t ) as two separate series and combining their expenentially weighted moving averages (obtained using SES) to achieve a forecast
of the demand per period. (Recently, some adaptations of Crostons method have
appeared in the literature that rely upon SMA rather than SES estimates and such
modifications are further discussed later in this section.)
In Crostons work, both demand sizes and intervals were assumed to have
constant means and variances, for modelling purposes, and demand sizes and
demand intervals to be mutually independent. Demand was assumed to occur as a
Bernoulli process. Subsequently, the inter-demand intervals are geometrically
distributed (with mean p ). The demand sizes were assumed to follow the normal
distribution (with mean and variance 2 ).
These assumptions have been challenged in respect of their realism (see, for
example, Willemain et al. 1994) and they have also been challenged in respect of
their theoretical consistency with Crostons forecasting method. The latter issue is
further discussed in Section 20.3.3.
490
Crostons method works in the following way: SES estimates of the average
size of the demand ( zt ) and the average interval between demand incidences ( p t ),
are made after demand occurs (using the same smoothing constant value, ). If no
demand occurs, the estimates remain exactly the same. The forecast of demand per
period ( Yt ) is given by: Yt = zt / p t . If demand occurs in every time period,
Crostons estimator is identical to SES. For constant lead times of length L , the
mean lead-time demand estimate ( YL ) is then obtained as follows:
YL = LYt
(20.1)
(20.2a)
(20.2b)
According to Croston, the expected estimate of demand per period in that case
would be: (Yt ) = ( zt / p t ) = ( zt ) / ( p t ) = / p (i.e. the method is unbiased).
If it is assumed that estimators of demand size and demand interval are
independent, then
z
1
t = ( zt )
p t
p t
(20.3)
1
1
(
p
p t )
t
(20.4)
but
and therefore Crostons method is biased. It is clear that this result does not depend
on Crostons assumptions of stationarity and geometrically distributed demand
intervals.
More recently, Boylan and Syntetos (2003), Syntetos and Boylan (2005) and
Shale et al. (2006) presented correction factors to overcome the bias associated with
Crostons approach. Some of these papers discuss: i) Crostons applications under a
Poisson demand arrival process and ii) estimation of demand sizes and intervals
using an SMA (using the ratio of the former to the latter as an estimate of demand
per period). The correction factors are summarized in the Table 20.1. (where k is the
length of the moving average and is the smoothing constant for SES).
491
2
Shale et al.(2006)
Estimation
SMA
k
k +1
Boylan and Syntetos (2003)
k 1
k
Shale et al. (2006)
At this point it is important to note that SMA and SES are often treated as
equivalent when the average age of the data in the estimates is the same (Brown,
1963). A relationship links the number of points in an arithmetic average (k) with
the smoothing parameter of SES ( ) for stationary demand. Hence it may be used
to relate the correction factors presented in Table 20.1 for each of the two demand
generation processes considered. The linking equation is
k=
(20.5)
492
ing attention to models implying Crostons method. For example, Poisson autoregressive models have been suggested to be potentially useful by Shenstone and
Hyndman (2005).
(20.6)
(20.7)
(20.8)
493
When using SES, under the above model formulation, the standard deviation of
the lead time forecast error was shown (Johnston and Harrison 1986) to be
correctly calculated as follows:
L = L + ( L 1) L(1 + (2 L 1/ 6)) t
(20.9)
Under the stationary mean model assumption (the demand level is assumed to
be constant) the forecast error correlation still exists because of the uncertainty
associated with the variance of the forecasts, which is carried forward from one
period to another. In addition, if a biased estimator is in place to forecast future
demand requirements, the auto-correlation can be also attributed to the bias. This
issue has been analytically addressed by Strijbosch et al. (2000) and Syntetos et al.
(2005).
494
495
argue that slow-moving service parts should attract a higher percentage cost, since
these parts are at the highest risk of obsolescence.
Service level is generally interpreted as off the shelf availability but the way
in which it is measured varies. Three common measures are defined as follows
(Silver et al. 1998):
The fill rate is probably the measure with the greatest appeal to practitioners,
since it relates most directly to customer satisfaction. Care is needed with its application, since different results are obtained if it is calculated over a lead-time or
over all time. If unsatisfied demand is back-ordered, Brown (1967) showed that
P2 = 1
LS
(1 P2LT )
Q
(20.10)
where P2LT is the measure over lead-time, P2 is the measure over all time, L is the
lead-time, S is total demand in a year, Q is the order-quantity and it is assumed that
LS < Q.
Ronen (1982) showed that, if unsatisfied demand is lost, then
P2 =
1
LS
(1 P2LT ) + 1
Q
(20.11)
These measures are based on the fraction of units satisfied from stock. Some
organizations also use measures that relate to the successful completion of an
order-line for a number of units of the same SKU. Typically, these are based on
the fraction of order-lines completely satisfied (partial satisfaction does not count).
Boylan and Johnston (1994) identified relationships between such measures and
fill-rates.
In addition to these standard measures, other suggestions have been made.
Gardner (1990) recommended the use of trade-off curves, showing the effect of
inventory investment on the average delay in filling backorders. Separate curves
are drawn for each forecasting method, allowing the manager to see at a glance if
one method dominates the others.
Sani and Kingsman (1997) proposed the use of average regret measures. The
service regret is the amount each method falls short of the maximum service level
over all methods for that SKU. (The method may be a forecasting or inventory
method.) The regret is then divided by the maximum service level and the ratios
496
are averaged across all SKUs. A cost regret measure is defined similarly. This
approach allows more detailed assessments of the interaction between forecasting
and inventory methods. Eaves and Kingsman (2004) suggested assessment of forecast performance according to implied stock-holdings. These are based on a calculation of the exact safety margin providing a maximum stock-out of zero. The
advantage of this approach is that it gives monetary values of stock-savings. However, these savings may not be achieved in practice using a standard stock control
method based on the mean and variance of lead-time forecasts.
Whilst it is essential to assess stock-holding costs and service levels, it is also
important to be able to diagnose the reasons for any deterioration in these measures.
Boylan and Syntetos (2006) argue that, since this may arise as a result of forecasting
methods or inventory rules (see Figure 20.5), the accuracy of forecasting methods
should also be monitored.
Forecasting
Stock
method
Stock-holding
management
system
Inventory
rules
costs
Service
level
1
n
100
t =1
Yt Yt
Yt
(20.12)
| Yt Yt |
1 n
100
n t =1
| Yt + Yt | / 2
(20.13)
497
1
n
(Y Y )
t
t =1
(20.14)
This measure is simple to interpret: if it is close to zero, then the forecast method
is unbiased; negative values indicate that forecasts are consistently too high, while
positive values show that forecasts are too low. Its use is recommended for intermittent service parts.
If a forecast method has high forecast errors, but is approximately unbiased, then
the positive and negative errors cancel one another out, yielding a mean error close
to zero. To capture the degree of error, regardless of sign, other error measures are
required. The mean absolute error (MAE) is often used for an individual SKU and is
defined as follows:
MAE =
1
n
Y Y
t =1
(20.15)
This error measure should not be averaged over a whole set of parts, since it
may be dominated by a few SKUs with large errors. To avoid this problem, four
alternatives have been suggested: the MAE: mean ratio, the geometric mean absolute error, the percentage better measure and the mean absolute scaled error.
Each of these measures will be reviewed in turn.
Hoover (2006) proposed the application of the MAE: mean ratio for intermittent demand:
n
Y Y
t
t =1
MAE:Mean =
Y
t =1
Y Y
t =1
Y
t =1
(20.16)
498
cism stands. Therefore, the MAE: mean ratio can be recommended for non-trended
intermittent service parts.
A second alternative to the MAE is the geometric mean absolute error (GMAE)
defined below; for a single series:
GMAE =
i =1
1/ n
Yt Yt
(20.17)
This can be generalized across series by taking the geometric mean again to
obtain the geometric mean (across series) of the geometric mean (across time) of
the absolute errors (GMGMAE):
1/ N
1/ n
N n
GMGMAE = Yit Yit
i =1 t =1
(20.18)
where Yit is the observation for the i-th SKU at time t, Yit is the forecast of
demand for the i-th SKU at time t, and N is the number of SKUs.
An outlying observation, producing a large error by any statistical method, will
affect the GMAE similarly for all methods, and so the ratio of the GMAE for one
method to another will be robust to outliers. (The same robustness property applies
to the GMGMAE). This was first shown by Fildes (1992), using a general argument, and applied to intermittent data by Syntetos and Boylan (2005). In fact, these
authors used a slightly more complex measure, the geometric root mean square
error (GRMSE); however, Hyndman (2006) pointed out that the GRMSE and the
GMAE are identical.
Although the measure is robust to outliers, it is sensitive to zero errors (Boylan
and Syntetos 2006). Just one exact forecast will yield a zero error and a zero
GMAE, regardless of the size of the other errors. This problem may be overcome,
for stationary errors, by using the geometric mean (across series) of the arithmetic
mean (across time) of the absolute errors (GMAMAE):
GMAMAE =
i =1
ni
ni
t =1
1/ N
Yit Yit
(20.19)
This measure collapses to zero only if a series has zero forecast errors for all
periods of time, and so is more robust to zero errors than the GMGMAE. The
measure is also robust to occasional large forecast errors, provided the remaining
errors are stable, and are not unduly affected by trend or seasonality. It can therefore be recommended for application in these cases.
Another approach, which is simple to use and interpret, is the percentage better
method. According to this approach, for each service part, one forecast method is
compared to another according to a criterion such as mean error or geometric root
mean square error (Syntetos and Boylan 2005). The percentage better shows the
499
percentage of series for which one method has the lower error. This approach is
robust to large forecast errors and the results can be subjected to formal statistical
tests (Syntetos 2001). It is a useful measure, although it does not quantify the
degree of improvement in forecast error.
Hyndman (2006) recently suggested a new error measure for intermittent demand. This measure, known as the mean absolute scaled error (MASE), is defined
as follows:
MASE = mean( qt )
qt =
Yt Yt
1 n
Yt Yt 1
n 1 i =2
(20.20a)
(20.20b)
The errors are scaled based on the in-sample MAE from the nave forecasting
method (i.e. the forecast for the next period is this periods observation). The
measure is robust to outliers, and is valid for all non-constant series.
Hyndman (2006) gave an example of the application of the MASE on
intermittent data from a major Australian lubricant manufacturer. He compared the
out-of-sample MASE of four methods: nave, overall mean, single exponential
smoothing and Crostons method. The nave method has the lowest MASE. This
result is valid statistically, but is counter-intuitive from an inventory-management
perspective. Boylan and Syntetos (2006) commented that the nave method is
sensitive to large demands and will generate high forecasts. Its use will almost
certainly lead to over-stocking and possibly to obsolescence. This example highlights the danger of relying on statistical error measures alone. As noted earlier in
this section, attention should always be paid to the stock-holding and service
implications of different forecasting methods. Improvements in forecasting accuracy do not necessarily translate into improved stock-control performance. However, if stock-control performance has deteriorated, then forecast error measures
can be used to diagnose problems with forecasting methods, and to suggest alternatives.
500
geometric and the negative exponential distribution were found to provide a good fit
to the demand patterns observed. The geometric distribution was also found to be a
reasonable approximation to the distribution of inter-demand intervals, for real
demand data, by Dunsmuir and Snyder (1989) and Willemain et al. (1994). Janssen
(1998) tested the Bernoulli demand generation process on a set of empirical data
obtained from a Dutch wholesaler of fasteners. The results indicated that the
Bernoulli demand generation process is a reasonable approximation for intermittent
demand processes. Finally, Eaves (2002) examined the demand patterns associated
with 6795 service parts from the Royal Air Force (UK). The findings of this detailed
study provide support for both Poisson and Bernoulli processes. In particular, the
geometric distribution was found to provide a statistically significant fit (5% significance level) to 91% of his sample whereas the negative exponential distibution
fitted 88% of the demand histories examined.
Kwan (1991) tested the empirical fit of the log-zero-Poisson (lzP) and negative
binomial (NBD), amongst other possible underlying demand distributions. The
NBD was found to be the best, fitting 90% of the SKUs. Boylan (1997) tested the
goodness-of-fit of four demand distributions (NBD, lzP, condensed negative binomial distribution (CNBD) and gamma distribution) on real demand data. The CNBD
arises if we consider a condensed Poisson incidence distribution (censored Poisson
process in which only every second event is recorded) assuming that the mean rate
of demand incidence is not constant, but varies according to a gamma distribution.
The empirical sample used for testing goodness-of-fit contained the six months
histories of 230 SKUs, demand being recorded weekly. The analysis showed strong
support for the NBD. The results for the gamma distribution were also encouraging,
although not as good, for slow moving SKUs, as the NBD.
501
service inventory items. They concluded that the bootstrap method was the most
accurate forecasting method and that Crostons method had no significant advantage over SES. As discussed in Section 20.4, some reservations have been
expressed regarding the studys methodology. Nevertheless, the bootstrapping
approach is intuitively appealing for very lumpy demand items. More empirical
studies are needed to substantiate its forecast accuracy in comparison with other
methods.
Syntetos and Boylan (2005) conducted an empirical investigation to compare
the forecast accuracy of SMA, SES, Crostons method and a bias-corrected
adaptation of Crostons estimator (termed the Syntetos-Boylan approximation,
SBA; please refer to Table 20.1). The forecast accuracy of these methods was
tested, using a wide range of forecast accuracy metrics, on 3000 service parts from
the automotive industry. The results demonstrated quite conclusively the superior
forecasting performance of the SBA method. In a later project, Syntetos and
Boylan (2006) assessed the empirical stock control implications of the same estimators on the same 3000 SKUs. The results demonstrated that the increased
forecast accuracy achieved by using the SBA method (also known as the Approximation Method) is translated to a better stock control performance (service level
achieved and stock volume differences). A similar finding was reported in an
earlier research project conducted by Eaves and Kingsman (2004). They compared
the empirical stock control performance (implied stock holdings given a specified
service level) of the above discussed estimators on 18750 service parts from the
Royal Air Force (UK). They concluded that the best forecasting method for a
spare parts inventory is deemed to be the approximation method (Eaves and
Kingsman 2004, p 436).
20.7 Conclusions
Service parts, particularly those subject to corrective maintenance, present a
considerable challenge for both forecasting and inventory management. If stocking
decisions are made injudiciously, then the result will be poor service or excessive
stock-holdings, possibly leading to obsolescence. Conversely, effective forecasting
and stock control will lead to cost savings and improved customer service
A number of stock-control methods may be employed for slow-moving service
parts. Sani and Kingsman (1997) recommended the (R, s, S) policy, based on its
inventory cost and service performance in an empirical study. However, empirical
evidence is not extensive, and further research is needed in this area.
Classification of service parts is an essential element in their management. Four
purposes are served by classification:
502
Determining service level requirements may be supported by a criticality classification, undertaken using management judgment or a more formal approach, such
as an assessment of the risk and severity of a part failure. Alternatively, a Pareto
classification may be used as a proxy for criticality, with the A items being deemed
the most important. A variation on this approach is to use a matrix of cost and sales
volume to determine service requirements.
Inventory decisions relate directly to a product life cycle classification. Initial
provisioning decisions must be taken during the initial phase of the life cycle.
Decisions should be taken regarding stocking locations, too. In the normal phase,
the inventory policy and the appropriate parameters must be determined. The inventory rules depend on forecasts of demand over lead-time, so the most appropriate forecasting method should be chosen. In the final phase, an all time buy
requires a decision on the final order quantity.
The product life cycle can also be used to help determine the forecasting approach: causal or time-series. Causal methods are often used in the initial phase,
because of the lack of data on demand history. In the normal phase, causal methods
also have an important role, if data on explanatory variables are available. Some
models have been proposed, for example, that link the sales rate with the renewal
function associated with the part replacement in order to derive the demand for
spares (Blischke and Murthy 1994). If such data are not available, then time-series
methods are used, usually based on exponential smoothing. In the final phase,
regression-based extrapolations have been recommended, assuming an exponential
decline of demand.
A further aim of classification, in the normal phase, is to determine the most
appropriate forecasting method. By examining the sources of intermittence and
erraticness of demand, it may be possible to identify parts with few customers and
to forecast using advance information or to predict responses to promotional
activity. For SKUs where this is not possible, two approaches have been proposed:
bootstrapping and distribution-based. In the former case, no distributional assumptions are made and the lead-time distribution of demand is generated by re-sampling from previous observations. In the latter case, a demand distribution must be
determined and its parameters estimated. Classification by the shape of the demand
distribution allows the system to determine whether the Poisson, the compound
Poisson or some other distribution should be used. Classification by demand
frequency and by demand size variability allows the system to choose between
smoothing methods such as single exponential smoothing (for non-intermittent,
non-lumpy data) and methods such as Crostons (for intermittent or lumpy data).
The time-series method that should be employed for service parts depends on
the characteristics of the demand pattern. For non-intermittent demand, exponential
smoothing methods should be employed, with appropriate variants being chosen
for trended, damped trended and seasonal data. For intermittent demand series,
Crostons method is the standard approach and has been adopted by a number of
forecast packages. The performance of Crostons method can be improved by
applying an appropriate adjustment factor to reduce the bias of the forecast. This
has been shown by Eaves and Kingsman (2004) and by Syntetos and Boylan
(2006) to improve the inventory performance of the system.
503
20.8 References
Bartezzaghi E, Verganti R, Zotteri G, (1996) A framework for managing uncertain lumpy
demand. Paper presented at the 9th International Symposium on Inventories, Budapest,
Hungary
Blischke WR, Murthy DNP, (1994) Warranty cost analysis. Marcel Dekker, Inc., New York
Boylan JE, (1997) The centralisation of inventory and the modelling of demand. Unpublished
Ph.D. Thesis, University of Warwick, UK
Boylan JE, Johnston FR, (1994) Relationships between service level measures for inventory
systems. Journal of the Operational Research Society 45: 838844
Boylan JE, Syntetos AA, (2003) Intermittent demand forecasting: size-interval methods
based on average and smoothing. Proceedings of the International Conference on
Quantitative Methods in Industry and Commerce, Athens, Greece
Boylan JE, Syntetos AA, (2006) Accuracy and accuracy-implication metrics for intermittent
demand. Foresight: the International Journal of Applied Forecasting 4: 3942.
504
Boylan JE, Syntetos AA, Karakostas GC, (2006) Classification for forecasting and stockcontrol: a case-study. Journal of the Operational Research Society: in press
Brown RG, (1963) Smoothing, forecasting and prediction of discrete time series. PrenticeHall, Inc., Englewood Cliffs, N.J.
Brown RG, (1967) Decision rules for inventory management. Holt, Reinhart and Winston,
Chicago
Burgin TA, (1975) The gamma distribution and inventory control. Operational Research
Quarterly 26: 507525
Burgin TA, Wild AR, (1967) Stock control experience and usable theory. Operational
Research Quarterly 18: 3552
Croston JD, (1972) Forecasting and stock control for intermittent demands. Operational
Research Quarterly 23, 289304
Croston JD (1974) Stock levels for slow-moving items. Operational Research Quarterly 25:
123130
Department of Defense USA, (1980) Procedures for performing a Failure Mode, Effects and
Criticality Analysis. MIL-STD-1629A
Dunsmuir WTM, Snyder RD, (1989) Control of inventories with intermittent demand.
European Journal of Operational Research 40: 1621
Eaves AHC, (2002) Forecasting for the ordering and stock holding of consumable spare
parts. Unpublished Ph.D. thesis, Lancaster University, UK
Eaves A, Kingsman BG, (2004) Forecasting for ordering and stock holding of spare parts.
Journal of the Operational Research Society 55: 431437
Ehrhardt R, Mosier C, (1984) A revision of the power approximation for computing (s, S)
inventory policies. Management Science 30: 618622
Fildes R, (1992) The evaluation of extrapolative forecasting methods. International Journal
of Forecasting 8: 8198
Fortuin L, (1980) The all-time requirements of spare parts for service after sales
theoretical analysis and practical results. International Journal of Operations and
Production Management 1: 5969
Fortuin L, Martin H, (1999) Control of service parts. International Journal of Operations and
Production Management 19: 950971
Friend JK, (1960) Stock control with random opportunities for replenishment. Operational
Research Quarterly 11: 130136
Gallagher DJ, (1969) Two periodic review inventory models with backorders and stuttering
Poisson demands. AIIE Transactions 1: 164171
Gardner ES, (1990) Evaluating forecast performance in an inventory control system.
Management Science 36: 490499
Gardner ES, Koehler AB, (2005) Correspondence: Comments on a patented bootstrapping
method for forecasting intermittent demand. International Journal of Forecasting 21:
617618
Ghobbar AA, Friend CH, (2002) Sources of intermittent demand for aircraft spare parts
within airline operations. Journal of Air Transport Management 8: 221231
Ghobbar AA, Friend CH, (2003) Evaluation of forecasting methods for intermittent parts
demand in the field of aviation: a predictive model. Computers and Operations Research
30: 20972014.
Hoover J, (2006) Measuring forecast accuracy: omissions in todays forecasting engines and
demand-planning software. Foresight: the International Journal of Applied Forecasting
4: 3235
Hua ZS, Zhang B, Yang J, Tan DS, (2006) A new approach of forecasting intermittent
demand for spare parts inventories in the process industries. Journal of the Operational
Research Society: in press.
505
Hyndman RJ, (2006) Another look at forecast-accuracy metrics for intermittent demand.
Foresight: the International Journal of Applied Forecasting 4: 4346
Janssen FBSLP, (1998) Inventory management systems; control and information issues.
Published Ph.D. thesis, Centre for Economic Research, Tilburg University, The
Netherlands
Johnston FR, (1980) An interactive stock control system with a strategic management role.
Journal of the Operational Research Society 31: 10691084
Johnston FR, Boylan JE, (1996) Forecasting for items with intermittent demand. Journal of
the Operational Research Society 47: 113121
Johnston FR, Harrison PJ, (1986) The variance of lead-time demand. Journal of the
Operational Research Society 37: 303308
Kalchschmidt M, Verganti R, Zotteri G, (2006) Forecasting demand from heterogeneous
customers. International Journal of Operations and Production Management 26: 619638
Kwan HW, (1991) On the demand distributions of slow moving items. Unpublished Ph.D.
thesis, Lancaster University, UK
Makridakis S, (1993) Accuracy measures: theoretical and practical concerns. International
Journal of Forecasting 9: 527529
Makridakis S, Hibon M, (2000) The M3-Competition: results, conclusions and implications.
International Journal of Forecasting 16: 451476
Naddor E, (1975) Optimal and heuristic decisions in single and multi-item inventory
systems. Management Science 21: 12341249
Quenouille MH, (1949) A relation between the logarithmic, Poisson and negative binomial
series. Biometrics 5: 162164
Rao AV, (1973) A comment on: Forecasting and stock control for intermittent demands.
Operational Research Quarterly 24: 639640
Ritchie E, Kingsman BG, (1985) Setting stock levels for wholesaling: performance
measures and conflict of objectives between supplier and stockist. European Journal of
Operational Research 20: 1724
Ronen D, (1982) Measures of product availability. Journal of Business Logistics 3: 4558
Sani B, Kingsman BG, (1997) Selecting the best periodic inventory control and demand
forecasting methods for low demand items. Journal of the Operational Research Society
48: 700713
Shale EA, Boylan JE, Johnston FR, (2006) Forecasting for intermittent demand: the
estimation of an unbiased average. Journal of the Operational Research Society 57:
588592
Shenstone L, Hyndman RJ, (2005) Stochastic models underlying Crostons method for
intermittent demand forecasting. Journal of Forecasting 24: 389402
Silver EA (1970) Some ideas related to the inventory control of items having erratic demand
patterns. CORS Journal 8: 87100.
Silver EA, Pyke DF, Peterson R, (1998) Inventory management and production planning
and scheduling (3rd edition). John Wiley & Sons, New York
Snyder R, (2002) Forecasting sales of slow and fast moving inventories. European Journal
of Operational Research 140: 684699
Strijbosch LWG, Heuts RMJ, van der Schoot EHM, (2000) A combined forecast-inventory
control procedure for spare parts. Journal of the Operational Research Society 51:
11841192
Syntetos AA, (2001) Forecasting of intermittent demand. Unpublished PhD Thesis,
Buckinghamshire Chilterns University College, Brunel University, UK
Syntetos AA, Boylan JE, (2001) On the bias of intermittent demand estimates. International
Journal of Production Economics 71: 457466
Syntetos AA, Boylan JE, (2005) The accuracy of intermittent demand estimates.
International Journal of Forecasting 21: 303314
506
Syntetos AA, Boylan JE, Croston JD, (2005) On the categorization of demand patterns.
Journal of the Operational Research Society 56: 495503
Syntetos AA, Boylan JE (2006) On the stock control performance of intermittent demand
estimators. International Journal of Production Economics 103: 3647
Teunter RH, (1998) Inventory control of service parts in the final phase. Published PhD
Thesis, University of Groningen, The Netherlands
Teunter RH, Fortuin L, (1998) End-of-life-service: a case-study. European Journal of
Operational Research 107: 1934
Vereecke A, Verstraeten P, (1994) An inventory management model for an inventory
consisting of lumpy items, slow movers and fast movers. International Journal of
Production Economics 35: 379389
Ward JB, (1978) Determining re-order points when demand is lumpy. Management Science
24: 623632
Watson RB, (1987) The effects of demand-forecast fluctuations on customer service and
inventory cost when demand is lumpy. Journal of the Operational Research Society 38:
7582
Willemain TR, Smart CN, Shockor JH, DeSautels PA, (1994) Forecasting intermittent
demand in manufacturing: a comparative evaluation of Crostons method. International
Journal of Forecasting 10: 529538
Willemain TR, Smart CN, Schwarz HF, (2004) A new approach to forecasting intermittent
demand for service parts inventories. International Journal of Forecasting 20: 375387
Williams TM, (1984) Stock control with sporadic and slow-moving demand. Journal of the
Operational Research Society 35: 939948
Part F
21
Maintenance in the Rail Industry
Jrn Vatn
21.1 Introduction
This chapter presents two case studies of maintenance optimization in the rail
industry. The first case study discusses grouping of maintenance activities into
maintenance packages. The second case study uses a life cycle cost approach to
prioritize between maintenance and renewal projects under budget constraints.
Grouping of maintenance activities into maintenance packages is an important
issue in maintenance planning and optimization. This grouping is important both
from an economic point of view in terms of minimization of set-up costs, and also
with respect to obtaining administratively manageable solutions. If several maintenance activities may be specified as one work-order in the computerized maintenance management system, we would have less work-orders to administer. The
maintenance intervals are usually determined by considering the various components or activities separately, and then the activities are grouped into maintenance
packages. By executing several activities at the same time, the set-up costs may be
shared by several activities. However, this will require that we have to shift the
intervals for the individual activities. If we try to put too many activities into the
same group, the gain with respect to set-up costs may be dominated by the costs of
changing the intervals for the individual activities. The case study we present for
maintenance grouping is related to train maintenance, and especially we focus on
activities related to components in the bogie.
Another problem most industries are facing is the limited resources available
for maintenance and renewal, implying that optimization has to be conducted under
budget constraints. Then two main questions should be addressed, first of all
whether the budget constraints should be eliminated to some extent by putting
more resources into maintenance and renewal in case we have more good projects
than we have resources. The other question is how to prioritize, given the budget
constraints. In the case study we present an approach to cost-benefit analysis of the
various projects. This gives a ranked list of projects to consider for execution. The
510
J. Vatn
proposed method has been implemented by the Norwegian National Rail Administration (JBV), responsible for the Norwegian railway net.
Section 21.2 presents some general information about rail maintenance in
Norway as a basis for the two case studies. The first case study in Section 21.3
discusses grouping of maintenance activities into maintenance packages. The
second case study in Section 21.4 uses a life cycle cost approach to prioritize between maintenance and renewal projects under budget constraints.
511
are Carreteroa et al. (2003), Zoeteman (2003), Veit and Wogowitsch (2003), Vatn
et al. (2003), Zarembski and Palese (2003), Pedregala et al. (2004), Meier-Hirmer1
et al. (2005), Budai et al. (2005) and Reddy et al. (2006). Railway research related
to maintenance is, however, dominated by wear modelling. Especially wheel-rail
wear models and track degradation models are important because the major maintenance and renewal costs of a railway line are due to track components. Some
important references are Bing and Gross (1983), Li and Selig (1995), Sato (1995),
Bogdaanski et al. (1996), Ferreria and Murray (1997), Zhang et al. (1997), Kay
(1998), Zakharov et al. (1998), Salim (2004), Telliskivi and Olofsson (2004),
Grassie (2005) and Braghin et al. (2006). A complete survey of reported models is
beyond the scope of this chapter.
512
J. Vatn
We often distinguish between the static and the dynamic planning regimes. In
the static regime the grouping is fixed during the entire system lifetime, whereas in
the dynamic regime the groups are re-established over and over again. The static
grouping situation may be easier to implement than the dynamic, and the maintenance effort is constant, or at least predictable. The advantage of the dynamic
grouping is that new information, unforeseen events, etc., may require a new
grouping and changing of plans. For an introduction to maintenance grouping we
refer to Wildeman (1996) who discusses these different regimes in detail. In the
example that follows we illustrate some aspects of dynamic grouping related to
maintenance activities on a train bogie.
21.3.2 Modelling Framework for the Grouping of Maintenance Activities
The trains are regularly taken out of service and sent to the maintenance depot for
execution of maintenance. Several subsystems are maintained at the same time,
and this makes the definition of set-up costs rather complicated when we develop
grouping strategies. In principle, some of the set-up costs are related to the fact that
the train is sent to the depot for maintenance, whereas some other parts of the setup costs are specific for one subsystem. In the following, we will simplify and only
consider costs related to the bogie, i.e. we assume one fixed set-up costs related to
the bogie. We also assume that the train is available at the maintenance depot at
any time. This is also a simplification, since each train follows a schedule, and can
only enter the maintenance depot at some of the end stations for the different
services. In order to get access to the various components in the bogie some disassembling is required before maintenance can be executed, and also some reassembling is required after execution of maintenance. The costs of disassembling
and re-assembling are here included in the set-up cost. In the model presented we
also assume that the set-up costs are the same for all activities. It is further assumed
that there is one and only one maintenance activity related to each component. This
simplifies notation because we then may alternate between failure of component i
and executing maintenance activity i where there is a unique relation between
component and activity. The basic notation to be used is as follows.
Notation
ciP
cUi
S
E,i(x)
Mi(x)
i(x,k)
*i,k
x*i,k
ki,Av
*i,Av
x*i,Av
t0
xi
t*i,Av
Kk
N
T
513
i(x,k) = [ciP + S/k + Mi(x)]/x = average costs per unit time if x is the
length of the interval between planned maintenance, and the set-up
costs are shared by totally k activities.
The minimum value of i(x,k), i.e., minimization over x.
The x-value that minimizes i(x,k).
Average number of components sharing the set-up costs for the i-th
component, i.e. the i-th component is in average maintained together
with ki,Av 1 other components.
Average minimum costs per unit time over all k-values.
Optimum value of xi over all k-values. x*i,Av is measured in million
kilometres since last maintenance on component i.
Point of time when we are planning the next group of activities. Initially
t0 = 0. t0 is measured in running (million) kilometres since t = 0.
Age of component i at time t0, i.e., time since preventive maintenance
t*i,Av = t0 +x*i,Av xi = optimum time in running (million) kilometres.
Candidate group, i.e. the set of the first k components to be maintained
according to individual schedule with t*i,Av as the basis for due time.
Number of activities/components.
End of planning horizon, i.e. we are planning from t0 = 0 to T.
(21.1)
If the grouping was fixed, i.e. static grouping, the optimization problem would
just be to minimize ii(x,k) for all k components maintained at the same time.
514
J. Vatn
Static grouping will not be discussed, but we present an approach for dynamic
grouping. Mathematically, the challenge now is to establish the grouping either in a
finite or infinite time horizon. In addition to the grouping, we also have to schedule
the execution time for each group (maintenance package). The grouping and the
scheduling cannot be done separately. Generally, such optimization problems are
NP hard (see Garey and Johnson (1979), for a definition), and heuristics are required. Before we propose our heuristic we present some motivating results.
Let *i,k be the minimum average costs when one component is considered
individually, and let x*i,k be the corresponding optimum x value. It is then easy to
prove that mi(x*i,k) = Mi(x*i,k) = *i,k meaning that when the instantaneous expected
unplanned costs per unit time, mi(x), exceeds the average costs per unit time,
maintenance should be carried out. The way to use the result is now the following.
Assume we are going to determine the first point of time to execute the
maintenance, i.e. to find t = x*i,k starting at t = 0. Further, assume that we know the
average costs per unit time (*i,k) but that we have for some reason lost or forgotten the value of x*i,k. What then we can do is to find t such that mi(t) = Mi(t) =
*i,k yielding the first point of time for maintenance. Then from time t and the
remaining planning horizon we can pay *i,k as the minimum average costs per unit
time. This is the traditional marginal costs approach to the problem, and brings the
same result as minimizing Equation 21.1. The advantage of the marginal thinking
is that we are now able to cope with the dynamic grouping. Assume that the time
now is t0, and xi is the age (time since last maintenance) for component i in the
group we are considering for the next execution of maintenance. Further, assume
that the planning horizon is [t0,T). The problem now is to determine the point of
time t (t0) when the next maintenance is to be executed. The total costs of
executing the maintenance activities in a group is S + iciP which we pay at time t.
Further, the expected unplanned costs in the period [t0 , t) is iMi(t-t0+xi) iMi(xi).
For the remaining time of the planning horizon the total costs are (Tt)i*i,k
provided that each component i can be maintained at perfect match with k1
activities the rest of the period. Since *i,k depends on how many components that
share the set-up cost, which we do not know at this time, we use some average
value *i,Av. We assume that we know this average value at the first planning. To
determine the next point of time for maintaining a given group of components we
thus minimize:
c1 (t ; k ) = S +
iK k
P
i
+ M i (t t0 + xi ) M i ( xi ) + (T t )*i ,Av
(21.2)
c2 (t ; k ) =
iK k
P
i
515
(21.3)
provided they can be maintained at perfect match with other activities, i.e. the
set-up costs are shared with ki,av 1 activities, and executed at time t*i,Av. The total
optimization problem related to the next group of activities is therefore to
minimize:
c(t ; k ) = S +
iK k
iK k
P
i
P
i
+ M i (t t0 + xi ) M i ( xi ) + (T t )*i ,Av
(21.4)
The idea is simple, we first determine the best group to execute next, and the
best time to execute it. Further we assume that subsequent activities can be
executed at their local optimum. It is expected to do better by taking the second
grouping into account when planning the first group, and not only treat the
activities individually. See, e.g. Budai et al. (2005) for more advanced heuristics in
similar situations to those presented here. The heuristic is as follows.
Step 0: Initialization. This means to find initial estimates of ki,Av, and use these kvalues as basis for minimization of Equation 21.1. This will give initial estimates
for x*i,Av and *i,Av. Finally the time horizon for the scheduling is specified, i.e., we
set t0 = 0 and choose an appropriate end of the planning horizon (T).
Step 1: Prepare for defining the group of activities to execute next. First calculate
t*i = x*i,Av + t0 xi and sort in increasing order.
Step 2: Establish the candidate groups, i.e. for k = 1 to N we use the ordered t*i s to
find a candidate group of size k to be executed next. If t*k > mini<k (t*i +x*i,Av) this
means that at least one activity in the candidate group needs to be executed twice
before the last one is scheduled which does not make sense. Hence, in this situation
the last candidate group is dropped and we are not searching for more candidate
groups at the time being.
Step 3: For each candidate group Kk, minimize c(t,k) in Equation 21.4 with respect
to execution time t. Next choose the candidate group Kk that gives the minimum
cost. This group should then be executed at the corresponding optimum time t.
Step 4: Prepare for the next group, i.e. we assume that all activities in the chosen
candidate group are executed at time t. This corresponds to setting xi = 0 for i Kk,
xi = xi +tt0 for i Kk and then update the current time, i.e. t0 = t. If t0 < T GoTo
Step 1, else we are done.
There are several ways to improve the algorithm. One intuitive improvement is
to improve the estimates of ki,Av and corresponding x*i,Av and *i,Av to be specified in
Step 0. This is easy, since in Step 4 we get a new value of k for those activities
included in the candidate group, and when the algorithm terminates we simply set
ki,Av as the average for each activity i in the period [0,T). We may then start over
again at Step 0 with these new values of ki,Av.
516
J. Vatn
Component
Function
Failure type
Failure effect
Crack
Potential reduction
of antitilting
Defect of gear
Flexible coupling
bearing (CENTA)
Coupling between
diesel engine/gear
Power transfer
Aeration valve
Pressure balance
Locked
Fissure and
demaged rubber of
silent blocks
Diesel engine
Cummins N14-R
Actuation of half
train set
Functional failure
or lower compression of engine
Engine attachment
(bearing NS3.59)
Engine seat
Damping of
vibrations
10
Primary damper
Absorbing the
vibration between
axle box and bogie
Functional failure
Reduced dynamic
characteristics
11
Horizontal damper,
motor bogie
Absorbing the
vibration between
bogie and car body
Functional failure
Reduced dynamic
characteristics
12
Horizontal damper,
motor bogie
Absorbing the
vibration between
bogie and car body
Functional failure
Reduced dynamic
characteristics
13
Vertical damper,
motor bogie
Absorbing the
vibration between
bogie and car body
Functional failure
Reduced dynamic
characteristics
14
Vertical damper,
motor bogie
Absorbing the
vibration between
bogie and car body
Functional failure
Reduced dynamic
characteristics
15
Longitudinal car
body damper
Absorbing
vibrations between
car bodies
Functional failure
Reduced dynamic
characteristics
517
Component
Function
Failure type
Failure effect
16
Break beam
support bush
Increased gap
between pin
and bush
17
Increased gap
between pin
and bush
18
Reduction of wear
between bolts and
brake unit support
Increased gap
between pin
and bush
19
Cylindrical roller
bearing actuation
side
Bearing rotor
of generator
Rotor of generator
blocks
20
Cardan shaft
Fracture joint
bearing
518
J. Vatn
In Step 2 we establish candidate groups. For k =12 we note that t*12 > t*1 + x*1,Av
which means that we only process candidate groups with k < 12.
In Step 3 we calculate c(t,k), and the minimum values are shown in Table 21.3.
The minimum is found for k = 10. Further c(t,10) has its minimum for t* = 0.829
million kilometres. We observe that for those activities included in the first group,
the t*i -values are rather close to 0.829 million kilometres.
In Step 4 we now proceed, and set xi to 0 for those activities which are executed
(i.e. i 10), whereas xi = xi + 0.829 million kilometres for i > 10. Finally we set t0
= 0.829 million kilometre before we go to Step 1 again. The next group of activities is similarly found to be executed at t* = 1.606 million kilometres. This next
group comprises some activities not included in the first group, but also some
activities that was executed in the first group and are now executed for the second
time. We proceed until t0 > 15.
When the procedure terminates, we have a total cost of 1.2 million Euros. We
have also recorded the average values of ki,Av which in this example ranges from
13.5 to 17 which is slightly higher than the initial assessment of ki,Av = 13. By repeating the entire procedure with the new values for ki,Av a small reduction in costs
of 1% is obtained.
Table 21.2. Cost figures and reliability parameters
#
CP ()
CU()
Aging,
960
6,740
2.56
3.5
1.38
9,600
22,400
3.33
2.48
680
6,230
1.33
3.5
0.67
632
5,960
2.22
3.5
1.12
720
6,320
10.00
4.76
400
5,720
2.11
3.5
0.98
37,000
72,500
2.00
3.5
7.90
520
5,960
4.17
3.5
2.01
780
6,440
12.50
3.5
6.46
10
664
6,236
1.60
3.5
0.80
11
424
5,786
1.61
3.5
0.75
12
384
5,711
1.61
3.5
0.74
13
384
5,711
1.78
3.5
0.82
14
184
5,336
1.78
3.5
0.74
15
600
6,116
1.78
3.5
0.88
16
1,440
7,580
2.67
3.5
1.53
17
4,060
12,590
2.67
3.5
1.77
18
1,160
7,130
2.67
3.5
1.48
19
6,080
16,220
1.61
2.5
1.22
20
6,400
16,700
1.33
3.5
0.93
519
c(t*,k) (106 )
t* (106 km
PM
0.674
1.2009
0.659
PM
0.740
1.2007
0.682
10
PM
0.742
1.2005
0.690
11
PM
0.751
1.2002
0.700
12
PM
0.805
1.2000
0.718
13
PM
0.819
1.1998
0.728
14
PM
0.879
1.1996
0.743
15
PM
0.932
1.1995
0.814
20
PM
0.979
1.1993
0.820
PM
1.120
10
1.1991
0.829
Wait
1.221
11
1.1993
0.872
18
Wait
1.375
12
Wait
1.475
13
16
Wait
1.534
14
Wait
1.769
15
17
Wait
2.013
16
Wait
2.483
17
Wait
4.760
18
19
Wait
6.461
19
Wait
7.904
20
Activity
Upon a failure requiring the set-up costs to be paid, it is rather obvious that
activities that already were due if they were treated individually according to
Equation 21.1 should be executed upon this opportunity. Further, activities not
scheduled in the next group (maintenance package) should not be executed since
they were not even included in a group to be executed later than the time of this
520
J. Vatn
opportunity. The basic question is thus which of the remaining activities in the next
due group that should be executed at this opportunity. Let Kk be the set of k
activities in this group. Assume that we have found that it is favourable to execute
the first i1 < k activities on this opportunity. The procedure to test whether or not
activity i also should be executed is as follows:
First perform a scheduling by starting at Step 1 in Section 21.3.2. First we
assume that all activities up to i are executed on this opportunity, i.e. xj = 0,
j i, and xj is set to the time since activity j were executed for j > i.
Let C1 be the minimum value of c(t,k) obtained in Step 3 plus the marginal
cost, ciP of executing activity i.
Next, we assume that only activities up to i1 is executed, i.e. xj = 0, j i1,
and xj is set to the time since activity j was executed for j i.
Let C2 be the minimum value of c(t,k) obtained in Step 3 this second time.
If C1 > C2 is it not beneficial to do activity i.
If it was beneficial to do activity i at t0 we should test for i = i+1 as long as i k.
The procedure is demonstrated by the following example.
We assume that a failure occurs at time t = 0.8 million km. From Table 21.3 we
observe that the first 10 activities were scheduled for execution at time 0.829
million km. Since the schedule costs is already paid by the corrective activity, it is
obvious that the first four activities, i.e. those with individual optimum less than
t = 0.8 million km, should be done. Then we test whether activity 5 (t*5 = 0.805)
should be done at this opportunity. We calculate C1 = 1.188267 million Euros and
C2 = 1.188274 million Euros, hence activity 5 should be done. Then we proceed
similarly, and find that also activity 6 should be executed. For activity 7 (t*7 = 0.879)
we find that it is not cost effective to executed this activity. Since the first six
activities have been executed upon this opportunity, the next planned maintenance
can be postponed from the original t = 0.829 million km to t = 0.985 million km.
Ballast cleaning when the ballast is polluted and stones are crushed
Rail grinding when the rail surface is rough
Tamping and leveling when track geometry is degraded
Sandblasting of bridges exposed to corrosion
Renewal of overgrown ditches
Point replacement of rails, e.g. in curvatures with high wear factor
521
Cost-benefit ratio, i.e. the net present value of the benefits divided by
the net present value of the costs of the project
Portfolio costs of renewals without the project
Portfolio costs of renewals with the project
Set of renewal times with the project
Set of renewal times without the project
Time dependent cost as at point of time t (from now)
Time dependent cost when a maintenance or renewal project is
executed
Factor to describe increase in time dependent cost due to degradation,
i.e. the increase from one year to another is d 100%
Life cycle cost
Calculation period for net present value calculations
Discount rate
Risk influencing factor, i.e. a factor that influences the risk level
Residual lifetime without the project
Residual lifetime with the project
522
J. Vatn
Costs
Renewal costs
Savings
c*(t)
c(t)
T
Time
Special attention will be paid to projects that aim at extending the lifelength of
a railway system. A typical example is rail grinding for lifelength extension of the
rail, but also the fastenings, sleepers and the ballast will take advantages of the rail
grinding. Figure 21.2 shows how a smart activity ( ) may suppress the increase
in c(t) and thereby extend the point of time before the costs explode and a renewal
is necessary.
From a modelling point of view the situation is rather complex because
different projects are interconnected. For example, by executing a ballast cleaning
project the track quality is increased, reducing the need for tamping and leveling.
On the other hand, by tamping and point-wise supplement of ballast in pumping
areas (surface water) we may postpone the much more expensive ballast cleaning.
A third factor to take into account is the fact that for each tamping cycle there is
some stone crushing, and hence we should also be reluctant to do too much
tamping. Despite the fact that railways have existed for over 160 years there is a
lack of documented mathematical models describing the interaction between
different components in the railway, and the effect of the various maintenance
activities. When developing a tool for prioritization it has therefore been necessary
to base the model on model parameters specified by the maintenance planners and
their experts. In the future, it is planned to improve the models based on the
findings from a joint research project between Norway and Austria.
In the following we describe the basic input for performing the cost benefit
analysis. The numerical calculations are supported by a computerized tool (PriFo).
21.4.2.1 Qualitative Information
The situation leading up to each proposed project is described. This is typically
information from measurements and analysis of track quality, trends, etc. It is
important to describe the situation qualitatively before any quantitative parameters
are assessed. It is, however, a great challenge to transform the qualitative problem
description to quantitative numbers. In the future this can be supported by the
expected results from various research projects on deterioration models.
21.4.2.2 Safety Related Information
A general risk model has been derived where important risk influencing factors
(RIFs) have been identified. The RIFs relate both to the accident frequency such as
523
number of cracks in the rails, but also to the accident consequences such as speed,
terrain description, etc.
Table 21.4 shows an example related to the derailment frequency. In the
modelling, f0 corresponds to the average derailment frequency related to rail
problems. The value of f0 is found by analysing statistics over derailments in
Norway, where we find f0 = 3 104 per kilometre per year.
Variable cost
Renewal
Renewal*
)
c(t
c*(t)
Time
RLL
RLL*
= smart maintenance activity, e.g., rail grinding
The variation width (w) in Table 21.4 shows the maximum negative or positive
effect of each RIF. In this model the values of the various RIFs are standardised,
which means that 1 represents the worst value of the RIF, 0 represent the base
case, and +1 represents the best value of the RIF. The interpretation of w is as
follows: If one RIF equals 1, then the derailment frequency is w times higher than
for the base case, and if the RIF equals 1 then the derailment frequency is w times
lower than the base case. Assuming that the various RIFs act independently of each
other an influence model for the derailment frequency may be written
f = f 0 i wi RIF
(21.5)
where wi is the variation width of RIF number i, and RIFi is the value of RIF
number i. By using Equation 21.5 with the generic weights from Table 21.4, we
may easily assess the derailment frequency only by assessing the values of the
RIFs for a given railway line or section.
In addition to the current value of the risk, the future increase also has to be
described corresponding to the two cost curves c(t) and c*(t) in Figure 21.1. For
example, we might use an exponential growth of the form c(t) = f (1+d)t1, where d
is the degradation from one year to the next. The rational behind an exponential
growth is that the forces driving the track deterioration often is assumed proportional to the deviation from an ideal track. A simple differential equation argument would then show an exponential growth.
524
J. Vatn
Variation width, w
Number of failures/cracks
Gradient
1.5
Horizontal geometry
1.5
525
The discount rate is r = 4%. Note that we here introduce the discount factor
as the difference between the interest rate and the inflation rate.
Monetary values for safety consequence classes as given in Table 21.5.
Costs per kiloton freight delayed 1 min = 160 Euros.
Table 21.5. Monetary values in Euros for each safety consequence class
Safety consequence
Monetary value ()
C1
Minor injury
2 000
C2
Medical treatment
33 000
C3
Serious injury
330 000
C4
1 fatality
1.7 millions
C5
2-10 fatalities
11 millions
C6
> 10 fatalities
175 millions
Costs per passenger delayed 1 min = 0.4 Euros. A train with 250 passengers
then gives 100 Euros per minute delayed.
(21.6)
t =1
where r is the discount rate, and N is the calculation period. N is here the residual
lifelength (RLL) if nothing is done. This means that we compare the situation with
and without the project in the period from now till we have to do something in any
526
J. Vatn
case. Similarly we obtain the change in punctuality costs, LCCP and the change in
maintenance and operational costs, LCCM&O.
To calculate Equation 21.6 we may in some special situations find closed
formulas. For example, if c(t) is constant, i.e. c(t) = c, the formula for the sum of a
geometric series yields
1 (1 + r ) N
t
+
=
c
(1
r
)
c
r
t =1
(21.7)
Further, if c(t) the first year is c1 and c(t) increases by a factor (1+d) each year we
have
1+ d N
c1 (1 + d )t 1 (1 + r ) t = c1 1 1+ r
t =1
r d
N
(21.8)
RC(t ) (1 + r )
t{T }
RC *(t ) (1 + r )
(21.9)
t{T *}
(21.10)
The cost benefit ratio, or more precisely the benefit cost ratio is given by
C/B =
(21.11)
527
528
J. Vatn
LCCS
LCCP
LCCM&O
LCCRLT
LCCI
529
= 0.3
= 0.4
= 2.3
= 12.9
= 2.2
This yields a cost benefit ratio of C/B = 7.2, meaning that for each Euro put
into rail grinding, the payback is 7 Euros.
By calculating the cost benefit ratio for the various maintenance and renewal
projects, we get a sorted list of the most promising projects. In principle, we should
execute those projects having a cost benefit ratio, C/B, higher than one. If the
budget constraints imply that we can not execute all projects with C/B higher than
one, it would be necessary to have a thorough discussion related to the budget for
maintenance and renewal. Since most organizations suffer from the short term
costs cutting syndrome, it is a hard struggle to argue for spending more money now
in order to save money in a five to ten years perspective.
Even if we cannot do much about the budget situation, we may use the results
from the cost-benefit analysis to prioritize between the various projects.
21.5 Conclusions
The two case studies presented elaborate on some of the challenges in Norwegian
rail maintenance. Both the railway undertaking (NSB) and the infrastructure
manager (JBV) aim at implementing more proactive strategies for maintenance and
renewal based on more formal methods such as RCM and NPV/CBA. These
methods require reliability parameters of a much higher level of detail than the
current experience databases can offer today. Therefore both NSB and JBV have
started the process of restructuring databases, and emphasize the importance of
proper failure reporting. Due to the lack of experience data it has up to now been
necessary to utilize expert judgment to a great extent. It is further important to
emphasize that optimization models like the ones presented here should be considered as decision support, rather than decision rules. In order to improve on these
areas we believe that more systematic collection and analysis of reliability data is
an important factor, and here the rail industry may learn from the offshore industry
where joint data collection exercises have been run for 25 years (OREDA 2002).
Another challenge of such modelling is the lack of consistent degradation
models. For example, for the track there is a good qualitative understanding of
factors affecting degradation such as water in the track, contamination, geometry
failures, heavy axles, etc. However, the quantitative models for degradation taking
these factors into account are not very well developed. Research has paid much
attention to design problems to ensure long service life but it is difficult to use the
research results for maintenance and renewal considerations. More empirical research on degradation mechanisms will also be important in the future.
530
J. Vatn
21.6 References
Bing AJ, Gross A, (1983) Development of Railroad Track Degradation Models.
Transportation Research Record 939, Transportation Research Board, National
Research Council, National Academy Press, Washington, D.C, USA.
Bogdaanski S, Olzak M, Stupnicki J. (1996). Numerical stress analysis of rail rolling contact
fatigue cracks. Wear 191:1424
Braghin F, Lewis R, Dwyer-Joyce RS, Bruni S, (2006) A mathematical model to predict
raiway wheel profile evolutio due to wear. Accepted for publication in Wear.
Budai G, Huisman D, Dekker R. (2005) Scheduling Preventive Railway Maintenance
Activities. Accepted for publication in Journal of the Operational Research Society.
Carreteroa J, Pereza JM, Garca-Carballeiraa F, Calderona A, Fernandeza J, Garcaa JD,
Lozano A, Cardonab L, Cotainac N, Prete P, (2003) Applying RCM in large scale
systems: a case study with railway networks. Reliability Engineering and System Safety
82:257273
Dekker R, Wildeman RE, Van der Duyn Schouten, FA, (1997). A Review of MultiComponent Maintenance Models with Economic Dependence. Mathematical Methods
of Operations Research, 45:411435.
Ferreira L, Murray M, (1997) Modelling rail track deterioration and maintenance: current
practices and future needs. Transport Reviews, 17(3): 207221.
Garey MR, Johnson DS (1979). Computers and Intractability: a Guide to the Theory of NPCompleteness. W.H. Freeman and Company: New York.
Grassie SL (2005). Rolling contact fatigue on the British railway system: treatment. Wear
258:13101318
Hecke A, (1998) Effects of future mixed traffic on track deterioration. Report TRITA-FKT
1998:30, Railway Technology, Department of Vehicle Engineering, Royal Institute of
Technology, Stockholm.
Kay AJ, (1998) Behaviour of Two Layer Railway Track Ballast under Cyclic and Monotonic
Loading. PhD Thesis, University of Shefield, UK.
Li D, Selig ET, (1995) Evaluation of railway sub grade problems. Transportation Research
Record. 1489:1725.
Meier-Hirmer1 C, Sourget F, Roussignol M, (2005). Optimising the strategy of track
maintenance. Advances in Safety and Reliability Koowrocki (ed.) Taylor & Francis
Group, London.
OREDA, (2002) Offshore Reliability Data, 4th ed. OREDA Participants. Available from
Det Norske Veritas, NO-1322 Hvik, Norway.
Pedregala DJ, Garcaa FP, Schmid F (2004) RCM2 predictive maintenance of railway
systems based on unobserved components models. Reliability Engineering and System
Safety 83:103110
Podofillini L, Zio E, Vatn J. Risk-informed optimization of railway tracks inspection and
maintenance procedures. Reliability Engineering and System Safety 91:2030, 2006
Reddy V, Chattopadhyay G, Larsson-Krik PO, Hargreaves DJ, (2006). Modelling and
analysis of rail maintenance cost . Accepted for publication in International Journal of
Production Economic.s
Salim W, (2004): Deformation and degradation aspects of ballast and constitutive modeling
under cyclic loading. PhD Thesis, university of Wollongong. Austrailia.
Sato Y, (1995) Japanese studies on deterioration of ballasted track. Vehicle System Dynamics,
24:197208.
Sriskandarajah C, Jardine, AKS, Chan, CK (1998). Maintennace scheduling of rolling stock
using a genetic algorithm. European J. Oper.Res., 35:115.
Telliskivi T, Olofsson U, (2004) Wheelrail wear simulation. Wear 257 11451153.
531
Vatn J, Podofillini, P, Zio E (2003). A risk based approach to determine type of ultrasonic
inspection and frequencies in railway applications. World Congress on Railway
Research. Edinburgh, Scotland 28 September 1 October 2003.
Veit P, Wogowitsch M, (2003) Track Maintenance based on life-cycle cost calculations. In
Innovations for a cost effective Railway Track.
www.promain.org/images/publications/Innovations-LCC.pdf
Welte T, Vatn J, Heggset J, (2006) Markov state model for optimization of maintenance and
renewal of hydro power components. 9th International Conference on Probabilistic
Methods Applied to Power Systems, KTH, Stockholm, 1115 June 2006.
Wildeman RE (1996). The art of grouping maintenance. PhD Thesis, Erasmus University
Rotterdam, Faculty of Economics.
Zakharov S, Komarovsky I, Zharov I (1998). Wheel flange/rail head wear simulation. Wear
215. 1824
Zarembski AM, Palese JW, (2003) Risk Based Ultrasonic Rail Test scheduling: Practical
Application in Europe and North America. 6th International Conference on Contact
Mechanics and Wear of Rail/Wheel Systems (CM2003) in Gothenburg, Sweden June
1013, 2003
Zhang YJ, Murray MH, Ferreira L, (1997). Railway track performance models: degradation
of track structures. Road and transport Research. 6(2):419
Zoeteman A, 2003. Life Cycle Management Plus. In Innovations for a cost effective Railway
Track. www.promain.org/images/publications/Innovations-LCC.pdf
22
Condition Monitoring of Diesel Engines
Renyan Jiang, Xinping Yan
22.1 Introduction
The engine is the heart of the ship; and the lubricant is the lifeblood of the engine.
Wear is one of the main causes that lead to engine failures. It is desirable to avoid
engine breakdowns for reasons of safety and economy. This has led to an increasing interest in engine condition monitoring and performance modeling so as to
provide useful information for maintenance decision.
Generally, an engine goes through three phases (i) running-in phase with an
increasing wear rate, (ii) normal operational phase with a roughly constant wear
rate and, (iii) wear-out phase with a quickly increasing wear rate. The wear state
can be effectively monitored by a number of techniques. The most popular technique is lubrication oil testing and analysis. Other techniques such as vibration and
acoustical emission analyses also provide evidences of the wear state. A more
effective way may be an integrated use of various monitoring techniques. In this
chapter we confine our attention on oil analysis.
Oil analysis techniques fall into the following three types. The first is concentration analysis of wear particles in lubricant. This can be conducted in the field or
the laboratory. The second is wear debris analysis. This deals with examination of
the shape, size, number, composition, and other characteristics of the wear particles
so as to identify the wear state. This is usually conducted in the laboratory. The
third is lubricant degradation analysis. This is used to analyze physical and chemical characteristics of lubricant and determine the state of lubricant. This can be
conducted in the field or the laboratory.
To avoid the use of expensive laboratory instrumentation for wear state identification, a usual practice is to build a quantitative relation (or discriminant model)
between the condition variables (e.g. concentrations of wear particles) and the wear
state using an observation sample obtained from both field and laboratory analysis.
Once such a relation is built and verified, only field analysis is needed in practical
applications. As a result, a key issue is to develop an effective and quantitative
condition monitoring model.
534
In this chapter we present a case study, which deals with applying oil analysis
techniques to condition monitoring of marine diesel engines. We present a systematic approach to identify the important condition variables, construct a multivariate
control chart, build the quantitative relation between the condition variables and
the wear state, and establish the state discrimination criterion or critical value. The
proposed approach is formulated based on intuitive reasoning, optimization technique and real data.
The chapter is organized as follows. Section 22.2 presents a literature review on
condition-based maintenance (CBM) and its applications to diesel engines. Section
22.3 provides the background details and presents the monitoring and experimental
results. The results are analyzed and modeled in Section 22.4. Finally, we conclude
the chapter with a summary and discussion in Section 22.5.
Notation and Acronyms
AE
Acoustic emission
AI
Artificial intelligent
CBM Condition-based maintenance
CM
Condition monitoring
CV
Coefficient of variation
TBM Time-based maintenance
f(x)
Pdf of X
F(x) Cdf of X
m
Mean
r
Correlation coefficient
V
Variance
()
Standard normal pdf
()
Standard normal cdf
, Distribution model parameters
and so on
535
In this chapter, we summarize the CBM literature from the following five perspectives:
Data acquisition
Data processing
Diagnosis and prognostics
Maintenance decision-making
Computerized CBM management system
536
accelerometers, laser vibrometers, microphones, acoustic emission sensors, ferrography, spectroscopy, thermography, thermocouples, etc. Finally, the knowledge
and experience of experts provide important information in determining system
state and importance of relevant factors, indices or measures.
CM can be continuous and intermittent. The former is often expensive and
probably inaccurate; the latter may be more cost effective and accurate but probably misses some failure events. Thus, it has been an important issue to determine
the optimal monitoring (or inspection or sampling) interval.
22.2.1.2 Data Processing
According to Jardine et al. (2006), CM data fall into three categories: value type,
waveform type, and image type. Data processing for value-type data is called data
analysis; data processing for waveform and image data is called signal processing;
and the procedure of extracting useful information from raw signals is called
feature extraction.
Two commonly used techniques for analyzing value type data are trend
analysis and time series modeling. When the problem involves a number of variables, dimension reduction appears very important. In a CBM setting, a reliability
model with covariates can combine event data (e.g. times to failure) with CM data
(or covariates). One such model is the proportional hazards model.
Another well known approach is some two-interval models, where the failure
process is divided into two intervals: the time interval from working state to the
initiation of the defect, and the time interval from the initiation of the defect to
failure. Moubray (1997) describes the latter as P-F interval, where P means potential failure and F means functional failure. Goode et al. (2000) describes the
former as I-P interval, which is the time interval from machine installation to its
potential failure. Each of the intervals can be represented by a certain distribution.
Based on the fitted distributions and the outcomes of condition monitoring, machine prognosis can be derived.
Waveform data analysis includes three main categories: time-domain analysis,
frequency-domain analysis and time-frequency analysis. Time-domain analysis
calculates some descriptive statistics such as mean, standard deviation, root mean
square (RMS), skewness, kurtosis, time synchronous average, etc., based on the
time waveform. More advanced approaches include time series, autoregressive and
autoregressive moving average models. The most widely used frequency-domain
analysis is spectrum analysis by means of fast Fourier transform. A typical timefrequency analysis is the wavelet transform.
Image processing is similar to waveform signal processing but more complicated.
22.2.1.3 Diagnostics and Prognostics
Diagnostics. Diagnostics deals with detection, isolation and identification of faults.
It maps the monitoring information and extracted features to machine faults. This
mapping process is usually called pattern recognition. Typical fault diagnostic approaches are model-based (or first principles; see Grimmelius et al. 1999), statistical, and artificial intelligent (AI).
537
538
Reference
Machinery type: diesel engine or marine diesel engine
CM technique: oil, vibration, acoustic emission (AE), others (including
multi-sensors)
Modeling technique: model-based, statistical, AI, others
Focus: data processing and modeling, development of sensors, and development of a CBM system
The relevant literature is summarized in Table 22.1. From the table, we can draw
the following observations:
1. CM technique: among the 23 references, 10 deal with oil analysis, 5 with
vibration analysis, 4 with AE analysis, and 7 with other analysis techniques
(mainly multi-sensors technique). This implies that oil analysis is the most
widely used CM technique for diesel engines.
2. Analysis and modeling technique: 4 references deal with model-based
approach, 9 with statistical approach, 10 with AI approach, and 3 with
other approaches (mainly integrated approach). This implies that statistical
and AI approaches are widely used analysis techniques for CM data of
diesel engines.
3. Application type: 15 references deal with data processing and/or modeling,
4 with development of on-line sensors or measurement systems, and 4 with
development and/or application of integrated CBM systems. This implies
that data processing and modeling plays a key role in a CBM program.
Table 22.1. Summary of literature in CBM of diesel engines
Reference
Quantitative
analytical
ferrography
Douglas et al.
(2006)
Acoustic
emission
Diesel
engines
Focus
Development of a
standard ferrography
analysis procedure,
evaluation of a high
gradient magnetic
separator
Statistical, AE
energy
Identification of AE
signals of ring/liner
Oil analysis
Development of
onboard oil analysis
meters
Grimmelius
et al. (1999)
Torsional
vibration of
crank shaft
First principles,
Demonstration of
feature extraction, modeling techniques
neural networks
through two cases
Marine
diesel
engines
539
Marine
diesel
engines
Oil,
Ferrography
Hofmann
(1987)
Ship main
engine
Vibration
analysis
HojenSorensen et al.
(2000)
Marine
diesel
engines
Vibration,
acoustic
emission
Neural network,
discriminant
methods, hidden
Markov decision
trees
On-line classification
scheme
Thermodynamics
Model-based,
simulation model
Automatic
troubleshooting
method
Hubert et al.
(1983)
Cummins
VT-903
diesel
engine
Ferrography
Model-based
approach
Development of a
testing methodology
for determining wear
particle generation
rates and filter
efficiencies
Jakopovic and
Bozicevic
(1991)
Marine
engine
Oil analysis
Expert system,
theory of fuzzy
sets
Jardine et al.
(1989)
Diesel
engine
Metal
concentration
of engine oil
Proportional
hazard model
Fit proportional
hazards model to oil
analysis data
Johnson and
Hubert (1983)
Medium
duty truck
engine
Analytical
ferrography
Statistical
Evaluation of the
particle generation
rate and the filtering
efficiency
Liu et al.
(2000)
Marine
diesel
engines
Ferrograph,
grid capacitance
and photoelectric sensors
Logan (2005)
Intelligent diagnostic
software agents
operating in real-time
onboard naval ships
Pontoppidan
and Larsen
(2003)
Marine
diesel
engines
Detection of
condition changes
Acoustical
emission
Discriminant
score plotting
technique
Identification of
normal and abnormal
states, conditionbased inspection
Vibration monitoring
program for
preventive
maintenance onboard
On-line wear
condition monitoring
system
Independent
component
analysis
540
Marine
diesel
engines
Scherer et al.
(2004)
Diesel
engines
Development and
application of
prototype of an oil
condition sensor
Vibration, AE,
cylinder
pressure
Neural network
Decision fusion
through a multinet
system.
Sun et al.
(1996)
Diesel
engines
Oil analysis
Artificial neural
network
Application of
multisensor fusion
technology
Tang et al.
(1998)
Marine
diesel
engine
Temperature,
pressure,
combustion air
flow
Fuzzy neural
network,
combustion
simulation model
Condition monitoring
system
Wang and
Wang (2000)
Diesel
engine
evidence theory,
decision-layer
multisensor data
fusion
Approach to diagnose
multiple faults of a
working diesel
engine.
Wu et al.
(2001)
Diesel
engine
Vibration
Zhang et al.
(2003)
Marine
diesel
engines
Oil
spectrometric
analysis
Grey system
theory
Determination of the
turning point
541
2.
There exists a close relation between oil degradation and abnormal wear. It
was observed that the concentration of wear particles increases as viscosity
decreases and the contaminate index increases.
There exist some differences among the outcomes provided by different
analysis techniques; and sometimes the outcomes are in disagreement.
542
j
1
2
3
4
5
6
7
8
9
10
11
12
State
1
1
1
1
1
1
1
1
1
1
1
1
Mean
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
Fe
52.18
52.73
35.31
32.2
82.87
48.22
30.78
37.99
39.51
33.47
36.50
35.03
Cr
2.95
3.25
0.95
1.35
4.74
2.17
1.03
1.30
1.39
1.06
2.17
1.73
Ni
2.66
2.55
0.68
1.17
2.61
1.94
0.00
0.00
0.41
0.35
1.05
0.57
Mn
2.36
1.78
1.26
1.03
1.85
1.37
1.15
1.07
1.04
0.86
1.61
1.30
Al
8.7
8.04
5.57
5.70
9.85
7.08
4.71
6.07
7.24
6.77
7.73
7.68
Cu
10.98
8.93
4.33
3.57
13.34
6.82
4.18
4.52
3.17
2.87
3.88
3.47
Pb
13.29
9.65
6.23
5.89
17.1
8.05
5.94
5.87
5.51
4.95
6.29
5.11
Si
5.32
5.46
4.57
4.56
7.22
4.88
4.00
3.90
7.15
7.18
7.61
7.43
43.07
2.01
1.17
1.39
7.10
5.84
7.82
5.77
28.2
27.02
25.66
22.25
30.72
29.4
29.17
31.45
30.04
29.48
25.97
42.05
43.16
23.38
29.16
22.82
0.39
0.79
0.43
0.50
1.28
0.58
0.47
1.10
0.43
0.66
0.34
2.34
2.10
0.96
0.62
0.66
0.00
0.40
0.00
0.18
0.64
0.00
0.00
0.00
0.00
0.00
0.00
1.98
2.16
0.31
1.07
0.31
0.72
0.87
0.69
0.50
1.09
1.01
0.97
1.12
1.02
0.91
0.68
1.94
1.92
0.58
0.91
0.91
4.04
4.09
3.64
3.94
5.09
4.30
4.12
4.73
4.16
4.49
3.69
7.75
7.41
4.63
3.23
4.52
2.71
3.24
2.65
2.15
4.15
4.20
3.67
4.27
3.91
3.58
2.56
11.12
10.64
2.13
2.95
2.06
3.70
5.02
3.57
3.80
6.63
4.92
4.73
5.96
5.30
4.79
3.71
12.78
13.25
3.21
4.93
3.66
3.96
5.71
5.29
5.50
3.99
3.65
3.67
4.01
3.90
3.85
4.03
5.93
5.67
6.74
8.52
6.35
543
approach to model the data. Comparing it with the previous approaches, it appears
more straightforward and comprehensive.
22.4.1 Correlation Analysis
The correlation coefficient r is a measure of strength of linear relationship between
two variables (Blischke and Murthy 2000, p 367). In this section we conduct a correlation analysis to:
For the former issue we examine the correlation coefficient matrices for both
States 0 and 1. The correlation coefficient matrix associated with State 1 can be
obtained from the first 12 rows of Table 22.2. The first figure of each entry in
Table 22.3 gives the correlation coefficient for this case. Similarly, the correlation
coefficient matrix associated with State 0 can be obtained from the last 16 rows of
Table 22.2. The second figure of each entry in Table 22.3 gives the correlation
coefficient for this case.
As can be seen from the table, the two figures of each entry are close to each
other with an average relative error about 10% except those of the column corresponding to Si. Thus, we can roughly assume that the correlation is state-independent. In the following discussion, when we mention the correlation coefficients,
they are the first figure of each entry in Table 22.3.
To judge the significance of a linear correlation, we need to determine a critical
value for the correlation coefficient r. According to Fisher (1970), given a correlation coefficient r the significance of the linear correlation of two variables can be
tested using the following statistic to transform the correlation coefficient to a
Students t-value:
t = r / 1 r2 .
(22.1)
The critical value of t associated with the 95% level, one tail, and the degrees of
freedom 121 = 11 is 1.7959. This implies that the critical value of r is 0.8737.
Namely, the linear relation between two variables are significant if their correlation
coefficient is larger than 0.8737 in this application.
As can be seen from Table 22.3, there are eight correlation coefficients that are
larger than 0.8737. They are:
r(Cu, Pb) = 0.98, r(Fe, Cr) = 0.95, r(Fe, Pb) = 0.94, r(Fe, Cu) = 0.93,
r(Cr, Cu) = r(Cr, Pb) = 0.92, r(Cr, Al) = r(Cu, Ni) = 0.88.
(22.2)
Equation 22.2 involve six elements: Fe, Cr, Ni, Al, Cu, and Pb. Among them, Ni
and Al appear only once and the corresponding correlation coefficients (= 0.88) are
very close to the citical value. Thus, we may classify the elements into three groups:
544
Fe
Cr
Ni
Mn
Al
Cr
Ni
Mn
Al
Cu
Pb
Si
0.95
0.84
0.78
0.80
0.66
0.96
0.81
0.85
0.93
0.96
0.94
0.96
0.24
-0.04
0.87
0.88
0.78
0.89
0.88
0.96
0.92
0.91
0.92
0.93
0.32
0.23
0.84
0.83
0.75
0.82
0.88
0.85
0.84
0.89
0.11
0.50
0.73
0.90
0.84
0.97
0.81
0.97
0.11
0.05
0.74
0.93
0.76
0.93
0.64
0.06
0.98
0.99
0.02
0.05
Cu
Pb
0.12
0.10
When two variables are strongly correlated and their means differ significantly,
then one can ignore the one with the smaller mean and simply use the one with the
larger mean. Using this reasoning, we may delete some of the elements in the
strong correlation group. Consider the first three correlation coefficients of Equation 22.2, which have larger r values. According to the first correlation coefficient
and the means given in Table 22.2, Cu may be deleted. Similarly, Cr and Pb may
be deleted based on the second and third correlation coefficients, respectively. As a
result, only five elements (Fe, Ni, Mn, Al and Si) are retained for further analysis.
A physical interpretation of the correlation in this case study is that the wear
debris may not be pure metal and can be from different parts. Its mathematical
interpretation is that an increase or decrease of the readings in one element implies
a possible increase or decrease [decrease or increase] of the readings in a positively
(negatively) correlated element. When the absolute value of readings is very small,
e.g. some of readings of Si, the correlation should be considered insignificant.
22.4.2 State Discrimination Capability of Condition Variables
Each condition variable contributes partial information for identifying the state of
the monitored system. By quantitatively examining their contributions, we can
identify those varables which carry more state information. This study develops a
method to quantitatively evaluate contributions of the condition variables. It starts
with building the marginal distributions associated with the abnormal and normal
states for each condition variable.
22.4.2.1 Marginal Distribution Associated with State 1
We use index 1 i 5 to denote the element (Fe, Ni, Mn, Al, Si), respectively. For
a given element i, the entries of the corresponding column in Table 22.2 form a
censored sample, denote them {xij, j = 1, 2, , 28}. Assume that Xi follows a
545
certain distribution F1(i ) ( x) . The data associated with State 0 can be viewed as
right-censored. Namely, if the observed value associated with State 0 is xij+ , then
the corresponding value of x associated with State 1 meets the relation: x > xij+ . Its
likelihood function is given by 1 F1( i ) ( xij+ ) . The overall maximum likelihood
function is given by
12
28
j =1
j =13
(22.3)
F ( x) = {1 exp[ exp(
x
)]} exp(e / ), x 0 .
(22.4)
1.5
1
0.5
0
-1
-0.5 0
-1
-2
-1.5
Ni
Mn
-2
-2.5
-3
-3.5
Si
Al
x
Fe
546
Their WPP plots are shown in Figure 22.2. Clearly, for each WPP plot of data in
Figure 22.1 one can find a shape that matches one of the WPP plots in Figure 22.2.
Thus, an appropriate model can be found from these three models for each variable.
Once the model type is determined, the maximum likelihood method can be
used to obtain the estimates of the model parameters. The estimated parameters,
( 1(i ) , 1(i ) ), are shown in Table 22.4.
In later analysis, we need to know the means and variances of the fitted
marginal distributions. For the truncated normal distribution, the mean and variance are given by
m = +
( / ) , V = 2 m( m ) ,
1 ( / )
(22.5)
where and are the model parameters, and (.) and (.) are pdf and cdf of
the standard normal distribution, respectively. For the truncated Gumbel distribution, the mean and variance are given by
m = + exp(e / ) I1 , V = (m )[( m) + I 2 / I1 ] ,
(22.6)
where
I1 =
e /
s
ln(s)e ds , I 2 =
ln
( s )e s ds .
(22.7)
(22.8)
Lognormal
T runcated
Gumbel
T runcated
normal
Figure 22.2. WPP plots of truncated normal, lognormal, and truncated Gumbel distributions
547
i = 2, Ni
Truncated
Gumbel
(i )
0
3.3532
1.0286
0.0548
4.4977
1.5194
(i )
0
0.1243
0.6361
0.8948
1.1658
0.1880
(i )
0
28.8153
0.9530
0.7342
4.4980
4.6511
(i )
0
3.5965
0.8064
0.5494
1.1653
0.8820
i = 3, Mn
Truncated
normal
i = 4, Al
Truncated
normal
i = 5, Si
Lognormal
(i )
1
(i )
1
(i )
1
3.7554
2.0401
1.5644
7.3574
1.8317
0.1897
0.7985
0.4440
1.3679
0.1964
43.5269
1.7723
1.5648
7.3574
6.3660
V1(i )
8.3335
1.1040
0.4433
1.3679
1.2624
(i )
c
34.3485
1.5703
1.0493
5.9022
5.3512
Err
0.0701
0.1171
0.2540
0.1142
0.2005
Err
P( x
0.1244
0.3797
0.1230
0.1437
0.2159
0.0973
0.2484
0.1885
0.1289
0.2082
(i )
1
(i )
2
(i )
)
c
Rank
28
j =1
j =13
(22.9)
A careful examination has been carried out to determine the model type of F0(i ) ( x) .
We found that that F0( i ) ( x) has the same model type as F1( i ) ( x) . The maximum
likelihood estimates of the model parameters, ( 0(i ) , 0( i ) ), are also shown in Table
22.4.
22.4.2.3 Critical Value Between States 0 and 1
For a given element i and observation value xi, we need to establish a state
discrimination criterion based on a certain critical value xc(i ) . Namely, we classify
it as normal if xi < xc(i ) ; otherwise, as abnormal and accordingly initiate an
548
(22.10)
(22.11)
(22.12)
f 1 (x)
f 0 (x)
x xa
xc
(22.13)
549
The specific values of the relevant parameters ( Err1(i ) , Err2(i ) , P( xc(i ) ), xc(i ) ) for
each element are shown in Table 22.4.
22.4.2.4 Discussion
P( xc(i ) ) is a measure of misjudgment probability. The smaller it is, the better is the
discrimination capability of variable i, namely, the variable contains more state
information. Using it as an importance criterion, we can rank the condition variables. The last row of Table 22.4 shows the rank number of each variable.
As can be seen from the table, Fe has the best discrimination capability. This is
consistent with the result of correlation analysis, which shows that it is highly
correlated with (Cr, Cu, Pb). Namely, the concentration of Fe comprehensively
reflects the concentrations of Cr, Cu, Pb and itself, and hence the reading of Fe
reflects the wear state to a great extent. The second most significant element is Al.
This also appears reasonable since debris of Al and Cr (the latter is reflected by Fe)
mainly comes from piston and piston rings, which are the main wear parts. Mn and
Si have almost the same discrimination capability. This appears reasonable due to
their independence. Finally, it is noted that Ni has the worst state discrimination
capability. This can be explained by the dispersion of its readings (see Table 22.2),
and the fact that the wear of the transmission gears may not be a major problem.
22.4.3 Construction of a Multivariate Control Chart
A multivariate control chart can intuitively display the results of condition monitoring and evolution trend. Therefore, it appears especially important to set an
alarm threshold and an abnormal threshold. Usually, the thresholds are optimized
in a CBM model. Here, our focus is on the construction of such a control chart, and
hence we only present a simple method to set the thresholds when the optimal
thresholds unavailable.
We define xc(i ) as the abnormal threshold, and define the alarm threshold as
below:
(22.14)
It is displayed in x-y plane with an element order from the most important
one to the least important one
The abnormal thresholds are normalized to 1 for all the elements
The alarm thresholds are transformed to the same value, for all the
elements
The overall state is represented along the y-axis
550
To achieve the second and third features, we use the following relation to
transform an observed concentration xi into a normalized concentration yi without
changing the relative magnitude of the original readings:
yi = ai + bi xi , bi > 0.
(22.15)
1 = ai + bi xc( i ) , = ai + bi xa(i ) .
(22.16)
Let
ai =
(22.17)
a
i
(22.18)
=0
so as to decrease the influence of the constant term in Equation 22.15. This yields
=
i
xa( i )
xc( i )
.
/
(22.19)
i = 2, Ni
i = 3, Mn
i = 4, Al
i = 5, Si
31.2898
0.4048
0.8342
5.1075
4.5206
ai
0.7925
0.7849
0.2215
0.1855
0.0284
bi
0.0522
0.1370
0.7419
0.2008
0.1922
(i )
a
551
combined scale is expected to have better failure (or abnormal state) prediction
capability than individual scales. Two typical models are the linear and multiplicative ones. Their parameters are determined by minimizing the sample coefficient
of variation (CV) of the composite scale. The minimum CV approach is hard to
apply in the presence of censored data. In this context, Jiang and Jardine (2006)
propose a simple method to estimate the model parameters in the presence of
censored data. The method transforms censored data into complete data by adding
a mean residual value to a censored datum for each scale. Such a new data set, thus
obtained, is called an equivalent complete data set and will be used for the parameter estimation using the minimal CV approach under the assumption that the
transformation does not significantly impact the composite scale model to be built.
They also conclude that a small value of CV is a necessary but insufficient condition of a good prediction capability of failure for the composite scale model.
Therefore, they consider more than one alternative model, use the minimum CV
method to estimate the parameters of the alternative models, and determine the best
model based on the prediction capability of the models.
Rescaled concentration
1.5
No. 12
Abnormal
Alarm
No. 13
0.5
State
Fe
Al
Element
element
Si
Mn
Ni
Under these assumptions, the misjudgment probability can be directly represented by a function of the parameters of the composite scale and the means and
variances of the condition variables. As a result, the parameters of the composite
552
scale and the misjudgment probability can be simultaneously determined by minimizing the misjudgment probability. The critical value is then established using the
approach presented in Section 22.4.2.
22.4.4.1 Determination of a Composite Scale
Consider the following linear model:
5
y = ci xi ,
i =1
c
i =1
= 1.
(22.20)
If we want to exclude a certain variable, say xk, from the model, we just need to set
ck = 0.
According to the above assumptions, the composite scale, Y, is a normal random variable. For State 1, the mean and variance of Y are given by
5
i =1
i =1
m1 = ci m1( i ) , V1 = ci2V1(i ) .
(22.21)
i =1
i =1
m0 = ci m0(i ) , V0 = ci2V0(i ) .
(22.22)
According to Equation 22.13, the critical value of the composite scale, yc, meets
the following relation:
y m0
y m1
) / V0 = (
) / V1 .
V0
V1
(22.23)
yc = m1 + V1
d 2 + 2( s 2 1) ln(s ) sd ,
s2 1
(22.24)
where
s = V1 / V0 , d = (m1 m0 ) / V0 .
(22.25)
P( yc ) = [1 (
yc m0
y m1
) + ( c
)] / 2 .
V0
V1
(22.26)
553
Since m0, V0, m1, and V1 are functions of the decision variables {ci}, P(yc) is a
function of {ci}. As a result, {ci} can be optimally determined by directly minimizing P(yc).
22.4.4.2 Candidate Linear Models
By considering all linear models that at least include three variables, then we have
ten three-parameter models, five four-parameter models, and one five-parameter
model. If we always include the two most important elements, Fe and Al, in all the
models, then we just needs to consider three three-parameter models, three fourparameter models, and one five-parameter model. We take the latter approach.
We first consider the five-parameter model. Using the approach outlined in
Section 22.4.4.1, we obtained the model parameters and objective function value
shown in the second row of Table 22.6. The third row of the table shows the values
of ci m1(i ) , which reflects the contribution of each element to the composite scale.
The larger it is, the more important is the element. Based on this criterion, we
rerank the elements and the results are shown in the fourth row. Comparing these
results with those shown in Table 22.4, we can find that the ranks are basically
consistent except that the positions of Mn and Si are exchanged.
By eliminating one of (Ni, Mn, Si), we obtain three four-parameter models,
whose parameters and objective function values are shown in rows 57 of Table
22.6. Similarly, by eliminating two of (Ni, Mn, Si), we obtain three three-parameter models, whose parameters and objective function values are shown in rows
810 of Table 22.6.
As can be seen from rows 510 of the table, the objective function value
obtained from the model excluding a less important element is smaller than that
obtained from the model excluding a more important element. This confirms the
reasonability of the new rank.
Table 22.6. Parameters of composite scale models
Model No.
c1, Fe
c2, Ni
c3, Mn
c4, Al
c5, Si
P(yc)
rI
1
ci m1(i )
0.0529
0.1182
0.4009
0.2310
0.1969
0.0201
12.4486
Rank
2.3039
1
0.2096
5
0.6274
4
1.6994
2
1.2534
3
0.0602
0.4542
0.2620
0.2235
0.0224
14.8925
3
4
5
6
7
0.0917
0.0664
0.1149
0.0782
0.1394
0.1974
0.1476
0
0
0.2948
0
0.4981
0
0.5837
0
0.3810
0.2879
0.4739
0.3381
0.5658
0.3299
0
0.4112
0
0
0.0290
0.0294
0.0323
0.0328
0.0425
11.5072
11.3420
15.4769
15.2289
11.7512
554
(22.27)
rI = 1 /[(n 1) P( yc )] .
(22.28)
It comprehensively reflects the above two requirements. A large value for rI implies a better model. We use this criterion to select the best model. The last column
of Table 22.6 shows the values of rI. As can be seen from the table, the best model
is the three-parameter model that includes the three important elements (Fe, Al, Si).
Also to be noted is that the second best model is the three-parameter model that
includes the elements (Fe, Al, Mn). Once more, it shows that Mn and Si have almost the same importance as indicated in the correlation analysis.
22.4.4.4 Rescaling of the Best Model
To display the state discrimination result on the control chart, we normalize the
state critical value yc (= 8.9081) to 1. To do so, all the coefficients in the composite
condition variable is divided by yc. Similarly, we may set an alarm threshold for y
as below:
(22.29)
In the current case, we take = 1%. This yields y0.01 = 8.1536. The rescaled alarm
threshold for the composite scale equals 0.9156, which is not equal to the rescaled
alarm threshold (= ) for the elements; see Figure 22.4.
555
4. The composite scale modeling approach based on minimizing the misjudgment probability is a useful technique to combine multiple variables.
The proposed information criterion for selecting the best model appears reasonable.
Some issues that need to be considered in the future are as follows:
1. Some additional work is needed to validate the proposed model. This can be
done by examining the agreement between the model predition results and
the actual observations in the field.
2. The alarm threshold and oil sampling interval can be optimized so as to
obtain a balance between the acquired information and the effort involved.
3. To provide a more accurate assessment of engine condition, it appears
necessary to use multiple monitoring techniques. Thus, fusion of multisensor data and aggregation of multi-state measures is an important topic
that needs further study.
4. An optimization maintenance decision model and computerized implementation software package needs to be developed to promote greater use of this
approach in industry.
22.6 Acknowledgement
The authors wish to thank Prof. D.N.P. Murthy for his constructive comments on
an earlier version of this chapter.
22.7 References
Anderson DN, Hubert CJ, Johnson JH, (1983) Advances in quantitative analytical
ferrography and the evaluation of a high gradient magnetic separator for the study of
diesel engine wear: Wear 90(2): 297333
Blischke WR, Murthy DNP, (2000) Reliability: modeling, prediction, and optimization.
John Wiley, New York
Douglas RM, Steel JA, Reuben RL, (2006) A study of the tribological behaviour of piston
ring/cylinder liner interaction in diesel engines using acoustic emission. Tribology
International 39(12): 16341642
Fisher RA, (1970) Statistical Methods for Research Workers. Oliver and Boyd, Edinburgh
Goode KB, Moore J, Roylance BJ, (2000) Plant machinery working life prediction method
utilizing reliability and condition-monitoring data. Proceedings of the Institution of
Mechanical Engineers Part E-Journal of Process Mechanical Engineering 214: 109122
Gorin N, Shay G, (1997) Diesel lubricant monitoring with new-concept shipboard test
equipment. TriboTest 3(4): 415430
Grimmelius HT, Meiler PP, Maas HLMM, Bonnier B, Grevink JS, van Kuilenburg RF,
(1999) Three state-of-the-art methods for condition monitoring. IEEE Transactions on
Industrial Electronics 46(2): 407416
Hargis SC, Taylor H, Gozzo JS, (1982) Condition monitoring of marine diesel engines
through ferrographic oil analysis. Wear 90(2): 225238
556
Hofmann SL, (1987) Vibration analysis for preventive maintenance: a classical case history.
Marine Technology 24(4): 332339
Hojen-Sorensen PAdFR, de Freitas N, Fog T, (2000) On-line probabilistic classification
with particle filters. Neural Networks for Signal Processing X, 2000. Proceedings of the
2000 IEEE Signal Processing Society Workshop 1: 386395
Hountalasa DT, Kouremenosa AD, (1999) Development and application of a fully automatic
troubleshooting method for large marine diesel engines. Applied Thermal Engineering
19(3): 299324
Hubert CJ, Beck JW, Johnson JH, (1983) A model and the methodology for determining
wear particle generation rate and filter efficiency in a diesel engine using ferrography.
Wear 90(2): 335379
Jakopovic J, Bozicevic J, (1991) Approximate knowledge in LEXIT, an expert system for
assessing marine lubricant quality and diagnosing engine failures. Computers in Industry
17(1): 4347
Jardine AKS, Ralston P, Reid N, Stafford J, (1989) Proportional hazards analysis of diesel
engine failure data. Quality and Reliability Engineering International 5(3): 207216
Jardine AKS, Lin D, Banjevic D, (2006) A review on machinery diagnostics and prognostics
implementing condition-based maintenance. Mechanical Systems and Signal Processing
20(7): 14831510
Jiang R, Jardine AKS, (2006) Composite scale modeling in the presence of censored data.
Reliability Engineering and System Safety 91(7): 756764
Johnson JH, Hubert CJ, (1983) An overview of recent advances in quantitative ferrography
as applied to diesel engines. Wear 90(2): 199219
Liu Y, Liu Z, Xie Y, Yao Z, (2000) Research on an on-line wear condition monitoring
system for marine diesel engine. Tribology International 33(12): 829835
Logan KP, (2005) Operational Experience with Intelligent Software Agents for Shipboard
Diesel and Gas Turbine Engine Health Monitoring. 2005 IEEE Electric Ship
Technologies Symposium: 184194
Lu S, Lu H, Kolarik WJ, (2001) Multivariate performance reliability prediction in real-time.
Reliability Engineering and System Safety 72: 3945
Moubray J, (1997) Reliability-centred maintenance. Butterworth-Heinemann, Oxford.
Murthy DNP, Xie M, Jiang R, (2003) Weibull Models, Wiley.
Pontoppidan NH, Larsen J, (2003) Unsupervised condition change detection in large diesel
engines. 2003 IEEE XI11 Workshop On Neural Networks For Signal Processing: 565574
Priha I, (1991) FAKSan on-line expert system based on hyperobjects. Expert Systems
with Applications 3(2): 207217
Raadnui S, Roylance BJ, (1995) Classification of wear particle shape. Lubrication
Engineering 51(5): 432437
Roylance BJ, Albidewi IA, Laghari MS, Luxmoore AR, Deravi F, (1994) Computer-aided
vision engineering (CAVE): Quantification of wear particle morphology. Lubrication
Engineering 50(2): 111116
Roylance BJ, Raadnui S, (1994) Morphological attributes of wear particles their role in
identifying wear mechanisms. Wear 175(1-2): 115121
Saranga H, (2002) Relevant condition-parameter strategy for an effective condition-based
maintenance. Journal of Quality in Maintenance Engineering 8(1): 92105
Scherer M, Arndt M, Bertrand P, Jakoby B, (2004) Fluid condition monitoring sensors for
diesel engine control. Sensors, 2004. Proceedings of IEEE 1: 459462
Sharkey AJC (2001) Condition monitoring, diesel engines, and intelligent sensor processing.
Intelligent Sensor Processing, A DERA/IEE Workshop on: 1/1 1/6
Sun C, Pan X, Li X, (1996) The application of multisensor fusion technology in diesel
engine oil analysis. Signal Processing, 1996., 3rd International Conference on 2:1695
1698
557
Tang T, Zhu Y, Li J, Chen B, Lin R, (1998) A fuzzy and neural network integrated
intelligence approach for fault diagnosing and monitoring. UKACC International Conference on Control 2: 975980
Wang HF, Wang JP, (2000) Fault diagnosis theory: method and application based on
multisensor data fusion. Journal of Testing and Evaluation 28(6): 513518
Wu X, Chen J, Wang W, Zhou Y, (2001) Multi-index fusion-based fault diagnosis theories
and methods. Mechanical Systems and Signal Processing 15(5): 9951006
Zhang H, Li Z, Chen Z, (2003) Application of grey modeling method to fitting and
forecasting wear trend of marine diesel engines. Tribology International 36(10): 753756
Zhao C, Yan X, Zhao X, Xiao H, (2003) The prediction of wear model based on stepwise
pluralistic regression. In: Proceedings of International Conference on Intelligent Maintenance Systems (IMS), Xian, China: 6672
23
Benchmarking of the Maintenance Process at
Banverket (The Swedish National Rail Administration)
Ulla Espling and Uday Kumar
23.1 Introduction
To sustain a competitive edge in business, railway companies all over the world are
looking for ways and means to improve their maintenance performance. Benchmarking is a very effective tool that can assist the management in their pursuits of
continuous improvement of their operations. The benefits are many, as benchmarking helps developing realistic goals, strategic targets and facilitate the achievement of excellence in operation and maintenance (Almdal 1994).
In this chapter three different benchmarking studies are presented, these are: (1)
benchmarking of the maintenance process for cross-border operations, (2) study of
the effectiveness of outsourcing of maintenance process by different track regions in
Sweden, and (3) study of the level of transparency among the European railway administrations. In these case studies the focus is on railway infrastructure excluding
the rolling stock. The outline of the chapter is as follows. An overview of Swedish
railway operation is presented in Section 23.2. The definition and methodology in
general is discussed in Section 23.3. The special demands for benchmarking of
maintenance is described in Section 23.4 and in Section 23.5, the special considerations caused by the railway context is overviewed generally for the railways
and in more detailed from the Swedish context. The case studies are discussed in
Sectiosn 23.623.8. The discussions and conclusions are presented in Sections 23.9
and 23.10 respectively.
All the data pertinent to benchmarking of railway operation and maintenance are
retrieved, classified and analyzed in close cooperation with operation and maintenance personnel from both infrastructure owners and maintenance contractors.
The chapter discusses the pros and cons, the areas for improvement and the need for
the development of a framework and metrics for benchmarking. The focus of this
chapter is to visualize best practices in maintenance and also proposed means for
improvement in railway sector with special reference to railway infrastructure.
560
1998-01-01
2004-07-01
2001-07-01
Carrilion
Inhouse
Contractor
Svensk Banproduktion
Client
SJ
SJ AB
Rail Traffic
Administration
Green Cargo AB
SJ
Jenhusen AB
EuroMaint AB
EuroMaint AB
ASG
Swebus
AB TR
TrainTech AB
Interfleet
TrafficCare AB
Sweferry
Unigrid AB
Nordwaggon AB
New
Traffic
operators
MTAB
TGOJ
Connex
561
23.2.1 Maintenance
Railway infrastructure is a complex system. Usually such infrastructure is technically divided into substructures, namely bridges, tunnels, permanent way, turnouts,
sleepers, electrical assets (both low and high voltage), signalling systems including
systems for traffic control, telecom systems such as systems for radio communication, telecommunications and detectors, etc. Maintenance of all these subsystems
is a complex issue which makes it difficult to plan and execute the maintenance
task. Factors such as geographical and geological features, topography, climatic
conditions need to be considered when planning for maintenance. Furthermore, the
availability of track for maintenance is also an important issue to be considered
when planning the maintenance tasks to be executed. Previously, maintenance
management was based on technical system characteristics instead of asset delivery
functions. Maintenance is critical for ensuring safety, train punctuality, overall capacity utilization and lower costs for modern railways.
The deregulation, privatization and outsourcing processes have created new
situations, new organizations and new structures for collecting appropriate data
from the field operations and extracting relevant information, so as to make correct
decision.
23.2.2 Need for Benchmarking in Maintenance
Many of the European railways have followed a similar evolution. Although many
of the countries of Europe are now members of the European Union, questions are
being raised concerning the transparency of the state-controlled railway sector in
order to make comparisons possible and to find the best practices followed within
the railway business. The European railway sector has gradually started to use
benchmarking so that the different actors may be able to learn from each other.
562
Such organizational measures are useful to service users and provide a clear system
for translating feedback from the analysis into strategy for corrective actions.
23.3.1 The Benchmarking Methodology
Successful benchmarking starts with a deep understanding and good knowledge
regarding ones own organisations processes; i.e. learning about ones own performance and bringing ones own core business under control before learning from
others (Wireman 2004).
The most common approach to benchmarking is to compare ones own performance indicators with those of competitors or other companies in the same area,
which can be accomplished using simple questionnaires completed by personnel
involved in maintenance activities, with little or no expert help to conduct comprehensive studies, or with help from outside firms providing expertise in the
planning, execution and implementation of such processes. Based on what is to be
compared, benchmarking can be classified as performance, process or strategic
benchmarking (Campbell 1995). Similarly, based on whom one should make a
comparison with, benchmarking can be classified as internal, competitive, functional or generic benchmarking (Zairi and Leonard 1994).
The results obtained from benchmarking identify the gap between ones own
organisations performance and the one following best practices. These results are
then used to improve and develop core competencies and core businesses, leading
to lower costs, increased profit, better service towards the customers, increased
quality, and continuous improvements. In order to gain benefits, an organisation
has to mature in its own core competencies, and to ensure success, the ROI (return
on investment) should be calculated for each benchmarking exercise (Wireman
1998, 2004).
A broad survey of the literature shows that, even though all the suggested
methodologies for benchmarking are similar in their approach, they vary from a
general two-step process to a more detailed 10-step process (Varcoe 1996;
Ramabadron et al. 1997; Wireman 2004). All these steps can be related to Demings
famous PDCA cycle. Malano (2000) goes a little further and describes Demings
cycle as a circular process which includes the following phases; planning, analysis, integration, action and review. The operational form of these four steps for the
purpose of benchmarking may look like the following:
1. Detailed planning of the benchmarking operation is to keep the goal of
benchmarking in focus (for example cost reduction, productivity, etc.) and
identify suitable partners for benchmarking. This step essentially encompasses an internal audit to learn about the organisations business indicators
etc.
2. Identifying which business to visits and appropriate data collection.
3. Analysis of the data and information collected to identify gaps and the
sharing of information
4. Implementation and continuous improvement.
563
Most of the literature points out the fact that successful benchmarking needs a
good plan specifying what to benchmark, whom to visit (to study the best practice),
when to visit, and what types of resources are required for analysis and implementation. Often simple studies are completed at little cost and generally have no
follow-up. Good benchmarking, on the other hand, is time- and resource-consuming and has well-structured follow-up plans etc. The selection of the type and scope
of the benchmarking process should be made on the basis of the impact of the
outcome on the critical success factors for the process (Mishra et al. 1998).
A benchmarking exercise is of no value, if the findings are not implemented. In
fact, without implementation it would be a waste of resources. The benefits of
benchmarking do not occur until the findings from the benchmarking project are
realized, and therefore performance improvement through benchmarking needs to
be a continuous process.
23.3.2 Metrics
Metrics for benchmarking can be indicators or KPIs as discussed in Chapter 19. In
order to make the benchmarking process a successful exercise, it is important that
the areas, the process enablers and the critical success factors required for a good
performance needs to be identified, so that the common denominator or any common structure that is important to compare can be described by indicators or other
types of measurements, often presented as percent (%) (Wireman 2004). These
performance drivers can be characterized as lead and lag indicators, lead indicators
being performance drivers and lag indicators being outcome measures (hrn et al.
2005).
Operation
Equipment
state
Maintenance
Perfomance
measurement
Comparison with
benchmarked
value
564
Wireman (2004) states that the maintenance management impact on the return
on fixed assets (ROFA) can be measured by two indicators, namely:
565
core business under control, since planned work vs. unplanned work may have a
cost ratio as high as 1:5. Another rule of thumb concerns a high level of overtime,
which indicates reactive situations in the maintenance process. Since labour is a
large cost driver for maintenance, the amount of overtime can have a large impact
on maintenance costs. Another large cost driver is spare parts (Wireman 2004,
Hgerby 2002).
23.4.2 Railway Context
Benchmarking approaches used by industries to improve their performance
through comparison with the best in the class, can be equally used for benchmarking of the railway operations. But unlike the industrial sector, railway infrastructure consists of a larger number of individual assets, including substructure,
permanent way, signalling, electrical and telecom assets that extend over a few or
hundreds of kilometres. Furthermore, there are large differences between the
structures of the different railway organisations. At present, many organisations are
characterised by comprising one entity, whereas some are divided up into traffic
companies and infrastructure owners, with an in-house or outsourced maintenance
function. The different types of traffic on the railway tracks have different degradation characteristics and, therefore, it is difficult to compare passenger-intensive
lines with heavy haul lines or lines with mixed traffic. Furthermore, the data
collected from the different partners selected for benchmarking are not always
possible to compare without normalisation. It is also important to validate and audit
the collected data to find outliers (Oliverson 2000). Some examples of the normalisation required within railway benchmarking are presented in the following.
In a benchmarking project called InfraCost, data have been collected over a
number of years to compare the asset life cycle costs of different railways. A
complex normalisation process has been used to bring all the information, for
example maintenance costs, renewal costs, local labour costs, intensity and speed
of trains, from different countries in Europe to a same base for comparison (see
http://promain.server.de; Zoeteman and Swier 2005 ).
Another way to normalize data is to identify the cost drivers and try to establish
a link between performance and cost on the one hand, and performance and the age
of the assets on the other. In order to compare the assets, compensation factors
were established on the basis of the network complexity, measured in terms of
(Stalder et al. 2002):
Density of turnouts
Length of lines on bridges and in tunnels
Degree of electrification
Usage according to average frequencies of train per year
Average gross tonnage per year (freight and passenger)
In the project, the cost drivers have been established, but the implementation of
life cycle cost (LCC) strategies for avoiding the difficulties of separating the maintenance cost from the renewal expenditures has not yet been fully realized (Stalder
et al. 2002).
566
When the International Union of Railways (UIC) in their benchmarking projects between the years 1996 and 2002 compared costs between Europe, USA and
Asia, they found big differences in the costs. In an attempt to understand the
differences, Zoeteman and Sweir (2005) developed a model that converted the
benchmarked results into life cycle cost per km of track, including the maintenance
cost, renewal cost and overhead cost both for the organization and the contractors.
The major differences are in purchasing power, wages, turnout density, and degree
of electrification, the proportion of single track and intensity of use.
Benchmarking is not yet common practice within the railway sector, and there
is a need to build up a framework and metrics in order to compare and find out the
best practices.
The aim of using benchmarking as a tool to improve prevalent maintenance
practices within the railway sector is to demonstrate the measures that make it
possible to compare the result from one operation to another regarding the railway
administrations under different circumstances and conditions, and to identify the
best practices in the area. Therefore, the benchmarking process has to be evaluated
and normalised to fit the railway maintenance process. Accordingly, it is also
essential to decide what kind of KPIs (key performance indicators) need to be implemented for improvement.
The case studies presented in Sections 23.623.8 have used three different
approaches concerning methodologies for data collection and classification, normalization and analysis of results. The case studies are:
1.
Two neighbouring local track areas sharing a line for railway traffic on
each side of the border. The aim was to compare the maintenance cost,
identify differences and find areas to improve.
2. Internal benchmarking for maintenance contracts in order to find the best
practice and to improve the maintenance contracts.
3.
567
The common denominator between these case studies is used for benchmarking
methodologies in order to find out if it is useful within the railway sector. The
differences between these case studies are the main objectives of the benchmarking.
568
However, the following information and data relevant to the study could not be
collected:
Overhead cost for the contractor (not available due to the competition
between the different contractors)
Man hours (not available, not collected in the client system from the invoice)
Traffic volume
Asset age, which were approximately the same (not necessary to collect,
since the traffic mix and volume were the same)
Spare part costs (not available)
36.6.1.1 Normalisation
Since the organisation and accounting structure were almost the same, it was
assumed that the missing data could be disregarded. The amount of normalisation
was restricted to adjusting the currency.
23.6.2 Results and Interpretations
The available data and information were then sorted as shown in Table 23.1. The
maintenance costs were grouped into the categories snow removal, corrective
maintenance and preventive maintenance; see Table 23.2.
Table 23.1. Comparing cost per metre of track
Object
Total cost
Maintenance cost
Track area administration cost (overhead)
Other external costs, e.g. consultancy
Charges for electric power
Track Area A
795
285
220
90
200
Track Area B
290
280
8
2
0
Table 23.2. Difference in percentage in maintenance costs between Track Areas A and B
Maintenance activities
Snow removal
Corrective maintenance, including organisation
for preparedness (emergency service)
Preventive maintenance, including inspection
Difference in percentage
from Track Area B
+ 10%
+ 32%
62%
The benchmarking result showed that the maintenance cost was approximately
the same as the total cost per track meter. One of the findings was that the amount
of corrective maintenance was very high in both track areas. A closer investigation
showed that Track Area A had a larger amount of corrective maintenance and
therefore less money for preventive maintenance.
569
Furthermore the overhead cost and other external costs such as travel costs,
costs for consultancy etc. in Track Area A were much higher compared to Track
Area B. One of the explanations was the geographical isolation of Track Area A
from its own administration, resulting in higher traveling costs and the necessity of
buying consultancy for some services that Track Area B could obtain from its
nearby regional office. Another explanation was that Track Area A had to finance
all its buildings, the electrical power and the cost for the traffic control centre,
while this was taken care of by a separate organization for Track Area B.
It was also possible to find those areas of work that could be mutually coordinated, for example snow removal. However, this was something that needed to
be negotiated and was therefore considered a political matter.
The implementation phase was the responsibility of the national railway administrations. The results were mainly used as arguments clarifying why the costs
were so much higher for the railway line in Country A compared with those of
other national lines.
570
maintenance contract by learning from the experience and knowledge of other regional track areas in this respect.
The benchmarking process followed the standard procedure recommended for
benchmarking as stated in an earlier section (Section 23.3). The study covered nine
local track areas named as Track Areas AI, and six of these were selected for the
study and follow-up of qualitative interviews (DI).
23.7.1 Metrics and Data
Before starting the collection of data and other relevant information, the existing
indicators and indices used by maintenance professionals available in the literature
and through professional bodies, for example the EFNMS indices (EFNMS 2006),
were examined for their suitability for the purpose of benchmarking maintenance
practices in different track regions at Banverket. Most of these metrics were not
found suitable for the purpose of this study and therefore actions were initiated to
establish indicators that would facilitate this benchmarking process. Furthermore,
information and data which were planned to be included in the study, namely
details of maintenance-related measures such as maintenance costs, maintenance
hours, material, maintenance vehicle costs, overhead costs etc., were missing or
only available in the aggregate form, due to the competitive situation.
As the deregulation of the railway transport system in Sweden has led to
competition among the traffic companies, it was not possible to get hold of traffic
data, i.e. how the track was used, because this information is being treated as a
business secret by the train operators.
Data from 2002 were collected from the systems for accounting, the failure
reports, the inspection remarks, and the asset information and from the train delay
reports. The following data were collected:
Asset data from BIS: total length of track, total length of operated track,
total amount of turnouts, total amount of operated turnouts, length of
electrification, number of protected level crossings. An attempt was also
made to define their standard by the assets age and what type of traffic
they had been exposed to this had to be skipped as it was not possible to
obtain complete data for all the assets and different track lines. The purpose
was to know the intensity of track utilization.
From the accounting system AGRESSO: snow removal and maintenance
costs for one year, defined per maintenance activity corresponding to the
maintenance contract (corrective, predetermined, condition-based etc.) and
cost per asset type (rail, sleeper, turnout etc.).
From BESSY (inspection remark system): the number of inspection remarks, classified as remarks requiring immediate attention or deployment
of corrective measures or remarks requiring attention or correction in the
near future (deferred inspections remarks).
From OFELIA (failure report system): failure reports (including asset type
and type of failure, time to fault localization and time to repair, symptoms
and causes, place, date and time). Time to establish on the fault place.
571
572
Corrective maintenance
Preventive maintenance
Figure 23.3. Share of corrective maintenance and preventive maintenance for the nine track
areas studied (Espling 2004)
Skr/sqr metre
10
8
6
4
2
0
A
Figure 23.4. Maintenance cost per square metre of track area (Espling 2004)
Another comparison was made concerning the maintenance cost per metre
within the framework of the maintenance contract for each track region under
study. Track Areas H, C and G showed the best practice followed; see Figure 23.5.
It was noted that the maintenance cost varies greatly per asset or per track metre
unit among the compared track areas due to the asset standard, type of wear,
climate and type of traffic.
To compare the performance, the amount of functional failures and train delay
hours were listed as failure or delay hours per metre or per cost driving asset; see
Figure 23.6. Even here the best performance was shown by Track Areas G and H.
573
Cost per m
100
Predetermined maintenance
Maintenance inspection
80
60
Failure repair
40
Snow removal
20
0
A
Track are a
amount/asset
hours/asset or
20
Inpection remarks/km
f ailur/crossing
15
f ailures/turnout
10
f ailures/km
h/catenary
h/turnout
0
A
h/track km
Tr ack ar e as
All these results obtained from the comparison of different track regions, in
combination with the content of the maintenance contract defining work specifications within the maintenance contracts, were used for the gap analysis. The gap
analysis was conducted with the help of interviews with the track area managers
for Track Areas DI. The best practice criteria were identified with the help of
interviews and survey questionnaires. The best practices were:
574
The best practices identified from the benchmarking study were immediately
implemented in the new purchasing procedures and documents. These were used
for floating tenders and for new contracts by the infrastructure manager for the
local track area initiating this benchmark, and resulted in maintenance contracts at
a much lower price with better control of quality and performance. The benchmarking study also identified the best practice for gaining control over backlogs by
using SMS and other internet-based tools. Besides these, the maintenance contract
was also provided with information about goals, objectives and expected incentives
related to the execution of the maintenance contracts.
Outsourced maintenance
Traffic operation
Traffic operators
Free service
Free service
Included
Free service
Is bought
Many
Few
Few
Many
Few
23.8.1 Metrics
In this study, many official documents, such as annual reports and regulation letters
and documents, were studied in detail in order to gain insight into the types of
measures, key performance indicators and indices used by the railway administrations investigated (hrn et al. 2005). The collected measures were then compared
with those recommended by EFNMS in order to see if these could be used in future
benchmarking exercises. Rather soon it was found that the EFNMS indices were
developed for factories and plants and were not suitable for studying or benchmarking the performance of infrastructures, as they did not consider the type of
asset, the age of the asset, the asset condition or the practice of outsourcing maintenance work in an open market.
23.8.2 Normalisation
Since data were qualitative in nature, no normalisation was carried out for the purpose of this study.
575
When comparing the outcomes of the findings only highly aggregated measures
were used for the purpose of analysis, in terms of:
Economy
Punctuality
Safety
Number of staff employed
Track quality
Total traffic volume divided up into passenger and freight kilometres
They can be used as benchmarking measures, the lag indicators showing past
performance. This indicates that these areas of interest are important for every
studied railway administration. It is also important to note that the identified
measures can be defined as outcome measures from the railway maintenance process. It has not been possible to find any measures reflecting the actual maintenance
performance. This can probably be explained by the fact that the maintenance activities are carried out by either in-house or external maintenance contractors (hrn
et al. 2005).
Some of the maintenance performance indicators are used by various organizations and provide railways with an opportunity to benchmark their operations
internationally to improve their performance. One of the findings in the studies is
that there are parameters missing regarding the traffic volume, infrastructure age,
and history of the performed maintenance.
576
Amount
25
Others
20
15
SDm
10
SDw
fic
li t
y
Tr
af
ua
en
nm
vi
ro
En
ty
l
fe
er
ia
Sa
at
bo
ur
La
ry
se
t
ist
o
se
t
As
As
Ec
on
om
23.9 Discussion
The reason why most plants do not enjoy best practices in maintenance is that they
do not picture how to structure a sustainable improvement process (Oliverson
2000). Benchmarking can then be a tool for waking up organisations and their
management in order to find improvement areas that create more value from the
business process. However, on the way there are many pitfalls to be aware of, such
as starting the process without knowing the starting point and the destination
(Oliverson 2000; Wireman 2004). Other pitfalls are:
The methodologies for performing benchmarking for plants are rather well
developed, but need to be adapted for infrastructure. Today it is difficult to
577
23.10 Conclusion
Stating that the benchmarking of maintenance provides gains with relatively
little effort is a truth that needs some modification. First of all, the theory of
maintenance is a rather young science, which has resulted in a lack of common
nomenclature and understanding of maintenance through value. This is one of the
reasons why it is difficult to define what is included in maintenance and where to
put the boundaries for renewal. There can also be different structures in use to
describe what operation is and what maintenance is, and also for grouping maintenance into preventive and corrective maintenance. Outsourcing maintenance has
become popular in recent years, and this makes it difficult to obtain all the
necessary measurements, especially if the outsourcing is carried out in a performance contract (lump sum, fixed price). The assets complexity and condition
are also difficult to compare and measure.
The multitude of entities involved in the railway systems after their restructuring has made it considerably difficult to locate the organization responsible for the
problems encountered and to ascertain the course of action to be taken to rectify
them.
Benchmarking cannot be used if its results are not implemented. The benefits
from benchmarking do not occur until the findings from the benchmarking project
are implemented and systematically followed up and analyzed against the set
targets and goals.
The results from the three benchmarking studies presented show that benchmarking is a powerful tool and its methodology can be used by other industries.
Since the focus of these case studies is the benchmarking process and not the continuous improvement process, it is important to point out the need for empowered
enablers, who will be responsible for identifying the problem, finding a solution to
the problem and implementing the solution and the continuous improvement
processes. The case studies also show that there is some more improvement to be
made in order to start the whole process of benchmarking including the implementation in an integrated manner.
578
By giving details of the status of the assets (age and degree of wear), the
total traffic volume per year and the available time on track for infrastructure maintenance. This information should be incorporated as a correction
factor in the analysis
By well-structured economic feedback reports on maintenance activities.
This should be implemented so that it is possible to differentiate resources
which are consuming corrective maintenance activities and those consuming preventive maintenance activities. The structure of the economic feedback reports on maintenance should be designed so that it may be possible
to differentiate operation and corrective and preventive maintenance.
By separating the specially targeted maintenance investment from normal
maintenance activities; efforts to enhance punctuality in special campaign form are an example of the former.
23.12 Acknowledgements
The authors are grateful to Banverket (the Swedish Rail Administration) for
sponsoring this research work and providing information and statistics through free
access to their database.
579
Appendix
Table A.1. Failure and delay statistics from Track Areas A-I for the year 2003
Track
area
Train
delay
h/track km
Train
delay
h/turnout
A
B
C
D
E
F
G
H
I
1.07
0.88
0.73
0.57
0.93
0.97
0.35
0.32
1.18
0.25
0.33
0.21
0.29
0.76
0.36
0.14
0.31
0.84
Train delay
h/catenaries
km
0.15
0.61
0.45
0.1
0.25
0.41
0.05
0.14
0.14
4.2
3.7
2.5
3.6
4.7
3.8
2.8
2.0
6.5
3.5
2.9
1.68
2.24
4.59
2.22
1.28
2.24
6.1
2.5
1.9
1.5
1.3
1.5
1.0
1.3
1.1
1.9
4.7
3.1
4.2
2.7
1.4
0.9
3.5
3.0
3.2
Table A.2. Cost of various maintenance activities in thousands of SEK for each track area
for the year 2003
Track
area
A
B
C
D
E
F
G
H
I
Snow removal
in thousands of SEK
15,325
16,801
12,908
22,085
18,074
8,250
4,336
3,041
4,976
Corrective
maintenance
Preventive
maintenance
Contract sum
24,189
17,792
28,728
46,772
44,168
39,181
22,050
22,854
46,414
14,130
12,941
10,863
20,537
21,532
15,991
26,388
19,131
31,803
53,644
47,534
52,553
89,394
83,774
63,442
52,774
45,026
83,193
Normalisation is necessary due to the investment of extra money just for one year
to enhance the preparedness to deal with failures causing train delays. The figures
in Table A.2 are the figures before normalisation
580
Table A.3. Costs in thousands of SEK for corrective maintenance due to failure reports from
s for the year 2003
Track
area
A
B
C
D
E
F
G
H
I
Maintenance
organisation
(personnel,
machines,
spare parts)
Emergency
organisation
Actual cost
2880a
4416a
3732a
7,989
6,145
4,128
11,448
16,078
14,095
3512a
7,785
20,274
4701
4776
4884
Fixed price
(lump sum)
6304
Total cost
(t SEK)
10,869
10,861
7,860
16,150
20,854
18,897
12,686
11,444
28,246
SEK/ failure
5933
5273
4690
5379
5073
5530
5838
6065
145
Table A.4. Cost statistics for corrective maintenance triggered by the failure reporting
system ofelia (in thousands of SEK) after normalisation
Track
area
A
B
C
D
E
F
G
H
I
Maintenance
organisation
Emergency
organisation
2156
1472
4701
4776
4884
3512
Actual cost
Fixed price
(lump sum)
7,989
6,145
4,128
11,448
16,078
14,095
7,785
20,274
6 304
Total cost
(t SEK)
7,989
8,601
5,600
16,150
20,854
18,897
12,686
11,367
28,246
SEK/ failure
1832
2060
1676
3002
4111
3417
2173
1887
5490
581
Table A.5. Reported corrective maintenance caused by inspection remarks classifying faults
as requiring immediate repair; also including activities such as inspection and conditionbased and predetermined maintenance that should have been booked under other codes in
the accounting system (before normalisation of the data)
Track
area
A
B
C
D
E
F
G
H
I
Inspection
remarks
calling for
immediate
repair
Mixes of
inspection
remarks calling
for immediate
repair and
CBM Remarks
13,320
6,931
12,355
16,361
10,864
9,963
Inspection cost
including
inspection
remarks calling
for immediate
repair
Operational
actions
due to predetermined
maintenance
Care of
electrical assets
due to predetermined
maintenance
1485
Conditionbased
maintenance
7081
7614
1962
3194
3558
4732
4289
3091
1486
2756
11,107
18,169
168
303
Total cost
13,320
6,931
20,921
30,638
19,044
20,383
9,346
11,410
18,168
Table A.6. Reported corrective maintenance caused by inspection remarks classifying faults
as requiring immediate repair; also including activities such as inspection and conditionbased and predetermined maintenance that should have been booked under other codes in
the accounting system (after normalisation)
Track Inspection remarks
area calling for immediate
repair
A
B
C
D
E
F
G
H
I
13,320
6,931
12,355
16,361
10,864
9,963
11,107
18,169
Inspection remarks
calling for immediate
repairbooked under
inspection
995
1904
491
799
Corrective
maintenance booked
as inspection in the
accounting system
1506
1553
8
916
13,320
6,931
13,350
19,771
12,908
10,770
9,346
11,410
19,084
582
Table A.7. Condition-based maintenance bought as extra orders in thousands of SEK, but
including the so-called special maintenance activity
Track area
A
B
C
D
E
F
G
H
I
60,913
27,972
7,680
12,722
New Sum
32,319
43,831
44,139
6,607
20,807
25,825
50,753
37,518
51,004
23.13 References
Almdal, W. (1994), Continuous improvement with the use of benchmarking, CIM Bulletin,
Vol. 87 No.983, pp.2126
Burke, C.J. 2004. 10 steps to BestPractices Benchmarking.
http://www.qualitydigest.com/feb/bench.html
Campbell, J.D. (1995). Uptime: Strategies for Excellence in Maintenance Management,
Productivity Press, Portland, US
Dunn, S. (2003), Benchmarking as a Maintenance Performance Measurement and Improvement Technique. Assetivity Pty Ltd,
http://www.plant-maintenance.com/maintenance_articles-Performance.shhtml
EFNMS (2006), http://www.efnms.org/efnms/publications/13defined101.doc
Espling, U. (2004), Benchmarking av Basentreprenad r 2002 fr drift och underhll,
Research Report, LTU 2004:16, (In Swedish).
Hgerby, M., Johansson, M. (2002). Maintenance performance assessment: strategies and
indicators. Master thesis, Linkping, Linkpings tekniska hgskola, LiTH IPE Ex arb
2002:635.
Kaplan, R.S. and Norton, D. P. (1992), The Balanced Scorecard: the measures that drive
performance, Harvard Business Review, JanFeb (1992), pp. 7179.
Larsson. L. (2002). Utvrdering av underhllspiloterna, delrapport 1. Banverket F02713/AL00. (In Swedish).
Liyanage, J.P. and Kumar, U. (2003). Towards a value-based view on operations and
maintenance performance management, Journal of Quality in Maintenance Engineering,
Vol. 9, pp. 333350.
Malano, H. (2000), Benchmarking irrigation and drainage performance: a case study in
Australia. Report on a Workshop 3 and 4 August 2000, FAO, Rome, Italy.
Mishra, C., Dutta Roy, A., Alexander, T.C. and Tyagi, R.P. (1998), Benchmarking of
maintenance practice for steel plants, Tata Search 1998, 167172.
Moulin, M. (2004), Eight essentials of performance measurements, International Journal of
Health Care Quality Assurance, Vol .17, Number 3. pp. 110112.
Oliverson, R.J. (2000), Benchmarking: a reliability driver, Hydrocarbon Processing, August
2000, pp. 7176.
Ramabadron, R., Dean Jr J.W. and Evans J.R. (1997), Benchmarking and project
management: a review and organisational model, Benchmarking for Quality
Management & Technology, Vol. 4, No. 1, pp. 437458.
583
Stalder, O., Bente, H. and Lking, J. (2002), The Cost of Railway Infrastructure. ProM@ain
Progress in Maintenance and Management of Railway Infrastructure, 2, pp. 3237.
http://promain.server.de
Varcoe, B.J. (1996), Business-driven facilities benchmarking, Facilities, Vol. 14. Number
3/4, March /April, pp. 4248, MCB University Press.
Wiarda, E.A. and Luria, D.D. (1998), The Best-practice Company and Other Benchmarking
Myths
Wireman, T. (1998), Developing Performance Indicators in Maintenance. New York:
Industrial Press Inc.
Wireman, T. (2004), Benchmarking Best Practice in Maintenance Management. New York:
Industrial Press Inc.
Zairi, M. and Leonard, P. (1994). Practical Benchmarking: the Complete Guide. London:
Chapman and Hall.
Zoeteman, A. and Swier, J. (2005), Judging the merits of life cycle cost benchmarking, in
Proceedings International Heavy Haul Association Conference, Rio de Janeiro June,
hrn, T., Espling, U. and Kumar, U. (2005), Benchmarking of maintenance process: two
case studies from Banverket, Sweden, in Conference proceedings of the 8th Railway
Engineering Conference, London June 2930.
hrn, T. and Espling, U. (2003), Samordnet/Felles drift av Jrnvgen Kiruna Narvik
(confidential). Lule, Lule tekniska universitet (In Swedish).
24
Integrated e-Operationse-Maintenance:
Applications in North Sea Offshore Assets
Jayantha P. Liyanage
24.1 Introduction
There is a clear growth of interests today on the development and use of e-maintenance concepts for industrial facilities. This is particularly seen in the offshore oil
and gas (O&G) production environment in the North Sea in relation to a major reengineering process termed integrated operations (IO) that began in 20042005
as a new development scenario for the offshore industry (OLF 2003). Major
challenges to conventional operations and maintenance (O&M) practice have been
seen unavoidable under this new IO initiative. Subsequently, the industry began to
develop some serious interests on novel and smart solutions for O&M. The
developments began in 2005 seeking long-term changes to the conventional O&M
practice. The change process has been relatively slow during the 20052006
period, but seemingly has gathered gradual and steady pace by now. This is a
large-scale change, and hence the current plan is to realize fully functional e-operations e-maintenance status by the years 20122015 or so. Even though the integrated e-operations and e-maintenance applications in the North Sea are still at their
inception, the learning process and the state of current knowledge can be very valuable for similar efforts in the development and implementation of novel solutions
in other industries and /or regions in the world.
Current developments in Norway exemplifies that the growth of smart use of
advanced information and communication technology (ICT) solutions is a principal
driving factor in the development and implementation of novel and smart solutions
to realize e-maintenance (Liyanage and Langeland 2007; Liyanage et al. 2006). In
principal it seeks to establish better offshore-onshore connectivity and interactivity
enhancing decisions and work processes. The emerging O&M practice will be
based on a smart blend of application technologies, novel managerial solutions,
new organizational forms, etc. to enable 24/7 online real-time operating modes.
The new set of O&M solutions for North Sea offshore assets are not simply about
the use of some form of core technologies for electronic data acquisition and so on,
but a large-scale re-engineering process dedicated to make a significant change to
586
J. Liyange
the conventional O&M practice based on a solid technical platform. It is noteworthy that, even though the changes within O&M by far is mostly technologydependent, its managerial implications are inevitable and that managerial changes
have to be properly blended into the technology-based change. Such an integrated
change is very critical in terms of technical and safety integrity of assets, and
subsequent commercial impact in terms of production, plant economics, and safety
and environmental performance.
Ongoing developments in Norway bring a good example of how an industrywide re-engineering process has triggered major changes in O&M practice leading
the path towards integrated e-operations e-maintenance. It implies that integrated
e-operations e-maintenance initiatives in Norway is not a standalone and a shortterm technical change limited to O&M, but an integral part of a wider and a longterm development process that combines various technical disciplines and different
sectors of the industry seeking an optimum and a long-term solution. In this context, there are two salient features that define the future of e-based O&M practice
in Norwegian O&G industry:
Integration with other technical disciplines that have major roles in the
realization of fully functional 24/7 online real-time operational status
The important technological and managerial change that the e-approach has
to incorporate to ensure fail-safe status
Owing to the growing interests and the importance of the subject matter on
e-operations e-maintenance, learning from different application scenarios in various industries has a timely significance. This chapter shares current experience and
knowledge with reference to ongoing developments in the Norwegian O&G industry. The chapter highlights current offshore asset maintenance practice, changing
technical and economic environment that lead the path towards an e-approach,
development and implementation of integrated e-operations and e-maintenance
solutions in the North Sea, key features of the e-approach in North Sea assets, and
future challenges to be fullyintegrated and fail-safe. The specific acronyms and
their application definitions are given in Section 24.2, and Section 24.3 contains
some recent reflections on the work on e-maintenance. Section 24.4 covers a brief
introduction to offshore asset maintenance. It describes current thinking, practice,
and visible trends. The technical and economic environment that shape guides a
shift towards e-operations and e-maintenance is discussed in Section 24.5. It
illustrates some of the major drivers that demand technological and managerial
integration in search for comprehensive solutions for offshore assets in the North
Sea. The section that follows (Section 24.6) highlights issues related to development and implementation of integrated e-operations and e-maintenance solutions
on the Norwegian continental shelf. The major features of the e-approach for
operations and maintenance are highlighted in Section 24.7, and pays specific
attention to the diagnostic and prognostic technologies and the emerging infrastructure (i.e. ICT network, Onshore centers) for their implementation and use.
Since the emerging environment represents a step change towards a more complex
operational setting, there are numerous challenges to realize reliable fully integrated status and to remain fail-safe. Section 24.8, briefly covers these issues, and
highlights the critical role and specific features of intelligent watchdog agent
587
technology in this context. This section also underlines some of the important nontechnical issues that play pivotal roles in terms of being fully integrated and failsafe.
Acronym
B2B
CBM
CMMS
CV
D2A
D2D
ERP
ICT
IO
IT
LAN
NCS
NOK
NPD
OLF
OOC
OSC
O&G
O&M
PDA
PM
PSA
RUL
R&D
SAP
SOIL
WAN
Application definition
Business-to-business
Condition based maintenance
Computerized maintenance management system
Confidence value
Decisions-to-actions
Data-to-decisions
Enterprise resource planning
Information and communication technology
Integrated operations
Information technology
Local area network
Norwegian continental shelf
Norwegian Crowns (Norwegian currency)
Norwegian Petroleum Directorate
Norwegian Oil Industry Association
Onshore operational center
Onshore support center
Oil and gas
Operations and maintenance
Personal digital assistant
Preventive maintenance
Petroleum safety authority
Remaining useful life
Research and development
A commercially available business ERP system
Secure Oil Information Link
Wide area network
588
J. Liyange
589
O&M process has a major role-play in production platforms (so-called topside) and rigs. In fact production and injection wells require some maintenance as
well, but this is a highly specialized technical area. This chapter mainly covers
O&M aspects in the top side. O&M, inclusive of testing and inspection, is an
important discipline in terms of the technical condition and the mechanical integrity of an O&G asset. Necessary functional and technical conditions are achieved
through a blend of O&M strategies, programs, and technologies. A diversity of
O&M strategies and management practices may be necessary during the life of an
asset that in general is under operation for 2030 years of commercial production.
The challenges to plant O&M can be quite dramatic particularly at the beginning
and end of production life cycle, i.e. in the startup phase, and in the tail-end
production phase (i.e. when the production begins to decline gradually). During
various stages of the life-cycle, demands for maintenance can also vary, for instance, due to design flaws, varying operational conditions (pressure, temperature,
etc.), ageing equipment, outdated O&M procedures, modifications, and so on and
so forth. The fact that a good number of production platforms on the Norwegian
shelf at the moment are in the tail-end and maturity phase of production poses
significant challenge and it demands novel solutions to improve maintenance
practices.
Obviously there is a common cause for performing O&M activities in various
O&G production assets, i.e. commercial, or statutory and regulatory. However,
there can be differences among O&M programs and practices performed by
various producers. Such variations can exist, for instance, due to age of installa-
590
J. Liyange
tions and equipment, scale of production operations, level of technological complexity, competence availability, budgeted operating costs, etc.
Preventive maintenance (PM) tasks account of a larger portion of the maintenance work performed in offshore installations. Such PM programs can be based
on industry practices, third party recommendations, or reliability analysis. PM
programs are built into running maintenance plans and thus are executed as
calendar-based or periodical maintenance tasks. The planning process can for instance be done on a 3-months or 7-weeks basis, and can be frozen weekly for
execution offshore. One of the major concerns related to current maintenance
practice is the consequences if the PM on equipment in offshore plants exceeds
what is actually required. Excessive PM has significant commercial implications in
terms of production interruptions, which on the other hand ensures compliance to
strict regulatory requirements particularly for safety critical equipment. Lately,
there seems to be some general preference for the use of condition monitoring
techniques and risk-based methods. While methods such as risk-based inspections
are already available, technology experts believe that application of CBM
techniques together with risk computation can be of great benefit as it can greatly
facilitate need based maintenance. This implies that the experts can precisely
identify at which point in time certain maintenance tasks have to be performed
based on risk conscious decisions. This is expected to bring substantial commercial
benefits by prolonging maintenance intervals and thus reducing the production
interruptions. However, conventionally CBM techniques have not been widely
applied other than on an ad hoc basis or on special rotating machineries such as
turbines. It has so far been a challenge to make effective use of condition monitoring in the production facilities on the Norwegian shelf. Some applications are in
use such as vibration monitoring on heavy rotating equipment, thermography on
electrical equipment, and oil analysis. However, many producers have been
struggling to capitalize on the inherent potential of CBM technologies for quite
some time. The underlying bottlenecks are largely related to the physical distance
between offshore assets and onshore support organization, the availability of
expertise to the site at a moments notice, and reluctance by some of the producers
to initiate a quick response solely based on CBM experts opinion since they
conventionally rely much on the overall equipment manufacturers recommendations and guidelines, etc.
O&M organizations, on the other hand, gradually appear to become more teambased. Recent downsizing moves, an ageing workforce, and the ongoing efforts to
integrate maintenance and operational crews have contributed much to this trend.
The way in which such teams are formed and the way they carry out work may
vary from one situation to another. Work teams can be dedicated to individual
plants (i.e. dedicated work teams) and also certain teams may involve in doing
campaign maintenance (i.e. fly-out maintenance) tasks. Campaign activities imply
that while dedicated work teams carry out asset-specific tasks, there are teams
(called campaign teams) with specific technical expertise (e.g. for turbine
maintenance) who carry out certain specific PM tasks in addition to the dedicated
maintenance personnel. They fly across platforms attending pre-assigned tasks in
accordance with maintenance programs registered in the system. Administratively,
while campaign teams may be responsible to the maintenance manager, function-
591
592
J. Liyange
political change processes. The trends of deviations from conventional wisdom and
practices have become more and more clear, seeking to adapt creative, innovative,
and smart solutions to manage complex systems for commercial advantage (During
et al. 2004; Hosni and Khalil 2004; Russell and Taylor 2006). With the growth of
business uncertainties, the enterprise risk profile has become more complex
demanding more flexible, collaborative, and open strategies to support various
operational activities in industrial plants and facilities. The emerging commercial
environment by far has already indicated the greater reliance on new technological
and managerial solutions to manage important asset processes such as O&M
establishing a new landscape for commercial activities. This seems to be a generic
trend among almost all the commercial business sectors, but to varying degrees,
where the dependence on advanced technological solutions to manage complex
technical systems is rapidly growing. The resulting environment will obviously be
very dynamic enabling key stakeholders of complex technical systems to remain
intact within an extended live network (Wang et al. 2006).
The production, manufacturing, and process industries are directly seen impacted by the new demands and the wave of subsequent changes. Technologically
complex and highrisk businesses in particular cannot afford to divert their management strategies of complex assets away from the mainstream technologydriven
change. Today different industrial sectors are seen adapting various novel and
integrated solutions to manage their industrial assets and internal processes to
realize major commercial benefits. More often, rapid advancement in information
and communication technologies (ICT) has been very catalytic to the progress in
technology applications (e.g. diagnostic technologies) and data management
solutions particularly for complex systems, such as offshore oil and gas (O&G)
production platforms.
O&G activities on the NCS began in the early 1970s with the discovery of the
great Ekofisk asset. Ever since, NCS has been a major supplier of oil to the world
energy market. Today, after more than 30 years of continuous production, NCS has
stepped up to its peak level. Despite the fact that NCS foresees a gradual decline
after 2010 or so, the remaining potential is known to be substantial. But the future
is known to have a unique set of challenges with a major need to enhance the
recovery efficiency so that the commercial lives of major production assets can be
extended by another 4050 years. By 20032004, the forthcoming challenges to
O&G exploration and production activities in North Sea became very obvious. The
major part of the industry became relatively more inclined to resort to advanced
application technologies to address underlying commercial risks. At the same time
the industry has been undergoing some other challenges widely acknowledged as
serious impediments to future growth on NCS. For instance, the industry has been
experiencing some major setbacks in attracting talent, and in centralizing core
competencies. The problem has been further aggravated by the ageing workforce
with no suitable remedy to solve competency gaps. Industry restructuring has been
seen by the majority as a feasible solution to provide a tighter integration and
partnerships with the knowledge industry. Table 24.2 illustrates the complex set of
economical and technical drivers that challenged the conventional practices in the
North Sea O&G production environment.
593
Table 24.2. Technical and economical factors that contributed to a step-change in North Sea
asset management practice also introducing changes to conventional O&M practices
Risk and uncertainty profile
Risk and uncertainty profile is seen to be
too large to ignore due to maturing assets,
declining production, rising lifting costs,
discovery of marginal fields, declining
investments for developments, lower
recovery efficiency, etc.
Commercial incentives
The underlying commercial incentives of a
major change have been very convincing.
This mainly includes substantial
enhancement in production recovery at least
by 10% or more, significant reduction in
operating costs, and major safety and
environmental benefits
Business conditions
O&M drew the attention of industry slightly later than two other technical
disciplines, but is widely acknowledged today as a technical process that has
substantial improvement potential. In fact some signs of development within O&M
began to appear in 20052006 period. Nevertheless, it has been known for some
time that the conventional O&M process has large limitations and some well-
594
J. Liyange
established O&M policies have seen significant hindrance to bringing cost-effective and efficient solutions. The integrated e-operationse-maintenance concept
for North Sea assets, brought forward a long-term development path to O&M
process from 2005 onwards with substantial opportunities to:
Advanced
technologies
Digital
infrastructure
B2B collaborative
partnerships
Active operational
networks
595
Figure 24.1. e-Approach to O&M in North Sea assets in principal relies on fourfold aspects
596
J. Liyange
Experience data
Offline-online technical
data
Intelligent systems and
components
Offshore asset
Direct visualization
Central
Datahub
Fiber-optic
network
Fiber-optic I
IP-VPN /
ADSL
based
access
Wireless network
and
Radio links
The figure highlights that the functional landscape for the establishment of ebased O&M setting in North Sea is a relatively complex combination of various
technical as well as social elements. The synergy among at least three elements is
critical in the development of the necessary technical infrastructure, i.e.:
597
Integrated O&M
workprocesses
Collaborative management
solutions
D2D
processes
D2A
processes
Figure 24.3. Integrated e-operationse-maintenance brings key solutions to enhance data-todecision (D2D) and decision-to-action (D2A) processes
The targeted benefits of these developments within O&M, together with those
in other technical disciplines, are continuously expected in a 3040 year time span.
The key value creation elements identified includes, for example, methods and
techniques to reduce uncertainty in data interpretation, reduced cycle time on
decisions, better planning and work coordination procedures, and reduced offshore
operating costs through offshore-onshore work re-organization and prolonged
maintenance intervals. The overall commercial benefits expected include; approximately 10% increment in production, 3040 % reduction in operating costs, and
significant improvements in health and safety performance.
24.7 Key features of the e-Approach for O&M in North Sea Assets
As aforementioned, integrated e-operations e-maintenance is not just an effort to
introduce new technologies. It in fact represents a change in the use of technical
tools, advanced methods, and joint expertise to make O&M processes more effective and efficient. It introduces a novel scenario to manage the process stepping out
of the convention. However, the successful implementation and use of e-approach
dependent heavily on the synergy between remote diagnostic and prognostic technology, onshore expert centers directly connected to offshore collaborative rooms,
and net-based web-enabled ICT solutions (Figure 24.4).
598
J. Liyange
Offshore-Onshore expert
centers
Figure 24.4. The solid foundation to e-approach in O&M demands a synergy between three
main components that establish a complex and an interactive technical system
Data acquisition techniques have developed to an extent that the experts can
tap signals real-time at onshore support centers (OSC) on critical equipment
Online communication capability has allowed joint interpretation and trend
analysis, for instance coupling to asset operators OSC, and comparing
with set alarm levels
Expert centers have acquired technological capability so that they can
secure connections to several offshore assets in a way that those assets can
be served simultaneously if necessary
The use of advanced networking technologies is in fact a landmark of integrated O&M solutions for North Sea assets, as opposed to offline technologies. It
has brought some unique capabilities to share the expertise. With the rapid use of
portable communication technology, offshore personnel can also communicate
effectively with OSCs allowing more sensible use of data acquisition technologies.
The current setting has given a new dimension to the diagnostic and prognostic
efforts for North Sea assets today.
The OSC in SKF-Norway is for instance a CBM expert center that has remote
diagnostic and prognostic capabilities and serves various operators in the Norwegian and Danish O&G sectors. Over the past few years it has carried out online
remote vibration monitoring of critical machinery of offshore production platforms
599
600
J. Liyange
3D technologies &
Simulations laandscape
Conferencing
landscape
Realtime monitoring
landscape
Figure 24.5. Landscape of onshore support centers (OSCs) with built-in collaborative and
decision support technologies are the active nodes of the integrated e-operationse-maintenance environment on NCS (courtesy: ConocoPhillips, Norway)
601
for expansions of substantial scale that can lead to a completely different technological setting and an operating mode by the year 2010 or so. The ongoing
developments at some stage would be coupled with other technologies, for instance
related to scenario simulations of technical faults and failures using 3D technologies, intelligent watchdog agents for condition prognostics, virtual tools to train
O&M crews, etc.
24.7.3 The ICT Network: Secure Oil Information Link (SOIL)
Often, advanced ICT solutions are at the heart of principal commercial activities of
almost all industrial sectors today (Chang et al. 2004; van Oostendrep et al. 2005;
Mezgaar 2006). Current developments on the Norwegian shelf also resorted to
such solutions as the basis to induce the change. Current ICT solutions are a technical blend of more centralized LANs, primarily localized within organizational
boundaries, to large scale WAN solutions that open up transaction routes for complex business-to-business (B2B) traffic. In fact, the specific need for such robust
integrated solutions for O&G industry in the North Sea have largely been growing
over the last 23 years, demanding more common platforms, for instance to manage
complex O&M and other plant data. The large scale ICT network established in
North Sea is called Secure Oil Informaton Link (SOIL).
SOIL was introduced to Norwegian E&P industry in 1998. It is a result of
growing demands for integrated data management and B2B communication
solutions. SOIL consists of a number of application services actively connecting
almost all the business sectors of the Norwegian O&G industry. This network
helps establishing the connectivity and interactivity between different parties, for
instance offshore O&M teams, operators onshore O&M support groups, thirdparty CBM experts, logistic contractors, etc. through the use of fiber-optic cables
and wireless communications. Real-time equipment data can be acquired, jointly
analyzed and results can be exchanged online between these parties, enhancing the
ability for shared interpretation and decision-making. In this context, there are two
major functional features of SOIL (see also Figure 24.6):
The high reliable information and knowledge-sharing network to coordinate and manage remotely O&M activities in North Sea offshore assets
regardless of the geographical location
Many-to-many simultaneous authorized connectivity breaking the conventional one-to-one solution enhancing collaboration between experts, third
party services, asset operator, and offshore crew
The conventional one-to-one setting only enabled the connectivity between two
distinctive parties, for example between an inspection engineer of a contractor and
a maintenance planner of an asset owner. However, with the use of the webenabled networking solutions available today, a number of distinctive groups can
stay connected and interact simultaneously (i.e. many-to-many connectivity). This
capability has major effects on improvements to D2D and D2A processes of O&M
in terms of time, cost, and quality.
602
J. Liyange
Figure 24.6. SOILs application solutions provide many-to-many connectivity and interactivity on 24/7 online real-time bases to enhance D2D and D2A performance of O&M
Regardless of the notable achievements so far during the last 12 years, the
challenges for further development of O&M process are quite many. In pure
technological terms, smart and cost-effective use of CBM technologies in particular still remains a significant challenge. In fact there is no argument about the
benefits of CBM in terms of being fully integrated and fail-safe. The demand by
far is on the more sensitive use of the diagnostic and prognostic technologies as a
principal means to improve and to be in control of technical and safety integrity of
assets. The demand in the current O&M setting is towards advanced technical
platforms that for instance combine unique signal processing, risk analysis, and
decision-making features. In fact the demand is for such technologies where failure
603
Signal-processing technology with a series of toolboxes for signal processing and system performance evaluation to track the health of a system/
machine and provide diagnostic and prognostic information in order to
achieve the goal of near-zero-downtime performance
Application software solutions to interpret optimally monitored data signals
regarding the execution of a maintenance action and to estimate remaining
useful life (RUL)
The requirement on the Norwegian shelf today is a CBM technology that is not
limited to data acquisition but also has integrated advanced solutions with signal
processing and decision-making capabilities to make it more attractive and commercially viable solution. In a series of more recent R&D efforts, the Center for
Intelligent Maintenance Systems (IMS) at University of Wisconsin-Milwaukee and
the CBM Lab at University of Toronto have developed such an integrated O&M
optimization platform to provide asset owners and operators with an advanced tool
for the signal processing and the maintenance decision-making (see Jardine et al.
1997; Banjevic et al. 2001). Figure 24.7 shows the multi-sensor performance assessment framework of this technology.
This watchdog agent constitutes a toolbox with modules for signal processing,
feature extraction, degradation assessment and performance evaluation embedded
in a common software application. It includes signal processing and feature extraction tools built on Fourier analysis, time-frequency distribution, wavelet packet
analysis and ARMA time series models. The component of performance evaluation
uses such tools as fuzzy logic, match matrix, neural network and other advanced
algorithms. Functionally, the watchdog agent in principal is used for feature extraction from a series of signals under a given condition, and comparing those with a
template model built-up based on signals under a pre-identified normal condition.
The performance evaluation yields a confidence value (CV), which indicates the
health status of the system and is used as the basis for diagnostics and prognostics
under given circumstances. If the data can be directly associated with some failure
mode, then most recent performance signatures, obtained through the signal processing and feature extraction modules, can also be matched against signatures
extracted from faulty behavior data for proper decisions.
604
J. Liyange
Feature E xtraction
T im e-frequency
/ W avelet
m om ents and
PCA
W avelet
Frequency B ands
AR m odel roots
E xpert extracted
features
(intensity, peakto-peak value,
R M S ).
Figure 24.7. The potential for further enhancement in the use of advance CBM technologies
such as Intelligent watchdog agents are very evident for North Sea assets (courtesy: CBM
Lab, University of Toronto, Canada)
605
There is much to do to make sure that the new integrated e-operationse-maintenance setting is fully functional and fail-safe. Perhaps the greater concern is that
the marvel of the success brought by ad hoc technological solutions may easily
lead to miscalculation of underlying risks of process re-engineering tasks. With this
606
J. Liyange
realization, a major portion of the industry has begun to adapt along a more
cautious, synchronized, and an incremental development path. Initiatives by
authorities (e.g. NPD, PSA, etc.) and by socio-political sources (e.g. OLF) are critical to establish a more harmonized setting to ensure necessary levels of safety and
security. Even though a systematic strategy may prolong the integration plan, the
argument is that such a systematic move will have substantial long-term pay back
rather than a rapid solution that would eventually expose major stakeholders to
deal with unforeseen events requiring ad hoc solutions or quick fixes that would
be too costly to bear.
24.9 Conclusion
Commencing from 20032004, the Norwegian O&G industry has launched a
dedicated program to overcome obvious commercial risks on the NCS. This is
termed the third efficiency leap that has directly supported the implementation of
integrated e-operationse-maintenance solutions for offshore assets in North Sea.
This new practice greatly challenged the conventional practices of many
disciplines, particularly of O&M seeking a technological as well as a managerial
change. The new O&M practice pays major emphasis on the more active exploitation of application technologies, new data and knowledge management techniques.
The change process has also begun to re-engineer the industry infrastructure to
actively integrate O&M expertise of O&G producers with that of the external
knowledge-based industry. The large-scale ICT network called Secure Oil
Information Link and onshore support centers mainly facilitate the rapid
development within O&M process. The new setting has already brought major
commercial benefits by streamlining D2D and D2A processes with substantial
improvements in work processes. However, some critical challenges still remains
to be addressed, and the socio-political organizations and authorities are keen on
ensuring fully functional and fail-safe operations. The demand and the interest to
complete the rest of the journey is through more cautious and systematic strategies
to sustain commercial benefits beyond the year 2050 without exposing the industry
to unwanted or hidden risks that would be too costly to bear.
24.10 References
Arnaiz, A., Arana, R., Maurtua, I., et al., (2005), Maintenance: future technologies,
Proceedings of the IMS (Intelligent Manufacturing System) International Forum IMS
Forum 2004 Como, Italy, May 1719, pp. 300307.
Bangemann, T., Rebeuf, X., Reboul, D., et al., (2006), PROTEUS-creating distributed
maintenance systems through an integration platform, Computers in Industry, 57(6),
pp. 539551.
Banjevic, D., Jardine, A.K.S., Makis, V. and Ennis, M., (2001), A control-limit policy and
software for condition-based maintenance optimization, INFOR, 39, pp. 3250.
Bonissone, G., (1995), Soft computing applications in equipment maintenance and service,
ISIE 95, Proceedings of the IEEE International Symposium, 2, pp. 1014.
Booher, HR. (ed.) (2003). Handbook of human systems integration, Wiley-Interscience.
607
Chande, A., Tokekar, R., (1998), Expert-based maintenance: a study of its effectiveness,
IEEE Transactions on Reliability 47, pp. 5358.
Chang, Y.S., Makatsoris, H.C., Richards, H.D., (2004), Evolution of supply chain
management: symbiosis of adaptive value networks and ICT, Boston: Kluwer Academic
Publishers.
Djurdjanovic, D., Ni, J., Lee, J., (2002), Time-frequency based sensor fusion in the
assessment and monitoring of machine performance degradation, Proceedings of the
2002 ASME International Mechanical Engineering Congress and Exposition paper
number IMECE 2002-32032.
Djurdjanovic, D., Lee, J., Ni, J., (2003), Watchdog agent an infotronics-based prognostics
approach for product performance degradation assessment and prediction, special issue
on intelligent maintenance systems, Engineering Informatics Journal 17 (34), pp. 107
189.
During, W., Oakey, R., et al. (ed.) (2004). New technology-based firms in the new
millennium. Elsevier.
Ellingssen, H.P., Liyanage, J.P., Rus, R., (2006), Smart integrated operations and
maintenance solutions to manage offshore assets in North Sea, Proceedings of the 18th
EuroMaintenace, MM Support GmbH, pp, 319324.
Emmanouilidis, C., MacIntyre, J., Cox, C., (1998), An integrated, soft computing approach
for machine condition diagnosis, Proceedings of the Sixth European Congress on
Intelligent Techniques & Soft Computing (EUFIT98), vol. 2 Aachen, Germany, pp.
12211225.
Emmanouilidis, C., Jantunen E., MacIntyre, J., (2006), Flexible software for condition
monitoring, Computers in Industry, 57(6), pp, 516527.
Garca, M.C., Sanz-Bobi, M.A., (2002), Dynamic Scheduling of Industrial Maintenance
Using Genetic Algorithms, Proceedings of EuroMaintenance 2002, Helsinki, Finland.
Garcia, M.C., Sanz-Bobi, M.A., Pico, J., (2006), SIMAP: Intelligent systems for predictive
maintenance: Application to the health condition monitoring of a wind-turbine gearbox,
Computers in Industry, 7(6), pp, 552568.
Han, T., Yang, B.S., (2006), Development of an e-maintenance system integrating advanced
techniques, Computers in Industry, 57(6), pp, 569580.
Hansen, R., Hall, D., Kurtz, S., (1994), New approach to the challenge of machinery
prognostics, Proceedings of the International Gas Turbine and Aeroengine Congress and
Exposition American Society of Mechanical Engineers, pp. 18.
Health and Safety Executive (HSE). (1997). Human and organizational factors in offshore
safety. HSE, UK.
Hosni, Y.A., Khalil, T.M. (ed.) (2004). Management of technology. Elsevier.
Iung, B., (2003), From remote maintenance to MAS-based e-maintenance of an industrial
process, International Journal of Intelligent Manufacturing 14(1), pp. 5982.
Jardine, A.K.S., Banjevic, D., Makis, V., (1997), Optimal replacement policy and the
structure of software for condition-based maintenance, Journal of Quality in Maintenance
Engineering, 3, pp. 109119.
Jardine, A.K.S., Makis, V., Banjevic, D., et al., (1998), Decision optimization model for
condition-based maintenance, Journal of Quality in Maintenance Engineering 4 (2), pp.
115121
Jardine, A.K.S. Lin, D., Banjevic, D., (2006) A review on machinery diagnostics and
prognostics implementing condition based maintenance, Mech. Syst. Signal Process. 20
(7), pp. 14831510.
Jantunen, E. Jokinen, H. Milne, R., (1996), Flexible expert system for automated on-line
diagnostics of tool condition, Integrated Monitoring & Diagnostics & Failure Prevention,
Technology Showcase, 50th MFPT Mobile, Alabama.
608
J. Liyange
Khatib, A.R., Dong, Z., Qiu, B., et al., (2000), Thoughts on future Internet based power
system information network architecture, in: Proceedings of the 2000 Power Engineering
Society Summer Meeting, vol. 1, Seattle, USA.
Koc, M., Lee, J., (2001), A system framework for next-generation e-maintenance system,
Proceeding of Second International Symposium on Environmentally Conscious Design
and Inverse Manufacturing Tokyo, Japan.
Lee, J. (1996), Measurement of machine performance degradation using a neural network
model, Computers in Industry 30, pp. 193209.
Lee, J., (2004), Infotronics based intelligent maintenance system and its impacts to closed
loop product life cycle systems, Proceedings of the Proceedings of the IMS2004
International Conference on Intelligent Maintenance Systems Arles, France.
Liao, H.T., Lin, D.M. Qiu, H., et al., (2005), A predictive tool for remaining useful life
estimation of rotating machinery components, ASME International 20th Biennial
Conference on Mechanical Vibration and Noise Long Beach, CA.
Liyanage, J.P., (2003), Operations and maintenance performance in oil and gas production
assets: Theoretical architecture and capital value theory in perspective, PhD Thesis,
Norwegian University of Science and Technology (NTNU), Norway.
Liyanage, J.P., Herbert, M., Harestad, J., (2006), Smart integrated e-operations for high-risk
and technologically complex assets: Operational networks and collaborative partnerships
in the digital environment, Wang, Y.C., et al., (ed.), Supply chain management: Issues in
the new era of collaboration and competition, Idea Group, USA, pp. 387414.
Liyanage, J.P., Langeland, T., (2007), Smart assets through digital capabilities, Mehdi
Khosrow-Pour (ed.), Encyclopaedia of Information Science and Technology, Idea Group,
USA.
Liang, E., Rodriguez, R., Husseiny, A., (1988), Prognostics/diagnostics of mechanical
equipment by neural network, Neural Networks 1 (1), p. 33.
Marseguerra, M., Zio, E., Podofilini, L., (2002), Condition-based optimisation by means of
genetic algorithms and Monte Carlo simulation, Reliability Engineering and System
Safety 77, pp. 151166.
Mezgaar, I., (2006), Integration of ICT in smart organizations, Hershey, PA: Idea Group Pub.
Moore, W.J., Starr, A.G., (2006), An intelligent maintenance system for continuous costbased prioritization of maintenance activities, Computers in Industry, 57(6), pp. 595606.
OLF (Oljeindustriens landsforening / Norwegian Oil Industry Association), (2003). eDrift
for norsk sokkel: det tredje effektiviseringsspranget (eOperations in the Norwegian
continental shelf: The third efficiency leap), OLF (www.olf.no). (in Norwegian)
Palluat, N., Racoceanu, D., Zerhouni, N., (2006), A neuro-fuzzy monitoring system:
Application to flexible production systems, Computers in Industry, 57(6), pp. 528538.
Perow, C. (1999). Normal accidents: Living with high-risk technologies, Pinceton University
Press.
Roemer, M. Kacprzynski, G., Orsagh, R. (2001), Assessment of data and knowledge fusion
strategies for prognostics and health management, IEEE Aerospace Conference
Proceedings, vol. 6, pp. 6297962988
Russell, R.S., Taylor, B.W., (2006), Operations management: Quality and competitiveness
in a global environment, Hoboken, N.J.: Wiley
Sanz-Bobi, M.A., Toribio, M.A.D., (1999), Diagnosis of electrical motors using artificial
neural networks, IEEE International Symposium on Diagnostics for Electrical Machines,
Power Electronics and Drives (SDEMPED) Gijn, Spain, pp. 369374.
Sanz-Bobi, M.A., Palacios, R. Munoz, A., et al., (2002), ISPMAT: Intelligent System for
Predictive Maintenance Applied to Trains, Proceedings of EuroMaitenance 2002,
Helsinki, Finland.
Swanson, L., (2001), Linking maintenance strategies to performances, International Journal
of Production Economics 70, pp. 237244
609
van Oostendrep, H., Breure, L., Dillon, A., (2005), Creation, use, and deployment of digital
information, Mahwah, N.J. : Lawrence Erlbaum Associates.
Wang, W., (2002), A stochastic control model for on line condition based maintenance
decision support, Proceedings of the Sixth World Multiconference on Systemics,
Cybernetics and Informatics, Part 6, vol. 6, pp. 370374
Wang, W.Y.C., Heng, M.S.H., Chau, P.Y.K., (2006), Supply chain management: Issues in
the new era of collaboration and competition, Idea Group Publishing.
Yager R., Zadeh, L., (1992), An Introduction to Fuzzy Logic Applications in Intelligent
Systems, Kluwer Academic Publishers.
Yang, B.S., Lim, D.S., Lee, C.M., (2000), Development of a case-based reasoning system
for abnormal vibration diagnosis of rotating machinery, Proceedings of the International
Symposium on Machine Condition Monitoring and Diagnosis Japan, pp. 4248.
Yen, G.G., (2003), Online multiple-model-based fault diagnosis and accomodation, IEEE
Transaction on Industrial Electronics 50 (2).
Yu, R., Iung B., Panetto, H., (2003), A mutli-agents based e-maintenance system with casebased reasoning decision support, Engineering Applications of Artificial Intelligence 16,
pp. 321333.
25
Fault Detection and Identification for Longwall
Machinery Using SCADA Data
Daniel R. Bongers and Hal Gurgenci
25.1 Introduction
Despite the most refined maintenance strategies, equipment failures do occur. The
degree to which an industrial process or system is affected by these depends on the
severity of the faults/failures, the time required to identify the faults and the time
required to rectify the faults. Real-time fault detection and identification (FDI)
offers maintenance personnel the ability to minimise, and potentially eliminate one
or more of these factors, thereby facilitating greater equipment utilisation and increased system availability.
This case study describes, in some detail, the application of data-driven fault
detection to an underground mining operation. However specific this application
may be, the concept can be employed on any system of machines, with or without
complex machine-machine or machine-environment interactions, or to individual
plant.
In addition to detailing the implementation of an FDI system in real-time, we
propose a semi-autonomous approach to dealing with inaccurate and incomplete
records of equipment malfunction. Since past equipment performance is often the
principal information source for maintenance planning and evaluation, it is of
utmost importance that this information be as accurate as possible. The method
described allows for varying levels of confidence in the record keeping.
Section 25.2 introduces the longwall mining system, the most common form of
mining coal underground at the present. The availability of longwall equipment
systems is low compared to surface systems of similar complexity. The present
approaches towards reducing equipment downtime in longwall mining are summarised in Section 25.3. Common FDI approaches are summarised in Section 25.4.
Two data-driven techniques are used in this study, namely artificial neural networks and multi-variate statistics. The availability of quality training data is of
critical importance for either one. The issue is addressed in Sections 25.525.7.
Once the training data set is constructed, the application of the selected FDI tech-
612
niques is reasonably straightforward. The application and the results are summarised in Section 25.8 with concluding remarks in Section 25.9.
Fault Detection and Identification for Longwall Machinery Using SCADA Data
613
614
Operating time
100%
Operating time + Maintenance delays
This KPI looks at the ability to have the machines operate for the time that they
are planned to operate. It is simply the percentage of the available planned time
that they do actually operate. The maintenance delays refer to scheduled and
breakdown maintenance. Some sites include only the breakdown maintenance in
this statistic, which leads to an inflated value of the equipment availability. Such
confusion in terms makes it difficult to benchmark practices between sites. Typical
values for this KPI average between 40% and 60%.
Mean time between failure (MTBF) =
This KPI looks at the ability to sustain the operation of machines over periods
of time. It is a measure of how long, on average, before machines stop due to a
maintenance problem. Typical values average around 1 h.
Mean time to repair (MTTR) =
This KPI looks at the ability to diagnose and remedy maintenance delays once
they have occurred. It is a measure of how long, on average, before machines that
have faulted are returned to operation. Typical values average around 20 min.
KPIs are typically reviewed on a weekly basis.
25.2.2 Longwall Monitoring
Equipment monitoring is playing an increasingly significant role in the modern
longwall. By monitoring is meant both the determination of the overall state of
plant as well as measurement of individual properties; for example, the running
temperature of a gearbox. This section discusses all forms of condition monitoring
undertaken in Australian longwalls.
Condition monitoring of longwall equipment takes on two main forms: on-line
and off-line. On-line monitoring includes all sensor measurements recorded and
transmitted to the surface using a PLC-driven, SCADA network. All other measurement and monitoring, including regular maintenance inspections, are classed as offline.
Fault Detection and Identification for Longwall Machinery Using SCADA Data
615
616
Fault Detection and Identification for Longwall Machinery Using SCADA Data
617
or stop the occurrence of the fault. Typically, machine design is the responsibility
of the manufacturer, and is a very time consuming and costly operation.
25.3.1.3 Redundancy
The concept of redundancy is widely used on a component level to make machines
more reliable. The principle is rather simple machine components such as relays,
hydraulic valves or electronic capacitors are unnecessarily duplicated in the design
in such a way that if one fails, another may take on its role. It may be possible in
some situations to extend this concept of redundancy to entire machinery, having
complete units on standby in case one should fail.
Although this method in no way improves the inherent reliability of the
individual plant, they are effectively made more available. The use of this sort of
redundancy is therefore suitable in those situations where the time to commission
replacement machinery is relatively small in comparison to the fault repair time (or
maintenance time).
Other factors that must be taken into account when considering the use of
redundant machinery are cost and storage. Many industries use machinery that is
far too expensive to purchase spares, or is too large to economically store.
25.3.1.4 Diagnosis and Repair Time
When a fault occurs causing a machine to cease operation, the process time lost is
that required to diagnose and resolve/repair the fault. If this time can be reduced,
machinery is then made more available. This may be achieved by improved operator training, or specialization of operator tasks.
Another way to reduce the time required to determine the nature of the fault is
to employ a diagnostic system that uses sensor and/or operational information to
detect and isolate the fault. Such a system would typically produce an instantaneous diagnosis; however it would not help to reduce the time required to repair
the fault, once determined.
The maintainability of the system is improved by reduction of the diagnosis and
repair time. A reliable FDI tools assists this by providing the operating and maintenance staff with a clear indicator towards the failing component and sometimes
to the mode of failure.
25.3.1.5 Failure Prediction
Similar to weather forecasting, the knowledge of an imminent failure of a piece of
machinery would allow machine operators to cease or change operation to minimize
and potentially avoid any subsequent downtime. Such a predictive system would
rely on a continual stream of quality data.
There is an inherent risk in such a concept in that false or misleading predictions could cause additional downtime, and quickly lose operator confidence.
For a predictive system to be effective it must produce not only a prediction of an
imminent fault, but provide a sensible reason for the prediction, so that operators
may make an informed decision whether to act or ignore the suggestion.
618
Fault Detection and Identification for Longwall Machinery Using SCADA Data
619
620
The inference to be drawn from the apparent difference between the model and
system outputs, referred to as the residual, often uses simple, statistical limits.
Assumptions enforced for model validity including the random distribution of
sensor noise allow chi-squared confidence limits, for example, to be determined for
each element of the residual vector. Expert knowledge is then employed to
establish which faults will be evident in each of these elements.
In contrast to model-based approaches where a priori knowledge of the system
is required, process history based or data-driven methods require only the availability of a large amount of historical process data. These techniques attempt to
capture the relationship between system measurements and system behaviour, with
the goal to detect and identify fault-affected behaviour from future measurements.
By definition, a data driven approach to fault detection and isolation is one in
which the decision criteria are based primarily or wholly on example data. Essentially, a sufficiently large, example dataset representative of each fault of interest is
used to generate an algorithm which maps a single observation input to a single
fault classification output. As new or unseen observations of the systems are presented, they are subsequently classified (using these mappings), which allows both
the detection and isolation of faults.
Data driven methods are typically applied to systems for which the development of accurate state-space or other dynamical equations is not possible or practical. Difficulty in the determination of accurate dynamical equations is common in
engineering problems for one or more of the following reasons:
1. A lack of understanding of the dynamic interactions between system components
2. Unpredictable environmental conditions which significantly affect the
system
3. Non-linearities within the system for which suitable approximations have
not been determined
Whether applied to fault detection or other classification problems, data driven
methods are often referred to as black box solutions. This is because little or no
understanding of the particular system is required for their implementation. Although often based on strict optimality criteria, little interpretation can be drawn
from the subsequent equations which map the input data to a system classification.
Regardless, these methods have proven useful tools for the detection and isolation
of faults in a wide variety of engineering problems.
25.4.2 Data Driven Techniques for FDI
Numerous journal and conference papers have been published describing the
application of data driven techniques to fault detection problems. Their popularity
is largely due to the fact that the established algorithms, namely principal components analysis (PCA), partial least squares (PLS), linear discriminant analysis,
fuzzy logic discriminant analysis and neural networks, are simple and fast to apply
with little system knowledge. Venkatasubramanian et al. (2003) provide a compre-
Fault Detection and Identification for Longwall Machinery Using SCADA Data
621
hensive review of process history based methods applied to FDI, referencing over
140 such papers.
This section provides just a handful of brief descriptions of data driven FDI
applications, for the sole purpose of illustrating the methods by which the example
data classifications are typically determined.
McKay et al. (1996) described the use of an artificial neural network, or ANN
(see Section 25.4.4) to determine the acceptability of a polymer coating used to
coat copper wire. It was determined that the viscosity of the polymer as it exited
the extrusion process (during manufacture) was the most reliable indicator of
quality, short of destructive testing. A neural network was employed to estimate
this viscosity based on sensor measurements on the extrusion equipment and data
from an attached rheometer.
Network training data was developed over a period of time whereby laboratory
experiments were performed to accurately determine the viscosity of a number of
extruded polymer samples. This form of training data is manually generated, and
relies on a number of supervised sets of measurements.
Also described in McKay et al. (1996) is the use of a neural network integrated
as part of a model based predictive control scheme. In this case, a detailed model
of the process of mixing air and fuel in a combustion engine was developed, and
the model interrogated with a number of initial condition scenarios to generate a
predicted set of measurements. This set of conditions/artificial measurements
formed the training dataset for the neural network.
Chow (2000) describes the use of an ANN to detect and isolate simple faults in
a DC motor. In contrast to the two prior examples, the training process involved
expert diagnosis to classify faults/failures as they occurred. With each occurrence,
the network weights were updated. To expedite the process, faults were induced by
damaging components or changing the resistance of internal components.
The supervised approach to generating example data is typical of data-driven
FDI examples in the open literature. Such research focuses on new detection and
isolation regimes, and assumes that training data is both available and accurate.
25.4.3 Training Data Set
All data-driven FDI systems need to be trained first on known data before they are
applied on unknown data. Availability of quality training or example data is an
essential requirement whether one used statistical FDI or artificial neural networks.
Example data are a sufficiently large dataset with the state of the system identified
for each observation. The identification process maps every observation to a
discrete state. Below is an augmented matrix, illustrating the form in which such a
training set with associated classifications, Y , would be assembled.
y11
y
21
Y =
y n1
y12
y 22
yn 2
y1 p
y2 p
y np
C1
C 2
C n
622
The last column in the above matrix includes the state descriptors assigned to
each observation vector (each row). Based on the assumption that the classifications accurately and discretely describe the state of the system, various algorithms
may be applied to generate rules (or equations) that map a single observation
vector input to a single classification output. Once generated from the training set,
these rules can be used to classify new observations of the system. As the state of
the system changes from normal operation to a state indicative of the presence of
a particular fault, this may be recognized as a fault being both detected and isolated
(identified).
Various data-driven techniques for FDI were discussed in the previous section.
The most common of these is multivariate statistical analysis (linear and nonlinear) and artificial neural networks. Both approaches have proven to be valuable
data-driven tools for the classification of multivariate observations.
The performance of an FDI system generated from example data is a function
of both the observability of each fault within the monitored variables and the
quality of the example data collected. Since these techniques are typically applied
where mathematical modeling is not feasible, a rigorous study of the observability
of each fault in observation space is not possible. The successful detection of faults
implies observability, but failure to detect certain faults does not imply non-observability. Observable faults will not be detected if the FDI function is not
sensitive to the specific changes exhibited by a fault, or if the training data set is
not of good quality.
It is paramount that one endeavours to apply a complete, unbiased and representative training dataset in order to achieve a robust and accurate fault detection
and isolation system.
25.4.4 Neural Networks for FDI
Inspired by the way the biological nervous system processes information, artificial
neural networks (ANNs) are a mathematical paradigm, composed of a large
number of interconnected elements operating in parallel. The function of the network, influenced by a number of factors including its architecture, is however
largely determined by the connections between elements. Analogous to the ability
of the biological system to learn by example, particular functions can be developed
by adjusting the value of these connections, which are known as weights.
Essentially, neural networks are adjusted, or trained, so that a particular input
produces a specific target output. Based on a comparison of the output and the
target, network parameters are adjusted in an iterative process until the output
adequately matches the target. This process is known as supervised learning, which
typically involves a large number of input/target pairs.
During training, each output is set to be a binary indicator for each data
classification. Unlike linear discriminant FDI, however, the output of the network
using unseen data is not open to interpretation of the likelihood that the observation
belongs to a particular class.
Figure 25.2 shows the mathematical workings of the most basic neural network
element, often termed a neuron. Each element of the vector input x is multiplied by
a weight. These products are summed, together with the neuron bias b, to form the
Fault Detection and Identification for Longwall Machinery Using SCADA Data
623
net input, n. This net input is then applied to a transfer function to produce the
neuron output, z. The projection of the neuron element can viewed as a discriminant function g(x) given by
g (x) z = f xi wi + b
i =1
624
Fault Detection and Identification for Longwall Machinery Using SCADA Data
625
the assumption that the state of the longwall system can be classified into a finite
number of categories. The only record of the activity of the longwall is the maintenance log, which details all unscheduled downtime at the longwall face.
Table 25.1 is an excerpt from the maintenance log corresponding to the
condition monitoring data discussed earlier. It records the time that the delay began
and the duration of downtime experienced. The plant responsible is also recorded,
as well as a description of the delay cause.
Figure 25.4 illustrates the inaccuracy of the maintenance records. It shows
traces of motor currents and the shearer position, which are centred on a time
corresponding to a documented delay. In this case, the maintenance records show
that a delay began at observation 9059, and that the longwall was inactive for 50
observations (25 min).
Table 25.1. Excerpt from the maintenance log
Date
DST
Dur.
Major delay
Minor delay
Detail delay
Remark
06-May-01
21:35
Support services
Pumps
06-May-01
21:45
25
Support services
Power supply
06-May-01
23:20
40
Support services
Power supply
06-May-01
0:30
80
Support services
Pumps
06-May-01
2:00
10
Maingate drive
Drive assembly
06-May-01
3:40
Maingate drive
Drive assembly
06-May-01
4:45
104
Panel
Supplies
07-May-01
6:30
20
Labour
Travel
Panel prepr.
07-May-01
6:50
30
Maingate drive
Drive assembly
07-May-01
8:25
20
Shearer
07-May-01
12:40
10
Shearer
07-May-01
13:15
10
Shearer
Electrical Control
Display Screen
07-May-01
13:58
45
Shearer
07-May-01
14:58
Maingate drive
Drive assembly
07-May-01
15:08
Maingate drive
Drive assembly
07-May-01
15:15
10
Mining conditions
Fall/clean up
07-May-01
16:05
Maingate drive
Drive assembly
626
Where possible, the challenges are approached in a generic manner. This will
illustrate the applicability of this research to a large number of engineering problems where system modeling is highly complex, and discrete states of the system
are not immediately apparent.
All faults considered lead to a complete longwall shutdown. That is, one or
more parameter (examples include gearbox temperatures, AFC chain tension and
earth leakage current) measures outside present safety limits, causing all major
longwall machinery to shutdown. As such, all longwall stoppages represent candidates for each documented maintenance event. This section describes the process
by which the start time and duration of all longwall stoppages was determined, as
well as the selection criteria for candidates for each maintenance event of interest.
Fault Detection and Identification for Longwall Machinery Using SCADA Data
627
628
Step 3:
Step 4:
Step 5:
Step 6:
Step 2 is required since the removal of sparse observations (a procedure carried out during data preprocessing) disturbed the sequential
nature of observations in the data matrix Y.
Beginning with the first entry in L, determine which successive
observation numbers that have a difference greater than 2. Place the
latter observation number of each such pair in a new list L2.
Using lists L and L2, create a two-column matrix S which lists the start
time and duration of each stoppage.
The following steps are repeated for each maintenance event of interest:
Using the maintenance log (stored electronically) determine the observation numbers spanned by the appropriate shift.
Determine which stoppages listed in S have a start time within this
shift.
The stoppages listed in S that have a start time within the same shift as a
particular delay are the candidate stoppages for that delay.
25.6.1.2 Results
When this procedure was applied to data representing five months of longwall
operations, 2452 stoppages were determined. The average duration for each
stoppage is 69 observations or 34.5 min (the sampling rate is two observations per
min). On average, five candidates were selected for each maintenance event of
interest using the procedure described. As further testimony as to the inaccuracy of
the maintenance log, analysis showed that two particular shifts had fewer longwall
stoppages than the number of catastrophic maintenance events documented for
each shift.
25.6.2 Event Candidate Cost Function
A number of electronically-recorded stoppages in longwall production have been
identified as those associated with documented maintenance events. Furthermore,
for reasons discussed in the previous section, only a handful of these are considered candidates for each occurrence of a fault of interest. This section attempts
to discriminate further between candidates.
A two-stage process was adopted. In the first stage, a number of stoppages in
longwall production were identified as candidates for each documented delay of
interest. On average, five candidates were selected for each event.
In the second stage (presented in this and subsequent sections), these likely
candidates were compared against each other to identify the best match to the event
described in the production delay history. It was important that this was done in a
generic manner, i.e. with no consideration given to the nature of each specific
failure. Each step in the development of the training dataset had to be universally
applicable, allowing the generation of FDI systems for a variety of applications.
The research question that must now be posed is: what information is available
that can be used to determine which candidate corresponds to the documented
downtime?
Fault Detection and Identification for Longwall Machinery Using SCADA Data
629
To answer this, we look to the maintenance log. The only information available
is the difference between the delay start time and duration of each candidate and
those of the documented event. We define DST as the difference between the
delay start time of a candidate and the documented delay start time. DD is
similarly defined as the difference between the duration of each candidate stoppage
and that of the documented delay. Each candidate will have associated values of
DST and DD, and these will initially be used to determine which candidate
corresponds to the documented downtime.
The discriminating metric is simply a weighted sum of the available discriminatory information, in this case DST and DD. Commonly referred to as a cost
function, it provides a crude way of determining which stoppage relates to the
documented maintenance event. The form of the cost function is
Cost = DST + DD
630
Table 25.2 shows the maintenance log from a single shift we observed. Table
25.3 is our record of the events as they occurred at the longwall face. Clearly, there
are discrepancies in both the DST and DD. Analysis of these errors shows the
average discrepancy to be 8 min and 31 min for DD and DST respectively.
In line with the previous arguments, and the limited comparative data, it is
decided that, on average, |DST| will be four times larger than |DD|. Therefore,
the cost function for initial candidate selection will be
Cost = DST + 4 DD
TIME
15:10
17:40
19:15
19:45
DELAY
120
45
20
25
CLASS
M
M
M
M
DETAILS
Shearer out of hydraulic oil, pressure switch faulted
Hydraulics: change stabilizer cylinder valve
Replace LW shearer picks
AFC Chain overtension
TIME
14:23
17:32
19:37
20:17
Fault Detection and Identification for Longwall Machinery Using SCADA Data
631
632
Similar results were seen for the majority of maintenance events, with the
exception of seven. Candidate selection was not possible for these because:
No candidates were projected within the 95% confidence interval as defined
by the T2-statistic
More than one candidate was projected within this confidence interval
Fault Detection and Identification for Longwall Machinery Using SCADA Data
633
This particular figure shows a distinct change from what is apparently normal
operation. The four observations prior to shutdown are clearly outside the 95%
confidence limit, which suggests that these represent operation with the fault
present.
634
Figure 25.7 shows the T2 values in the vicinity of an AFC maingate blockage
fault. Once again, significant abnormal activity is observed prior to shutdown.
Fault Detection and Identification for Longwall Machinery Using SCADA Data
635
636
Fault Detection and Identification for Longwall Machinery Using SCADA Data
637
3. Fault x present: current measurements are commensurate with those associated with fault-affected operation which suggests that a shutdown of the
longwall as a result of an x-type fault is imminent
4. Longwall shutdown: current measurements suggest that the longwall is in
full shutdown; that is, all major face equipment is not operational
25.7.3 Compilation of Training Set
To this point, a large number of observations have been classified as representing
normal operation, longwall shutdown, fault affected behaviour, or other (nonnormal operation). This section describes the engineering judgment that must be
employed in the assembly of the training set that will be used for FDI function
development. Such judgment is required to obtain an unbiased training dataset that
best characterizes the various states of the longwall system.
25.7.3.1 Removing Bias from Training Set
An unbiased training set is one that contains an equal number of observations for
each class to be discriminated. Research has shown that an unequal quantity of
each class of data typically results in a discrimination function that is less likely to
classify new observations as the more frequently presented class. As such, some
fault affected observations were duplicated to ensure a large training set with an
equal proportion of each data class.
25.7.3.2 Transitional Observations
Ultimately the goal of the training data development is to have a set of observations and associated classifications that, when applied to an FDI development
algorithm, can produce an FDI function with the greatest distinguishing power.
Erroneous classifications in the training set may alter the decision criteria in a way
that either:
Reduces the mutual exclusivity of the decision space
Incorrectly loosens or tightens the decision criteria for one or more classes
Observations immediately preceding the IMMINENT of FAULT PRESENT
classifications that were classified as NORMAL OPERATION may represent
operation with the presence of a fault. As mentioned previously, the dynamics of
the longwall are slow, requiring time for the presence of a fault to become evident
in the sensor measurements. Also, they may correspond to observations taken
when the severity of the impending fault is low. These so-called normal observations are therefore considered transitional, and may not adequately characterize
fault-free operation. In order to remain consistent with the stated goal of training
data development, these transitional observations were removed from the training
dataset.
638
output (i ) correct (i )
correct (i )
where output(i) refers to the set of all observations that the system classifies as that
of fault type i. The term correct(i) is the set of all observations in the input set that
are actually in fault class i. The recall is then the fraction of the correct classifications of observation type i that the system correctly computes. It is of course
possible that correct(i) = 0 (when the system is presented with an input set for
Fault Detection and Identification for Longwall Machinery Using SCADA Data
639
output (i ) correct (i )
output (i )
Recall
Precision
14
0.929
0.813
21
0.952
0.833
16
0.934
0.789
0.750
0.857
The results presented in this section show the successful detection and isolation
of faults using both the linear discriminant algorithm and the two-layer neural
network. The improvements in FDI performance offered by the NN suggest that
there exists some non-linearity in the relationship between sensor measurements
and the determined classifications. This is typical of most mechanical systems,
largely due to the non-linear effect of damping.
640
25.10 References
Bongers, D., (2004) Development of a Classification System for Fault Detection in Longwall
Systems, PhD Thesis, The University of Queensland
Chow, M.Y., (2000) Guest Editorial: Special Section on Motor Fault Detection and
Diagnosis. IEEE Transactions on Industrial Electronics, 47(5):982983
Frank, P.M., (1990) Fault diagnosis in dynamic systems using analytical and knowledgebased redundancy a survey and some new results, Automatica, 26(3): 459474
Gelb, A., (1974) Applied Optimal Estimation, MIT Press, Cambridge, Massachusetts.
Fault Detection and Identification for Longwall Machinery Using SCADA Data
641
Grewal, M.S., Andrews, A.P., (2001) Kalman Filtering: Theory and practice using MATLAB,
John Wiley and Sons, New York
Hotelling, H., (1931) The generalization of Student's ratio. Annals of Mathematical Statistics,
2:360378
McKay, B., Lennox, B., Willis, M., Barton, G., Montague, G., (1996) Extruder Modelling:
A Comparison of two Paradigms. UKACC International Conference on Control'96, 2:
734739, Exeter, UK. Conference publication No. 427
Reid, A. (2007) Longwall Shearer Cutting Force Estimation, PhD Thesis, The University of
Queensland
Sorenson, H.W., (1985) Kalman Filtering: Theory and Application, IEEE Press, New York
Todeschini, R., (1990) Weighted k-nearest neighbor method for the calculation of missing
values, Chemometrics and Intelligent Laboratory Systems, 9:201205
Venkatasubramanian, V., Rengaswamy R, Yin K, Kavuri S, (2003) Review of Process Fault
Diagnosis Parts I, II, III. Computers and Chem Eng, 27(3): 293346
Willsky, A.S., (1976) A survey of design methods for failure detection in dynamic systems,
Automatica, 12:601611
Contributor Biographies
Chapter 1
Khairy Kobbacy is the Professor of Management Science and Associate Head
(Research) of Salford Business School, Salford University, UK. He is also the
Director of the Management and Management Sciences Research Institute. Prof
Kobbacy has a BSc from Cairo, M.Sc. from Strathclyde and Ph.D. from Bath
University. He has sustained research interests in mathematical modelling in
maintenance, intelligent management systems in operations, and supply chain
management. He has over 40 refereed publications and edited 9 volumes including
conference proceedings, special issues of international journals and ORS 46
Keynote papers. He chaired the European Conference on Intelligent Management
Systems in Operations in 1997, 2001 and 2005 and the IBC Middle East Conference: Superstrategies for Maintenance in 1998. He was elected Vice President of
the Operational Research Society (UK) 20012003.
Prabhakar Murthy obtained B.E. and M.E. degrees from Jabalpur University and
the Indian Institute of Science in India and M.S. and Ph.D. degrees from Harvard
University. He is currently Research Professor in the Division of Mechanical
Engineering at the University of Queensland. He has held visiting appointments at
several universities in the USA, Europe and Asia. His research interests include
various aspects of new product development, operations management (lot sizing,
quality, reliability, maintenance), and post-sale support (warranties, service contracts). He has authored or coauthored 20 book chapters, 150 journal papers and
140 conference papers. He is a coauthor of five books and co-editor of two books.
He is on the editorial boards of eight international journals.
644
Contributor Biographies
Chapter 2
Liliane Pintelon holds degrees in Chemical Engineering (1983) and Industrial
Management (1984) of the KULeuven (Catholic University of Leuven, Belgium).
In 19881989 she worked as a visiting research associate at the W. Simon Graduate Business School (University of Rochester, USA). She obtained her doctoral
degree in industrial management (maintenance management) from the KULeuven
in 1990. Currently, she is professor at the Centre for Industrial Management
(KULeuven); she is also Board Member of BEMAS (Belgian Maintenance
Society) and of IFRIM (International Foundation for Research in Maintenance).
Her research and teaching area is industrial engineering and logistics, with a
special interest in maintenance. In this area lays the majority of her academic
publications. She also has considerable experience as an industrial consultant in
this area.
Alejandro Parodi-Herz received his M.Sc. degree in Mechanical Engineer at the
Simon Bolivar University, Venezuela (2002), the degree in Master of Industrial
Management (2003) at the Katholieke Universiteit Leuven and the degree of
Master in Operations and Technology Management (2004) at the Universiteit Gent.
Currently he works with the Centre of Industrial Management at the Katholieke
Universiteit Leuven as research associate to pursue his Ph.D. degree. His research
interest is mainly focused on maintenance, spare parts demand categorisation and
inventory control.
Chapter 3
Jay Lee is Ohio Eminent Scholar and L.W. Scott Alter Chair Professor in
Advanced Manufacturing at the University of Cincinnati and is founding director
of National Science Foundation (NSF) Industry/University Cooperative Research
Centre (I/UCRC) on Intelligent Maintenance Systems. His current research focuses
on autonomic computing and smart prognostics technologies for predictive maintenance and self-maintenance systems, as well and closed-loop product life cycle
service model studies. He has authored/co-authored over 100 technical publications, edited 2 books, contributed numerous book chapters, 3 U.S. patents and 2
trademarks. He received his B.S. degree from Taiwan, a M.S. in Mechanical
Engineering from the Univversity of Wisconsin-Madison, a M.S. in Industrial
Management from the State University of New York at Stony Brook, and D.Sc. in
Mechanical Engineering from the George Washington University. He is a Fellow
of ASME and SME.
Haixia Wang is a postdoctoral researcher in the NSF Industry/University Cooperative Research Centre (I/UCRC) on Intelligent Maintenance Systems (IMS)
Center headquartered at the University of Cincinnati. Her current research interest
focuses on data streamlining for machinery prognostics and health management,
manufacturing process performance and quality improvement, and design for product reliability and serviceability. Haixia Wang received her B.S. degree in
Mechanical Engineering from Shandong University at China, a Ph.D. in Mechanical Engineering from Southeast University at China, a M.S. and a Ph.D. in Industrial and Systems Engineering from the University of Wisconsin-Madison.
Contributor Biographies
645
Chapter 4
Marvin Rausand is Professor of Reliability Egineering at the Norwegian University of Science and Technology (NTNU). He worked for the research institute
SINTEF for ten years, mostly related to offshore oil and gas activities. The last
four years of this period he was Director of SINTEF Department of Safety and
Reliability. In 1989 he joined NTNU as a full time professor. He was head of
NTNUs Department of Machine Design for five years and vice-dean of the
Faculty of Mechanical Engineering for six years. In 19851986 he was visiting
professor at Heriot-Watt University in Scotland, and in 20022003 he was visiting
professor at Ecole des Mines de Nantes. Professor Rausand is a member of the
Norwegian Academy of Technical Sciences, and of the Royal Norwegian Society
of Letters and Science.
Jrn Vatn is Professor of Maintenance Optimisation at the Norwegian University
of Science and Technology (NTNU). He worked for the research institute SINTEF
for 15 years, mostly related to transportation, critical infrastructure, and offshore
oil and gas activities. He has developed several computerized tools for decision
support in safety, reliability and maintainability. For the last five years he has been
involved in implementing a new maintenance strategy in the Norwegian National
Railway Administration.
Chapter 5
Wenbin Wang is Chair of Operational Research at the Centre for OR and Applied
Statistics, Salford Business School, University of Salford, UK. Prof. Wang
received his B.Sc. (Harbin, China) in Mechanical Engineering in 1981, M.Sc.
(Xian, China) in Operations Management in 1984 and Ph.D. in OR and Applied
Statistics from Salford University (UK) in 1992. He has over 20 years experience
in OR modelling in general and maintenance and reliability modelling in particular.
He received 3 EPSRC projects in the past and has authored and co-authored over
80 research papers. Professor Wang is a fellow of Royal Statistics Society, Operational Research Society, Institute of Mathematical Applications, and a charted
mathematician. He is also a member of the International Foundation for Research
in Maintenance. Professor Wang holds a guest professorship at Harbin Institute of
Technology, China.
Chapter 6
David Percy gained a B.Sc. degree with first class honours in mathematics from
Loughborough University in 1985 and a Ph.D. degree in statistics from Liverpool
University in 1990. He is a reader in mathematics at the University of Salford and
his research into Bayesian inference, stochastic processes and multivariate analysis
has produced 40 refereed publications and many conference presentations. He is
actively involved in collaborative research for industrial applications, particularly
concerning maintenance scheduling problems for complex systems. Dave is a
chartered scientist, chartered mathematician and member of the governing Council
for the Institute of Mathematics and its Applications.
646
Contributor Biographies
Chapter 7
Elsayed Elsayed is Professor of the Department of Industrial Engineering, Rutgers
University. He is also the Director of the NSF/ Industry/ University Co-operative
Research Centre for Quality and Reliability Engineering, Rutgers-Arizona State
University. His research interests are in the areas of quality and reliability
engineering and Production Planning and Control. He is a co-author of Quality
Engineering in Production Systems, McGraw Hill Book Company, 1989. He is also
the author of Reliability Engineering, Addison-Wesley, 1996. These two books
received the 1990 and 1997 IIE Joint Publishers Book-of-the-Year Award respectively. He is a co-recipient of the 2005 Golomski Award for the outstanding paper.
Chapter 8
David Percy: See Chapter 6
Chapter 9
Khairy Kobbacy: See Chapter 1
Chapter 10
Bo Lindqvist is Professor in Statistics at the Department of Mathematical Sciences,
Norwegian University of Science and Technology, Trondheim (associate professor
since 1979, professor since 1988). He obtained the degree of Dr.Philos. in statistics
at the Univerisity of Oslo in 1982. Lindqvist's main research interest is in stochastic modeling and statistical analysis related to reliability and survival analysis.
Lindqvist is Editor of Scandinavian Journal of Statistics (2007). He is elected
member of The Royal Norwegian Society of Sciences and Letters and International
Statistical Institute.
Chapter 11
Robin Nicolai is a Ph.D. student at Tinbergen Institute Rotterdam. He is also
affiliated with the Econometric Institute at Erasmus University Rotterdam. His
research interests are maintenance optimization, in particular degradation modelling, discrete-event systems and simulation optimization. One of his papers has
been accepted for publication in Reliability Engineering and System Safety. Other
papers have appeared in proceedings of different international conferences.
Rommert Dekker is a full-time professor in Operations Research and Quantitative
Logistics at Erasmus University Rotterdam. His research interests are maintenance
optimization, inventory control, service and reverse logistics. He has published
over 100 papers in scientific journals and he has been involved in the development
of several decision support systems for maintenance planning.
Contributor Biographies
647
Chapter 12
Philip Scarf is a lecturer at the University of Salford. He obtained his Ph.D. in
1989 from the University of Manchester. Among his research interests are capital
replacement, reliability and maintenance modelling, and extreme value theory. He
has worked on capital replacement problems with the UK NHS, Mass Transit Rail
Corporation of Hong Kong, Express National Berhad Malaysia, and Malaysia
Truck and Bus Berhad. He currently serves as co-editor of the IMA Journal of
Management Mathematics.
Joseph Hartman is an Associate Professor of Industrial and Systems Engineering
at Lehigh University in Bethlehem, PA, USA. He also serves as Department Chair
and holds the Kledaras Endowed Chair. He received his Ph.D. in 1996 from the
Georgia Institute of Technology and currently serves as Editor of The Engineering
Economist, a journal devoted to the problems of capital investment. His research
and teaching interests are in economic decision analysis, including equipment
replacement analysis and transportation logistics.
Chapter 13
Gabriella Budai is a Ph.D. student at Tinbergen Institute Rotterdam. She is also
affiliated with the Econometric Institute at Erasmus University Rotterdam. Her
research topic is railway maintenance optimization, in particular scheduling
preventive railway maintenance activities and rescheduling of the rolling stock
during track possession. Her papers have been published in Journal of the Operational Research Society (JORS) and in proceedings of different international
conferences.
Rommert Dekker: See Chapter 11
Robin Nicolai: See Chapter 11
Chapter 14
Wenbin Wang: See Chapter 5
Chapter 15
Prabhakar Murthy: See Chapter 1
Nat Jack is a Lecturer in Operational Research and Statistics at the University of
Abertay Dundee and has more than 30 publications in refereed journals, books, and
conference proceedings. The present focus of his research deals with product
warranty, in collaboration with Professor D.N.P. Murthy from the University of
Queensland, and this research has resulted in a series of papers examining optimal
maintenance strategies for items sold with one- and two-dimensional warranties.
His latest project involves a study of extended warranty decision-making using a
game theoretic approach.
648
Contributor Biographies
Chapter 16
Prabhakar Murthy: See Chapter 1
Jarumon Pongpech received her B.E. (IE) at Chiang Mai University, Thailand in
1993. She got the scholarship from Faculty of Engineering, Chiang Mai University
to pursue her master degree and graduated in the field of M.S. (EM) from The
George Washington University, USA in 1996. For her Doctoral degree she also got
Thailands grant of Commission on Higher Education in 2000 to study at Department of Industrial Engineering, Chulalongkorn University in Thailand and to
conduct her research at Division of Mechanical Engineering, The University of
Queensland in Brisbane Australia. She was formerly a lecturer at Chiang Mai
University until 1999 before moving to Thammasat University. Her research interests are in the areas of maintenance policy of a system, service contract, engineering management, and industrial engineering.
Chapter 17
Ashraf Labib is Chair of Operations and Decision Analysis at Strategy and
Business Systems Department, Portsmouth Business School, University of Portsmouth. He holds a B.Sc. in Production Engineering, a M.B.A., a M.Sc. in integrated
manufacturing systems and a Ph.D. in maintenance systems. His research work
focuses on asset management, manufacturing maintenance systems, best practice
and decision-making. In particular, he is concerned with the analysis of data related
to machine failures and design and to the development of computerised maintenance
management systems (CMMSs). He is a Fellow of the Operational Research Society
(ORS), a Fellow of the IEE and a Chartered Engineer. He has published over 80
refereed papers in professional journals and international conferences proceedings.
He is currently the Associate Editor of IEEE Transactions SMC (Systems, Man, and
Cybernetics).
Chapter 18
Terje Aven is Professor of Risk Analysis and Risk Management at University of
Stavanger, Norway. He is also a Principal researcher at International Research
Institute of Stavanger (IRIS). He has been Professor II (adjunct professor) in reliability
and safety at University of Trondheim (Norwegian Institute of Technology) 1990
1995 and Professor II in reliability and risk analysis at University of Oslo 19902000.
He was the Dean of the Faculty of Technology and Science, Stavanger University
College, 19941996. Dr. Aven has many years of experience from the petroleum
industry (The Norwegian State Oil Company, Statoil). He is the author of several
reliability and risk related books and he is an associate editor/area editor/member of
the editorial board of several international journals. He received his master's degree
(cand.real) and Ph.D. in mathematical statistics (reliability) at the University of Oslo
in 1980 and 1984, respectively
Contributor Biographies
649
Chapter 19
Uday Kumar is a Professor of Operation and Maintenance Engineering at Lule
University of Technology, Sweden. He obtained his B. Tech. from India and a
Ph.D. degree in field of reliability and maintenance from Lule University of
Technology, Lule, Sweden in 1990. He worked six years in Indian mining industries prior to joining the postgraduate program. His research interests are equipment maintenance, reliability and maintainability analysis, product support, life
cycle costing, risk analysis, system analysis, etc. He is also member of the editorial
boards and reviewer for many international journals. He has published more than
100 papers in international journals and conference proceedings.
Aditya Parida obtained his Ph.D. in the area of maintenance performance measurement and hat taught operation and maintenance engineering at Lule University of
Technology, Sweden since 2002. Prior to this, he was teaching the same subject in
couple of institutes in India and was joint-director of NIILM Centre for Management Studies, New Delhi. He has a bachelors degree in mechanical engineering and
a post-graduation qualification in industrial engineering from IIT, Kharagpur, India
and has more than two decades experience in the area of operation and maintenance
engineering from the Indian Army, amongst others. He is actively involved in research in the area of maintenance performance measurement and other related
issues. He has published a number of papers in this subject area and was the coeditor for the proceedings of the COMADEM 2006.
Chapter 20
John Boylan is Professor of Management Science at Buckinghamshire Chilterns
University College. He holds degrees from Oxford and Warwick Universities and
has published papers on short-term forecasting in a variety of academic and
practitioner-oriented journals. In addition to his academic work, Professor Boylan
advises commercial organisations on forecasting processes and software. He also
leads a large project, funded by the European Union and the Learning and Skills
Council, facilitating the education and training of managers in small and medium
enterprises. His current research interests relate to demand forecasting in the
supply chain, with a particular emphasis on intermittent demand.
Aris Syntetos is a reader working with the Centre for Operational Research and
Applied Statistics (CORAS) at the University of Salford, UK. He holds a B.A.
degree from the University of Athens, an M.Sc. degree from Stirling University
and in 2001 he completed a Ph.D. at Brunel University Buckinghamshire Business School. His research interests relate primarily to intermittent demand forecasting and the interface between forecasting and stock control. Ariss work has
appeared in the International Journal of Forecasting, International Journal of
Production Economics and Journal of the Operational Research Society. He is
currently holding three research grants two from the Engineering and Physical
Sciences Research Council (EPSRC, UK) and one from the Department of Trade
and Industry (DTI, UK).
650
Contributor Biographies
Chapter 21
Jrn Vatn: See Chapter 4
Chapter 22
Renyan Jiang is a Professor and Director of the Quality, Reliability and
Maintenance Laboratory at Changsha University of Science and Technology, China.
He obtained his undergraduate and graduate degrees from Wuhan University of
Technology, China, and his Ph.D. from University of Queensland, Australia. He
held visiting appointments at City University of Hong Kong, University of
Saskatchewan, The Hong Kong Polytechnic University, and University of Toronto.
His research interests are in various aspects of quality, reliability and maintenance.
He is the author or co-author of three reliability related books, including Weibull
Models, Wiley, 2003. He has published 28 papers in international journals and a
number of other papers.
Xinping Yan is a Professor and Director of Reliability Engineering Institute at
Wuhan University of Technology, China. He obtained his undergraduate and
graduate degrees from Wuhan University of Technology, China, and his Ph.D.
from Xian Jiaotong University, China. He is a member of ISO/TC108/SC5 Committee and a member of the Council Committee of Tribology Institute of Chinese
Mechanical Engineering Society (CMES). He is an editorial member of Journal of
COMADEM(U.K.) and Journal of Maritime Environment (U.K.). His research
interests include condition monitoring and fault diagnosis, tribology and its
industrial application, and intelligent transport system.
Chapter 23
Uday Kumar: See Chapter 19
Ulla Espling is deputy director at Lule Railway Research Centre (JVTC) and a
researcher within Framework for Maintenance Strategies for Railway Infrastructure dealing with a regulated administration, outsourced maintenance, high
demands on safety and yearly funding. She has a M.Sc. degree in mechanical
engineering and a Licentiate in operation and maintenance engineering. She also
has a background from the railway which goes back to 1984. Within the railway
she has been working withh both traffic operation and planning, track engineer,
design leader and as the head for a track area, giving her a broad and rich
experience.
Contributor Biographies
651
Chapter 24
Jayantha Liyanage is an Associate Professor of Asset Operations, Maintenance
technology, and Asset Management at the University of Stavanger (UiS), Norway.
He is also the Chair and a project advisor of Center for Industrial Asset Management (CIAM), and a member of the R&D group of the Center for Risk Management and Societal Safety (SEROS), at UiS. In addition, Dr Liyanage also serves as
the Co-Organiser and Coordinator of the European Research Network for Strategic
Engineering Asset Management (EURENSEAM). Currently, he was appointed to
the Board of Directors of the Society of Petroleum Engineers (SPE) Stavanger
section, where he also take up the responsibilities as the Chairman of the Schoralship committee. Dr Liyanage is actively involved in numerous joint industry
projects at advisory and managerial capacities. He has received a number of awards
for his excellent academic and research performance. He serves in international
editorial boards of a number of international journals and international steering
committees of many International conferences.
Chapter 25
Daniel Bongers received his B.E. (1999) and Ph.D. (2004) from the University of
Queensland, Australia. He is currently a research fellow for the Australian Cooperative Research Centre Mining, and is responsible for managing two late-stage
technology development projects. His current research interests include physiological signal processing, fault detection and isolation, physiological fatigue detection
and signal measurement.
Hal Gurgenci received his B.Sc. (1976) and M.Sc. (1979) from the Middle East
Technical University, Turkey, and Ph.D. (1982) from the University of Miami. He
is currently a professor with the School of Engineering, The University of Queensland in Brisbane. Previously, he was a Vice President of the Australian Cooperative Research Centre on Mining responsible for research and education activities of
the Centre. He was the principal investigator of several large projects in mining
equipment design, automation, reliability and maintenance. His current research
interests include energy generation and conservation.
Index
A
ABC classification 484
Accelerated
Degradation testing 157
Failure time testing 156
Life testing plans 156
Adverse selection 388
Agency Theory 387
Issues 388
Aging parameter 517
AHP 427
Artificial intelligence 209
Asset 4
B
Bayesian
Approach 135
Decision Theory 146
Inference 136
Benchmarking
Methodology 562
Need 563
Overview 561
C
Candidate group 515
Case
Based reasoning 209, 212
Studies 69, 124, 150, 445
CBA See cost benefit analysis
CBM 52, 54
Applications 538
CMMS 43, 417
Composite scale 542
Condition monitoring techniques 112
Contract 402
Cost benefit ratio 525, 526, 529
Cost benefit analysis 509, 521, 529
Costs
Down time 324
Punctuality 526
Safety 525
Criticality index 93
D
Data
Acquisition 535
Fusion 537
Processing 536
Decision
Charts 35
Model 116
Support 42
Delay time
Bayesian approach 362
Modelling 345
Objective data method 364
Subjective estimation 359
Demand
Distribution 487
Estimators 500
Mean 489
Variance 492
654
Index
Dependence
Economic 265
Stochastic 266
Structural 266
Diagnostics 536
Module 66
Technologies 598
Diesel Engine 538
Discount rate 521, 525, 528
Distributions
Event time 627
Posterior 139
Predictive 142
Prior 139
DMG 422
Dynamic grouping 512, 514, 516, 519,
527
E
Economy of scale 511
Economy of scope 511
Effective failure rate 512
E-maintenance 586
EMQ 333
E-operations 586
Equipment leasing 397
ERP 418
Extrusion press 366
F
Failure
Information 94
Interaction 275
Interaction Type I 276
Interaction Type II 278
Fault detection 611
FMECA 90, 517
Forecasting
Non-parametric 493
Parametric 487
FTA 442
Functional
Block diagrams 85
Failure analysis 84
Failures 85
Fuzzy logic 212, 428
G
Game
Nash 385
Stackelberg 385
Genetic algorithm 212
Government 400
H
HAZOP 441
HIMOS 217
HSE 471
I
Industry
Nuclear 473
Oil and gas 474
Process and utility 475
Railway 475, 565
Information fusion 128
Infrastructure 376
Inspections
Imperfect 349
Perfect 348
Intensity function
General proportional 193
Reduction 405
Interval optimization 105
Inventory decision 482
K
Knowledge based systems 212
KPI 461
L
Laplace trend test 197
Lease
Definition 397
Finance 398
New equipment 408
Operating 397
Sale and leaseback 399
Used equipment 409
Lessee 402
Lessor 401
Life cycle cost 34, 509, 510, 525
Calculations 525
Index
M
Maintainability 8
Maintenance
Actions 27
Actions Selection 94
Benchmarking 563
Concepts 32
Concepts customized 40
Condition based 49, 424
Context 22
Contract 569
Corrective 27, 379
Design-out 30
Failure-based 30
Framework 4
Grouping activities 511
Intelligent Systems 56
Intervals 97
Longwall 613
Management 9
Management 22
Manager 41
Maturity levels 45
Measurement and control 225
Offshore asset 589
Opportunity based 519
Opportunity-based 30
Optimization 509, 511
Outsourcing 24
Outsourcing advantages 375
Outsourcing disadvantages 375
Passive 29
Performance 6
Performance Measurement 459
Policies 30
Predictive 29
Preventive 28, 199, 511, 79, 379,
510513, 519
Preventive comparison analysis 97
Preventive optimal schedule 170,
173
Proactive 29, 53
Reactive 51
Reliability centered 37
Scheduling 199, 271
Self 53
Service contract 6
Technologies 50
Time based 30
655
Total productive 37
Usage based 30
Metrics 461, 570
Misjudgment 549
Model
Age based 306
Basic risk 446
Capital replacement 303
Competing risk 245
Cumulative usage based 310
Dynamic programming 306
Economic life 290
Finite horizon 294
Intensity reduction 191
Linear regression 160
Markov 252
Non-homogeneous Poisson process
187
Period based 308
Proportional hazards 190
Proportional intensities 192
Renewal process 187
Repair alert 248
Risk influence 448
Selection 197, 553
State discriminant 533
Statistic based nonparametric 160
Statistic based parametric 159
Two-cycle 291
Virtual age 190
Monitoring
Off-line 615
Oil based 114
On-line 616
Vibration based 113
Moral hazard 388
MPM system 469
MTTF 517, 518
Multivariate control chart 549
N
Net present value 521, 525, 526, 528
Neural network 212, 622
NPV See Net present value
O
Oil degradation 541
Opportunity maintenance 325
656
Index
Outsourcing
Operational 24
Strategic 24
Tactical 24
P
Parameter estimation
Delay time model 359
Parameter Estimation 194
Penalties 407
Performance
Assessment 65
Assessment Multi-sensor 63
Indicators 461, 613
Measurement 494
Peristaltic pump 357
Planned maintenance cost 512
Point process 236
PriFo 522
Priors
Reference 140
Specification 145
Subjective 141
Prognostics 537
Approach 54
Approach Technologies 598
Project costs 524, 528
Punctuality costs 528
R
RCM
Analysis process 79
Data collection 80, 99
Implementation 80, 99
Regulator 400
Reliability
Inherent 6
Measures 254
Theory 8
Renewal Process
Alternating 189
Trend 235
Repair
Maximal 187
Minimal 187
Replacement model
Age based 306
Cumulative-usage based 310
S
Safety costs 527
Scheduled
Function test 95
Overhaul 96
Service
Agent 377
Contract 381
Part classification 480
Set-up costs 509, 511, 512, 513, 515,
517, 519
Shape parameter 517
Signal Processing 64
Spares forecasting
Approach 482
Method 482
Stakeholders 5
Static grouping 512, 513
Systems approach 4
T
Technologies
Diagnostic 598
Prognostic 598
Training set 637
Trend renewal process 237
Heterogeneous 242
U
Unplanned costs 512
V
Variable costs 525
Virtual age 406
Index
W
Warranty
Extended 377
Servicing 384
Wear particles 541
Weibull 517
657