Professional Documents
Culture Documents
Series Editor
Professor Hoang Pham
Department of Industrial Engineering
Rutgers
The State University of New Jersey
96 Frelinghuysen Road
Piscataway, NJ 08854-8018
USA
Complex System
Maintenance Handbook
123
Khairy A.H. Kobbacy, PhD D.N. Prabhakar Murthy, PhD
Management and Management Sciences Division of Mechanical Engineering
Research Institute The University of Queensland
University of Salford Brisbane 4072
Salford, Greater Manchester Australia
M5 4WT
UK
and process large amounts of relevant data and in the tools and techniques needed
to build model to determine the optimal maintenance strategies.
The aim of this book is to integrate this vast literature with different chapters
focusing on different aspects of maintenance and written by active researchers
and/or experienced practitioners with international reputations. Each chapter re-
views the literature dealing with a particular aspect of maintenance (for example,
methodology, approaches, technology, management, modelling analysis and opti-
misation), reports on the developments and trends in a particular industry sector or,
deals with a case study. It is hoped that the book will lead to narrowing the gap
between theory and practice and to trigger new research in maintenance.
The book is written for a wide audience. This includes practitioners from indus-
try (maintenance engineers and managers) and researchers investigating various
aspects of maintenance. Also, it is suitable for use as a textbook for postgraduate
programs in maintenance, industrial engineering and applied mathematics.
We would like to thank the authors of the chapters for their collaboration and
prompt responses to our enquiries which enabled completion of this handbook on
time. We also wish to acknowledge the support of the University of Salford and the
award of CAMPUS Fellowship in 2006 to one of us (PM). We gratefully acknowl-
edge the help and encouragement of the editors of Springer, Anthony Doyle and
Simon Rees. Also, our thanks to Sorina Moosdorf and the staff involved with the
production of the book.
Contents
Part A An Overview
Chapter 1: An Overview
K. Kobbacy and D. Murthy ...................................................................................... 3
Part E Management
An Overview
1
An Overview
1.1 Introduction
The efficient functioning of modern society depends on the smooth operation of
many complex systems comprised of several pieces of equipment that provide a
variety of products and services. These include transport systems (trains, buses,
ferries, ships and aeroplanes), communication systems (television, telephone and
computer networks), utilities (water, gas and electricity networks), manufacturing
plants (to produce industrial products and consumer durables), processing plants
(to extract and process minerals and oil), hospitals (to provide services) and banks
(for financial transactions) to name a few. All equipment is unreliable in the sense
that it degrades with age and/or usage and fails when it is no longer capable of
delivering the products and services. When a complex system fails, the conse-
quences can be dramatic. It can result in serious economic losses, affect humans
and do serious damage to the environment as, for example, the crash of an aircraft
in flight, the failure of a sewage processing plant or the collapse of a bridge.
Through proper corrective maintenance, one can restore a failed system to an
operational state by actions such as repair or replacement of the components that
failed and in turn caused the failure of the system. The occurrence of failures can
be controlled through maintenance actions, including preventive maintenance,
inspection, condition monitoring and design-out maintenance. With good design
and effective preventive maintenance actions, the likelihood of failures and their
consequences can be reduced but failures can never be totally eliminated.
The approach to maintenance has changed significantly over the last one
hundred years. Over a hundred years ago, the focus was primarily on corrective
maintenance delegated to the maintenance section of the business to restore failed
systems to an operational state. Maintenance was carried out by trained technicians
and was viewed as an operational issue and did not play a role in the design and
operation of the system. The importance of preventive maintenance was fully
appreciated during the Second World War. Preventive maintenance involves
additional costs and is worthwhile only if the benefits exceed the costs. Deciding
4 K. Kobbacy and D. Murthy
the optimum level of maintenance requires building appropriate models and use of
sophisticated optimisation techniques. Also, around this time, maintenance issues
started getting addressed at the design stage and this led to the concept of main-
tainability. Reliability and maintainability (R&M) became major issues in the
design and operation of systems.
Degradation and failure depend on the stresses on the various components of the
system. These depend on the operating conditions that are dictated by commercial
considerations. As a result, maintenance moved from a purely technical issue to a
strategic management issue with options such as outsourcing of maintenance, leasing
equipment as opposed to buying, etc. Also, advances in technologies (new materials,
new sensors for monitoring, data collection and analysis) added new dimensions
(science, technology) to maintenance. These advances will continue at an ever-
increasing pace in the twenty-first century.
This handbook tries to address the various issues associated with the main-
tenance of complex systems. The aim is to give a snapshot of the current status and
highlight future trends. Each chapter deals with a particular aspect of maintenance
(for example, methodology, approaches, technology, management, modelling
analysis and optimisation) and reports on developments and trends in a particular
industry sector or deals with a case study. In this chapter we give an overview of
the handbook. The outline of the chapter is as follows. Section 1.2 deals with the
framework that is needed to study the maintenance of complex systems and we
discuss some of the salient issues. Section 1.3 presents the structure of the book
and gives a brief outline of the different chapters in the handbook. We conclude
with a discussion of the target audience for the handbook.
1.2.1 Stakeholders
The number of parties involved would depend on the asset under consideration.
For example, in case of a rail network (used to provide a service to transport people
and goods) the customers can include the rail operators (operating the rolling
stock) and the public. The owner can be a business entity, a financial institution or
a government agency. The operator is the agency that operates the track and is
responsible for the flow of traffic. The service provider refers to the agency
carrying out the maintenance (preventive and corrective). It can be the operator (in
which case maintenance is done in-house) or some external agent (if maintenance
is outsourced) or both (when only some of the maintenance activities are out-
sourced). The regulator is the independent agency which deals with safety and risk
issues. They define the minimum standards for safety and can impose fines on the
owner, operator and possibly the service provider should the safety levels be
compromised. Government plays a critical role in providing the subsidy and
assuming certain risks. In this case all the parties involved are affected by the
maintenance carried out on the asset. If the line is shut either frequently and/or for
long duration, it can affect customer satisfaction and patronage, the returns to the
operators and owners and the costs to the government.
We focus our attention on the case where the asset is owned by the owner and
maintenance is outsourced. In this case, we have two parties (i) owner (of the
asset) and (ii) service agent (providing the maintenance). Figure 1.2 is a very
simplified system characterisation of the maintenance process where the main-
6 K. Kobbacy and D. Murthy
tenance activities are defined through a maintenance service contract. The problem
is to determine the terms of the service contract.
Each of the elements of Figure 1.2 involves several variables. For example, the
maintenance service contract involves the following: (i) duration of contract, (ii)
price of contract, (iii) maintenance performance requirements, (iv) incentives and
penalties, (v) dispute resolution, etc. The maintenance performance requirements
can include measures such as availability, mean time between failures and so on.
The characterisation of the owners decision-making process can involve costs,
asset state at the end of the contract, risks (service agent not providing the level and
quality of service) and so on. The interests and goals of the owner are different
from that of the service agent.
The study of maintenance is complicated by the unknown and uncontrollable
factors. It could be rate of degradation (which depends on several factors such as
material properties, operating environment etc) and other commercial factors (high
demand for power in the case of a power plant due to very hot weather).
The key issues in the maintenance of an asset are shown in Figure 1.3. The asset
acquisition is influenced by business considerations and its inherent reliability is
determined by the decisions made during design. The field reliability and degrada-
tion is affected by operations (usage intensity, operating environment, operating
load etc.). Through use of technologies, one can assess the state of the asset. The
analysis of the data and models allow for optimizing the maintenance decisions
(either for a given operating condition or jointly optimizing the maintenance and
operations). Once the maintenance actions have been formulated it needs to be
implemented.
An Overview 7
The linking of the technical and commercial issues is indicated in Figure 1.4
and this requires an inter-disciplinary approach.
1.2.3.1 Engineering
The degradation of an asset depends to some extent on the design and building (or
production) of the asset. Poor design leads to poor reliability that in turn results in
high level of corrective maintenance. On the other hand, a well-designed system is
more reliable and hence less prone to failures. Maintainability deals with main-
tenance issues at the design and development stage of the asset.
1.2.3.2 Science
This is very important in the understanding of the physical mechanisms that are at
play and have a significant influence on the degradation and failure. Choosing the
wrong material can have a serious consequence and impact on the subsequent
maintenance actions needed.
1.2.3.3 Economic
Maintenance costs can be a significant fraction of the total operating budget for a
business depending on the industry sector. There are two types of costs annual
cost and cost over the life cycle of the asset. The costs can be divided into direct
(labour, material etc.) and indirect (consequence of failure).
1.2.3.4 Legal
This is important in the context of maintenance out-sourcing and maintenance of
leased equipment. In both cases, the central issue is the contract between the
parties involved. Of particular importance is dispute resolution when there is a
disagreement between the parties in terms of the violation of some terms of the
contract.
1.2.3.5 Statistics
The degradation and failures occur in an uncertain manner. As such, the analysis of
such data requires the use of statistical techniques. Statistics provide the concepts
and tools to extract information from data and for the planning of efficient collec-
tion systems.
- BUSINESS PERSPECTIVE
- TECHNICAL & COMMERCIAL MAINTENANCE STRATEGIC
- IN-HOUSE vs. OUT-SOURCING STRATEGY LEVEL
- REPLACEMENT / DESIGN CHANGES
The strategic level deals with maintenance strategy. This needs to be formu-
lated so that it is consistent and coherent with other (production, marketing,
finance, etc.) business strategies. The tactical level deals with the planning and
scheduling of maintenance. The operational level deals with the execution of the
maintenance tasks and collection of relevant data.
Part A: An Overview
Part E: Management
that has been developed by the authors, is used to illustrate the new approach.
Several examples from railway applications are provided.
difficulties are addressed and practical illustrations are presented, based on sub-
systems of oil platforms and
Chapter 25: Fault Detection and Identification for Longwall Machinery Using
SCADA Data
In an attempt to improve equipment availability and facilitate informed, preventa-
tive maintenance, engineers may choose to implement one or more fault detection
and identification (FDI) technologies. For complex systems (systems for which
component interactions are not understood and model uncertainties are significant),
data-driven methods of FDI are often the only practicable solution. The develop-
ment of a data-driven FDI system for longwall mining equipment using SCADA
data is described here.
Significant data preprocessing was required to generate a quality example set.
Missing value estimation (MVE) techniques were required to complete the high-
dimensional stream of condition monitoring data from existing sensors. A cost
function, in combination with a linear discriminant analysis, was used to align the
inaccurate, categorical delay records with those delays inferred by the SCADA
data. A neural network was developed to determine the state of the system as a
18 K. Kobbacy and D. Murthy
function of the real-time SCADA data input. Validation of this algorithm with
unseen condition monitoring data showed misclassification rates of machine faults
as low as 14.3%.
2.1 Introduction
Over the last decennia industrial maintenance has evolved from a non-issue into a
strategic concern. Perhaps there are few other management disciplines that under-
went so many changes over the last half-century. During this period, the role of
maintenance within the organization has drastically been transformed. At first
maintenance was nothing more than a mere inevitable part of production, now it is
an essential strategic element to accomplish business objectives. Without a doubt,
the maintenance function is better perceived and valued in organizations. One
could considered that maintenance management is no longer viewed as an under-
dog function; now it is considered as an internal or external partner for success.
In view of the unwieldy competition many organizations seek to survive by
producing more, with fewer resources, in shorter periods of time.To enable these
serious needs, physical assets take a central role. However, installations have
become highly automated and technologically very complex and, consequently,
maintenance management had to become more complex having to cope with
higher technical and business expectations. Now the maintenance manager is
confronted with very complicated and diverse technical installations operating in
an extremely demanding business context.
This chapter, while considering the fundamental elements of maintenance and
its environment, describes the evolution path of maintenance management and the
driving forces of such changes. In Section 2.2 the maintenance context is described
and its dynamic elements are briefly discussed. Section 2.3 explains how main-
tenance practice have evolved in time and different epochs are distinguished.
Further, this sections devotes special attention to describe a common lexicon for
maintenance actions and policies to further focuss on the evolution of maintenance
concepts. Section 2.4 underlines how the role of the maintenance manager has been
reshaped as a consequence of the changes of the maintenance function. Finally, the
chapter concludes with Section 2.5 identifying the new challenges for maintenance.
22 L. Pintelon and A. Parodi-Herz
People
Technological
Legislation evolution
Management
Total asset
Society Technology life cycle Operations e-business
optimization
Logistics Support
Outsourcing Information
Market Technology
Competition
To cope with and to coordinate the complex and changing characteristics that
constitute maintenance in the first place, a management layer is imperative.
Management is about what to decide and how to decide. In the maintenance
arena, a manager juggles with technology, operations and logistics elements that
mainly need to harmonize with production. Technology refers to the physical
assets which maintenance has to support with adequate equipment and tools.
Operations indicate the combination of service maintenance interventions with
Maintenance: An Evolutionary Perspective 23
core production activities. Finally, the logistics element supports the maintenance
activities in planning, coordinating and ultimately delivering, resources like spare
parts, personnel, tools and so forth. In one way or another, all these elements are
always present, but their intensity and interrelationships will vary from one
situation to another. For example, the elevator maintenance in a hospital vs. the
plant maintenance in chemical process industries stipulates a different maintenance
recipe tailored to the specific needs. Clearly, the choice of the structural elements
of maintenance is not independent from the environment. Besides, other factors
like the business context, society, legislation, technological evolution, outsourcing
market, will be important. Furthermore, relative new trends, such as the e-business
context, will influence the current and future maintenance management enor-
mously. A whole new era for maintenance is expected as communication barriers
are bridged and coordination opportunities of maintenance service become more
intense.
One should expect that neither maintenance management nor its environment are
stationary. The constant changes in the field of maintenance are acknowledged to
have enabled new and innovative developments in the field of maintenance
science.
The technological evolution in production equipment, an ongoing evolution
that started in the twentieth century, has been tremendous. At the start of the
twentieth century, installations were barely or not mechanized, had simple design,
worked in stand-alone configurations and often had a considerable overcapacity.
Not surprisingly, nowadays installations are highly automated and technologically
very complex. Often these installations are integrated with production lines that are
right-sized in capacity.
Installations not only became more complex, they also became more critical in
terms of reliability and availability. Redundancy is only considered for very critical
components. For example, a pump in a chemical process installation can be con-
sidered very critical in terms of safety hazards. Furthermore, equipment built-in
characteristics such as modular design and standardization are considered in order
to reduce downtime during corrective or preventive maintenance. However, pre-
dominantly only for some newer, very expensive installations, such as flexible
manufacturing systems (FMS), these principles are commonly applied. Fortunate-
ly, a move towards higher levels of standardization and modularization begins to
be witnessed at all level of the installations. As life cycle optimization concepts are
commendable, it becomes mandatory that at the early design stages supportability
and maintainability requirements are well thought-out.
Parallel to the technological evolution, the ever-increasing customer focus
causes even higher pressure, especially on critical installations. As customers
service in terms of time, quality and choice becomes central to production deci-
sions, the more flexibility is required to cope with these varying needs. This calls
for well-maintained and reliable installations capable to fulfil shorter and more
reliable lead-times estimation. Physical assets are ever more important for business
success.
24 L. Pintelon and A. Parodi-Herz
Full
Strategic
Transformational service
To think with
e.g. outsourcing of all
maintenance, BOT, ...
Service package
Tactic
Partnership e.g. MRO, utilities, facilities, ... To manage
Projects
e.g. renovation, shutdown, ...
The fact that maintenance has become more critical implies that a thorough
insight into the impact of maintenance interventions, or the omission of these, is
indispensable. Per se, good maintenance stands for the right allocation of resources
(personnel, spares and tools) to guarantee, by deciding on the suitable combination
of maintenance actions, a higher reliability and availability of the installations.
Furthermore, good maintenance foresees and avoids the consequences of the
failures, which are far more important than the failures as such. Bad or no main-
tenance can appear to render some savings in the short run, but sooner or later it
will be more costly due to additional unexpected failures, longer repair times,
accelerated wear, etc. Moreover, bad or no maintenance may well have a signi-
ficant impact on customer service as delivery promises may become difficult to
fulfil. Hence, a well-conceived maintenance program is mandatory to attain busi-
ness, environmental and safety requirements.
Despite the particular circumstances, if one intends to compile or judge any
maintenance programme, some elementary maintenance terms need to be unam-
biguous and handled with consistency. Yet, both in practice and in the literature a
lot of confusion exists. For example, what for some is a maintenance policy others
refer to as a maintenance action; what some consider preventive maintenance
others will refer to as predetermined or scheduled maintenance. Furthermore, some
argue that some concepts can almost be considered strategies or philosophies, and
Maintenance: An Evolutionary Perspective 27
so on. Certainly there is a lot of confusion, which perhaps is one of the breathing
characteristics of such a dynamic and young management science. The terminol-
ogy used to describe precisely some maintenance terms can almost be taken as
philosophical arguments. However, the adoption of a rather simplistic, but truly
germane classification is essential. Not intending to disregard preceding terminol-
ogies, neither to impose nor dictate a norm, we draw attention, in particular, to
three of those confusing terms: maintenance action, maintenance policy and
maintenance concept. In the remainder of this chapter the following terminology is
adopted.
Maintenance Action. Basic maintenance intervention, elementary task carried out
by a technician (What to do?)
Maintenance Policy. Rule or set of rules describing the triggering mechanism for
the different maintenance actions (How is it triggered?)
Mainenance Concept. Set of maintenance polices and actions of various types and
the general decision structure in which these are planned and supported. (The logic
and maintenance recipe used?)
TPM
CONCEPTS
RCM
CIBOCOF
BCM Q&D
preventive predictive
POLICIES
Corrective Precautionary
1
See abbreviations list at the end of this chapter
Maintenance: An Evolutionary Perspective 29
that preventive actions could avoid some of the breakdowns and would lead to cost
savings in the long run. The main concern was how to determine, based on
historical data, the adequate period to perform preventive maintenance. Certainly,
not enough was known about failure patterns, which, among other reasons, have
led to a whole separate branch of engineering and statistics: reliability engineering.
In the late 1970s and early 1980s, equipment became in general more complex.
As result, the super-positioning effect of the failure pattern of individual com-
ponents starts to alter the failure characteristics of simpler equipment. Hence, if
there is no dominant age-related failure mode, preventive maintenance actions are
of limited use in improving the reliability of complex items. At this point, the
effectiveness of applying preventive maintenance actions started to be questioned
and was considered more carefully. A common concern about over-maintaining
grew rapidly. Moreover, as the insidious belief on preventive maintenance benefits
was put at risk, new precautionary (predictive) maintenance techniques emerged.
This meant a gradual, though not complete, switch to predictive (inspection and
condition-based) maintenance actions. Naturally, predictive maintenance was, and
still is, limited to those applications where it was both technically feasible and
economically interesting. Supportive to this trend was the fact that condition-
monitoring equipment became more accessible and cheaper. Prior to that time,
these techniques were only reserved to high-risk applications such as airplanes or
nuclear power plants.
In the late 1980s and early 1990s a different footprint on maintenance history
occurred with the emergence of concurrent engineering or life cycle engineering.
Here maintenance requirements were already under consideration at earlier product
stages such as design or commission. As a result, instead of having to deal with
built in characteristics, maintenance turned out to be active in setting design
requirements for installations and became partly involved in equipment selection
and development. All this led to a different type of precautionary (proactive) main-
tenance, the underlying principle of which was to be proactive at earlier product
stages in order to avoid later consequences. Furthermore, as the maintenance
function was better appreciated within the organization, more attention was paid to
additional proactive maintenance actions. For example, as operators are in straight
and regular contact with the installations they could intuitively identify and feel
right or wrong working conditions of the equipment. Conditions such as noise,
smell, rattle vibration, etc., that at a given point are not really measured, represent
tacit knowledge of the organization to foresee, prevent or avoid failures and its
consequences in a proactive manner. Yet these actions are indeed typically not
performed by maintenance people themselves, but are certainly part of the
structural evolution of maintenance as a formal or informal partner within the
organization.
The last type of precautionary (passive) maintenance actions are driven by the
opportunity of other maintenance actions being planned. These maintenance
actions are precautionary since they occur prior to a failure, but are passive as they
wait to be scheduled depending on others probably more critical actions. Passive
actions are in principle low priority for the maintenance staff as, at a given moment
in time, they may not really be a menace for functional or safety failures. However,
these actions can save significant maintenance resources as they may reduce the
30 L. Pintelon and A. Parodi-Herz
Policy Description
FBM Maintenance (CM) is carried out only after a breakdown. In case of CFR
behaviour and/or low breakdown costs this may be a good policy.
TBM / UBM PM is carried out after a specified amount of time (e.g. 1 month, 1000 working
hours, etc.). CM is applied when necessary. UBM assumes that the failure
behaviour is predictable and of the IFR type. PM is assumed to be cheaper than
CM.
CBM PM is carried out each time the value of a given system parameter (condition)
exceeds a predetermined value. PM is assumed to be cheaper than CM. CBM is
gaining popularity due to the fact that the underlying techniques (e.g. vibration
analysis, oil spectrometry,...) become more widely available and at better prices.
The traditional plant inspection rounds with a checklist are in fact a primitive
type of CBM.
OBM For some components one often waits to maintain them until the opportunity
arises when repairing some other more critical components. The decision
whether or not OBM is suited for a given component depends on the expectation
of its residual life, which in turn depends on utilization.
DOM The focus of DOM is to improve the design in order to make maintenance easier
(or even eliminate it). Ergonomic and technical (reliability) aspects are
important here.
For the more common maintenance policies many models have been developed
to support tuning and optimization of the policy setting. It is not our intention to
explain the fundamental differences between these models, but rather to provide an
overview of types of policies available and why these have been developed. Much
Maintenance: An Evolutionary Perspective 31
has to do with the discussion in the previous section regarding the acuity of main-
tenance actions. Therefore, it is clear that policy setting and the understanding of its
efficiency and effectiveness continues to be fine-tuned as any other management
science. We advocate the reader, particularily interested in the underlying principles
and type of models, to review McCall (1965), Geraerds (1972), Valdez-Flores and
Feldman (1989), Cho and Parlar (1991), Pintelon and Gelders (1992), Dekker (1996),
Dekker and Scarf (1998) and Wang (2002) for a full overview on the state-of-the-art
literature.
The whole evolution of maintenance was based not solely on technical but
rather on techno-economic considerations. FBM is still applied providing the cost
of PM is equal to or higher than the cost of CM. Also, FBM is typically handy in
case of random failure behaviour, with constant failure rate, as TBM or UBM are
not able to reduce the failure probability. In some cases, if there exists a
measurable condition, which can signal the probability of a failure, CBM can be
also feasible. Finally, a FBM policy is also applied for installations where frequent
PM is impracticable and expensive, such as can be the maintenance of glass ovens.
Either TBM or UBM is applied if the CM cost is higher than PM cost, or if it is
necessary because of criticality due to the existence of bottleneck installation or
safety hazards issues. Also in case of increasing failure behaviour, like for example
wear-out phenomena, TBM and UBM policies are appropriate.
Typically, CBM was mainly applied in those situations where the investment in
condition monitoring equipment was justified because of high risks, like aviation
or nuclear power regeneration. Currently, CBM is beginning to be generally
accepted to maintain all type installations. Increasingly this is becoming a common
practice in process industries. In some cases, however, technical feasibility is still a
hurdle to overcome. Another reason that catches the attention of practitioners in
CBM is the potential savings in spare parts replacements thanks to the accurate and
timely forecasts on demand. In turn, this may enable better spare parts management
through coordinated logistics support.
Finding and applying a suitable CBM technique is not always easy. For example,
the analysis of the output of some measurement equipment, such as advanced
vibration monitoring equipment, requires a lot of experience and is often work for
experts. But there are also simpler techniques such as infrared measuring and oil
analysis suitable in other contexts. At the other extreme, predictive techniques can be
rather simple, as is the case of checklists. Although fairly low-level activity, these
checklists, together with human senses (visual inspections, detection of strange
noises in rotating equipment, etc.) can detect a lot of potential problems and initiate
PM actions before the situation deteriorates to a breakdown.
At present FBM, TBM, UBM and CBM accept and seize the physical assets
which they intend to maintain as a given fact. In contrast, there are more proactive
maintenance actions and policies which, instead of considering the systems as a
given, look at the possible changes or safety measures needed to avoid maintenance
in the first place. This proactive policy is referred to as DOM. This policy implies
that maintenance is proactively involved at earlier stages of the product life cycle to
solve potential related problems. Ideally, DOM policies intend to completely avoid
maintenance throughout the operating life of installations, though, this may not be
realistic. This leads one to consider a diverse set of maintenance requirements at the
32 L. Pintelon and A. Parodi-Herz
Main Main
Generation Concept Description
strengths weaknesses
1st Ad hoc Implementing FBM and UBM Simple Ad hoc
policies; rarely CBM, DOM, decisions
OBM
2nd LCC Detailed cost breakdown over the Sound basic Resource and
equipments lifetime helping to philosophy data intensive
plan the maintenance logistics
All these concepts, as many others, enjoy several advantages and are doomed to
specific shortcomings. Correspondingly, new maintenance concepts are developed,
old ones are updated and methodologies to design customized maintenance
concepts are created. These concepts enjoy a lot of interest in their original form
and also give raise to many derived concepts. For example, streamlined RCM from
RCM. One may consider that customized maintenance concepts constitute the
third generation of this evolution. They have fundamentally emerged since it is
very difficult to claim a one fits all concept in the complex and still constantly
changing world of maintenance. They are inspired by the former concepts while
trying to aviod in the future previously experienced drawbacks. One way or
another, customized maintenance concepts mainly consist of a cherry picking of
useful techniques and ideas applied in other maintenance concepts. This important,
but relatively new concept is expected to grow in importance both in practice and
with academicians. Concepts that belong to this generation are, for example, value
driven maintenance (VDM) and CIBOCOF, which was developed at the Centre of
Maintenance: An Evolutionary Perspective 35
Total time
planning
planning delays
losses
failures
Operating time
losses
at large. Well known are the books by Nowlan and Heap (1978); Anderson and
Neri (1990) and Moubray (1997) who contributed to the adoption of RCM by in-
dustry.
Note that today many versions of RCM are around, streamlined RCM being
one of the more popular ones. However, the Society for Automotive Engineers
(SAE) holds the RCM definition that is generally accepted. SAE puts forward the
following basic questions to be solved by the any RCM implementation; if any of
these is omitted, the method is incorrectly being refered to as an RCM. To answer
these seven questions a clear step-by-step procedure exists and decision charts and
forms are available:
What are the functions and associated performance standards of asset in its
present operating context?
How can it fail to fulfil its functions? (functional failures)
What causes each failure? (failure modes)
What happens when each failure occurs? (failure effects )
In what way does each failure matter? (failure consequences)
What should be done to predict or prevent each failure? (proactive tasks and
task intervals)
What should be done if a suitable proactive task cannot be found? (default
actions)
RCM is undeniably a valuable maintenance concept. It takes into account
system functionality, and not just the equipment itself. The focus is on reliability.
Safety and environmental integrity are considered to be more important than cost.
Applying RCM helps to increase the assets lifetime and establish a more efficient
and effective maintenance. Its structured approach fits in the knowledge manage-
ment philosophy: reduced human error, more and better historical data and analy-
sis, exploitation of expert knowledge and so forth.
RCM is popular and many RCM implementations have started during the last
decade. Although RCM offers many benefits, there are also drawbacks. From the
conceptual point of view there are some weak points. For instance, the fact that the
original RCM does not offer a task packaging feature and thus does not automati-
cally offer a workable maintenance plan and the fact that the standard decision
charts and forms offered are helpful but also far from perfect. A serious remark,
mainly from the academic side, is about the scientific basis of RCM: the FMEA
analysis, which is the heart of the RCM analysis, is often done on a rather ad hoc
basis. Often available statistical data are insufficient or inaccurate, there is a lack of
insight in the equipment degradation process (failure mechanisms) and the physical
environment (e.g. corrosive or dusty environment) is ignored. The balance between
valuable experience and equally valuable, objective statistical evidence is often
absent. Many companies call in the (expensive) help of consultants to implement
RCM; some of these consultants however are not capable of offering the help
wanted and this in combination with the lack of in-house experience with RCM
discredits this methodology. RCM is in fact an on-going process, which often
causes reluctance to engage in a RCM project. RCM is undoubtedly a very
resource consuming process, which also makes it difficult to apply RCM to all
equipment.
Maintenance: An Evolutionary Perspective 39
achieve this objective the traditional RCM should be enhanced. Coetzee proposes a
new RCM blending concept from different RCM authors related techniques. He
also puts forward some innovations like the funnelling approach to ensure that
RCM efforts are concentrated on the most important failure modes in the organiza-
tion.
Finally, there is a vast range of so-called streamlined RCM concepts. These
concepts claim to be derivations of RCM. It is consultants who mainly promote
streamlined RCM as the solution for the resource consuming character of RCM.
Although streamlining sounds attractive it should be carefully applied, in order to
keep the RCM benefits. Different streamlining approaches exist; however, very
few are acceptable as formal RCM methodologies. Based on Pintelon and Van
Puyvelde (2006), Table 2.3 provides a picture of popular streamlined RCM ap-
proaches.
Generic Uses generic lists of failure modes, Ignores the operational context of the
approach or even generic analyses of technical technical systems and the current
systems maintenance practices. It assumes a
standard level of analysis detail for all
systems.
Skipping Omits one or more steps. Typically, Omits the first and essential step of
approach the first step (functions) is skipped RCM, i.e. the functional analysis and
and the analysis starts with listing the as such also does not allow for a sound
failure modes. performance standard setting
Troublemaker Carries out a full RCM analysis for Idem as above, although here all RCM
approach critical equipment only. Critical steps are followed which guarantees a
equipment is defined here as complete picture.
bottleneck equipment, which had a
lot of maintenance problems in the
past or is critical in terms of safety
hazards.
best maintenance practices and concepts such as TPM, RCM and RBI. It shows
where the added-value of maintenance lies and how an organisation can be best
structured to realise this value. One of the main contributions of VDM is that it
offers a common language to management and maintenance to discuss maintenance
matters. VDM identifies four value drivers in maintenance and provides concepts to
manage by those drivers. For all four value drivers, maintenance can help to in-
crease a companys economic value. VDM makes a link between value drivers and
core competences. For each of the core competences, some managerial concepts are
provided.
Most recently, Waeyenbergh (2005) presents CIBOCOF as a framework to
developed customised maintenance concepts. CIBOCOF starts out from the idea
that although all maintenance concepts available from the literature contain
interesting ideas, none of them is suitable for implementation without further
customization. Companies have their own priorities in implementing a maintenance
concept and are likely to go for cherry picking from existing concepts. CIBOCOF
offers a framework to do this in an integrated and structured way. Figure 2.6
illustrates the steps that this concept structurally goes through. A particularly
interesting step is step 5, maintenance policy optimization, where a decision chart is
offered to determine which mathematical decision model can be used to optimize
the chosen policy (step 4). This decision chart guides the user through the vast
literature on the topic.
M2
Technical
analysis
M3
M1
Policy decision
Start-up Maintenance making
Plan
M5 M4
Continuous Implementation
improvement & Evaluation
Nowadays, the decisions expected from the maintenance manager are complex and
sometimes can have far reaching consequences. He/she is (partly) responsible for
operational, tactical and strategical aspects of the companys maintenance manage-
ment. This involves the final responsibility for operational decisions like the
planning of the maintenance jobs and tactical decisions concerning the long-term
maintenance policy to be adopted. More recently, maintenance managers are also
consulted in strategic decisions, e.g. purchases of new installations, design choices,
personnel policy,
The career path of todays maintenance manager starts out from a rather technical
content, but evolves over time into more financial and strategic responsibilities. This
career path can be horizontal or vertical. It is also important that the maintenance
manager is a good communicator and people manager, as maintenance remains a
labor-intensive function. The maintenance manager needs to be able to attract and
retain highly skilled technicians. On-going training for technicians is needed to keep
track of the rapidly evolving technology. Motivation of maintenance technicians
often requires special attention. Job autonomy in maintenance is more than in
production, instructions may be vague, immediate assessment of the quality of work
is mostly not possible, complaints are more often heard than compliments etc.
Aspects like safety and ergonomics are an indispensable element in current main-
tenance management. Besides people, materials are another important resource for
maintenance work. Maintenance material logistics mainly concerns the spare parts
management and the determination of finding the optimum trade-off between high
spare parts availability and the corresponding stock investments.
The above described evolution in maintenance management incurs a sharp need
for decision support techniques of various nature: statistical analysis tools for
predicting the failure behaviour of equipment, decision schemes for determining
the right maintenance concept, mathematical models to optimize the maintenance
policy parameters (e.g. PM frequency), decision criteria concerning e-maintenance,
decision aids for outsourcing decisions, etc. Table 2.4 illustrates the use of some
decision support techniques for maintenance management. These techniques are
available and have proven their usefulness for maintenance, but they are not yet
widely adopted.
In the 1960s most maintenance publications were very mathematically oriented
and mainly focussed on reliability. The 1970s and early 1980s publications were
more focused on maintenance policy optimization such as determination of opti-
mum preventive maintenance interval, planning of group replacements and inspec-
tion modelling. This was a step forward, although these models still often were too
focussed on mathematical tractability rather than on realistic assumptions and
hypotheses. This caused an unfortunate gap between academics and practitioners.
The former had the impression that industry and service sector were not ready
for their work, while the latter felt frustrated because the models were too
theoretical. Fortunately, this is changing. Academics pay more attention to the real-
life background of their subject and practitioners discover the usefulness of the
academic work. Moreover academic work gets broader and offers a more diverse
range of models and concepts, such as maintenance strategy design models,
e-maintenance concepts, service parts supply policies, and the like besides the
more traditional maintenance optimization models. With the introduction of main-
Maintenance: An Evolutionary Perspective 43
tenance software, the necessary data required for these models could be more
easily collected. There still is a big gap between practitioners and academics, but it
is already slowly closing.
The help from information technology (IT) is of special interest when dis-
cussing decision support for maintenance managers. Computerized maintenance
management systems (CMMS), also called computer aided maintenance manage-
ment (CAMM), maintenance management information systems (MMIS) or even
enterprise asset management systems (EAM), nowadays offer substantial support
for the maintenance manager. These systems too have evolved over time (Table
2.5). IT of course also supports the e-maintenance applications and offers splendid
opportunities for knowledge management implementations. At the beginning of the
knowledge management hype, knowledge management was mainly aimed at fields
like R&D, innovation management, etc. Later on the potential benefits of
knowledge management were also recognized for most business functions. For
maintenance management, a knowledge management programme helps to capture
the implicit knowledge and expertise of maintenance workers and secure this
information in information systems, so making it accessible for other technicians.
The benefits of this in terms of consistency in problem solving approach and
knowledge retention are obvious. Other knowledge management applications can
be, for example, expert systems, assisting in the diagnosis of complex equipment
44 L. Pintelon and A. Parodi-Herz
Business IT
CMMS Characteristics
systems
1st generation Mainly registration and data administration (EDP).
1970s
3rd generation Broader, e.g. also asset utilization, and EHS module
External communication possible, e.g. e-MRO.
1990s ...
function and an area of intensive academic research. Efforts are aimed at ad-
vancing towards world class maintenance and providing methodologies to do so.
Pintelon et al. (2006) describes several maintenance maturity levels required to
achieve world class maintenance; these are illustrated in Figure 2.7.
still not enough research on the link between maintenance and business
strategy. The main focus of maintenance management research is still on the
tactical and operational planning. Links between the former and the latter part of
research however are still very rare. Closing this gap by linking maintenance and
business throughout all decision levels is one of the major challenges for the
future; every step taken brings us closer to real world-class maintenance.
2.7 References
Anderson, R.T., Neri, L., (1990), Reliability Centred Maintenance: Management and
Engineering Methods, Elsevier Applied Sciences, London
Blanchard, B.S., (1992), Logistics Engineering and Management, Prentice Hall, Englewood
Cliffs, New Jersey
Cho, I.D, Parlar, M., (1991), A survey on maintenance models for multi-unit systems.
European Journal of Operational Research, 51:123
Coetzee, J.L., (2002), An Optimized Instrument for Designing a Maintenance Plan: A Sequel
to RCM. PhD thesis, University of Pretoria, South-Africa
Dekker, R., (1996) Applications of maintenance optimization models: A review and
analysis. Reliability Engineering and System Safety, 52(3):229240
Dekker, R., and Scarf, P.A., (1998) On the impact of optimisation models in maintenance
decision making: the state of the art. Reliability Engineering and System Safety,
60:111119
Geraerds, W.M.J., (1972), Towards a Theory of Maintenance. The English University Press.
London.
48 L. Pintelon and A. Parodi-Herz
Gits, C.W., (1984), On the Maintenance Concept for a Technical System: A Framework for
Design, Ph.D.Thesis, TUEindhoven, The Netherlands
Haarman, M. and Delahay, G., (2004), Value Driven Maintenance New Faith in Main-
tenance, Mainnovation, Dordrecht, The Nederlands
Jones, R.B., (1995), Risk-Based Maintenance, Gult Professional Publishing (Elsevier),
Oxford
Kelly, A., (1997), Maintenance Organizations & Systems: Business-Centred Maintenance,
Butterworth-Heinemann, Oxford
McCall, J.J. (1965), Maintenance policies for stochastically failing equipment: A survey.
Management Science, 11 (5):493524
Moubray, J., (1997), Reliability-Centred Maintenance. Second Edition. Butterworth-
Heinemann, Oxford
Nowlan, F.S., Heap, H.F., (1978), Reliability Centered Maintenance, United Airlines
Publications, San Fransisco
Parkes, D. in Jardine, A.K.S., (1970), Operational Research in Maintenance, University of
Manchester Press, Manchester
Pintelon, L., Gelders, L., Van Puyvelde, F., (2000), Maintenance Management, Acco Leuven/
Amersfoort
Pintelon, L., Gelders, L., (1992) Maintenance management decision making. European
Journal of Operational Research, 58:301317
Pintelon, L., Pinjala, K., Vereecke, A., (2006), Evaluating the Effectiveness of Maintenance
Strategies, Journal of Quality in Maintenance Engineering (JQME), 12(1):214229
Pintelon, L., Van Puyvelde, F., (2006), Maintenance Decision Making, Acco, Leuven,
Belgium
Takahashi, Y. and Takashi, O., (1990) TPM: Total Productive Maintenance. Asian
Productivity Organization, Tokyo
Valdez-Flores, C., Feldman, R.M., (1989) A survey of preventive maintenance models for
stochastically deteriorating single-unit systems. Naval Research Logistics, 36:419446
Waeyenbergh, G., (2005), CIBOCOF A Framework for Industrial Maintenance Concept
Development, PhD thesis, Centre for Industrial Management K.U.Leuven, Leuven,
Belgium
Waeyenbergh, G., Pintelon, L., (2002) A framework for maintenance concept development.
International Journal of Production Economics, 77:299313
Wang H., (2002), A survey of maintenance policies of deteriorating systems. European
Journal of Operational Research, 139:469489
3
3.1 Introduction
For years, maintenance has been treated as a dirty, boring and ad hoc job. Its seen as
critical for maintaining productivity but has yet to be recognized as a key component
of revenue generation. The question most often asked is Why do we need to main-
tain things regularly? The answer is To keep things as reliable as possible. How-
ever, the question that should be asked is How much change or degradation has
occurred since the last round of maintenance? The answer to this question is I
dont know. Today, most machine field services depend on sensor-driven manage-
ment systems that provide alerts, alarms and indicators. The moment the alarm
sounds, its already too late to prevent the failure. Therefore, most machine main-
tenance today is either purely reactive (fixing or replacing equipment after it fails) or
blindly proactive (assuming a certain level of performance degradation, with no input
from the machinery itself, and servicing equipment on a routine schedule whether
service is actually needed or not). Both scenarios are extremely wasteful.
Rather than reactive maintenance, fail-and-fix, world-class companies are
moving forwards towards predict-and-prevent maintenance. A maintenance
scheme, referred to as condition based maintenance (CBM), was developed by
considering current degradation and its evolution. CBM methods and practices
have been continuously improved for the last decades; however, CBM is conducted
at equipment level one piece of equipment at a time, and the developed prognos-
tics approaches are application or equipment specific.
Holistic approach, real-time prognostics devices, and rapid implementation
environment are potential future research topics in product and system health
assessment and prognostics. With the level of integrated network systems develop-
ment in todays global business environment, machines and factories are net-
worked, and information and decisions are synchronized in order to maximize a
companys asset investments. This generates a critical need for a real-time remote
machinery prognostics and health management (R2M-PHM) system. The unmet
needs in maintenance can be categorized into the following:
50 J. Lee and H. Wang
3.2.1.1 No Maintenance
There are two kinds of situations in which no maintenance will occur:
No way to fix it: the maintenance technique is not available for a special
application, or the maintenance technique is at too early stage of develop-
ment.
Isnt worth it to fix it: some machines were designed to be used only once.
When compared to maintenance cost, it may be more cost-effective just to
discard it.
Neither of the scenarios above is within the scope of the discussion here.
New Technologies for Maintenance 51
Machine Performance
Self-Maintenance
or Maintenance-free
Proactive
and uptime Machine
Maintenance
(Failure Root
causes analysis)
Predictive
Preventive Maintenance
Maintenance
(Scheduled
Reactive Maintenance)
Maintenance
No (Fire Fighting)
Maintenance
from the historical databases of equipment behavior over time. These two indices
provide a rough estimate of the time between two adjacent breakdowns and the
mean time needed to restore a system when such breakdowns happen. Although
equipment degradation processes vary from case to case, and the causes of failure
can be different as well, the information contained in MTBF and MTTR can still
be informative. Other indices can also be extracted and used, including the mean
lifetime, mean time to first failure, and mean operational life, as discussed by Pham
et al. (1997). With the introduction of minimal repair and imperfect maintenance,
various extensions and modifications to the age-dependent PM policy have been
proposed (Bruns 2002; Chen et al. 2003). Another preventive maintenance policy
that received much attention is the periodic PM policy, in which degraded
machines are repaired or replaced at fixed time intervals independent of the
equipment failures. Various modifications and enhancements to this maintenance
policy have also been proposed recently (Cavory et al. 2001).
The preventive maintenance schemes are time-based without considering the
current health state of the product, and thus are inefficient and less valuable for a
customer whose individual asset is of the most concern. For the case of helicopter
gearboxes, it was found that almost half of the units were removed for overhaul
even though they were in a satisfactory operating condition. Therefore techniques
for more economical and reliable maintenance are needed.
3.2.1.6 Self-maintenance
Self-maintenance is a new design and system methodology. Self-maintenance
machines are expected to be able to monitor, diagnose, and repair themselves in
order to increase their uptime.
One system approach to enabling self-maintenance is based on the concept of
functional maintenance (Umeda et al. 1995). Functional maintenance aims to
recover the required function of a degrading machine by trading off functions,
whereas traditional repair (physical maintenance) aims to recover the initial
physical state by replacing faulty components, cleaning, etc. The way to fulfil the
self-maintenance function is by adding intelligence to the machine, making it
clever enough for functional maintenance, so that the machine can monitor and
diagnose itself, and it can still maintain its functionality for a while if any kind of
failure or degradation occurs. In other words, self-maintainability would be
appended to an existing machine as an additional embedded reasoning system. The
required capabilities of a self-maintenance machine (SMM) are defined as follows
(Labib 2006):
Monitoring capability: SMM must have the ability of on-line condition
monitoring using sensor fusion. The sensors send the raw data of machine
condition to a processing unit.
Fault judging capability: from the sensory data, the SMM can judge
whether the machine condition is at normal or abnormal state. By judging
the condition of the machines, we can know the current condition and time
left to failure of the machines.
54 J. Lee and H. Wang
prognostics were built into the network functionality. Vachtsevanos and Wang
(2001) gave an overview of different CBM algorithms and suggested a method to
compare their performance for a specific application.
Prognostic information, obtained through intelligence embedded into the
manufacturing process or equipment, can also be used to improve manufacturing
and maintenance operations in order to increase process reliability and improve
product quality. For instance, the ability to increase reliability of manufacturing
facilities using the awareness of the deterioration levels of manufacturing equipment
has been demonstrated through an example of improving robot reliability (Yamada
and Takata 2002). Moreover, a life cycle unit (LCU) (Seliger et al. 2002) was
proposed to collect usage information about key product components, enabling one
to assess product reusability and facilitating the reuse of products that have
significant remaining useful life.
In spite of the progresses in CBM, many fundamental issues still remain. For
example:
form. More often, no infrastructure exists for delivering the data over a network, or
for managing and analyzing the data, even if the devices were networked.
Watchdog Agent-based real-time remote machinery prognostics and health
management (R2M-PHM) system has been recently developed by the IMS Center.
It focuses on developing innovative prognostics algorithms and tools, as well as
remote and embedded predictive maintenance technologies to predict and prevent
machine failures, as illustrated in Figure 3.2.
Figure 3.2. Key focus and elements of the Intelligent Maintenance Systems
The rest of the section is organized as follows. Section 3.1 deals with the
platform of Watchdog Agent-based real-time remote machinery prognostics and
health management (R2M-PHM) system. Section 3.2 presents a generic and
scalable prognostic methodology or toolbox, i.e., the Watchdog Agent toolbox;
and Section 3.3 illustrates the effectiveness and potentials of this new development
using several real industry case studies.
Figure 3.3. Illustration of IMS real-time remote machinery diagnosis and prognosis system
Vibration
Watchdog Decision Client
Temperature Database Web
Agent support software
server
Pressure toolbox tools
Current
Embedded operating system Remote
Voltage
computer
On/Off
I/O cards
Embedded computer
of memory since all of the tools are embedded into the hardware. It has 16 high
speed analog input channels to deal with highly dynamic signals. It also has
various peripherals that can acquire non-analog sensor signals such as RS-
232/485/432, parallel and USB. The prototype uses a compact flash card for
storage, so it can be placed on top of machine tools and is suitable for withstanding
vibrations in a working environment. Once a certain set of tools/algorithms is
determined for a certain industry application, commercially available hardware,
such as Advantech and National Instruments (NI) as illustrated in Figure 3.6b and
c, respectively, will be further evaluated for customized Watchdog Agent applica-
tions.
The Watchdog Agent toolbox enables one to assess and predict quantitatively
performance degradation levels of key product components, and to determine the
root causes of failure (Casoetto et al. 2003; Djurdjanovic et al. 2000; Lee 1995,
1996), thus making it possible to realize physically closed-loop product life cycle
monitoring and management. The Watchdog Agent consists of embedded
computational prognostic algorithms and a software toolbox for predicting de-
gradation of devices and systems. Degradation assessment is conducted after the
critical properties of a process or machine are identified and measured by sensors. It
is expected that the degradation process will alter the sensor readings that are being
fed into the Watchdog Agent, and thus enable it to assess and quantify the
degradation by quantitatively describing the corresponding change in sensor
signatures. In addition, a model of the process or piece of equipment that is being
considered, or available application specific knowledge can be used to aid the
degradation process description, provided that such a model and/or such knowledge
exist. The prognostic function is realized through trending and statistical modeling
of the observed process performance signatures and/or model parameters.
In order to facilitate the use of Watchdog Agent in a wide variety of applications
(with various requirements and limitations regarding the character of signals,
available processing power, memory and storage capabilities, limited space, power
consumption, the users preference etc.) the performance assessment module of the
64 J. Lee and H. Wang
Watchdog Agent has been realized in the form of a modular, open architecture
toolbox. The toolbox consists of different prognostics tools, including neural
network-based, time-series based, wavelet-based and hybrid joint time-frequency
methods, etc., for predicting the degradation or performance loss on devices, process,
and systems. The open architecture of the toolbox allows one easily to add new
solutions to the performance assessment modules as well as to easily interchange
different tools, depending on the application needs. To enable rapid deployment, a
quality function deployment (QFD) based selection method had been developed to
provide a general suggestion to aid in tool selection; this is especially critical for
those industry users who have little knowledge about these algorithms. The current
tools employed in the signal processing and feature extraction, performance assess-
ment, diagnostics and prognostics modules of Watchdog Agent functionality are
summarized in Figure 3.10.
Each of these modules is realized in several different ways to facilitate the use
of the Watchdog Agent in a wide variety of products and applications.
signals, which place a strong emphasis on the need for development and utilization
of non-stationary signal analysis techniques, such as wavelets, or joint time-
frequency analysis. The feature extraction module extracts features most relevant
to describing a products performance. Those features are extracted from the time
domain into which the sensory processing module transforms sensory signals,
using expert knowledge about the application, or automatic feature selection
methods such as roots of the autoregressive time-series model, or time-frequency
moments and singular value decomposition.
Currently the following signal processing and feature extraction tools are used
in the Watchdog Agent toolbox:
The Fourier transformation method has been widely used in de-noising and
feature extraction. Noise component in the signal can be distinguished after
it is transformed, and feature components can be identified after the
removal of noise. However, Fourier transformation is applicable to non-
stationary signals only since frequency-band energies for applications are
characterized by time-invariant frequency content.
The autoregressive modeling method calculates frequency peak locations
and intensities using autoregressive oscillation modes of sensor readings
and bares significant information about the process (usually, mechanical
systems are well described by the modes of oscillations).
The wavelet/wavelet packet decomposition method enables the rapid
calculation of non-stationary signal energy distribution at the expense of
loosing some of the desirable mathematical properties.
The time-frequency analysis method provides both temporal and spectral
information with good resolution, and is applicable to highly non-stationary
signals (e.g. impacts or transient behaviors). However, it is not applicable if
a large amount of data has to be considered and calculation speed is a
concern.
The application specific features extraction method is applicable in cases
when one can directly extract performance-relevant features out of the
time-series of sensor readings.
expert knowledge exists, simple but rapid performance assessment based on the
feature-level fused multi-sensor information can be made using the relative number
of activated cells in the neural network, or by using the logistic regression
approach. For products with open-control architecture, the match between the
current and nominal control inputs and the performance criteria can also be utilized
to assess the products performance. For more sophisticated applications with
intricate and complicated signals and performance signatures, statistical pattern
recognition methods, or the feature map based approach can be employed.
The following performance assessment tools are currently being used in the
Watchdog Agent toolbox:
The logistic regression method allows one to predict a discrete outcome,
such as group membership, from a set of variables that may be continuous,
discrete, dichotomous, or a mix of any of these. It can quantitatively
represent the proximity of current operating conditions to the region of
desirable or undesirable behavior. However, it is applicable when a good
feature domain description of unacceptable behavior is available.
The feature map method assesses the overlap between the normal and most
recent process behavior, and is applicable in cases when the Gaussianness
of extracted features cannot be guaranteed.
The statistical pattern recognition method calculates overlap of feature
distributions based on the assumption of Gaussian distribution of the
features, and is applicable to a repeatable and stable process. However, it is
not applicable to the highly dynamic systems in which feature distribution
cannot be approximated as Gaussian
The hidden Markov model method is applicable to highly dynamic phe-
nomena when a sequence of process observations rather than a single
observation is needed to describe adequately the behavior of process
signatures.
The particle filters performance assessment is able to describe quantitatively
process performance, and is applicable in cases of complex systems that
display multiple regimes of operation (both normal and faulty). In this case a
hybrid description of the system is needed, incorporating both discrete and
continuous states.
failure modes occur, performance signatures related to each specific failure mode
can be collected and used to teach the Watchdog Agent to recognize and diagnose
those failure modes in the future. Thus, the Watchdog Agent is envisioned as an
intelligent device that utilizes its experience and human supervisory inputs over
time to build its own expandable and adjustable world model.
Performance assessment, prediction and prognostics can be enhanced through
feature-level or decision-level sensor fusion, as defined by Hall and Llinas (2000)
(Chapter 2). Feature-level sensor fusion is accomplished through concatenation of
features extracted from different sensors, and the joint consideration of the con-
catenated feature vector in the performance assessment and prediction modules.
Decision-level sensor fusion is based on separately assessing and predicting pro-
cess performance from individual sensor readings and then merging these indi-
vidual sensor inferences into a multi-sensor assessment and prediction through
some averaging technique.
In summary, the following performance forecasting tools are currently used in
the Watchdog Agent:
The autoregressive moving average (ARMA) method is applicable to linear
time-invariant systems whose performance features display stationary
behavior. ARMA utilizes a small amount of historic data and can provide
good short term predictions.
The compound match matrix/ARMA prediction method is applicable to
cases when abundant records of multiple maintenance cycles exist for non-
linear processes. It excels at dealing with high dimension data and can
provide good long term prediction by converting vector-based feature
prediction to scalar-based prediction.
The fuzzy logic prediction method is applicable to complex systems whose
behavior is unknown and no model, function or numerical technique to
describe the system is readily available. It utilizes linguistic vagueness or
form and allows imprecision, to some extent, in formulating approximations.
Fuzzy logic can give fast approximate solutions.
The Elman recurrent neural network (ERNN) prediction method is appli-
cable to non-linear systems and can give long term predictions when given
a large amount of training data. However, no standard methodology exists
to determine ERNN structure, and trial-and-error is usually used in the
modeling process.
New tools will be continuously developed and added to the modular, open
architecture Watchdog Agent toolbox based on the development procedure as
shown in Figure 3.12.
New Technologies for Maintenance 69
Prototyping &
Program
testing
development
No No
Accepted Evaluation
Yes
Yes
Deployment
Several Watchdog Agent tools for on-line performance assessment and prediction
have already been implemented as stand alone applications in a number of in-
dustrial and service facilities. Listed below are several examples to illustrate the
developed tools.
Figure 3.15. The bearing test rig sponsored by Rexnord Technical Service
Figure 3.16 presents the vibration waveform collected from bearing 4 at the last
stage of the bearing test. The signal exhibits strong impulses periodicity because of
the impacts generated by a mature outer race defect. However, when examining the
historical data and observing the vibration signal three days before the bearing
failed, there is no sign of periodic impulses as shown in Figure 3.17a. The periodic
impulse feature is completely masked by the noise.
An adaptive wavelet filter is designed to de-noise the raw signal and enhance
degradation detection. The adaptive wavelet filter is yielded in two steps. First the
optimal wavelet shape factor is found by the minimal entropy method. Then an
optimal scale is identified by maximizing the signal periodicity. By applying the
designed wavelet filter to the noisy raw signal, the de-noised signal can be obtained
as shown in Figure 3.17b. The periodic impulse feature can then be clearly dis-
covered, which serves as strong evidence of bearing outer race degradation. The
wavelet filter-based de-noising method successfully enhanced the signal feature
and provided potent evidence for prognostic decision-making.
72 J. Lee and H. Wang
3.3.3.3 Example 3: Bearing Risk of Failure and Remaining Useful Life Prediction
An important issue in prognostic technology is the estimation of the risk of failure,
and of the remaining useful life of a component, given the components age and its
past and current operating condition. In numerous cases, failures were attributed to
many correlated degradation processes, which could be reflected by multiple
degradation features extracted from sensor signals. These features are the major
information regarding the health of the component under monitoring; however, the
failure boundary is hard to define using these features. In reality, the same feature
vector could be attributed to totally different combinations of the underlying
degradation processes and their severity levels. There is only a probabilistic
relationship between the component failure and the certain level of degradation
features. A typical example can be found during bearing operation. Two bearings
of the same type could fail at different levels of RMS and Kurtosis of vibration
signal. To capture the probabilistic relationship between the multiple degradation
features and the component failure as well as to predict the risk of failure and the
remaining useful life, IMS has developed a Proportional Hazards (PH) approach
(Liao et al. 2005) based on the PH model proposed by Cox (1972). The PH model
involving multiple degradation features is given as
where (t ; Z ) is the hazard rate of the component given the current age t and the
degradation feature vector Z ; 0 (t ) is called the baseline hazard rate function;
is the model parameter vector. This formulation relates the working age and
multiple degradation feature to the hazard rate of the component. To estimate the
parameters, the maximum likelihood approach could be utilized using offline data,
including the degradation features over time of many components and their failure
times. Afterwards, the established model can be used for predicting the risk of
failure for the component by plugging in the working age and the degradation
features extracted from the on-line sensor signals. In addition, the remaining useful
life L(tcurrent ) given the current working age and the history of degradation features
can be estimated as
New Technologies for Maintenance 73
L(tcurrent ) exp (v; z (v)) dv d (3.2)
t current t current
Table 3.1. Estimates of expected remaining useful life Test 1, Bearing 3 (unit: day)
Time 26 29 31
Estimated expected remaining useful life 3.5549 3.3965 1.5295
True remaining useful life 6.5278 3.5278 1.5278
Error 2.9729 0.1313 0.0017
has been developed, which serves as a baseline system for researchers and
companies to develop next-generation e-maintenance systems. It enables machine
makers and users to predict machine health degradation conditions, diagnose fault
sources, and suggest maintenance decisions before a fault actually occurs. The
Watchdog Agent-based R2M-PHM platform expands the OSA-CBM architecture
topology by including real-time remote machinery diagnosis and prognosis
systems and embedded Watchdog Agent technology. The Watchdog Agent is an
embedded algorithm toolbox which converts multi-sensory data to machine health
information. Innovative sensory processing and autonomous feature extraction
methods are developed to facilitate the plug-and-play approach in which the
Watchdog Agent can be setup and run without any need for expert knowledge or
intervention.
Future work will be the further development of the Watchdog Agent-based
IMS platform. Smart software and NetWare will be further developed for proactive
maintenance capabilities such as performance degradation measurement, fault
recovery, self-maintenance and remote diagnostics. For the embedded Watchdog
Agent application, we need to harvest the developed technologies and tools and to
accelerate their deployment in real-world applications through close collaboration
between industrial and academic researchers. Specifically, future work will include
the following aspects: (i) evaluate the existing Watchdog Agent tools and identify
the application needs from the smart machine testbed; (ii) develop a configurable
prognostics tools platform for rotary machinery elements such as bearings, motors,
and gears, etc., so that several of most frequently used prognostics tools can be pre-
tested and deposited into a ready-to-use tool library; (iii) develop a user interface
system for tool selection, which allows users to use the right tools effectively for
the right applications and achieve the first tool correct accuracy; (iv) validate the
reconfiguration of these tools to a variety of similar applications (to be defined by
the company participants); and (v) explore research in a peer-to-peer (P2P)
paradigm in which Watchdog Agents embedded on identical products operating
under similar conditions could exchange information and thus assist each other in
machine health diagnosis and prognosis.
To predict, prioritize, and plan precision maintenance actions to achieve an
every action correct objective, the IMS Center is creating advanced maintenance
simulation software for maintenance schedule planning and service logistics cost
optimization for transparent decision making. At the same time, the Center is
exploring the integration of decision support tool and optimization techniques for
proactive maintenance; this integration will facilitate the functionalities of the
Watchdog Agent-based R2M-PHM in which an intelligent maintenance systems
can operate as a near-zero down-time, self-sustainable and self-aware artificially
intelligent system that learns from its own operation and experience.
Embedding is crucial for creating an enabling technology that can facilitate
proactive maintenance and life cycle assessment for mobile systems, transportation
devices and other products for which cost-effective realization of predictive perform-
ance assessment capabilities cannot be implemented on general purpose personal
computers. The main research challenge will be to accomplish sophisticated perform-
ance evaluation and prediction capabilities under the severe power consumption,
processing power and data storage limitations imposed by embedding. The Center
New Technologies for Maintenance 75
will develop a wireless sensor network made of self-powered wireless motes for
machine health monitoring and embedded prognostics. These networked smart motes
can be easily installed in products and machines with ad hoc communications. In
addition, the Center is investigating the feasibility of harvesting energy by using
vibration in an environment equipped with wireless motes for remote monitoring of
equipment and machinery. In conjunction with that investigation, the Center is
looking at ways of developing communication protocols that require less energy for
communication. Power converter circuitry has been designed by using vibration
signals in order to convert vibration energy into useful electric energy. These tech-
nologies are very critical for monitoring equipment or systems in a complex environ-
ment where the availability of power is the major constraint.
In the area of collaborative product life cycle design and management, the
Watchdog Agent can serve as an infotronics agent to store product usage and end-
of-life (EOL) service data and to send feedback to designers and life cycle
management systems. Currently, an international intelligent manufacturing systems
consortium on product embedded information systems for service and EOL has
been proposed. The goal is to integrate Watchdog Agent capabilities into products
and systems for closed-loop design and life cycle management, as illustrated in
Figure 3.19.
The Center will continue advancing its research to develop technologies and tools
for closed-loop life cycle design for product reliability and serviceability, as well as
explore research in new frontier areas such as embedded and networked agents for
self-maintenance and self-healing, and self-recovery of products and systems. These
new frontier efforts will lead to a fundamental understanding of reconfigurability and
allow the closed-loop design of autonomously reconfigurable engineered systems
that integrate physical, information, and knowledge domains. These autonomously
reconfigurable engineered systems will be able to sense, perform self-prognosis, self-
76 J. Lee and H. Wang
Closed-Loop Near
Near0
0
Life
LifeCycle
Cycle Downtime
Design
Design
Design for
Reliability and Product or Health Monitoring Service
Serviceability System Sensors & Embedded
In Use Intelligence
Product Product
Center Redesign Degradation Watchdog Web-enabled Monitoring &
Agent Prognostics
3.5 References
Badia, F.G., Berrade, M.D. and Campos, C.A., (2002) Optimal Inspection and Preventive
Maintenance of Units with Revealed and Unrevealed Failures. Reliability Engineering
and System Safety 78: 157163.
Barbera, F., Schneider, H. and Kelle, P., (1996) A Condition Based Maintenance Model
with Exponential Failures and Fixed Inspection Interval. Journal of the Operational
Research Society 47(8): 10371045.
Bonissone, G., (1995) Soft computing applications in equipment maintenance and service
in: ISIE 95, Proceedings of the IEEE International Symposium, 2: 1014.
Brotherton, T., Jahns, G., Jacobs, J. and Wroblewski, D., (2000) Prognosis of faults in gas
turbine engines, in: Aerospace Conference Proceedings, (2000) IEEE, 6: 1825.
Bruns, P., (2002) Optimal Maintenance Strategies for Systems with Partial Repair Options
and without Assuming Bounded Costs. European Journal of Operational Research 139:
146165.
Bunday, B.D., (1991) Statistical Methods in Reliability Theory and Practice, Ellis Horwood.
Burrus, C., Gopinath, R. and Haitao, G., (1998) Introduction to wavelets and wavelet
transforms a primer. NJ: Prentice Hall.
Casoetto, N., Djurdjanovic, D., Mayor, R., Lee, J. and Ni, J., (2003) Multisensor process
performance assessment through the use of autoregressive modeling and feature maps.
Trans. of SME/NAMRI, 31:483490.
New Technologies for Maintenance 77
Cavory, G., Dupas, R. and Goncalves, R., (2001) A Genetic Approach to the Scheduling of
Preventive Maintenance Tasks on a Single Product Manufacturing Production Line,
International Journal of Production Economics, 74: 135146.
Chen, C.T., Chen, Y.W. and Yuan, J., (2003) On a Dynamic Preventive Maintenance Policy
for a System under Inspection. Reliability Engineering and System Safety 80: 4147.
Chen, D. and Trivedi, K., (2002) Closed-Form Analytical Results for Condition-Based
Maintenance. Reliability Engineering and System Safety 76: 4351.
Cohen, L., (1995) Time-frequency analysis. NJ: Prentice Hall.
Cox, D., (1972) Regression models and life tables (with discussion). Journal of the Royal
Statistical Society, Series B 34:187220.
Djurdjanovic, D., Widmalm, S.E., William, W.J., et al., (2000) Computerized classification
of temporomandibular joint sounds. IEEE Transactions on Biomedical Engineering
47:977984.
Djurdjanovic, D., Ni, J. and Lee, J., (2002) Time-frequency based sensor fusion in the
assessment and monitoring of machine performance degradation. Proceedings of 2002
ASME Int. Mechanical Eng. Congress and Exposition, paper number IMECE2002-32032.
Garga, A., McClintic, K.T., Campbell, R.L., et al., (2001) Hybrid reasoning for prognostic
learning in CBM systems, in: Aerospace Conference, 1017 March, 2001, IEEE Proceed-
ings, 6: 29572969.
Goodenow, T., Hardman, W., Karchnak, M., (2000) Acoustic emissions in broadband
vibration as an indicator of bearing stress. Proceedings of IEEE Aerospace Conference,
2000; 6: 95122.L.D.
Hall, L.D. and Llinas, J., (Eds.), (2000) Handbook of Sensor Fusion, CRC Press.
Hall, L.D., (1992) Mathematical techniques in Multi-Sensor Data Fusion, Artech House Inc.
Hansen, R., Hall, D., Kurtz, S., (1994) New approach to the challenge of machinery
prognostics. Proceedings of the International Gas Turbine and Aeroengine Congress and
Exposition, American Society of Mechanical Engineers, June 1316 1994: 18.
IMS, NSF I/UCRC Center for Intelligent Maintenance Systems, www.imscenter.net; 2004.
Kemerait, R., (1987) New cepstral approach for prognostic maintenance of cyclic
machinery. IEEE SOUTHEASTCON, 1987: 256262.
Kleinbaum, D., (1994) Logistic regression. New York: Springer-Verlag.
Labib, A.W., (2006) Next generation maintenance systems: Towards the design of a self-
maintenance machine. 2006 IEEE International Conference on Industrial Informatics,
Integrating Manufacturing and Services Systems, 1618 August, Singapore
Lee, J., (1995) Machine performance monitoring and proactive maintenance in computer-
integrated manufacturing: review and perspective. International Journal of Computer
Integrated Manufacturing 8:370380.
Lee, J., (1996) Measurement of machine performance degradation using a neural network
model. Computers in Industry 30:193209.
Lee, J., Ni, J., (2002) Infotronics agent for tether-free prognostics. Proceeding of AAAI
Spring Symposium on Information Refinement and Revision for Decision Making:
Modeling for Diagnostics, Prognostics, and Prediction. Stanford Univ., Palo Alto, CA,
March 2527.
Liang, E., Rodriguez, R., Husseiny, A., (1988) Prognostics/diagnostics of mechanical
equipment by neural network, Neural Networks 1 (1) 33.
Liao, H., Lin, D., Qiu, H., Banjevic, D., Jardine, A., Lee, J., (2005) A predictive tool for
remaining useful life estimation of rotating machinery components. ASME International
20th Biennial Conference on Mechanical Vibration and Noise, Long Beach, CA, 2428
September, 2005.
Liu, J., Djurdjanovic, D., Ni, J., Lee, J., (2004) Performance similarity based method for
enhanced prediction of manufacturing process performance. Proceedings of the 2004
ASME International Mechanical Engineering Congress and Exposition (IMECE), 2004.
78 J. Lee and H. Wang
4.1 Introduction
Reliability centred maintenance (RCM) is a method for maintenance planning that
was developed within the aircraft industry and later adapted to several other
industries and military branches. A high number of standards and guidelines have
been issued where the RCM methodology is tailored to different application areas,
e.g., IEC 60300-3-11, MIL-STD-217, NAVAIR 00-25-403 (NAVAIR 2005), SAE
JA 1012 (SAE 2002), USACERL TR 99/41 (USACERL 1999), ABS (2003, 2004),
NASA (2000) and DEF-STD 02-45 (DEF 2000). On a generic level, IEC 60300-3-11
(IEC 1999) defines RCM as a systematic approach for identifying effective and
efficient preventive maintenance tasks for items in accordance with a specific set of
procedures and for establishing intervals between maintenance tasks. A major ad-
vantage of the RCM analysis process is a structured, and traceable approach to deter-
mine the optimal type of preventive maintenance (PM). This is achieved through a
detailed analysis of failure modes and failure causes. Although the main objective of
RCM is to determine the preventive maintenance, the results from the analysis may
also be used in relation to corrective maintenance strategies, spare part optimization,
and logistic consideration. In addition, RCM also has an important role in overall
system safety management.
An RCM analysis process, when properly conducted, should answer the
following seven questions:
1. What are the system functions and the associated performance standards?
2. How can the system fail to fulfil these functions?
3. What can cause a functional failure?
4. What happens when a failure occurs?
5. What might the consequence be when the failure occurs?
6. What can be done to detect and prevent the failure?
7. What should be done when a suitable preventive task cannot be found?
80 M. Rausand and J. Vatn
1. Study preparation
2. System selection and definition
3. Functional failure analysis (FFA)
4. Critical item selection
5. Data collection and analysis
6. Failure modes, effects, and criticality analysis (FMECA)
7. Selection of maintenance actions
8. Determination of maintenance intervals
9. Preventive maintenance comparison analysis
10. Treatment of non-critical items
11. Implementation
12. In-service data collection and updating
The rest of the chapter is structured as follows: In Section 4.2 we describe and
discuss the 12 steps of the RCM process. The concepts of generic and local RCM
analysis are introduced in Section 4.3. These concepts have been used in a novel
RCM approach to improve and speed up the analyses in a railway application.
Models and methods for optimization of maintenance intervals are discussed in
Section 4.4. Some main features of a new computer tool, OptiRCM, are briefly
introduced. Concluding remarks are given in Section 4.5. The RCM analysis
approach that is described in this chapter is mainly in accordance with accepted
standards, but also contains some novel issues, especially related to steps 6 and 8
and the approach chosen in OptiRCM. The RCM approach is illustrated with
examples from railway applications. Simple examples from the offshore oil and
gas industry are also mentioned.
Before the actual RCM analysis process is initiated, an RCM project group must be
established. The group should include at least one person from the maintenance
function and one from the operations function, in addition to an RCM specialist.
In Step 1 the RCM project group should define and clarify the objectives and
the scope of the analysis. Requirements, policies, and acceptance criteria with
Reliability Centred Maintenance 81
The system level is usually recommended as the starting point for the RCM
process. This is further discussed and justified, e.g., by Smith (1993) and in MIL-
STD 2173 (MIL-STD 1986). This means that on an offshore oil/gas platform the
starting point of the analysis should be the compression system, the water injection
system or the fire water system, and not the whole platform. In railway application
the systems were defined above as the next highest level in the plant hierarchy.
The systems may be further broken down into subsystems, and sub-subsystems,
and so on. For the purpose of the RCM analysis process the lowest level of the
hierarchy should be what we will call an RCM analysis item.
RCM analysis item: A grouping or collection of components, which together
form some identifiable package that will perform at least one significant function
as a stand-alone item (e.g., pumps, valves, and electric motors). For brevity, an
RCM analysis item will in the following be called an analysis item. By this
definition, a shutdown valve, e.g., is classified as an analysis item, while the valve
actuator is not. The actuator is supporting equipment to the shutdown valve, and
only has a function as a part of the valve. The importance of distinguishing the
analysis items from their supporting equipment is clearly seen in the FMECA in
Step 6. If an analysis item is found to have no significant failure modes, then none
of the failure modes or causes of the supporting equipment are important, and
therefore do not need to be addressed. Similarly, if an analysis item has only one
significant failure mode, then the supporting equipment only needs to be analyzed
to determine if there are failure causes that can affect that particular failure mode
(Paglia et al. 1991). Therefore, only the failure modes and effects of the analysis
items need to be analyzed in the FMECA in Step 6. An analysis item is usually
repairable, meaning that it can be repaired without replacing the whole item. In the
offshore reliability database OREDA (2002) the analysis item is called an
equipment unit. The various analysis items of a system may be at different levels
of assembly. On an offshore platform, for example, a huge pump may be defined
as an analysis item in the same way as a small gas detector. If we have redundant
items, e.g., two parallel pumps; each of them should be classified as analysis items.
When in Step 6 we identify causes of analysis item failures, we often find it
suitable to attribute this failure causes to failures of items on an even lower level of
indenture. The lowest level is usually referred to as components.
Component: The lowest level at which equipment can be disassembled without
damage or destruction to the items involved. Smith (2005) refers to this lowest level
as least replaceable assembly, while OREDA (2002) uses the term maintainable
item.
It is very important that the analysis items are selected and defined in a clear
and unambiguous way in this initial phase of the RCM analysis process, since the
following analysis will be based on these analysis items. If the OREDA database is
to be used in later phases of the RCM process, it is recommended as far as possible
to define the analysis items in compliance with the equipment units in OREDA.
84 M. Rausand and J. Vatn
1. Essential functions: These are the functions required to fulfil the intended
purpose of the item. The essential functions are simply the reasons for
installing the item. Often an essential function is reflected in the name of the
item. An essential function of a pump is, e.g., to pump a fluid.
2. Auxiliary functions: These are the functions that are required to support the
essential functions. The auxiliary functions are usually less obvious than the
essential functions, but may in many cases be as important as the essential
functions. Failure of an auxiliary function may in many cases be more
critical than a failure of an essential function. An auxiliary function of a
pump is, e.g., to contain fluid.
3. Protective functions: The functions intended to protect people, equipment,
and the environment from damage and injury. The protective functions may
be classified according to what they protect, as: (i) safety functions, (ii)
environment functions, and (iii) hygiene functions. An example of a pro-
tective function is the protection provided by a rupture disk on a pressure
vessel.
4. Information functions: These functions comprize condition monitoring,
various gauges and alarms, and so on.
5. Interface functions: These functions apply to the interfaces between the item
in question and other items. The interfaces may be active or passive. A passive
interface is, e.g., present when an item is a support or a base for another item.
6. Superfluous functions: According to Moubray (1997) Items or components
are sometimes encountered which are completely superfluous. This usually
happens when equipment has been modified frequently over a period of years,
or when new equipment has been over-specified. Superfluous functions are
Reliability Centred Maintenance 85
sometimes present when the item has been designed for an operational context
that is different from the actual operational context. In some cases failures of a
superfluous function may cause failure of other functions.
For analysis purposes the various functions of an item may also be classified as:
On-line functions: These are functions operated either continuously or so
often that the user has current knowledge about their state. The termination
of an on-line function is called an evident (or detectable) failure. In relation
to safety instrumented systems, on-line functions correspond to high
demand systems; see IEC 61508 (IEC 1997).
Off-line functions: These are functions that are used intermittently or so
infrequently that their availability is not known by the user without some
special check or test. The protective functions are very often off-line
functions. An example of an off-line function is the essential function of an
emergency shutdown (ESD) system on an oil platform. The termination of
an off-line function is called a hidden (or undetectable) failure. In the IEC
61508 setting, off-line functions correspond to low demand systems.
Note that this classification of functions should only be used as a checklist to
ensure that all relevant functions are revealed. Discussions about whether to
classify a function as, e.g., essential or auxiliary should be avoided.
The item may in general have several operational modes (e.g., running, and
standby), and several functions related to each operating state.
The term functional failure is mainly used in the RCM literature, and has the
same meaning as the more common term failure mode. In RCM we talk about
functional failures on equipment level, and use the term failure mode related to the
parts of the equipment. The failure modes will therefore be causes of a functional
failure. It is important to realize that a functional failure (and a failure mode) is a
manifestation of the failure as seen from the outside, i.e., a deviation from perform-
ance standards.
Functional failures and failure modes may be classified in three main groups
related to the function of the item:
Total loss of function: In this case the function is not achieved at all, or the
quality of the function is far beyond what is considered as acceptable.
Partial loss of function: This group may be very wide, and may range from
the nuisance category almost to the total loss of function.
Erroneous function: This means that the item performs an action that was
not intended, often the opposite of the intended function.
A variety of classifications schemes for functional failures (failure modes) have
been published. Some of these schemes, e.g., Blache and Shrivastava (1994), may
be used in combination with the function classification scheme in Step 3(ii) to
ensure that all relevant functional failures are identified.
The system functional failures may be recorded on a specially designed FFA-
worksheet that is rather similar to a standard FMECA worksheet. An example of an
FFA-worksheet is presented in Figure 4.2
In the first column of Figure 4.2 the various operational modes of the system
are recorded. For each operational mode, all the relevant functions of the system
are recorded in column 2.
Reliability Centred Maintenance 87
The performance requirements to the functions, like target values and acceptable
deviations, are listed in column 3. For each function (in column 2) all the relevant
functional failures are listed in column 4. In column 5 the frequency/probability of
the functional failure is listed. A criticality ranking of each functional failure in that
particular operational mode is given is given in column 6. The reason for including
the criticality ranking is to be able to limit the extent of the further analysis by
disregarding insignificant functional failures. For complex systems such a screening
is often very important in order not to waste time and money.
The criticality ranking depends on both the frequency/probability of the
occurrence of the functional failure, and the severity of the failure. The severity must
be judged at plant level.
The severity ranking should be given in the four consequence classes: (S) safety
of personnel, (E) environmental impact, (A) production availability, and (C) eco-
nomic losses. For each of these consequence classes the severity should be ranked as
for example (H) high, (M) medium, or (L) low. How we should define the border-
lines between these classes will depend on the specific application.
If at least one of the four entries are (M) medium or (H) high, the severity of the
functional should be classified as significant, and the functional failure should be
subject to further analysis.
The frequency of the functional failure may also be classified in the same three
classes. (H) high may, e.g., be defined as more than once per 5 years, and (L) low
less than once per 50 years. As above, the specific borderlines will depend on the
application.
The frequency classes may be used to prioritize between the significant system
failure modes.
If all the four severity entries of a system failure mode are (L) low, and the
frequency is also (L) low, the criticality is classified as insignificant, and the
functional failure is disregarded in the further analysis. If, however, the frequency is
(M) medium or (H) high the functional failure should be included in the further
analysis even if all the severity ranks are (L) low, but with a lower priority than the
significant functional failures.
The FFA may be rather time-consuming because, for all functional failures, we
have to list all the maintenance significant items (MSIs) (see Step 4). The MSI lists
will hence have to be repeated several times. To reduce the workload we often
conduct a simpler FFA where for each main function we list all functional failures in
one column, and all the related MSIs in another column. This is illustrated in Figure
4.3 for a railway application.
88 M. Rausand and J. Vatn
The function name reflects the functions to be carried out on a relatively high
level in the system. In principle, we should explicitly formulate the function(s) to
be carried out. Instead we often specify the equipment class performing the
function. For example, departure light signal is specified rather than the more
correct formulation ensure correct departure light signal. We observe that the last
functional failure in Figure 4.3 is not a failure mode for the correct functional
description (Ensure correct departure light signal), but is related to another function
of the departure light signal. Thus, if we use an equipment class description
rather than an explicit functional statement, the list of failure modes should cover
all (implicit) functions of the equipment class.
At the functional failure level, it is also convenient to specify whether the
failure mode is evident or hidden; see Figure 4.3 where we have introduced an
EF/HF column.
For each function we also list the relevant items that are required to perform the
function. These items will form rows in the FMECA worksheets; see Step 5.
The objective of this step is to identify the analysis items that are potentially
critical with respect to the functional failures identified in Step 3(iii). These
analysis items are denoted functional significant items (FSI). For simple systems
the FSIs may be identified without any formal analysis. In many cases it is obvious
which analysis items that have influence on the functional failures. For complex
systems with an ample degree of redundancy or with buffers, we may need a
formal approach to identify the FSIs.
If failure rates and other necessary input data are available for the various
analysis items, it is usually a straightforward task to calculate the relative importance
of the various analysis items based on a fault tree model or a reliability block
diagram. A number of importance measures are discussed by Rausand and Hyland
(2004).
In addition to the FSIs, we should also identify items with high failure rate,
high repair costs, low maintainability, long lead-time for spare parts, or items
requiring external maintenance personnel. These analysis items are denoted
maintenance cost significant items (MCSI).
The sum of the functional significant items and the maintenance cost significant
items are denoted maintenance significant items (MSI).
In an RCM project for the Norwegian Railway Administration the use of
generic RCM analyses (see Section 4.3) made it possible to analyze all identified
MSIs. In this case this step could be omitted.
The purpose of this step is to establish a basis for both the qualitative analysis
(relevant failure modes and failure causes), and the quantitative analysis (reliability
parameters such as MTTF, PF-intervals, and so on). The data necessary for the
RCM analysis may be categorized into the following three groups:
Reliability Centred Maintenance 89
Function: _______
Function: Home signal
Function: Departure light signal
Description: Five lamp signals, with three main signals and two pre-signals
During the initial phase of the RCM analysis process it often becomes evident that
the format and quality of the operational data are not sufficient to estimate the
relevant reliability parameters. Some of the main problems encountered are:
The failure data is on a too high level in the assembly hierachy, i.e., data is
not reported on the RCM analysis item level (MSI).
Failure mode and failure causes are not reported, or the recorded infor-
mation does not correspond to definitions and code lists used in the
FMECA of Step 6.
90 M. Rausand and J. Vatn
The objective of this step is to identify the dominant failure modes of the MSIs
identified in Step 4. The information entered into the FMECA worksheet should be
sufficient both with respect to maintenance task selection in Step 7, and interval
optimization in Step 8. Our FMECA worksheet has more fields than the FMECAs
found in most RCM standards. The reason for this is that we use the FMECA as
the main database for the RCM analysis. Other RCM approaches often use a rather
simple FMECA worksheet, but then have to add an additional FMECA-like
worksheet with the data required for optimization of maintenance intervals.
TOP Events
Experience has shown that we can significantly reduce the workload of the
FMECA by introducing so-called TOP events as a basis for the analysis. The idea is
that for each failure mode in the FMECA, a so-called TOP event is specified as
consequence of the failure mode. A number of failure modes will typically lead to
the same TOP event. A consequence analysis is then carried out for each TOP event
to identify the end consequences of that particular TOP event, covering all con-
sequence classes (e.g., safety, availability/punctuality, environmental aspects). For
many plants, risk analyses (or safety cases) have been carried out as part of the
design process. These may sometimes be used as a basis for the consequence
analysis.
Figure 4.4 shows a conceptual model of this approach for a railway application
where the left part relatively to the TOP event is treated in the FMECA, and the
right part is treated as generic, i.e., only once for each TOP event.
Reliability Centred Maintenance 91
C1
C2
C3
Initiating event TOP event
Red bulb failure Train collision
C4
C5
C6
In the rectangle (dashed line) in the left-hand side of Figure 4.4 an initiating
event and a barrier are illustrated. To analyze this rectangle we need reliabil-
ity parameters, such as MTTF, aging parameter, and PF interval, that are included
in the FMECA worksheet (e.g., see Rausand and Hyland 2004). Three situations
are considered:
Other barriers in Figure 4.4 can prevent the component failure from
developing into a critical TOP event. Track circuit detection may be a barrier
against rail breakage, because the track circuit can detect a broken rail. Typical
examples of TOP events in railway application are:
Train derailment
Collision train-train
Collision train-object
92 M. Rausand and J. Vatn
Fire
Persons injured or killed in or at the track
Persons injured or killed at level crossings
Passengers injured or killed at platforms
Several consequence-reducing barriers may also be available. Guide rails may,
e.g., be installed to mitigate the consequences in case of derailment.
In Figure 4.4 we have indicated that the outcome of the TOP event may be one
out of six (end) consequence classes:
C1: Minor injury
C2: Medical treatment
C3: Permanent injury
C4: 1 fatality
C5: 210 fatalities
C6: >10 fatalities
Note that the consequence reducing barriers and the end consequences are not
analyzed explicitly during the FMECA, but treated as generic for each TOP event.
In the railway situation this means only six analyses of the safety consequences
related to human injuries/fatalities.
In the following, a list of fields (columns) for the FMECA worksheets is
proposed. The structure of the FMECA is hierarchical, but the information is
usually presented in a tabular worksheet. The starting point in the FMECA is the
functional failures from the FFA in Step 3. Each maintainable item is analyzed
with respect to any impact on the various functional failures. In the following we
describe the various columns:
Failure mode (equipment class level). The first column in the FMECA
worksheet is the failure mode at the equipment class level identified in the
FFA in Step 3.
Maintenance significant item (MSI). The relevant MSI were identified in
the FFA.
MSI function. For each MSI, the functions of the MSI related to the current
equipment class failure mode are identified.
Failure mode (MSI level). For the MSI functions we also identify the failure
modes at the MSI level.
Detection method. The detection method column describes how the MSI
failure mode may be detected, e.g., by visual inspection, condition monitor-
ing, or by the central train control system (for railway applications).
Hidden or evident. Specify whether the MSI function is hidden or evident.
Demand rate for hidden function, fD. For MSI functions that are hidden, the
rate of demand of this function should be specified.
Failure cause. For each failure mode there is/are one or more failure
causes. A failure mode will typically be caused by one or more component
failures at a lower level. Note that supporting equipment to the component
is considered for the first time at this step. In this context a failure cause
may therefore be a failure mode of supporting equipment.
Reliability Centred Maintenance 93
Failure mechanism. For each failure cause, there is one or several failure
mechanisms. Examples of failure mechanisms are fatigue, corrosion, and
wear. To simplify the analysis, the columns for failure cause and failure
mechanism are often merged into one column.
Mean time to failure (MTTF). The MTTF when no maintenance is per-
formed should be specified. The MTTF is specified for one component if it
is a point object, and for a standardized distance if it is a line object
such as rails, sleepers, and so on.
TOP event safety. The TOP event in this context is the accidental event that
might be the result of the failure mode. The TOP event is chosen from a
predefined list established in the generic analysis
Barrier against TOP event safety. This field is used to list barriers that are
designed to prevent a failure mode from resulting in the safety TOP event.
For example, brands on the signalling pole would help the locomotive
driver to recognize the signal in case of a dark lamp.
PTE-S. This field is used to assess the probability that the other barriers
against the TOP event all fail; see Figure 4.4. PTE-S should count for all the
barriers listed under Barrier against TOP event safety.
TOP event availability/punctuality. Also for this dimension a predefined list
of TOP events may be established in the generic analysis.
Barrier against TOP event availability/punctuality. This field is used to list
barriers that are designed to prevent a failure mode from resulting in an
availability/punctuality TOP event. Since the fail safe principle is fundamental
in railway operation, there are usually no barriers against the punctuality TOP
event when a component fails. An example of a barrier is a two out of three
voting system on some critical components within the system.
PTE-P. This field is used to assess the probability that the other barriers
against an availability/punctuality TOP event all fails. PTE-P should count for
all the barriers listed under Barrier against TOP event availability/
punctuality. Due to the fail safe principle, PTE-P will often be equal to one.
Other consequences. Other consequences may also be listed. Some of these
are non-quantitative like noise effects, passenger comfort, and aesthetics.
Material damage to rolling stock or components in the infrastructure may
also be listed. Material damage may be categorized in terms of monetary
value, but this is not pursued here.
Mean downtime (MDT). The MDT is the time from a failure occurs until
the failure has been corrected and any traffic restrictions have been
removed.
Criticality indexes. Based on already entered information, different criticality
indexes can be calculated. These indexes are used to screen out non-
significant MSIs.
If a failure mode is considered significant with respect to safety or availability/
punctuality (or other dimensions) a preventive maintenance task should be
assigned. In order to do such an assignment, further information has to be
specified. This additional information will be completed during Steps 7 and 8. The
following fields are recommended:
94 M. Rausand and J. Vatn
Failure progression. For each failure cause the failure progression should
be described in terms of one of the following categories: (i) gradual
observable failure progression, (ii) non-observable and fast observable
failure progression (PF model), (iii) non-observable failure progression but
with aging effects, and (iv) shock type failures.
Gradual failure information. If there is a gradual failure progression
information about a what values of the measurable quantity represents a
fault state. Further information about the expected time and standard
deviation to reach this state should be recorded.
PF-interval information. In case of observable failure progression the PF
model is often applied (e.g., see Rausand and Hyland 2004, p. 394). The
PF concept assumes that a potential failure (P) can be observed some time
before the failure (F) occurs. This time interval is denoted the PF interval
(e.g., see Rausand and Hyland 2004). We need information both on the
expected value and the standard deviation of the PF interval.
Aging parameter. For non-observable failure progression aging effects
should be described. Relevant categories are strong, moderate or low aging
effects. The aging parameter can alternatively be described by a numeric
value, i.e., the shape parameter in the Weibull distribution.
Maintenance task. The maintenance task is determined by the RCM logic
discussed in Step 7.
Maintenance interval. Often we start by describing existing maintenance
interval, but after the formalized process of interval optimalization in Step
8 we enter the optimized interval.
An example of an FMECA worksheet is shown in Table 4.1 for a departure
light signal.
This step is the most novel compared to other maintenance planning techniques. A
decision logic is used to guide the analyst through a questionandanswer process.
The input to the RCM decision logic is the dominant failure modes from the
FMECA in Step 6. The main idea is for each dominant failure mode to decide
whether a preventive maintenance task is suitable, or it will be best to let the item
deliberately run to failure and afterwards carry out a corrective maintenance task.
There are generally three reasons for doing a preventive maintenance task:
Prevent a failure
Detect the onset of a failure
Reveal a hidden failure
Only the dominant failure modes are subjected to preventive maintenance. To
obtain appropriate maintenance tasks, the failure causes or failure mechanisms
should be considered.
Reliability Centred Maintenance 95
MSI Function Failure Failure TOP event Safety PTE-S TOP event
mode cause barriers
4
Lamp Give light No light Burnt-out Train Directional 3 x 10 Manual
filament Train block, ATP, train
TCC, operation
Black=red
-5
Lens Protect Broken Rock fall Train Directional 2 x 10 None
lamp lens Train block, ATP,
TCC,
Black=red
Slip No light Fouling Train Directional 2 x 104 None
through slipping Train block, ATP,
light through TCC,
Black=red
The failure mechanisms behind each of the dominant failure modes should be
entered into the RCM decision logic to decide which of the following basic
maintenance tasks is most applicable:
1. There must be an identifiable age at which the item shows a rapid increase
in the items failure rate function.
2. A large proportion of the units must survive to that age.
3. It must be possible to restore the original failure resistance of the item by
reworking it.
Scheduled replacement (SRP) is scheduled discard of an item (or one of its parts)
at or before some specified age limit. A scheduled replacement task is applicable
only under the following circumstances:
1. The item must be subject to a functional failure that is not evident to the
operating crew during the performance of normal duties.
2. The item must be one for which no other type of task is applicable and
effective.
Run to failure (RTF) is a deliberate decision to run to failure because the other
tasks are not possible or the economics are less favourable.
Reliability Centred Maintenance 97
Continuous on-
condition
Yes
task (CCT)
Does a failure alerting Yes Is continuous
measurable indicator monitoring
Scheduled on-
exist? feasible?
No condition
No task (SCT)
Scheduled overhaul
Yes (SOH)
Is aging parameter Yes Is overhaul
>1? feasible? Scheduled
No replacement
No (SRP)
No
No PM activity
found (RTF)
Two overriding criteria for selecting maintenance tasks are used in RCM. Each
task selected must meet two requirements:
98 M. Rausand and J. Vatn
It must be applicable
It must be effective
Applicability: Meaning that the task is applicable in relation to our reliability
knowledge and in relation to the consequences of failure. If a task is found based
on the preceding analysis, it should satisfy the applicability criterion.
A PM task is applicable if it can eliminate a failure, or at least reduces the
probability of occurrence to an acceptable level (Hoch 1990) or reduces the
impact of failures!
Cost-effectiveness: Meaning that the task does not cost more than the failure(s)
it is going to prevent.
The PM tasks effectiveness is a measure of how well it accomplishes that
purpose and if it is worth doing. Clearly, when evaluating the effectiveness of a
task, we are balancing the cost of performing the maintenance with the cost of not
performing it. In this context, we may refer to the cost as follows (Hoch 1990):
The cost of a PM task may include:
The risk of maintenance personnel error, e.g., maintenance introduced
failures
The risk of increasing the effect of a failure of another component while
one is out of service
The use and cost of physical resources
The unavailability of physical resources elsewhere while in use on this task
Production unavailability during maintenance
Unavailability of protective functions during maintenance of these
The more maintenance you do the more risk you expose your maintenance
personnel to
On the other hand, the cost of a failure may include:
The consequences of the failure should it occur (i.e., loss of production,
possible violation of laws or regulations, reduction in plant or personnel
safety, or damage to other equipment)
The consequences of not performing the PM task even if a failure does not
occur (i.e., loss of warranty)
Increased costs for emergency
In Step 4 critical items (MSIs) were selected for further analysis. A remaining
question is what to do with the items that are not analyzed. For plants already
having a maintenance program it is reasonable to continue this program for the non-
MSIs. If a maintenance program is not in effect, maintenance should be carried out
according to vendor specifications if they exist, else no maintenance should be per-
formed. See Paglia et al. (1991) for further discussion.
Reliability Centred Maintenance 99
A necessary basis for implementing the result of the RCM analysis is that the
organizational and technical maintenance support functions are available. A major
issue is therefore to ensure the availability of the maintenance support functions.
The maintenance actions are typically grouped into maintenance packages, each
package describing what to do, and when to do it.
Many accidents are related to maintenance work. When implementing a
maintenance program it is therefore of vital importance to consider the risk asso-
ciated with the execution of the maintenance work. Checklists may be used to
identify potential risk involved with maintenance work:
Can maintenance people be injured during the maintenance work?
Is work permit required for execution of the maintenance work?
Are means taken to avoid problems related to re-routing, by-passing, etc.?
Can failures be introduced during maintenance work?
Task analysis, e.g., see Kirwan and Ainsworth (1992), may be used to reveal
the risk involved with each maintenance job. See Hoch (1990) for further discus-
sion on implementing the RCM analysis results.
The reliability data we have access to at the outset of the analysis may be scarce, or
even almost none. In our opinion, one of the most significant advantages of RCM
is that we systematically analyze and document the basis for our initial decisions
and, hence, can better utilize operating experience to adjust that decision as
operating experience data is collected. The full benefit of RCM is therefore only
achieved when operation and maintenance experience is fed back into the analysis
process.
The updating process should be concentrated on three major time perspectives:
1. Short term interval adjustments
2. Medium term task evaluation
3. Long term revision of the initial strategy
For each significant failure that occurs in the system, the failure characteristics
should be compared with the FMECA. If the failure was not covered adequately in
the FMECA, the relevant part of the RCM analysis should, if necessary, be revised.
The short-term update can be considered as a revision of previous analysis
results. The input to such an analysis is updated reliability figures either due to more
data, or updated data because of reliability trends. This analysis should not require
excessive resources, since the framework for the analysis is already established. Only
Steps 5 and 8 in the RCM process will be affected by short-term updates.
The medium term update will also review the basis for the selection of
maintenance actions in Step 7. Analysis of maintenance experience may identify
significant failure causes not considered in the initial analysis, requiring an updated
FMECA in Step 6.
100 M. Rausand and J. Vatn
The long-term revision will consider all steps in the analysis. It is not sufficient
to consider only the system being analyzed; it is required to consider the entire
plant with its relations to the outside world, e.g., contractual considerations, new
laws regulating environmental protection, and so on.
1
A railway point is a railway switch that allows a train to go from one track to another.
A railway point is called a turnout in American English.
Reliability Centred Maintenance 101
consider all parameters that are involved in the optimization model (see
Section 4.4.
6. Re-run the optimization procedure. Based on the new local parameters
we next re-run the optimization procedure to adjust maintenance intervals
taking local differences into account. To carry out this process we need a
computerized tool to streamline the work.
7. Document the results. The results from the local analysis are stored in a
local RCM database. This is a database where only the adjustment factors
are documented, for example, for railway points A, B, C, and D on line Y
the MTTF is 30 % higher than the average. Hence the maintenance interval
is also reduced accordingly.
The aim of the component model is to establish the effective failure rate with
respect to a specific failure mode, E ( ) , as a function of the maintenance interval
. The effective failure rate is the unconditional expected number of failures per
time unit for a given maintenance level. Typically, the effective failure rate is an
increasing function of . A large number of models for determining the effective
failure rate as a function of the maintenance strategies, the degradation models, and
so on, have been proposed in the literature.
102 M. Rausand and J. Vatn
The interpretation of the effective failure rate is not straightforward for hidden
functions. For such functions we also need to specify the rate at which the hidden
function is demanded. In this situation we may approximate the effective failure
rate by the product of the demand rate and the probability of failure on demand
(PFD) for the hidden function.
In the following we indicate models that may be used for modelling the
effective failure rate, and we refer to the literature for details. The aim of OptiRCM
has been:
To cover the standard situations, both with respect to evident/hidden
failures, but also with respect to the type of failure progression.
Provide formulae that do not require too many reliability parameters to be
specified.
Limit the number of probabilistic models as a basis for the optimization.
Only the Weibull distribution is used to model aging failures in OptiRCM. There
may, of course, be situations where another distribution would be more realistic,
but our experience is that the user of such a tool rarely has data or insight that helps
him to do better than applying the Weibull model.
uses the renewal equation to establish an iterative scheme for the effective failure
rate based on an initial approximation.
Figure 4.4 shows a simplified model of the risk picture related to the component
failure being analyzed. In order to quantify the risk related to safety, we need the
following input data:
The effective failure rate, E ( )
The probability that the other barriers against the TOP event with respect to
safety all fail, PTES
The probability that the TOP event results in consequence C j is PC j for j
running through the number of consequence classes.
104 M. Rausand and J. Vatn
Table 4.2. PLL and cost contribution and for each consequence class
Fj = E ( ) PTES PC j (4.1)
where PCj is the probability that the TOP event results in consequence class C j .
We will later indicate how we can model Equation 4.1 as a function of the
maintenance interval, .
In some situations we also assign a cost, and/or a PLL (potential loss of life)
contribution to the various cost elements. PLL denotes the annual, statistically
expected number of fatalities in a specified population. Proposed values adopted by
the Norwegian National Rail Administration are given in Table 4.2. Please see
discussion by Vatn (1998) regarding what it means to assign monetary values to
safety.
The total PLL contribution related to the component failure being analyzed is
then
This procedure may, if required, be repeated for other dimensions like environ-
ment, material damage, and so on.
The approach to interval optimization is based on minimizing the total cost related
to safety, punctuality, availability, material damage, etc. Within an ALARP regime
(e.g., see Vatn 1998) this requires that the risk is not unacceptable. Assuming that
risk is acceptable, we proceed by calculating the total cost per time unit:
where CS ( ) and CP ( ) are given by Equation 4.3 and 4.4, respectively. Further,
where PM Cost is the cost per preventive maintenance activity. Note that for
condition-based tasks we distinguish between the cost of monitoring the item, and
the cost of physically improving the item by some restoration or renewal activity.
This complicates Equation 4.6 slightly because we have to calculate the average
number of renewals.
Table 4.3. Generic probabilities, PCj, of consequence class Ci for the different TOP events
The total cost C( ) in Equation 4.5 can now be found as a function of ; see
Figure 4.7 for a graphical illustration. The optimum interval is found to be 7.5 mil-
lion km. The maintenance action is scheduled replacement of the pump; see
Figure 4.5.
4.5 Conclusions
The main parts of the RCM approach that we have described in this chapter are
compatible with common practice and with most of the RCM standards. We are,
however, using a more complex FMECA where we also record data that are
necessary during maintenance interval optimization. The novel parts of our approach
are related to the use of so-called generic RCM analysis and to maintenance interval
optimization. The use of generic RCM analysis will significantly reduce the
workload of a complete RCM analysis. Maintenance optimization is, generally, a
very complex task, and only a brief introduction is presented in this chapter. For
maintenance personnel to be able to use the proposed methods, they need to have
access to simple computerized tools where the mathematically complex methods are
hidden. This was our objective in developing the OptiRCM tool. Maintenance
optimization modules are, more or less, non-existent in the standard RCM tool.
OptiRCM is not a replacement for these tools, but rather a supplement. OptiRCM is
still in the development stage, and we are currently trying to implement several new
features into OptiRCM. Among these are additional methods related to maintenance
strategies, and grouping of maintenance tasks.
4.6 References
ABS, (2003) Guide for Survey Based on Reliability-Centered Maintenance. American
Bureau of Shipping, Houston.
ABS, (2004) Guidance Notes on Reliaility-Centered Maintenance. American Bureau of
Shipping, Houston.
Blanchard BS, Fabrychy WJ, (1998) Systems Engineering and Analysis, 3rd ed. Prentice
Hall, Englewood Cliffs, NJ.
Blanche KM, Shrivastava AB, (1994) Defining failure of manufacturing machinery and
equipment. Proceedings from the Annual Reliability and Maintainability Symposium,
pp. 6975.
Castanier B, Rausand M, (2006) Maintenance optimization for subsea oil pipelines.
Pressure Vessels and Piping 83:236243.
Chang KP, (2005) Reliability-centered maintenance for LNG ships. ROSS report 200506,
NTNU, Trondheim, Norway.
Chang KP, Rausand M, Vatn J, (2006) Reliability Assessment of Reliquefaction Systems on
LNG Carriers. Submitted for publication in Reliability Engineering and System Safety.
Cho DI, Parlar M, (1991) A survey of maintenance models for multi-unit systems. European
Journal of Operational Research 51:123.
Christer AH, Waller WM, (1984) Delay time models of industrial inspection maintenance
problems, Journal of the Operational Research Society 35:401406.
DEF-STD 02-45 (NES 45), (2000) Requirements for the application of reliability-centred
maintenance technique to HM ships, submarines, Royal fleet auxiliaries and other naval
aixiliary vessels. Defense Standard, U.K. Ministry of Defence, Bath, England.
Gertsbakh I, (2000) Reliability Theory with Applications to Preventive Maintenance.
Springer, New York.
Hoch R, (1990) A practical application of reliability centered maintenance. the American
Society of Mechanical Engineers, 90-JPGC/Pwr-51, Joint ASME/IEEE Power
Generation Conference, Boston, MA, 2125 October.
108 M. Rausand and J. Vatn
Wenbin Wang
5.1 Introduction
The use of condition monitoring techniques in industry to direct maintenance
actions has increased rapidly over recent years to the extent that it has marked the
beginning of what is likely to prove a new generation in production and main-
tenance management practice. There are both economic and technological reasons
for this development driven by tight profit margins, high outage costs and an
increase in plant complexity and automation. Technical advances in condition
monitoring techniques have provided a means to achieve high availability and to
reduce scheduled and unscheduled production shutdowns. In all cases, the
measured condition information does, in addition to potentially improving decision
making, have a value added role for a manager in that there is now a more ob-
jective means of explaining actions if challenged.
In November 1979, the consultants, Michael Neal & Associate Ltd published A
Guide to Condition Monitoring of Machinery for the UK Department of Trade and
Industry; Neal et al. (1979). This groundbreaking report illustrated the difference in
maintenance strategies (e.g., breakdown, planned, etc.) and suggested that condition
based maintenance, using a range of techniques, would offer significant benefits to
industry. By the late 1990s condition based maintenance had become widely
accepted as one of the drivers to reduce maintenance costs and increase plant
availability. With the advent of e-procurement, business to business (B2B), customer
to business (C2B), business to customer (B2C) etc., industry is fast moving towards
enterprise wide information systems associated with the internet. Today, plant asset
management is the integration of computerised maintenance management systems
and condition monitoring in order to fulfil the business objectives. This enables
significant production benefits through objective maintenance prediction and
scheduling. This positions the manufacturer to remain competitive in a dynamic
market.
Today there exists a large and growing variety of condition monitoring tech-
niques for machine condition monitoring and fault diagnosis. A particularly popular
112 W. Wang
one for rotating and reciprocal machinery is vibration analysis. However, irrespective
of the particular condition monitoring technique used, the working principle of
condition monitoring is the same, namely condition data become available which
need to be interpreted and appropriate actions taken accordingly. There are generally
two stages in condition based maintenance. The first stage is related to condition
monitoring data acquisition and their technical interpretations. There have been
numerous papers contributing to this stage, as evidenced by the proceedings of
COMADEM over recent years. This stage is characterised by engineering skill,
knowledge and experience. Much effort of the study at this stage has gone into
determining the appropriate variables to monitor, Chen et al. (1994), the design of
systems for condition monitoring data acquisition, Drake et al. (1995), signal
processing, Wong et al. (2006), Samanta et al. (2006), Harrison (1995), Li and Li
(1995), and how to implement computerised condition monitoring, Meher-Homji et
al. (1994). These are just a few examples and no modelling is explicitly entered into
the maintenance decision process based upon the results of condition monitoring. For
detailed technical aspects of condition monitoring and fault diagnosis, see Collacott
(1997). The second stage is maintenance decision making, namely what to do now
given that condition information data and their interpretations are available. The
decision at this stage can be complicated and entails consideration of cost, downtime,
production demand, preventive maintenance shutdown windows, and most im-
portantly, the likely survival time of the item monitored. Compared with the exten-
sive literature on condition monitoring techniques and their applications, relatively
little attention has been paid to the important problem of modelling appropriate
decision making in condition based maintenance.
This chapter focuses on the second stage of condition monitoring, namely
condition based maintenance modeling as an aid to effective decision making. In
particular, we will highlight a modelling technique used recently in condition based
maintenance, e.g. residual life modelling via stochastic filtering (Wang and
Christer 2000). This is a key element in modeling the decision making aspect of
condition based maintenance. The chapter is organised as follows. Section 5.2
gives a brief introduction to condition monitoring techniques. Section 5.3 focuses
on condition based maintenance modeling and discuss various modeling tech-
niques used. Section 5.4 presents the modelling of the residual life conditional on
observed monitoring information using stochastic filtering. Section 5.5 concludes
the chapter with a discussion of topics for future research.
industrial equipment some measurements can be taken and the likely condition of
the plant assessed.
Today there exists a large and growing variety of forms of condition monitoring
techniques for machine condition monitoring and fault diagnosis. Understanding the
nature of each monitoring technique and the type information measured will certainly
help us when establishing a decision model. Here we briefly introduce five main
techniques and among them, vibration and oil analysis techniques are the two most
popular.
T
Vrms = 1
T 0
(V (t )) 2 dt ,
which causes the abnormal signals, but not vice versa (Wang 2002). This factor
plays an important role when selecting an appropriate model for describing such a
relationship.
is only carried out when it becomes necessary utilizing available condition infor-
mation. But in reality, all too often we see effort and money spent on monitoring
equipment for faults which rarely occur, and we also see planned maintenance
being carried out when the equipment is perfect healthy though the monitored
information indicates something is wrong. A study of oil based condition moni-
toring of gear boxes of locomotives used by Canadian Pacific Railway (Aghjagan
1989) indicated, that since condition monitoring was commissioned (entailed 34
samples per locomotive per week, 52 weeks per year), the incidence failure of gear
boxes while in use fell by 90 %. This is a significant achievement. However, when
subsequently stripped down for reconditioning/overhaul, there was nothing evi-
dently wrong in 50 % of cases. Clearly, condition monitoring can be highly effec-
tive, but may also be very inefficient at the same time. Modelling is necessary to
improve the cost effectiveness and efficiency of condition monitoring.
The long term expected cost per unit time, C (t ) , given that a preventive replace-
ment is scheduled at time t> ti is given by (Wang 2003)
(c f c p ) P (t ti | i ) + c p + icm
C (t ) = t ti
(5.1)
ti + (t ti )(1 P (t ti | i )) + 0
xi pi ( xi | i ) dxi
t ti
where P (t ti | i ) = P ( X i < t ti | i ) = pi ( xi | i )dxi , which is the probability
0
of a failure before t conditiional on i . The right hand side of Equation 5.1 is the
expected cost per unit time formulated as a renewal reward function, though the
lifetimes are independent but not identical.
Condition-based Maintenance Modelling 117
The time point t is usually bounded within the time period from the current to
the next monitoring since a new decision shall be made once a new monitoring
reading becomes available at time ti +1 .
In general, if a minimum of C (t ) is found within the interval to the next
monitoring in terms of t , then this t should be the optimal replacement time. If no
minimum is found, then the recommendation would be to continue to use the plant
and evaluate Equation 5.1 at the next monitoring point when new information
becomes available. For a graphical illustration of the above principle see Figure 5.1.
C(t)
No replacement is recommended
5.3.2 Modelling pi ( xi | i )
h (t ) = t 1 .
There are two problems with proportional hazards modeling or accelerated life
models in condition based maintenance. The first is that the current hazard is
determined partially by the current monitoring measurements and the full
monitoring history is not used. The second is the assumption that the hazard or the
life is a function of the observed monitoring data which acts directly on the hazard
via a covariate function. Both problems relate to the modeling assumption rather
than the technique. The first can be overcome if some sort of transformation of the
observed data is used. The second problem remains unless the nature of monitoring
indicates so. It is noted however that, for most condition monitoring techniques,
the observed monitoring measurements are concomitant types of information
which are a function of the underlying plant state. A typical example is in vibration
monitoring where a high level of vibration is usually caused by a hidden defect but
not vice versa as we have discussed earlier. In this case the observed vibration
signals may be regarded as concomitant variables which are caused by the plant
state. Note that in oil based monitoring things are different as the metal particles
and other contaminants observed in the oil can be regarded both as concomitant
Condition-based Maintenance Modelling 119
Defect Short residual life Higher than normal signal may be observed.
If the severity of the defect is represented by the length of the residual life, the
relationship between the residual life and observed condition related variables
follows.
120 W. Wang
X (ti ti 1 ) if X i 1 > ti ti 1
X i = i 1 . (5.2)
not defined else
Condition-based Maintenance Modelling 121
yi
x1
y3
y2 x3
y1 x2
Threshold level
0 t1 t2 t3 failure
p ( xi , yi | i 1 )
pi ( xi | i ) = p ( xi | yi , i 1 ) = (5.3)
p ( yi | i 1 )
p ( xi , yi | i 1 ) = p( yi | xi , i 1 ) p ( xi | i 1 ) (5.4)
p ( xi , yi | i 1 ) = p ( yi | xi , i 1 ) p( xi | i 1 ) = p( yi | xi ) p( xi | i 1 ) (5.5)
p( yi | i 1 ) = 0
p( xi , yi | i 1 )dxi = 0
p( yi | xi ) p( xi | i 1 )dxi (5.6)
122 W. Wang
dg ( xi )
p( xi | i 1 ) = pi 1 ( g ( xi ) | i 1 , X i 1 > ti ti 1 ) (5.7)
dxi
dg ( xi )
Since = 1 and
dxi
pi 1 ( g ( xi ) | i 1 )
pi 1 ( g ( xi ) | i 1 , X i 1 > ti ti 1 ) =
(5.8)
ti ti 1
pi 1 ( xi 1 | i 1 )dxi 1
we finally have
pi 1 ( xi + ti ti 1 | i 1 )
p( xi | i 1 ) =
(5.9)
ti ti 1
pi 1 ( xi 1 | i 1 )dxi 1
p ( yi | xi ) pi 1 ( xi + ti ti 1 | i 1 )
pi ( xi | i ) =
(5.10)
0
p( yi | xi ) pi 1 ( xi + ti ti 1 | i )dxi 1
p ( y1 | x1 ) p0 ( x1 + t1 t0 | 0 )
p1 ( x1 | 1 ) =
(5.11)
0
p( y1 | x1 ) p0 ( x1 + t1 t0 | 0 )dx1
p0 ( x0 ) is just the delay time distribution over the second stage of the plant life.
Here we use the Weibull distribution as an example in this context. In practice or
theory, the distribution density function p0 ( x0 ) should be chosen from the one
which best fits to the data or from some known theory.
The set-up of the p ( yi | xi ) term requires more attention. Here we follow the
one used in Wang (2002), where yi | xi is assumed to follow a Weibull distribution
with the scale parameter being equal to the inverse of A + Be cx . In this way we i
yi
yi ( )
) 1 e A+ Be
cxi
p( yi | xi ) = cx
( cx
. (5.12)
A + Be A + Be
i i
This is a concept called floating scale parameter, which is particularly useful in this
case (Wang 2002). There are other choices to model the relationship between yi
and xi , but these will not be discussed here, and can be found in Wang (2006a).
To calculate the actual pi ( xi | i ) we need to know the values for the model
parameters. They are the parameters of p0 ( x0 ) and p ( yi | xi ) . The most popular
way to estimate them is using the method of maximum likelihood.
At each monitoring point, ti , two pieces information are available, namely, yi
and X i 1 > ti ti 1 , both conditional on i1 . The pdf. for yi | i 1 is given by
Equation 5.7 and the probability function of X i 1 > ti ti 1 | i 1 is given by
P ( X i 1 > ti ti 1 | i 1 ) = pi 1 ( xi 1 | i 1 ) dxi 1 (5.13)
ti ti 1
If the item monitored failed at time t f after the last monitoring at time t n , the
complete likelihood function is then given by
L () = ( n
i =1
p( yi | i 1 )
ti ti 1 )
pi 1 ( xi 1 | i 1 )dxi 1 ) pn (t f tn | n ) (5.14)
Figure 5.3 shows the data of overall vibration level in rms of six bearings, which is
from a fatigue experiment (Wang 2002). It can be seen from Figure 5.3 that the
bearing lives vary from around 100 h to over 1000 h, which shows a typical sto-
chastic nature of the life distribution. The monitored vibration signals also indicate
an increasing trend with bearing ages in all cases, but with different paths. An
important observation is the pattern of vibration signals which stays relatively flat
in the early stage of the bearing life and then increases rapidly (a defect may have
been initiated). This indicates the existence of the two stage failure process as
defined earlier.
The initial point of the second stage in these bearings is identified using a
control chart called the Shewhart average level chart and the threshold levels of the
bearings are shown in Table 5.1 (Zhang 2004).
p ( x0 ) = ( x0 ) 1 e ( x 0)
and
yi
yi ( )
) 1 e A+ Be
cxi
p( yi | xi ) = cx
( cx
A + Be A + Be
i i
i
( xi + ti ) 1 e ( ( x + t )) i i
k =1
k ( xi , ti )
pi ( xi | i ) =
(5.15)
i
( z + ti ) 1 e ( ( z + t )) i
( z , ti )dz
k =1 k
0
where
C ( z +ti tk ) 1
e ( y ( A+ Be
k ) )
k ( z , ti ) = C ( z + t t )
.
A + Be i k
A B C
0.011 1.873 7.069 27.089 0.053 4.559
Based on the estimated parameter values in Table 5.2 and Equation 5.15 the
predicted residual life at some monitoring points given the history information of
bearing 6 in Figure 5.3 is plotted in Figure 5.4.
In Figure 5.4 the actual residual lives at those checking points are also plotted
with symbol *. It can be seen that actual residual lives are well within the predicted
residual life distribution as expected.
Given the estimated values for parameters and associated costs such as
c f = 6000 , c p = 2000 and cm = 30 (Wang and Jia 2001) we have the expected
cost per unit time for one of the bearings at various checking time t, shown in
Figure 5.5.
126 W. Wang
27
t=80.5 hrs
t=92.5 hrs
23
t=104 hrs
t=116.5 hrs
19 t=129 hrs
15
0 10 20 30
Planned replacement time
Figure 5.5. Expected cost per unit time vs. planned replacement time in hours from the
current time t
In can be seen from Figure 5.5. that at t = 116.5 and 129 h both planned replace-
ments are recommended within the next 30 h.
To illustrate an alternative decision chart in terms of the actual condition
monitoring reading, we transformed the cost related decision into actual reading in
Figure 5.6 where the dark grey area indicates that if the reading falls within this area
a preventive replacement is required within the planning period of consideration.
The advantage of Figure 5.6 is that it can not only tell us whether a preventive
replacement is needed but also show us how far the reading is from the area of pre-
ventive replacement so that appropriate preparation can be done before the actual
replacement.
Condition-based Maintenance Modelling 127
14
Preventive replacement area
12
Observed CM reading
10
4
No preventive replacement area
2
0
80.5 92.5 104 116.5 129
Tim e (age in hour) of CM reading taken
With the delay time concept (see Chapter 14), system life is assumed to be
classified into two stages. The first is the normal working stage where no abnormal
condition parameters are to be expected. The second starts when a hidden defect is
first initiated with possible abnormal signals. The identification of the initial point
in the evolution of such a defect is important and has a direct impact on the
128 W. Wang
The definition of the underlying state and the relationship between the observed
monitoring parameters and the state of the system are issues which still need
attention. In the model presented in this chapter, the state of the system is defined
as the residual life, which is assumed to influence the observed signal parameters.
Whilst the modelling output appears to make sense, there are a few potential
problems with the approach. The first is the issue that the life of the plant is fixed
at birth (installation) but unknown. This is termed as playing God. Second, the
residual life is not the direct cause of the observed abnormal signals. These are
more likely caused by some hidden defects which are linked to the residual life in
this chapter. To correct the first problem we can introduce another equation
describing the relationship between X i and X i 1 deterministically or randomly.
This will allow X i to change during use, which is more appropriate. If the
relationship is deterministic, then a closed form of Equation 5.3 is still available,
but if it is random, HMM must be used and no closed form of Equation 5.3 exists
unless the noises associated are normally distributed. The second problem can be
overcome if we adopt a discrete or continuous state hidden Markov chain to de-
scribe the system deterioration process where the state space of the chain repre-
sents the system state in question.
system state. A model which can handle both type of information is ideal, but very
few attempts have been made (Hussin and Wang 2006).
This chapter introduces the concept of condition monitoring, key condition moni-
toring techniques, condition based maintenance and associated modelling support
in aid of condition based maintenance. Particular attention is paid to the residual
time prediction based on available condition information to date. An important
development made here is the establishment of the relationship between the ob-
served information and underlying condition which is the residual life in this case.
This is achieved by letting the mean of the observed information at ti be a function
of the residual life at that point conditional on X i = xi . The mathematical develop-
ment is based on a recursive algorithm called filtering where all past information is
included. The example illustrated is based on real data which came from a fatigue
experiment. However, data from industry has shown the robustness of the approach
and the residual life predictions conducted so far are satisfactory.
5.7 References
Aghjagan, H.N., (1989) Lubeoil analysis expert system, Canadian Maintenance Engineering
Conference, Toronto.
Aven, T., (1996) Condition based replacement policies a counting process approach, Rel.
Eng. & Sys. Safety, 51(3), 275281.
Banjevic, D., Jardine, A.K.S., Makis, V. and Ennis, M., (2001) A control-limit policy and
software for condition based maintenance optimization, INFOR 39(1), 3250.
Baruah, P. and Chinnam R.B., (2005) HMM for diagnostics and prognostics in maching
processes, I. J. Prod. Res., 43(6), 12751293.
Black, M., Brint, A.T. and Brailsford J.R., (2005) A semi-Markov approach for modelling
asset deterioration, J. Opl. Res. Soc. 56(11), 12411249.
Bunks C., McCarthy, D. and Al-Ani T., (2000) Condition based maintenance of machine
using hidden Markov models, Mech. Sys. & Sig. Pro., 14(4), 597612.
Chen, D. and Trivedi, K.S., (2005) Optimization for condition based maintenance with
semi-Markov decision process, Rel. Eng. & Sys. Safety, 90(1), 2529.
Chen, W., Meher-Homji, C.B. and Mistree, F., (1994) COMPROMISE: an effective
approach for condition-based maintenance management of gas turbines. Engineering
Optimization, 22, 185201.
Christer, A.H., Wang, W. and Sharp, JmM., (1997) A state space condition monitoring
model for furnace erosion prediction and replacement, Euro. J. Opl. Res., 101, 114.
Christer, A.H. and Wang, W., (1992) A model of condition monitoring inspection of
production plant, I. J. Prod. Res., 30, 21992211.
Christer A.H and Wang, W., (1995) A simple condition monitoring model for a direct
monitoring process, E. J. Opl. Res., 82, 258269.
Collacott, R.A., (1977) Mechanical fault diagnosis and condition monitoring, Chapman and
Hall Ltd., London.
Dong M. and He, D., (2004) Hidden semi-Markov models for machinery health diagnosis
and prognosis, Trans. North Amer. Manu. Res. Ins. of SME, 32, 199206.
130 W. Wang
Drake, P.R., Jennings, A.D., Grosvenor, R.I. and Whittleton, D., (1995) acquisition system
for machine tool condition monitoring. Quality and Reliability Engineering Inter-
national 11, 1526.
Freund, J.E., (2004) Mathematical statistics with applications, Pearson Prentice and Hall,
London.
Harrison, N., (1995) Oil condition monitoring for the railway business. Insight 37, 278283.
Hontelez, J.A.M., Burger, H.H. and Wijnmalen, D.J.D., (1996) Optimum condition based
maintenance policies for deteriorating systems with partial information, Rel. Eng. & Sys.
Safety, 51(3), 267274.
Hussin, B., and Wang, W., (2006) Conditional residual time modelling using oil analysis: a
mixed condition information using accumulated metal concentration and lubricant
measurements, to appear in Proc. 1st Main. Eng. Conf, Chendu, China.
Jardine, A.K.S., Makis, V., Banjevic, D., Braticevic, D. and Ennis, M., (1998) A decision
optimization model for condition based maintenance, J. Qua. Main. Eng., 4(2), 115
121.
Jensen, U., (1992) Optimal replacement rules based on different information level, Naval
Res. Log. 39, 937955.
Kalbfleisch, J.D. and Prentice, R.L., (1980) The Statistical Analysis of Failure Time Data.
Wiley, New York.
Kumar, D. and Westberg, U., (1997) Maintenance scheduling under age replacement policy
using proportional hazard modelling and total-time-on-test plotting, Euro. J. Opl. Res.,
99, 507515.
Li, C.J. and Li, S.Y., (1995) Acoustic emission analysis for bearing condition monitoring.
Wear 185, 6774.
Lin, D. and Makis, V., (2003) Recursive filters for a partially observable system subject to
random failures, Adv. Appl. Prob., 35(1), 207227.
Lin D. and Makis, V., (2004) Filters and parameter estimation for a partially observable
system subject to random failures with continuous-range observations, Adv. Appl. Prob.,
36(4), 12121230.
Love C.E., Zhang Z.G., Zitron M.A., and Guo R., (2000) A discrete semi-Markov decision
model to determine the optimal repair/replacement policy under general repairs, Euro. J.
Opl Res, 125, 2, 398409
Love, C.E. and Guo, R., (1991) Using proportional hazard modelling in plant maintenance.
Quality and Reliability Engineering International, 7, 717.
Makis, V. and Jardine, A.K.S., (1991) Computation of optimal policies in replacement
models, IMA J. Maths. Appl. Business & Industry, 3, 169176.
Matthew, C. and Wang, W., (2006) A comparison study of proportional hazard and
stochastic filtering when applied to vibration based condition monitoring, submitted to
Int. Tran OR.
Meher-Homji, C.B., Mistree, F. and Karandikar, S., (1994) An approach for the integration
of condition monitoring and multi-objective optimization for gas turbine maintenance
management. International Journal of Turbo and Jet Engines, 11, 4351.
Neal, M., and Associates, (1979) Guide to the condition monitoring of machinery, DTI,
London.
Reeves, C.W. (1998) The vibration monitoring handbook, Coxmoor Publishing Company,
Oxford.
Samanta, B., Al-Balushi, K.R., Al-Araimi, S.A. (2006) Artificial neural networks and
genetic algorithm for bearing fault detection Soft Computing, 10 (3), 264271.
Wang, W., (2002) A model to predict the residual life of rolling element bearings given
monitored condition monitoring information to date, IMA. J. Management Mathematics,
13, 316.
Condition-based Maintenance Modelling 131
Wang, W., (2003) Modelling condition monitoring intervals: A hybrid of simulation and
analytical approaches, J. Opl. Res Soc, 54, 273282.
Wang, W., (2006a) A prognosis model for wear prediction based on oil based monitoring, to
appear in J. Opl. Res Soc,
Wang, W., (2006b) Modelling the probability assessment of the system state using available
condition information, to appear in IMA. J. Management Mathematics.
Wang, W. and Christer, A.H., (2000) Towards a general condition based maintenance model
for a stochastic dynamic system, J. Opl. Res. Soc. 51, 145155.
Wang, W. and Jia, Y., (2001) A multiple condition information sources based maintenance
model and associated prototype software development, proceedings of COMADEM
2001, Eds. A. Starr and Raj B.K.N. Rao, Elsevier, 889898.
Wang, W. and Zhang, W., (2005) A model to predict the residual life of aircraft engines
based on oil analysis data, Naval Logistics Research, 52, 276284.
Wong, M.L.D., Jack, L.B., Nandi, A.K., (2006) Modified self-organising map for automated
novelty detection applied to vibration signal monitoring Mech. Sys. & Sig. Proc., 20(3),
593610.
Zhang, W., (2004) Stochastic modeling and applications in condition based maintenance,
PhD, thesis, University of Salford, UK.
6
David F. Percy
6.1 Introduction
Reliability applications often suffer from a lack of data with which to make in-
formed maintenance decisions. Indeed, the very nature of maintenance is to avoid
observed failure data from arising!
This effect is particularly noticeable for high reliability systems such as aircraft
engines and emergency vehicles, and when new production lines are established or
warranty schemes are planned. The evaluation of such systems is a learning pro-
cess and knowledge is continually updated as more information becomes available.
Such issues are of great importance when selecting and fitting mathematical
models to improve the accuracy and utility of these decisions.
This chapter investigates why reliability data are so limited, identifies the
problems that this causes and proposes statistical methods for dealing with these
difficulties. In particular, it considers graphical and numerical summaries, appropri-
ate methods for model development and validation, and the powerful approach of
subjective Bayesian analysis for including expert knowledge about the application
area, such as information pertaining to a particular manufacturing process and ex-
perience of similar operational systems.
Many reliability problems involve making strategic decisions under risk or un-
certainty. Stochastic models involving unknown parameters are often adopted for
this purpose and our concern is how to make inference about, and arising from,
these unknown parameters. The easiest approach involves skilfully guessing the
parameter values by subjective means, which is fine so long as there is sufficient
expert knowledge to perform this task well. More commonly, the parameters are
estimated from observed data and decisions are then made by assuming the
parameters equal to their estimates. This frequentist approach to inference is very
good if there are sufficient data to estimate the parameters well.
However, few data are available in many areas of maintenance and replace-
ment; see Percy et al. (1997) and Kobbacy et al. (1997) for example. There are
several reasons why data are scarce in these situations. New systems and processes
134 D. Percy
naturally offer scant historical data about their performance and reliability. Poor
and incomplete maintenance records are often kept, as the engineers and managers
do not always appreciate the potential benefits that can be achieved through
quantitative modelling and analysis. Of equal importance, many observations of
failure times tend to be censored due to maintenance interventions.
Typical applications take the form of reliability analysis, such as modelling a
critical systems time to failure, and scheduling problems, such as determining
efficient policies for scheduling capital replacement and preventive maintenance,
all of which are considered elsewhere in this book. Other applications include
determining appropriate thresholds for condition monitoring and specifying
warranty schemes for new products. Under these circumstances, it is important to
allow for the uncertainty about the unknown model parameters. This is readily
achieved by adopting the Bayesian approach to inference, as described by
Bernardo and Smith (2000) and OHagan (2004).
The structure for the remainder of this chapter is as follows. Section 6.2
explains the need for Bayesian analysis and Section 6.3 introduces the concepts
beginning with Bayes theorem, which is of great importance in its own right.
Section 6.4 discusses the construction of prior and posterior distributions, whilst
Section 6.5 considers the role of predictive distributions and Section 6.6 considers
techniques for setting the hyperparameters of prior distributions. One of the great
strengths of the Bayesian approach, particularly in relation to practical problems in
reliability and maintenance, is its ability to improve the quality of decision
analysis, as described in Section 6.7. Section 6.8 presents a review of the Bayesian
approach to maintenance and Section 6.9 includes specific case studies that demon-
strate these methods. Finally, Section 6.10 suggests topics for future research and
possible new applications.
For convenience, there follows a list of symbols and acronyms that are used
throughout this chapter.
P() : Probability
E () : Expected value
p() : Probability mass function
f () : Probability density function
R() : Reliability function
L() : Likelihood function
g () : Prior or posterior probability density function
Be( ) : Bernoulli distribution
Po( ) : Poisson distribution
Ge( ) : Geometric distribution
Ex( ) : Exponential distribution
No(, ) : Normal distribution
Ga ( , ) : Gamma distribution
We( , ) : Weibull distribution
Maintenance Based on Limited Data 135
Figure 6.1. The link between fundamental aspects of maintenance modelling and analysis
(
R ( x , ) = exp x ) (6.1)
136 D. Percy
n n
L ( , ; D ) R ( xi , ) = exp xi .
(6.2)
i =1 i =1
n
x
i =1
i =0; (6.3)
n
x
i =1
i log xi = 0 . (6.4)
These have no finite solutions for and , so our analysis has been thwarted by
the lack of uncensored data.
P ( A B) P ( B)
P ( B A) = (6.5)
P ( A)
where it is sometimes useful to evaluate the probability of A using the law of total
probability
P ( A) = P ( A B ) P ( B ) + P ( A B ) P ( B ) (6.6)
where the event B is the complement of the event B ; that is, the event that B
does not occur. Bayes theorem can be interpreted as a way of transposing the
conditionality from P ( A B ) to P ( B A ) , or as a way of updating the prior prob-
ability P ( B ) to give the posterior probability P ( B A ) .
Example 6.2 An aircraft warning light comes on if the landing gear on either
side is faulty. Suppose we know that faults only occur 0.4% of the time, that they
are detected with 99.9% reliability and that false alarms only occur 0.5% of the
time when the landing gear is operational. Defining events W = warning light
comes on and L = landing gear faulty, this information can be summarized
Maintenance Based on Limited Data 137
P (W ) = P (W L ) P ( L ) + P (W L ) P ( L )
(6.7)
= 0.999 0.004 + 0.005 0.996 = 0.008976 ,
P (W L ) P ( L ) 0.999 0.004
P(L W ) = = = 0.45 (6.8)
P (W ) 0.008976
to two decimal places. This result implies that most (55%) of these warning lights are
false alarms, despite the apparent accuracy of the alarm system! The reason for this
paradoxical outcome is that the landing gear is operational for the vast majority of the
time. If we were to specify P ( L ) = 0.04 instead, we would obtain P ( L W ) = 0.89 ,
which is far more acceptable. Similar patterns of behaviour apply to medical
screening procedures in order to reduce the incidence of misdiagnoses, only
patients deemed to be at risk of an illness are routinely screened for it.
Only in the mid-twentieth century were the real benefits of Bayes theorem
appreciated though. Not only does it apply to probabilities, but also to random
variables. For example, suppose X is a discrete random variable and Y is a con-
tinuous random variable. Then the conditional probability density function of Y
given X can be determined using Bayes theorem, if we know the marginal distri-
butions of X and Y , and the conditional distribution of X given Y :
p ( x y) f ( y)
f ( y x) = . (6.9)
p ( x)
This rule for transposing the conditionals has proven to be crucial in a variety of
important applications, including quality control, fault diagnosis, image processing,
medical screening and criminal trials.
Even more importantly, we can apply Bayes theorem to unknown model
parameters. This is the foundation of the Bayesian approach to statistical inference
and has had an enormous and profound impact on the subject over the last few
decades. Suppose that a continuous random variable X has a probability distribution
that depends on an unknown parameter . For example, X might represent the fire-
breach time of a door in minutes and it might have an exponential distribution with
unknown mean = 1 .
A nave approach to statistical inference would simply replace by a good
guess based on expert opinions. However, this is inherently inaccurate and can lead
to poor decisions. A better method is the frequentist approach to inference, where-
138 D. Percy
f ( D ) g ( )
g ( D ) = . (6.10)
f ( D)
This enables us to make any inference we wish about . We can also use our
posterior beliefs about for any subsequent inference involving X . The price that
we pay for obtaining exact answers and avoiding approximations in this way
comes in two parts, the need to assume a prior distribution for and the increase
in algebraic complexity. This chapter shows how to resolve these issues.
Example 6.3 Suppose the unknown parameter represents the proportion of
car batteries that fail within two years and our prior beliefs about can be ex-
pressed in terms of the probability density function
Suppose also that we observe three car batteries, one of which fails within two
years and two of which do not. Then we can express the likelihood of these data
using the binomial probability mass function
p ( D ) = 3 (1 ) ,
2
(6.12)
g ( D ) p ( D ) g ( ) (1 )
3
(6.13)
for 0 < < 1 , so our posterior beliefs about the unknown parameter can be
expressed as a beta distribution D ~ Be ( 2, 4 ) . We elaborate on this process
further in Section 6.4.
Maintenance Based on Limited Data 139
n
L ( ; D ) f (x ) .
i =1
i (6.14)
g ( D ) L ( ; D ) g ( ) (6.15)
or, in words,
3
L ( ; D ) p ( x ) = (1 )
2
i (6.16)
i =1
g ( D ) L ( ; D ) g ( ) (1 )
3
(6.17)
for 0 < < 1 , which agrees with the result we obtained previously. The
corresponding prior and posterior probability density functions are graphed for
comparison in Figure 6.2.
140 D. Percy
2
prior ( )
posterior ( )
1
0
0 0.2 0.4 0.6 0.8 1
Figure 6.2. Prior and posterior probability density functions for Example 6.4
Having evaluated a posterior distribution using this rule, we can evaluate the
posterior mode such that
( )
g D g ( D ) , (6.18)
d
L ( ; D ) g ( ) = 0 . (6.19)
d
However, to find the median or mean, and to use this posterior density to make any
further inference, we need to determine the constant of proportionality in the
fundamental rule above. In standard situations, we can recognise the functional
form of L ( ; D ) g ( ) and hence quote published work on probability distributions
to determine this constant of proportionality and so derive g ( D ) explicitly. In
non-standard situations, we determine this constant of proportionality using
numerical quadrature or simulation, both of which we discuss later.
There are two main types of prior distribution, which loosely correspond with
objective priors and subjective priors. As objective priors strictly do not exist, this
category is generally known as reference priors and are used if little prior
information is available and as a benchmark against which to compare the output
from using subjective priors. This offers a default Bayesian analysis that is not
dependent upon any personal prior knowledge. The simplest reference prior is
proposed by the Bayes-Laplace postulate and simply recommends the use of a
uniform or locally-uniform prior g ( ) 1 for all in the region of support R .
Maintenance Based on Limited Data 141
g ( ) I ( ) ; R (6.20)
where
d 2 log f ( x )
I ( ) = E X (6.21)
d 2
g ( ) C g ( D ) C , (6.22)
so that the posterior density has the same functional form as the prior density. This
property is particularly appealing, as our prior knowledge can be regarded as
posterior to some previous information. Again, we tend to suppose that com-
ponents in multi-parameter problems are independent, so that their joint prior
density is the product of corresponding univariate marginal priors.
Such closed priors exist, and are called natural conjugate priors, for sampling
distributions f ( x ) that belong to the exponential family. This family includes
Bernoulli, binomial, geometric, negative binomial, Poisson, exponential, gamma,
normal and lognormal models. For a model in the exponential family with scalar
parameter , we can express the probability density or mass function in the form
142 D. Percy
f ( x ) = exp {a ( x ) b ( ) + c ( x ) + d ( )} (6.23)
for suitable constants k1 and k 2 . However, any conjugate prior of the form
g ( D ) L ( ; D ) g ( ) (6.26)
Bayesian approach to inference makes statements about the parameters given the
data, which are precisely what is required. OHagan (1994) commented that the
Bayesian approach is fundamentally sound, very flexible, produces clear and
direct inferences and makes use of all the available information. In contrast, he
noted that the Classical approach suffers from some philosophical flaws, has a
restrictive range of inferences with rather indirect meanings and ignores prior
information.
One of the most important and useful features of the Bayesian approach arises
when we wish to make predictions about future values of the random variable X
where f ( x ) is specified. If is unknown, the prior predictive probability
density function of X is
f ( x) = f ( x ) g ( ) d .
(6.27)
f ( x D) = f ( x ) g ( D ) d
(6.28)
f ( x ) L ( ; D ) g ( ) d .
1 1
f ( x ) exp ( x ) d (6.29)
0
x
for x > 0 , which is improper. However, this does provide information about the
relative likelihoods for different values of X . For example, the ratio of probabilities
that X lies in the intervals (5,10) and (10,20) is given by
144 D. Percy
10
1
P ( 5 < X < 10 ) x dx
5 log10 log 5
= = =1 (6.30)
P (10 < X < 20 ) 20
1 log 20 log10
10
x
dx
so the time to breakdown of this pulper is equally likely to lie in these two intervals
without taking account of any subjective or empirical information that might
be available. Even if we subsequently observe a random sample of lifetimes
D = { x1 , x2 , , xn } the posterior predictive density
n
f ( x D ) exp ( x ) n 1 exp xi d
0 i =1
n! (6.31)
= n +1
; x>0
n
x + xi
i =1
is still improper, though we can evaluate relative likelihoods as we did for the prior
predictive density. In contrast, a frequentist approach would merely generate the
approximation X D ~ Ex (1 x ) and could do no better than guess a value for X
before observing any data.
Example 6.6 Reconsidering the time to breakdown of the pulper in Example
6.5, suppose we instead use a gamma prior to reflect the knowledge of experts on
site. The prior predictive density is now given by
ba
f ( x ) = exp ( x )
a 1 exp ( b ) d
0
(a)
a
(6.32)
ab
= ; x>0
( x + b)
a +1
n
f ( x D ) exp ( x ) a + n 1 exp b +
x d
i
0 i =1
( a + n + 1) (6.33)
= a + n +1
; x>0
n
x + b + xi
i =1
Maintenance Based on Limited Data 145
ab a
f ( x) = ; x>0. (6.34)
( x + b)
a +1
x a
ab a b
F ( x) = dx = 1 ; x>0. (6.35)
0 ( x + b)
a +1
x+b
a
1 b
= 1 ; (6.36)
3 2,500 + b
a
2 b
= 1 . (6.37)
3 7,500 + b
There are many algorithms for solving simultaneous nonlinear equations and
several computer packages that contain these algorithms. Mathcad gives the values
a = 3.5240 and b = 20,502 , so the prior distribution for the exponential parameter
is specified completely as ~ Ga ( 3.5240, 20,502 ) .
analysis to allow for the uncertainty attached to these parameters. This effect is
particularly important when dealing with limited amounts of data, a common
problem in the area of reliability and maintenance and the subject of this chapter.
For example, the author recently acquired a set of data relating to the perform-
ance of an industrial valve subject to corrective and preventive maintenance. Only 12
uncensored lifetime observations were available, despite the fact that this represents
six years of data collection. From a frequentist point of view, it would be unwise to
fit any model involving more than three parameters to these data. However, the
Bayesian is not constrained in this manner, as prior knowledge gleaned from ex-
perience of similar systems can be incorporated in the analysis. Of course, parsimony
still dictates that models with fewer parameters are more robust for predictive
purposes, even if they provide better fits to the observed data. We can resolve such
issues using model comparison methods using prior odds, Bayes factors and posterior
odds, which we do not discuss here.
Consider a set of possible decisions d with associated utility function
u (d , ) , which depends on an unknown parameter . The best decision is that
which maximizes the prior expected utility
E {u ( d , )} = u ( d , ) g ( ) d (6.38)
E {l ( d , )} = l ( d , ) g ( ) d (6.39)
E {u ( d , ) D} = u ( d , ) g ( D ) d (6.40)
where
g ( D ) L ( ; D ) g ( ) (6.41)
X i i ~ Ex ( i ) (6.42)
148 D. Percy
l ( i, i ) = ci i . (6.43)
Then
E {l ( i, i )} = ci E ( i )
i
(6.44)
and we choose system i which minimizes this expected loss, where E ( i ) is the
prior mean.
Be ( ) Be ( a, b ) Be ( a + nx , b + n (1 x ) )
probability
POISSON gamma gamma
Po ( ) Ga ( a, b ) Ga ( a + nx , b + n )
mean
Ge ( ) Be ( a, b ) Be ( a + n, b + n ( x 1) )
probability
EXPONENTIAL gamma gamma
Ex ( ) Ga ( a, b ) Ga ( a + n, b + x )
hazard
NORMAL normal normal
No ( , ) No ( a, b ) ab + nx
No , b + n
mean b + n
known precision
(1969), Bury (1972) and Canavos and Tsokos (1973), who are concerned particu-
larly with analysis of the Weibull distribution. Singpurwalla (1988) and Percy
(2004) are concerned with prior elicitation for reliability analysis and OHagan
(1998) presents an accessible, general discussion of Bayesian methods.
There are many other academic publications dealing with Bayesian approaches
in maintenance and a representative sample of recent articles include those by van
Noortwijk et al. (1992), Mazzuchi and Soyer (1996), Chen and Popova (2000),
Apeland and Aven (2000), Kallen and van Noortwijk (2005) and Celeux et al.
(2006). The general aim is to determine optimal policies for maintenance schedul-
ing and operation, by combining subjective prior knowledge with observed data
using Bayes theorem and employing belief networks for larger systems.
The proportion of defective test versions of digital set top boxes in a large ship-
ment is unknown, but a beta prior probability density function of the form
1
g ( ) = a 1 (1 ) ; 0 < < 1
b 1
(6.45)
B ( a, b )
As a final exercise, suppose we select a further box at random from the ship-
ment and consider the random variable X which takes the value 0 if the box is
functional, or 1 if it is defective. Then Equation 6.28 can be used to determine the
posterior predictive probability mass function for X given the data above as
0.967 ; x = 0
p ( x D) = (6.46)
0.033 ; x = 1
so the posterior probability that a randomly chosen box from the shipment is
defective is P ( X = 1 D ) = 0.033 , or 1 in 30.
Maintenance Based on Limited Data 151
30
20
prior( )
posterior( )
10
0
0 0.1 0.2
Figure 6.3. Prior and posterior probability density functions for digital set top boxes
4010 9
g ( ) = exp ( 40 ) ; > 0 . (6.47)
9!
She runs an experiment for one day, replacing each flat battery by an identical fully
charged battery after failure, so that the total number of failures X has a Poisson
distribution with probability mass function
( 24 )
x
20
prior( )
10
posterior ( )
0 0.5 1
Figure 6.4. Prior and posterior probability density functions for rechargeable tool batteries
6.10 Conclusions
Bayesian inference represents a methodology for mathematical modelling and
statistical analysis of random variables and unknown parameters. It provides an
excellent alternative to the frequentist approach which gained immense popularity
throughout the twentieth century. Whereas the frequentist approach is based upon
the restrictive inference of point estimates, confidence intervals, significance tests,
p-values and asymptotic approximations, the Bayesian approach is based upon
probability theory and provides complete solutions to practical problems.
Advocates of the Bayesian approach regard it as superior to the frequentist
approach in most circumstances and infinitely superior in some. However, it does
depend upon the existence and specification of subjective probability to represent
individual beliefs, whereas the frequentist approach is almost completely objective.
Partial resolution of these difficulties was addressed in Section 6.6 and continues to
be improved upon, particularly in regards to eliciting subjective prior knowledge
for multiparameter models. The approach advocated here also involves more ana-
lytical and computational complexity, though this is not much of a hindrance with
modern computing power.
In particular, this approach often involves intractable integrals of the forms
g ( D ) L ( ; D ) g ( ) d posterior densities; (6.49)
f ( x) = f ( x ) g ( ) d
predictive densities; (6.50)
Maintenance Based on Limited Data 153
E {u ( d , )} = u ( d , ) g ( ) d expected utilities. (6.51)
Monte Carlo simulation can be used to approximate any integral of this form by
generating many pseudo-random numbers 1 , 2 , , n from the prior or posterior
density in the integrand and evaluating the unbiased estimator
n
1
s ( ) g ( ) d n s ( ) ,
i =1
i (6.52)
6.11 References
Apeland S, Aven T, (2000) Risk based maintenance optimization: foundational issues.
Reliability Engineering & System Safety 67:285292
Bernardo JM, Smith AFM, (2000) Bayesian Theory. Chichester: Wiley
Bury KV, (1972) Bayesian decision analysis of the hazard rate for a two-parameter Weibull
process. IEEE Transactions on Reliability 21:159169
154 D. Percy
Canavos GC, Tsokos CP, (1973) Bayesian estimation of life parameters in the Weibull
distribution. Operations Research 21:755763
Celeux G, Corset F, Lannoy A, Ricard B, (2006) Designing a Bayesian network for
preventive maintenance from expert opinions in a rapid and reliable way. Reliability
Engineering & System Safety 91:849856
Chen TM, Popova, E, (2000) Bayesian maintenance policies during a warranty period.
Communications in Statistics 16:121142
Jeffreys H, (1998) Theory of Probability. Oxford: University Press
Kallen MJ, van Noortwijk JM, (2005) Optimal maintenance decisions under imperfect
maintenance. Reliability Engineering & System Safety 90:177185
Kobbacy KAH, Percy DF, Fawzi BB, (1995) Sensitivity analyses for preventive-main-
tenance models. IMA Journal of Mathematics Applied in Business and Industry 6:5366
Kobbacy KAH, Percy DF, Fawzi, BB, (1997) Small data sets and preventive maintenance
modelling. Journal of Quality in Maintenance Engineering 3:136142
Lee PM, (2004) Bayesian Statistics: an Introduction. London: Arnold
Martz HF, Waller RA, (1982) Bayesian Reliability Analysis. New York: Wiley
Mazzuchi TA, Soyer R, (1996) A Bayesian perspective on some replacement strategies.
Reliability Engineering & System Safety 51:295303
OHagan A, (1998) Eliciting expert beliefs in substantial practical applications. The
Statistician 47:2135
OHagan A, (1994) Kendall's Advanced Theory of Statistics Volume 2B: Bayesian Inference.
London: Arnold
Percy DF, (2002) Bayesian enhanced strategic decision making for reliability. European
Journal of Operational Research 139:133145
Percy DF, (2004) Subjective priors for maintenance models. Journal of Quality in Main-
tenance Engineering 10:221227
Percy DF, Kobbacy KAH, Fawzi BB, (1997) Setting preventive maintenance schedules
when data are sparse. International Journal of Production Economics 51:223234
Singpurwalla ND, (1988) An interactive PC-based procedure for reliability assessment
incorporating expert opinion and survival data. Journal of the American Statistical
Association 83:4351
Soland RM, (1969) Bayesian analysis of the Weibull process with unknown scale and shape
parameters. IEEE Transactions on Reliability 18:181184
van Noortwijk JM, Dekker A, Cooke RM, Mazzuchi TA, (1992) Expert judgement in
maintenance optimization. IEEE Transactions on Reliability 41:427432
7
E. A. Elsayed
7.1 Introduction
Reliability is one of the key quality characteristics of components, products and
systems. It cannot be directly measured and assessed like other quality characteris-
tics but can only be predicted for given times and conditions. Its value depends on
the use conditions of the product as well as the time at which it is to be predicted.
Reliability prediction has a major impact on critical decisions such as the optimum
release time of the product, the type and length of warranty policy and associated
duration and cost, and the determination of the optimum maintenance and replace-
ment schedules. Therefore, it is important to provide accurate reliability predictions
over time in order to determine accurately the repair, inspection and replacements
strategies of products and systems.
Reliability predictions are based on testing a small number of samples or proto-
types of the product. The difficulty in predicting reliability is further complicated by
many limitations such as the available time to conduct the test and budget con-
straints, among others. Testing products at design conditions requires extensive
time, large number of units and cost. Clearly some kind of reliability testing, other
than testing at normal design conditions, is needed. One of the most commonly used
approaches for testing products within the above stated constraints is accelerated life
testing (ALT) where units or products are subjected to more severe stress conditions
than normal operating conditions to accelerate its failure time and then use the test
results to predict (extrapolate) the reliability at design conditions. This Chapter will
address the determination of optimum maintenance schedule at normal operating
conditions while utilizing the results from accelerated testing.
We classify the ALT into two types: accelerated failure time testing (AFTT) and
accelerated degradation testing (ADT). The AFTT is conducted when accelerated
conditions result in the failure of test units without experiencing failure mechanisms
different from those occurring at normal operating conditions and when there is
enough units to be tested at different conditions. Moreover, the economics of
conducting AFTT need to be justified as the test is destructive and its duration is
156 E. Elsayed
directly related to the reliability of test units and the applied stresses. Finally, testing
at stresses far from normal makes it difficult to predict reliability accurately at
normal conditions as in some cases few or no failures are observed even under
accelerated conditions making reliability inference via failure time analysis highly
inaccurate, if not impossible. On the other hand ADT is a viable alternative to
AFTT when the products physical characteristics or performance indices leading to
failure (e.g. drift in resistance value of a resistor, change in light intensity of light
emitting diodes (LED) and loss of strength of a bridge structure) experience
degradation over time. Moreover, significant degradation data can be obtained by
observing degradation of a small number of units over time. Degradation testing
may also be conducted either at normal or accelerated conditions, and no actual
failure is required for reliability inference (Liao 2004).
In this chapter, we address the issues associated with conducting accelerated life
testing and describe how the reliability models obtained from ALT are used in the
determination of the optimum maintenance schedules at normal operating conditions.
This chapter is organized as follows. Section 7.1 provides an overview of the role of
reliability prediction and the importance of accelerated life testing. In Section 7.2 we
present the two most commonly used accelerated life testing types in reliability
engineering. The approaches and models for predicting reliability using accelerated
life testing are described in Section 7.3 while Section 7.4 focuses on mathematical
formulation and solution of the design of accelerated life testing plans. Section 7.5
shows how accelerated life testing is related to maintenance decisions at normal
operating conditions. Models to determine the optimum preventive maintenance
schedules for both failure time models and degradation models are presented in
Section 7.6. A summary of the chapter is presented in Section 7.7. We begin by
describing the ALT types.
It is known that the more reliable the device, the more difficult it is to measure its
reliability. In fact, many devices last so long that life testing at normal operating
conditions is impractical. Furthermore, testing devices or components at normal
operating conditions requires an extensive amount of time and a large number of
devices in order to obtain accurate measures of their reliabilities.
ALT is commonly used to obtain reliability and failure rate estimates of devices
and components in a much shorter time.
A simple way to accelerate the life of many components or products that are
used on a continuous time basis such as tires and light bulbs is to accelerate time
(i.e. run the product at a higher usage rate). It is typically assumed that the number
of cycles, hours, etc., to failure during testing is the same as would be observed at
the normal usage rate. For example, in evaluating the failure time distribution of
light bulbs which are used on the average about 6 h per day, one year of operating
experience can be compressed into three months by using the light bulb for 24 h
every day. The advantage of this type of testing is that no assumptions need to be
Reliability Prediction and Accelerated Testing 157
made about the relationship of the failure time distributions at both the accelerated
and the normal conditions. However, it is not always true that the number of cycles
to failure at high usage rate is the same as that of the normal usage rate. Moreover,
the effect of aging is ignored. Therefore, this type of testing must be run with
special care to assure that product operation and stress remain normal in all regards
except usage rate and the effect of aging is taken into account, if possible.
An alternative to the above accelerated failure time testing is to accelerate
stress (apply stresses more severe than that of the normal conditions) to shorten
product or component life. Typical accelerating stresses are temperature, voltage,
humidity, pressure, vibration, and fatigue cycling. It is important to recognize the
type of stress which indeed accelerates product or component life. Suitable
accelerating stresses need to be determined. One may also wish to know how
product life depends on several stresses operating simultaneously. In accelerated
life testing, the test stress levels should also be controlled. They cannot be so high
as to produce other failure modes that rarely or are unlikely to occur at normal
conditions. Yet levels should be high enough to yield enough failures similar to
those that exist at the design (operating) stress. The limited range of the stress
levels needs to be specified in the test plans to avoid invalid or biased estimates of
reliability. The stress application loading can be constant, increase (or decrease)
continuously or in steps, vary cyclically, or vary randomly or combinations of
these loadings. The choice of such stress loading depends on how the product is
loaded in service and on practical and theoretical limitations (Shyur 1996).
In some cases, applying high stresses might not induce failures or result in
sufficient data and reliability inference via failure time analysis becomes highly
inaccurate, if not impossible. However, if a products physical characteristics or
performance indices leading to failure experience degradation over time then
degradation analysis could be a viable alternative to traditional failure time ana-
lysis. The advantages of degradation modeling over time-to-failure modeling are
significant. Indeed, degradation data may provide more reliability information than
would otherwise be available from time-to-failure data with censoring. Moreover,
degradation testing may be conducted either at normal or accelerated conditions,
and no actual failure is required for reliability inference.
Degradation data needed for reliability inference may be obtained from two
categories: the first is field application and the second is degradation testing
experiments. The first category requires an extensive data collection system over a
long time. Since the collected data are often subject to highly random stress environ-
ment and human errors, the data may exhibit significant volatility and sometimes its
accuracy is questionable, limiting its use for reliability inference and prediction. The
second category, prognostics, is a process of predicting the future state of a product
(or component). Degradation data analysis might be used in this process to mini-
mize field failure and reduce the life-cycle expenses by recommending condition-
based maintenance on observed components or systems. Moreover, degradation
testing is usually conducted to demonstrate products reliability and helps in
revealing the main failure mechanisms and the major failure-causing stress factors.
158 E. Elsayed
The failure times at each stress level are used to determine the most appropriate
failure time distribution along with its parameters. We refer to these models as
AFT (accelerated failure time). Parametric models assume that the failure times at
different stress levels are related to each other by a common failure time distribu-
tion with different parameters. Usually, the shape parameter of the failure time
distribution remains unchanged for all stress levels, but the scale parameter may
present a multiplicative relationship with the stress levels. For practical purposes,
we assume that the time scale transformation (also referred to as acceleration
factor, AF > 1 ) is constant, which implies that we have a true linear acceleration.
Thus the relationships between the accelerated and normal conditions are sum-
marized as follows (Tobias and Trindade 1986; Elsayed 1996). Let the subscripts o
and s refer to the operating conditions and stress conditions, respectively.
The relationship between the time to failure at operating conditions and stress
conditions is
to = AF tS . (7.1)
t
Fo ( t ) = Fs . (7.2)
AF
1 t
fo ( t ) = fs . (7.3)
AF AF
1 t
ho ( t ) = hs . (7.4)
AF AF
160 E. Elsayed
Nonparametric models relax the requirement of the common failure time distribution,
i.e., no common failure time distribution is required among all stress levels. Several
nonparametric models have been developed and validated in recent years. We
describe these models below.
(t ; z ) = 0 (t ) exp( z ) (7.5)
Etezadi-Amoli 1985; Etezadi-Amoli and Ciampi 1987; Shyur et al. 1999) is pro-
posed to combine the PH and AFT models into one form:
The unknowns of this model are the regression coefficients , and the
unspecified baseline hazard function 0 (t ) . The model reflects that the covariate z
has both the time scale changing effect and hazard multiplicative effect. It becomes
the PH model when = 0 and the AFT model when = .
Elsayed et al. (2006) propose a new model called Extended Linear Hazard
Regression (ELHR) model. The ELHR model (e.g., with one covariate) assumes
those coefficients to be changing linearly with time:
(t ; z ) = 0 (te( 0 + 1t ) z
) exp ( ( 0 + 1t ) z ) (7.7)
The model considers the proportional hazards effect, time scale changing effect
as well as time-varying coefficients effect. It encompasses all previously developed
models as special cases. It may provide a refined model fit to failure time data and
a better representation regarding complex failure processes.
Since the covariate coefficients and the unspecified baseline hazard cannot be
expressed separately, the partial likelihood method is not suitable for estimating the
unknown parameters. Elsayed et al. (2006) propose the maximum likelihood
method which requires the baseline hazard function to be specified in a parametric
form. In the EHR model, the baseline hazards function has two specific forms; one
is a quadratic function and the other is a quadratic spline. In the proposed ELHR
model, we assume the baseline hazard function 0 (t ) to be a quadratic function:
0 (t ) = 0 + 1t + 2 t 2 (7.8)
where
0 = 0 + 0 , 1 = 1 + 1 , 0 = 0 + 2 0 , 1 = 1 + 21
t t t t
(t ; z ) = 0
(u; z )du = 0
0 e 0 z +1zu du + 0
1ue0 z +1zu du + 0
2u 2e0 z +1zu du
0 0 z +1zt 0 0 z 1t 0 z +1zt
= e e + e 1 2 e0 z +1zt + 1 2 e0 z
1 z 1 z 1 z (1 z ) (1 z )
2t 2 0 z +1zt 2 2t 0 z +1zt 2 2 0 z +1zt 2 2 0 z
+ e e + e e
(1 z ) (1 z ) (1 z )
2 3 3
1 z
R(t ; z ) = exp((t ; z ))
f (t ; z ) = (t ; z ) exp((t; z ))
e( x) = e0 ( x) (7.10)
We refer to this model as the proportional mean residual life regression model
which is used to model accelerated life testing. Clearly e0 ( x) serves as the MRL
corresponding to a baseline reliability function R0 (t ) and is called the baseline
mean residual function; e(t z ) is the conditional mean residual life function of
T t given T > t and Z = z . Where z T = ( z1 , z2 ; , z p ) is the vector of covariates,
T = ( 1 , 2 ; , p ) is the vector of coefficients associated with the covariates,
and p is the number of covariates. Typically, we can experimentally obtain
{(ti , zi ), i = 1, 2, , n} the set of failure time and the vectors of covariates for each
unit (Zhao and Elsayed, 2005). The main assumption of this model is the pro-
portionality of mean residual lives with applied stresses. In other words, the mean
Reliability Prediction and Accelerated Testing 163
residual life of a unit subjected to high stress is proportional to the mean residual
life of a unit subjected to low stress.
F (t ; z ) F (t )
= exp( z ) 0 (7.12)
1 F (t ; z ) 1 F0 (t )
(t ; z ) = exp( z ) 0 (t ) (7.13)
log[ (t ; z1 )] log[ (t ; z2 )] = ( z1 z2 ) ,
which is independent of the baseline odds function 0 (t ) and the time t . Hence,
the odds functions are constantly proportional to each other. The baseline odds
function could be any monotone increasing function of time t with the property of
0 (0) = 0 . When 0 (t ) = t , PO model presented by Equation 7.13 becomes the
164 E. Elsayed
log-logistic accelerated failure time model (Bennett 1983), which is a special case
of the general PO models.
In order to utilize the PO model in predicting reliability at normal operating
conditions, it is important that both the baseline function and the covariate
parameter, , be estimated accurately. Since the baseline odds function of the
general PO models could be any monotone increasing function, it is important to
define a viable baseline odds function structure to approximate most, if not all, of
the possible odds function. In order to find such a universal baseline odds
function, we investigate the properties of odds function and its relation to the
hazard rate function.
The odds function (t ) is denoted by
F (t ) 1 R(t ) 1
(t ) = = = 1 (7.14)
1 F (t ) R(t ) R(t )
From the properties of reliability function and its relation to odds function
shown in Equation 7.14, we could easily derive the following properties of odds
function (t ) :
1. (0) = 0 , () =
2. (t ) is monotonically increasing function in time
1 exp[ (t )]
3. (t ) = = exp[ (t )] 1 , and (t ) = ln[ (t ) + 1]
exp[ (t )]
(t )
4. (t ) =
(t ) + 1
An ALT plan requires the determination of the type of stress, method of applying
stress, stress levels, the number of units to be tested at each stress level and an
applicable accelerated life testing model that relates the failure times at accelerated
conditions to those at normal conditions.
When designing an ALT, we need to address the following issues: (a) select the
stress types to use in the experiment, (b) determine the stress levels for each stress
type selected, (c) determine the proportion of devices to be allocated to each stress
level (Elsayed and Jiao 2002). We refer the reader to Meeker and Escobar (1998)
and Nelson (2004) for other approaches for the design of ALT plans.
We consider the selection of the stress level zi and the proportion of devices pi to
allocate for each zi such that the most accurate reliability estimate at use conditions
zD can be obtained. We consider two types of censoring: type I censoring involves
running each test unit until a prespecified time. The censoring times are fixed and
the number of failures is random. Type II censoring involves simultaneously testing
units until a prespecified number of them fails. The censoring time is random while
the number of failures is fixed. We use the following notations:
ln Natural logarithm
ML Maximum likelihood
n Total number of test units
zH, zM, zL High, medium, low stress levels respectively
zD Specified design stress
p1 , p2 , p3 Proportion of test units allocated to zL, zM and zL, respectively
T Pre-specified period of time over which the reliability estimate is of
interest
R(t; z) Reliability at time t, for given z
f(t; z) Pdf at time t, for given z
F(t; z) Cdf at time t, for given z
166 E. Elsayed
0 (t ) = 0 + 1t
(t ; z ) = ( 0 + 1t ) exp( z )
1t 2 z
(t ; z ) = ( 0 t + )e
2
2
Var[(0 + 1t )e Z ] = (Var[0 ] + Var[1 ]t 2 )e2( z +Var [ ] z
D )
2 2
+ e 2 z +Var [ ] z (eVar [ ] z 1)( 0 + 1t ) 2
T
Min Var[(
0
0 + 1t )e z ]dt
D
subject to
= F 1
0 < pi < 1, i = 1, 2,3
3
p
i =1
i =1
where, MNF is the minimum number of failures and is the inverse of the
Fisher's information matrix.
Other objective functions can be formulated which result in different design of
the test plans. These functions include the D-Optimal design that provides efficient
estimates of the parameters of the distribution. It allows relatively efficient deter-
mination of all quantiles of the population, but the estimates are distribution depen-
dent.
Figure 7.1. Distributions of the time to failure at stress and normal conditions
Reliability Prediction and Accelerated Testing 169
40 C
Degradation Path 80 C 60 C
Threshold
Time
Figure 7.2. Distributions of degradation paths with time at different stress levels
The first step has been discussed in Section 7.3 and the second step will be
discussed in Section 7.6.
The first step is to relate the accelerated testing results to stress conditions and
obtain a reliability expression which is a function of the applied stresses. We then
substitute the normal operating conditions in the expression to obtain a reliability
function at normal conditions. We illustrate this by designing an optimum test plan
then use its results to obtain the reliability expression.
Suppose we develop an accelerated life test plan for a certain type of electronic
devices using two stresses: temperature and electric voltage. The reliability estimate
at the design condition over a 10-year period of time is of interest. The design
condition is characterized by 50 C and 5V. From engineering judgment, the highest
levels (upper bounds) of temperature and voltage are pre-specified as 250 C and
10 V, respectively. The allowed test duration is 200 h, and the total number of
devices placed under test is 200. The minimum number of failures at any test
combination is specified as 10. The test plan is determined through the following
steps:
(t ; z ) = 0 ( t ) exp ( 1 z1 + 2 z2 )
where 0 (t ) = 0 + 1t + 2 t 2
3. A baseline experiment is conducted to obtain initial estimates for the model
parameters. These values are: 0 = 0.0001 , 1 = 0.5 , 2 = 0 , 1 = 3800 ,
and 2 = 10 .
3800 10
( + )
(t ; T , V ) = 0.5t e T V
(7.15)
The reliability and the probability density function (pdf) expressions are respec-
2
tively given as f (t ;30o C,5V ) = 0.5t exp[(e 3.6336 t ) )]
2
R(t ; T , V ) = exp[(e0.25((3800 / T ) +10 / V )t ) )] (7.16)
Reliability Prediction and Accelerated Testing 171
2
f (t ; T , V ) = 0.5t exp[(e0.25((3800 / T ) +10 / V ) t ) )] (7.17)
2
f n (t ) = f (t;30o C,5V ) = 0.5t exp[ (e3.6336 t )] (7.19)
FAILURE PREVENTIVE
NEW ITEM
REPLACEMENTS REPLACEMENT
0 tp
ONE CYCLE
Let c(t p ) be the total replacement cost per unit time as a function of t p .Then
The total expected cost in the interval (0, t p ] is the sum of the expected cost of
failure replacements and the cost of the preventive replacement. During the interval
(0, t p ], one preventive replacement is performed at a cost of c p and M (t p ) failure
172 E. Elsayed
c p + c f M (t p )
c(t p ) = . (7.21)
tp
tp
10 + 1200 tf n (t ) dt
0
c(t p ) = (7.22)
tp
Calculated values of the cost per unit time are shown in Table 7.1 and plotted in
Figure 7.4. The optimum preventive maintenance schedule at normal operating
conditions is 0.18 unit times.
Table 7.1. Time vs. cost per unit time values (bold numbers indicate optimum values)
Cost/unit time 918 885 862 847 839 836 840 848
3500
3000
C ost per unit tim e
2500
2000
1500
1000
500
0
0,03 0,13 0,23 0,33 0,43 0,53
Time
Figure 7.4. Optimum preventive maintenance schedule
Reliability Prediction and Accelerated Testing 173
where Di = 1.41 in. is the initial reinforcing bar diameter and t is the elapsed time.
Note that t T1 and D(t) 0. For more details of Equation 7.23 the reader is
referred to Enright and Frangopol (1998).
The time-variant reinforced concrete strength, Mp(t), can now be evaluated
using the conventional design equations in Enright and Frangopol (1998):
a
M p = nAs f y d (7.24)
2
(
a = ( nAs f y ) 0.85 f c` b ) (7.25)
Note that As = D(t ) 2 4 . The reinforcing steel and the concrete strengths are
f y and f c` , respectively. The number of reinforcing bars is n. The effective depth
and the width of the beam are d and b, respectively. For the current example, the
174 E. Elsayed
x
Rx (t ) = P( X > x; t ) = exp[ ] (7.26)
b exp(at )
m
m ni x ij (7.27)
L ( , a , b , t ) = ( ) ni x ij 1 e x p ( b e x p ( a ti ) )
i =1 b e x p ( a ti ) i =1 j =1
m m m m ni m ni
xij
ln L = ni ln ni ln b + ni ati + ( 1) ln xij (7.28)
i =1 i =1 i =1 i =1 j =1 i =1 j =1 b exp(ati )
x
Rx (t ) = P( X > x; t ) = exp[ ]
b exp(at )
or
x1.49
Rx (t ) = exp[ ].
1.1346598 107 exp (-0.12t )
The reliability for different threshold values of the strength is shown in Figure
7.5. The time to failure for threshold values of 4800, 4000, 3500, 3000, and 2500
are 25.04, 27.25, 28.88, 30.76, and 33.0 years respectively.
Reliability Prediction and Accelerated Testing 175
s=2500
1
s=3000
0.8 s=3500
Reliability
0.6 s=4000
0.4 s=4800
0.2
0
0 10 20 30 40 50 60
Time (Years)
Figure 7.5. Reliability for different threshold levels
The next step is to determine the optimum preventive maintenance schedule for
every threshold level and select the schedule corresponding to the smallest cost
among all optimum cost values. This will represent both the optimum threshold
level and the corresponding optimum preventive maintenance schedule.
We demonstrate this for two threshold levels (S = 4800 and S = 2500) assuming
c p =10 and c f =1200; we utilize Equation 7.21 as follows:
tp
10 + 1200 tf ( x; t ) dt
0
c(t p ) = (7.29)
tp
where
1 x
f ( x; t ) = x exp( ) , t > 0, (t ) = be at (7.30)
(t ) (t )
As shown in Figure 7.6, the optimum t p values for S=400 and S=2500 are 17
and 16 years respectively. The minimum of the two is the one corresponding to
S = 2500. Therefore, the optimum threshold is 2500 and the corresponding optimum
maintenance schedule is 16 years.
176 E. Elsayed
2,5
2 S=2500
S=4800
Cost / Unit Time
1,5
0,5
0
2 12 22 32
Time
7.7 Summary
In this chapter we present the common approaches for predicting reliability using
accelerated life testing. The models are classified as accelerated life testing models
(ALT) and accelerated degradation models (ADT). The ALT models are also
classified as accelerated failure time models with assumed failure time distribu-
tions and distribution free models. Also we modify the proportional odds model
to be used for reliability prediction with multiple stresses. Most of the research in
the literature does not extend the use of accelerated life testing beyond reliability
predictions at normal conditions. This is the first work that links the ALT to
maintenance theory and maintenance scheduling. We develop optimum preventive
maintenance schedules for both ALT models and degradation models. We
demonstrate how the reliability prediction models obtained from ALT can be used
in obtaining the optimum maintenance schedules. We also demonstrate the link
between the optimum degradation threshold level and the optimum maintenance
schedule. This work can be further extended to include other maintenance cost or
insurance of minimum availability level of a system. Further work is needed to
investigate the relationship between threshold levels at accelerated conditions and
those at normal conditions. Moreover, the models need to include the repair rate as
well as spares availabilities.
Reliability Prediction and Accelerated Testing 177
7.8 References
Agresti, A. and Lang, J.B., (1993) Proportional odds model with subject-specific effects for
repeated ordered categorical responses, Biometrika, 80, pp. 527534
Bennett, S. (1983) Log-logistic regression models for survival data, Applied Statistics, 32,
165171
Brass, W., (1971) On the scale of mortality, In: Brass, W., editor. Biological aspects of
Mortality, Symposia of the society for the study of human biology. Volume X. London:
Taylor & Francis Ltd.: 69110
Brass, W., (1974) Mortality models and their uses in demography, Transactions of the
Faculty of Actuaries, Vol. 33, 122133.
Ciampi, A. and Etezadi-Amoli, J., (1985) A general model for testing the proportional
hazards and the accelerated failure time hypotheses in the analysis of censored survival
data with covariates, Commun. Statist. - Theor. Meth., Vol. 14, pp. 651667.
Cox, D.R., (1972) Regression models and life tables (with discussion), Journal of the Royal
Statistical Society B, Vol. 34, pp. 187208
Cox, D.R., (1975) Partial likelihood, Biometrika, Vol. 62, pp. 269276
Eghbali, G. and Elsayed, E.A., (2001) Reliability estimate using degradation data, in
Advances in Systems Science: Measurement, Circuits and Control, Mastorakis, N. E.
and Pecorelli-Peres, L. A. (Editors), Electrical and Computer Engineering Series, WSES
Press, pp. 425430
Elsayed, E.A., (1996) Reliability engineering, Addison-Wesley Longman, Inc., New York,
1996.
Elsayed, E.A. and Jiao, L., (2002) Optimal design of proportional hazards based accelerated
life testing plans, International Journal of Materials & Product Technology, Vol. 17,
Nos. 5/6, 411424
Elsayed, E.A. and Zhang, H., (2006) Design of PH-based accelerated life testing plans under
multiple-stress-type, to appear in the Reliability Engineering and Systems Safety
Elsayed, E.A., Liao, H., and Wang, X., (2006) An extended linear hazard regression model
with application to time-dependent-dielectric-breakdown of thermal oxides, IIE Trans-
actions on Quality and Reliability Engineering, Vol. 38, No. 4, 329340
Elsayed, E.A. and Zhang, H., (2005) Design of optimum simple step-stress accelerated life
testing plans, Proceedings of 2005 International Workshop on Recent Advances in
Stochastic Operations Research. Canmore, Canada.
Enright, M.P. and Frangopol, D.M., (1998) Probabilistic analysis of resistance degradation
of reinforced concrete bridge beams under corrosion, Engineering Structures, Vol. 20
No. 11, pp. 960971
Etezadi-Amoli, J. and Ciampi, A., (1987) Extended hazard regression for censored survival
data with covariates: a spline approximation for the baseline hazard function,
Biometrics, Vol. 43, pp. 181192
Ettouney, M. and Elsayed, E.A., (1999) Reliability estimation of degraded structural
components subject to corrosion, Fifth ISSAT International Conference, Las Vegas,
Nevada, August 1113
Hannerz, H., (2001) An extension of relational methods in mortality estimation,
Demographic Research, Vol. 4, p. 337368
Kalbfleisch, J.D. and Prentice, R.L., (2002) The statistical analysis of failure time data, John
Wiley & Sons, New York, New York
Liao, H., Elsayed, E.A., and Ling-Yau Chan, (2005) Maintenance of continuously monitored
degrading systems, European Journal of Operational Research, Vol. 75, No. 2, 821835
Liao, H., (2004) Degradation models and design of accelerated degradation testing plans,
Ph.D. Dissertation, Department of Industrial and Systems Engineering, Rutgers
University
178 E. Elsayed
McCullagh, P., (1980) Regression models for ordinal data, Journal of the Royal Statistical
Society. Series B, Vol. 42, No. 2, 109142
Meeker, W.Q. and Escobar, L.A., (1998) Statistical methods for reliability data, John Wiley
& Sons, New York, New York
Nelson, W., (2004) Accelerated testing: statistical models, test plans, and data analyses,
John Wiley & Sons, New York, New York
Oakes, D. and Dasu, T. (1990) A note on residual life, Biometrika, 77, pp. 409410.
Pascual, F.G., (2006) Accelerated life test plans robust to misspecification of the stress-life
relation, Technometrics, Vol. 48, No. 1, 1125
Shyur, H-J., (1996) A General nonparametric model for accelerated life testing with time-
dependent covariates, Ph.D. Dissertation, Department of Industrial and Systems
Engineering, Rutgers University
Shyur, H-J., Elsayed, E.A. and Luxhoj, J.T., (1999) A General model for accelerated life
testing with time-dependent covariates, Naval Research Logistics, Vol. 46, 303321
Tobias, P. and Trindade, D., (1986) Applied reliability, Von Nostrand Reinhold Company,
New York, New York
Zhang, H. and Elsayed, E.A., (2005) Nonparametric accelerated life testing based on
proportional odds model, Proceedings of the 11th ISSAT International Conference on
Reliability and Quality in Design, St. Louis, Missouri, USA, August 46
Zhao, W. and Elsayed, E.A., (2005) Optimum accelerated life testing plans based on
proportional mean residual life, Quality and Reliability Engineering International
8
David F. Percy
8.1 Introduction
Preventive maintenance (PM) of repairable systems can be very beneficial in
reducing repair and replacement costs, and in improving system availability, by
reducing the need for corrective maintenance (CM). Strategies for scheduling PM
are often based on intuition and experience, though considerable improvements in
performance can be achieved by fitting mathematical models to observed data; see
Handlarski (1980), Dagpunar and Jack (1993) and Percy and Kobbacy (2000) for
example.
For systems comprising few components, and systems comprising many iden-
tical components, modelling and analysis using compound renewal processes
might be possible. Such situations are considered by Dekker et al. (1996) and Van
der Duyn Schouten (1996). However, many systems comprise a large variety of
different components and are too complicated for applying this methodology. We
refer to these as complex repairable systems.
This chapter reviews basic models for complex repairable systems, explaining
their use for determining optimal PM intervals. Then it describes advanced
methods, concentrating on generalized proportional intensities models, which have
proven to be particularly useful for scheduling PM. Computational difficulties are
addressed and practical illustrations are presented, based on sub-systems of oil
platforms and refineries.
The motivation is that for complex systems, one needs to build models for
failures based on the history of maintenance (PM and CM) available. Once a model
is built, one can evaluate different PM strategies to determine the best one. The
focus is to look at different models and how to determine the best model based on
historical data.
Section 8.2 presents some real examples of complex systems with historical
data sets. In each case, it discusses current maintenance policies and any problems
with collection or accuracy of the data. Section 8.3 considers the effects of PM and
CM actions upon system reliability and availability, so justifying the need for
180 D. Percy
Table 8.1. Hypothetical reliability data from Ascher and Feingold (1984)
Example 8.2 Percy et al. (1998) published a set of data relating to the reliability
and maintenance history of a valve in a petroleum refinery, as displayed in Table 8.2.
The two columns successively represent the times in days between maintenance
actions and the types of actions, where 0 indicates no failure (PM) and 1 indicates
failure (CM).
At first glance, this would appear to be a noncommittal system. However, on
further inspection, there appear to be fewer failures later on and more preventive
actions. Whether the PM is proving to be effective or the system is generally happy
is not easy to determine. Modelling can provide these answers though. Based on
these data, our ultimate goal is to decide how often to perform PM in future or on
similar systems.
When collecting such data, it is very important to record all PM and CM events
accurately, as errors of omission or commission can result in wrong decisions. For
example, if the first failure were not recorded, the average time until system failure
over the first 94 days would appear to be twice its actual value, perhaps suggesting
that PM is not required.
182 D. Percy
Example 8.3 Kobbacy et al. (1997) published a set of historical reliability and
maintenance data collected from a main pump at an oil refinery over a period of
nearly seven years. These data are reproduced in Table 8.3, with consecutive
observations reading down the columns successively from left to right.
Table 8.3. Reliability and maintenance history of a main oil refinery pump
Many mathematical models have been proposed for statistical analysis of complex
repairable systems. Table 8.4 presents a summary of the main types. In order to
discuss the strengths and weaknesses of each model in more depth, we first intro-
duce some standard notation. Suppose that each time a system fails, we repair it
and thereby return it to operational condition. For a preliminary analysis, we also
assume that repair times are negligible. Let T1 , T2 , T3 , be the times to successive
failures of the system and let X i = Ti Ti 1 be the time between failure i 1 and
failure i where T0 = 0 . The Ti and X i are random variables and we define ti and
xi to be their corresponding realized values. Figure 8.1 illustrates this situation.
We also define N (t ) as the number of failures in the interval (0, t ] .
We generally model the time to first failure using a familiar lifetime probability
distribution or hazard function. However, this approach is inadequate for modelling
other times to failure, as the inter-failure times are neither independent nor
identically distributed in general (Ascher and Feingold 1984). Stochastic processes
form the appropriate basis for models to use under these circumstances. We are
interested in the probability that a system fails in the interval (t, t + ] given the
history of the process up to time t . We describe the behaviour of the failure
process by the intensity function (identified here by the Greek letter iota):
Preventive Maintenance Models for Complex Systems 185
( t ) = lim
{
P N (t + ) N (t ) 1 H (t )}. (8.1)
0
For an orderly process, where simultaneous failures are impossible, the intensity
function is equal to the derivative of the conditional expected number of failures:
(t ) =
d
dt
{
E N (t ) H (t ) , } (8.2)
t
(iii) { N ( t ) N ( s )} ~ Po ( t ) dt [Poisson inter-failure times]
s
1
(ii) f (x ,) = x exp ( x ) ; x > 0 [gamma]
( )
(iii) f ( x , ) = ( x )
1
{
exp ( x )
}; x > 0 [Weibull]
The form of the hazard function is precisely the same as the form of the intensity
function if we were to use a stochastic process to model the complex system. For a
nonhomogeneous Poisson process, this intensity function applies beyond the first
failure. However, successive hazard functions for inter-failure times have different
forms, which correspond to shifted and truncated versions of the distribution for
time to first failure.
Imperfect maintenance models must allow for the dynamic evolution of a system
and take account of hypothesized and observed knowledge about the effectiveness
of repairs. As mentioned above, this section reviews a variety of existing models for
repairable systems and describes suitable adaptations for systems that are subject to
preventive maintenance. In passing, we remark that time is used as the only scale of
measurement here. Some applications use running time instead, or both, such as the
flight time of an aircraft or the mileage and age of a car. Further details of such
variations are described by Baik et al. (2004) and Jiang and Jardine (2006).
Preventive Maintenance Models for Complex Systems 187
This model assumes that repairs renew a system to its condition as new. A renewal
process is a counting process that registers the successive occurrence of events
during a given time interval ( 0,t ] where the time durations between consecutive
events X 1 , X 2 , X 3 , form a sequence of independent and identically distributed
non-negative random variables. The special case where their distribution is
exponential corresponds to the homogeneous Poisson process. We can characterize
the intensity function of a renewal process by
(
( t ) = 0 t t N ( t ) ) (8.3)
where 0 ( t ) is the baseline intensity function, which would prevail if there were no
system failures. As this is a renewal process, the baseline intensity function is
equal to the hazard function for the inter-failure times: 0 ( x ) = h ( x ) . The baseline
intensity function can take many forms, including:
(i) 0 (t ) = [constant]
(ii) 0 (t ) = t [loglinear]
(iii) 0 (t ) = t [power-law]
The renewal process is a plausible first order model for components or parts
when the repair time is negligible, since complete replacement of a component
after failure implies renewal instead of repair. Conversely, the renewal process is a
poor model for complex systems, where repairs involve replacing or restoring just
a fraction of the systems components. If a large portion of a system needs to be
restored, it is often more economical to replace the entire system. Even if a repair
restores the systems performance to its original specification, the presence of
predominantly aged components implies that system reliability is not renewed.
The assumptions underlying this model imply that, when a repair is carried out, a
system assumes the same condition that it was in immediately before failure. The
nonhomogeneous Poisson process (NHPP) differs from the homogeneous Poisson
process only in that the rate of occurrence of failures varies with time rather than
being constant. As mentioned early in this section, it is the fundamental model for
repairable systems. The NHPP is also the most appropriate model for the reliability
of a complex system comprising infinity components. However, for a finite
number of components, this model can only serve as an approximation, often poor,
as the intensity function changes following each repair. In this model, the inter-
arrival times X 1 , X 2 , X 3 , are neither independent nor identically distributed.
188 D. Percy
( t ) = 0 ( t ) (8.4)
As shown in Figure 8.2, define the random variables U and V to be the lifetimes
after PM and CM respectively. Their probability density functions, conditional
upon known parameters, are fU ( u ) and fV ( v ) respectively. These distributions
might take the exponential, gamma or Weibull forms defined earlier, to achieve the
required flexibility. Note that the exponential distribution is a limiting case of the
gamma as 1 and Weibull as 1 . The DRP assumes that downtimes are
negligible compared with the costs of parts and labour. We now consider the
effects of non-ignorable downtimes.
The delayed renewal process described above assumes that the downtimes for
preventive and corrective maintenance are negligible when compared with the
lifetimes. It also assumes that the costs associated with these downtimes are
dominated by the costs of parts and labour. The model and analysis are further
complicated when we allow for periods of downtime, when maintenance actions
take place. In many applications involving continuous-process industries, the
principal costs are not due to parts and labour, but are due to lost production whilst
the system is down. Consequently, we must consider downtime costs and durations
when determining cost-effective strategies for scheduling PM.
This extension results in the delayed alternating renewal process (DARP), for
which analytical solution is not even feasible in practice. The downtimes following
preventive and corrective maintenance can be fixed or random. Since analytical
solution of the optimisation problems is not possible and we are adopting a
simulation approach here, either of these can be included in the calculations with
ease. In the following work, we consider them fixed to avoid confusion. Another
benefit of simulation over numerical solution of the renewal equations is that
anomalies are readily catered for, such as switching from CM to PM if the system
is in the failed state when PM is due. The DARP is illustrated in Figure 8.3.
The delayed alternating renewal process is appropriate when the time to replace
(or repair back to new) a failed item is non-zero. In this case, we have working and
190 D. Percy
failed states and these alternate. So far, we have only allowed for systems that
display no long-term trends, corresponding to improvement or deterioration. We
now discuss age-based models that allow for such trends. These models can also be
used for stationary and non-stationary systems when concomitant information is
available. We discuss these benefits later, as the need for including such extra
sources of information is described.
The virtual age model (VAM) modifies the hazard function for a systems inter-
failure times at each corrective maintenance action. For these repairs, the systems
virtual age at any given time is determined by a variety of additive or multi-
plicative age-reduction factors. This resets the system to a younger state, which is
only an approximation for reasons mentioned earlier. The intensity function of a
point process under the age reduction model may be additive
N (t )
( t ) = 0 t
si
(8.5)
i =1
or multiplicative
N (t )
( t ) = 0 t si
i =1
(8.6)
where both si are constants, representing the age reduction factors, and 0 ( t ) is
the baseline intensity function again.
In order to evaluate the intensity function for a sequence of failures under age
reduction, the renewal function governs the system failure pattern. The additive
model can generate negative intensities but the multiplicative model is suitable if
replacement components are infallible. The age-reduction model has been applied
to systems under a block replacement policy. A critical defect of the age-reduction
model and its many variants is that they do not provide a realistic description of the
failure processes. For example, replacing a corroded exhaust pipe does not reduce a
cars age, as very many other components are no less likely to fail.
The proportional hazards model (PHM) is more flexible than the renewal process,
DRP and DARP, as it allows for non-stationarity. It is also more flexible than the
virtual age model because it allows for concomitant information. In principle, this
model appears to be inappropriate for representing a complex system, because
hazards naturally relate to lifetimes of components rather than inter-failure times of
processes. We cannot physically justify this model as readily as the proportional
intensities model described later. However, this does not invalidate its use in this
context as a statistical model rather than a mathematical model and considerable
Preventive Maintenance Models for Complex Systems 191
( u ) = 0 ( u ) exp ( y t ) (8.7)
and after CM
( v ) = 0 ( v ) exp ( z t ) (8.8)
N (t )
( t ) = 0 ( t ) s
i =1
i (8.9)
where the si are constants representing the intensity reduction factors and 0 ( t ) is
the baseline intensity function again. We later generalize this model by supposing
si are simple functions of i , or are random variables that are independent of the
failure and repair process. Having concluded that this model is ideally suited to
modelling complex repairable systems, this chapter later considers how to extend it
to allow for preventive maintenance and concomitant information.
( t ) = lim
{ }
P N (t + ) N (t ) 1 H (t )
(8.10)
0
at system age t units, where H ( t ) is the history of the process. However, the NHPP
corresponds with minimal repair as in Section 8.4.2 and makes no allowances for
system improvement, or even deterioration, arising from maintenance actions.
Hence, we modify the intensity function by introducing a multiplicative factor,
so that we can express the intensity function as
where the baseline intensity 0 ( t ) has a standard form such as constant, loglinear
and power-law. Furthermore, the parameter vector represents the regression
coefficients and the observation vector x t contains factors and covariates relating
to the system, such as the cumulative observations and concomitant information
mentioned in Section 8.4.6.
An alternative option arises when using the PIM to model a complex repairable
system subject to PM. Rather than adopting a global time scale for the baseline
intensity function as implied above, we could reset the time scale of the baseline
intensity function to zero upon each PM action. This introduces an element of
Preventive Maintenance Models for Complex Systems 193
M ( t ) N ( t )
( )
( t ) = 0 ( t ) ri s j exp xTt . (8.12)
i =1 j =1
Here, 0 ( t ) is the baseline intensity function, whilst ri > 0 and s j > 0 are the
intensity scaling factors for preventive maintenance (PM) and corrective main-
tenance (CM) actions respectively. Furthermore, M ( t ) and N ( t ) are the total
numbers of PM and CM actions, whilst xt is a vector of predictor variables and
is an unknown parameter vector of regression coefficients. One might expect the
rj and s j to be less than one for a deteriorating system and greater than one for an
improving system, though replacing failed components with used parts and acci-
dentally introducing faults during maintenance can produce the opposite effects.
System copies can have different forms of baseline intensity function. For
reduction of intensity, the scaling factors can take the forms of positive constants,
random variables, deterministic functions of time ( t ) and events ( i and j ) or
stochastic functions of time and events. As for the intensity reduction model de-
scribed in Section 8.4.7, a reasonable assumption for initial analysis is that ri =
for i = 1, 2, , M ( t ) and s j = for j = 1, 2, , N ( t ) , in which case the GPIM
corresponds with the PIM of Section 8.4.8. The vector of predictor variables might
include:
194 D. Percy
{
D = ( ui , vij ) ; i = 1, , n; j = 1, , ni } (8.13)
0 ; ui right censored
ci = (8.14)
1 ; ui observed lifetime
and
Preventive Maintenance Models for Complex Systems 195
L ( , ; D )
ni
(8.16)
{ f ( u )} {R ( u )} { f ( v )} {R ( v )}
n 1 dij
ci 1 ci dij
i i ij ij
i =1 j =1
L ( , ; D ) = L ( ; D ) L ( ; D ) (8.17)
where
{ } {R ( u )}
ci 1 ci
L ( ; D ) f ( ui ) i (8.18)
i =1
and
ni
{ f ( v )} {R ( v )}
n dij 1 dij
L ( ; D ) ij ij . (8.19)
i =1 j =1
N (T )
T
( )
L {; H ( t )} ti exp ( t ) dt (8.20)
i =1 0
196 D. Percy
N (T ) T
l {; H ( t )} = const. + log ( t ) ( t ) dt .
i =1
i
(8.21)
0
Therefore, once we specify the formulation of ( t ) , we can obtain estimates for its
unknown parameters via likelihood-based methods.
Example 8.4 Assuming T = t N (T ) so that observation ceases at a failure, the
maximum likelihood estimates (MLEs) can be determined analytically for the
power-law process (NHPP with power-law intensity). With (t ) = t and
n = N (T ) , the MLEs are
n
= n
1 (8.22)
T
i =1
log
ti
and
=
(
n + 1 ). (8.23)
+1
T
For a particular system, successive arrival times (not inter-arrival times) were
observed to be 15, 42, 74, 117, 168, 233 and 410 days. With n = 7 , T = 410 and
t1 = 15, , t7 = 410 , we have 0.3007 and then 0.07288 . As < 0 , the
intensity is a strictly decreasing function of time; this is a happy system that seems
to improve with age.
Analysis of the intensity based models follows by extending this likelihood
function corresponding to the NHPP. Consider the generalized proportional inten-
sities model of Section 8.5. The choice of which predictor variables to include
depends upon the sample size (history of failures) and the results of standard
selection procedures based on analyses of deviance for nested models. Only im-
portant predictors should be included in order to produce a robust model. We can
estimate the parameters in the model by maximum likelihood, on extending the
NHPP likelihood presented above, whereby the log-likelihood is given by
l {; H (T )} = const. +
n
n M t N t tk +1
( t ) exp ( x ) dt .
( )
( k k )
0
T
t
k =0
tk
This corresponds to the simple case where the scaling factors are constant: minor
changes are needed for the more general cases.
Preventive Maintenance Models for Complex Systems 197
n
nt
t 2
i =1
i
U= (8.25)
n
t
12
with standard normal critical values, rejecting the null hypothesis of no trend if
U ( z p 2 , z p 2 ) for a hypothesis test at the 100 p % level of significance, where
the proportion p represents the size of the test. For a 5% significance test, the
critical values are given by z p 2 = 1.960 .
If we decide that a system is nonstationary, we could use the VAM or PHM,
which are easier to fit to data than the stochastic processes considered next, but are
less robust because of their statistical rather than mathematical derivation.
However, all of these models require numerical computation to some extent. The
VAM and PHM might provide a better fit to the observed data on occasions,
198 D. Percy
L1
2 log ~ 2 ( p1 p2 ) (8.26)
L2
and so we can test whether the extra parameters are significant. This is particularly
beneficial when choosing which elements to include in a linear predictor.
If the models M 1 and M 2 are not nested, we cannot use this formal test and
simply compare the log-likelihood functions log L1 and log L2 , choosing the model
with the larger log-likelihood. This is appropriate for choosing between gamma and
Weibull baseline hazard functions, for example. However, it is only valid if p1 = p2 ,
as a model with more parameters often fits better than a model with fewer para-
meters, by definition. To compare non-nested models with different numbers of para-
meters, we usually apply a correction factor to the log-likelihood functions.
Two common modified forms are the Akaike information criterion (AIC),
which suggests that we compare log L1 p1 with log L2 p2 , and the Schwarz
criterion, or Bayes information criterion (BIC), which suggests that we compare
log L1 ( p1 log n ) 2 with log L2 ( p2 log n ) 2 where n is the number of obser-
vations in the data set. The latter arises as the limiting case of the posterior odds
resulting from a Bayesian analysis with reference priors. In each case, the best
model to choice is the one that maximizes the information criterion.
Example 8.5 Suppose we fit two non-nested models to a set of lifetime data,
based on n = 31 observed failures. The first model contains three parameters and
has a likelihood of L1 = 8.742 1018 . The second model contains five parameters
and has a likelihood of L2 = 3.110 1017 . The Bayes information criterion for the
first model is log L1 ( p1 log n ) 2 44.43 and for the second model it is
log L2 ( p2 log n ) 2 46.59 so we prefer the first, simpler model here.
Preventive Maintenance Models for Complex Systems 199
m
1
K=
m
K
i =1
i (8.27)
represents an unbiased estimator for the total cost per PM interval. This enables us
to estimate the expected cost per unit time as K t .
Now we must repeat the whole simulation for different values of t , using an
efficient search algorithm, to determine the value of t that minimises this expected
cost per unit time. This is the recommended PM interval duration. We advocate
direct search algorithms for practical implementation, such as golden-section search.
For practical purposes, t is unlikely to vary continuously and discrete values will
dominate. Convenient multiples of days, weeks or months provide suitable units of
measurement for practical implementation.
200 D. Percy
{t ( )}
n
{
P N (t + ) N (t ) = n H (t ) } =
n!
exp { t ( )} (8.28)
t +
t ( ) = ( t ) dt (8.29)
t
{ }
Rt ( ) = P N ( t + ) N ( t ) = 0 H ( t ) = exp { t ( )} , (8.30)
ft ( ) = Rt ( ) = ( t + ) exp { t ( )} . (8.31)
Preventive Maintenance Models for Complex Systems 201
This allows us to simulate the process as before, evaluate expected costs over a
finite horizon, and so deduce the most economical time for the next preventive
maintenance. This decision can be made at any specific event, such as during PM
or CM, or even between events, so long as the intensity function is known.
Next we consider the proportional hazards model. To avoid referring separately
to the hazard functions ( u ) and ( v ) , consider a general hazard function h ( x ) .
For the purposes of simulation in order to schedule PM in the future, the reliability
function can be determined as
x
R ( x ) = exp h ( x ) dx ,
(8.32)
0
x
f ( x ) = R ( x ) = h ( x ) exp h ( x ) dx ,
(8.33)
0
8.9 Applications
We now apply some of these models to the data sets in Section 8.2.
Example 8.6 For each system, we fitted the intensity reduction model using
constant, loglinear and power-law baseline intensities with constant reduction
factors. Its goodness of fit is measured by the log-likelihoods in Table 8.5, obtained
using Mathcad software. For comparison, we also display the log-likelihoods for
the extremes of renewal process (maximal repairs) and nonhomogeneous Poisson
process (minimal repairs)
202 D. Percy
As expected, the intensity reduction model provides a good fit to all three
systems, preferring the power-law baseline intensity for the happy system and the
loglinear baseline intensity for the sad and noncommittal systems. Figure 8.4
shows that these baseline intensities are all increasing functions and any apparent
happiness is due to the high quality of repairs rather than a self-improving system.
Preventive Maintenance Models for Complex Systems 203
Intensity Function
0.18
( t , a , b , s)
0 t 410
Intensity Function
0.18
( t , a , b , s)
0 t 410
Intensity Function
0.18
( t , a , b , s)
0 t 410
Figure 8.4. Best fitting models for happy, sad and noncommital systems, respectively
204 D. Percy
n
nt 22 2,128
t 2
i =1
i 21,901
2
U= = 0.5230 . (8.34)
n 22
t 2,128
12 12
As 1.960 < U < 1.960 , the test is not significant at the 5% level and we conclude
that this test provides no evidence of non-stationarity for these data. Consequently,
the delayed renewal process might provide an adequate fit to these data, without
the need for a more complicated model. However, we might consider using the
DARP if downtime is important or one of the later models if concomitant
information is also available.
Example 8.8 Here the data comprise 65 event observations collected over seven
years. In the first half of this period, there were 15 CM and 11 PM actions. In the
second half of this period, there were 29 CM actions and 10 PM actions. Hence,
this is a sad system, which might benefit from preventive maintenance. We fit the
generalized proportional intensities model to these data with explanatory variables
representing quality of last maintenance and time since last maintenance. A
loglinear baseline with constant reduction factors generates the results in Table 8.6.
Table 8.6. Log-likelihoods and parameter estimates for GPIM analyses of oil pump data
The best model includes both quality of last maintenance action and time
since last maintenance action as predictor variables. This is not surprising, as it
contains six parameters whereas the model with no predictor variables has only
four. As the associated PM reduction factor is about two-thirds, preventive
Preventive Maintenance Models for Complex Systems 205
maintenance reduces the intensity of critical failures for this system and so
improves its reliability. Although slightly less impressive, corrective maintenance
reduces the intensity function too. Hence, the maintenance workforce appears to be
very effective for this application! A graph of the intensity function for the GPIM
with both covariates follows in Figure 8.5, based on the corresponding parameter
estimates in the last row of Table 8.6.
Intensity Function
0.1
( t , a , b , r , s , c1 , c2)
0 t 2487
Fig. 8.5. Intensity function for GPIM analysis of oil pump data with two covariates
We now perform a simulation analysis for this last model based on the methods
described in Section 8.8, in order to determine an optimal strategy for scheduling
preventive maintenance. Several convenient PM intervals are considered for our
calculations, including weekly, monthly, two-monthly, quarterly, biannually, annu-
ally and biennially. The minimum cost per unit time over a ten-year fixed horizon
is achieved with monthly PM and generates a projected 80% saving over annual
PM, though this estimated reduction in costs is sensitive to the choice of model.
The previous policy implemented averages about three PM actions per year, which
our simulation estimates would cost about four times as much in preventive main-
tenance when compared with the optimal policy of monthly PM.
8.10 Conclusions
This chapter discussed the ideas of modelling complex repairable systems, with the
intention of scheduling preventive maintenance to improve operational efficiency
and reduce running costs. It started by emphasising the importance of improved,
accurate and complete data collection in practice. It then presented the renewal
process, delayed renewal process and delayed alternating renewal process as
reasonable models for systems that exhibit stationary failure patterns.
206 D. Percy
The virtual age model and proportional hazards model were described as
suitable for systems that do not exhibit stationarity and for systems where predictor
variables such as condition monitoring observations are also measured. The non-
homogeneous Poisson process, intensity reduction model and proportional inten-
sities model, with a promising generalization, were described next. We claim that
these models offer natural interpretations of the physical underlying reliability and
maintenance processes.
Finally, this chapter demonstrated some applications of these ideas using
reliability and maintenance data taken from the oil industry and reviewed several
methods for model selection and goodness-of-fit testing, including graphs, Laplace
trend test, likelihood ratios and the Akaike and Bayes information criteria. The use
of mathematical modelling and statistical analysis in this fashion can improve, and
has improved, the quality of PM scheduling. This can then result in considerable
cost savings and help to improve system availability.
8.11 References
Ascher HE, Feingold H, (1984) Repairable Systems Reliability: Modeling, Inference,
Misconceptions and their Causes. New York: Marcel Dekker
Baik J, Murthy DNP, Jack N, (2004) Two-dimensional failure modeling with minimal
repair. Naval Research Logistics 51:345362
Cox DR, (1972a) Regression models and life tables (with discussion). Journal of the Royal
Statistical Society Series B 34:187220
Cox DR, (1972b) The statistical analysis of dependencies in point processes. In Stochastic
Point Processes (Lewis PAW). New York: Wiley
Crowder MJ, Kimber AC, Smith RL, Sweeting TJ, (1991) Statistical Analysis of Reliability
Data. London: Chapman and Hall
Dagpunar JS, Jack N, (1993) Optimizing system availability under minimal repair with non-
negligible repair and replacement times. Journal of the Operational Research Society
44:10971103
Dekker R, Frenk H, Wildeman RE, (1996) How to determine maintenance frequencies for
multi-component systems? A general approach. In Reliability and Maintenance of
Complex Systems (Ozekici S). Berlin: Springer
Doyen L, Gaudoin O, (2004) Classes of imperfect repair models based on reduction of
failure intensity or virtual age. Reliability Engineering and System Safety 84:4556
Handlarski J, (1980) Mathematical analysis of preventive maintenance schemes. Journal of
the Operational Research Society 31:227237
Jack N, (1998) Age-reduction model for imperfect maintenance. IMA Journal of
Mathematics Applied in Business and Industry 9:347354
Jardine AKS, Anderson PM, Mann DS, (1987) Application of the Weibull proportional
hazards model to aircraft and marine engine failure data. Quality and Reliability
Engineering International 3:7782
Jiang R, Jardine AKS, (2006) Composite scale modeling in the presence of censored data.
Reliability Engineering and System Safety 91:756764
Kobbacy KAH, Fawzi BB, Percy DF, Ascher HE, (1997) A full history proportional hazards
model for preventive maintenance scheduling. Quality and Reliability Engineering
International 13:187198
Lindqvist BH, Elvebakk G, Heggland K, (2003) The trend-renewal process for statistical
analysis of repairable systems. Technometrics 45:3144
Preventive Maintenance Models for Complex Systems 207
Khairy A. H. Kobbacy
9.1 Introduction
Over the past two decades their has been substantial research and development in
operations management including maintenance. Kobbacy et al. (2007) argue that the
continous research in these areas implies that solutions were not found to many
problems. This was attributed to the fact that many of the solutions proposed were
for well-defined problems, that the solutions assumed accurate data were available
and that the solutions were too computationally expensive to be practical. Artificial
intelligence (AI) was recognised by many researchers as a potentially powerful tool
especially when combined with OR techniques to tackle such problems. Indeed,
there has been vast interest in the applications of AI in the maintenance area as
witnessed by the large number of publications in the area. This chapter reviews the
application of AI in maintenance management and planning and introduces the
concept of developing intelligent maintenance optimisation system.
The outline of the chapter is as follows. Section 9.2 deals with various main-
tenance issues including maintenance management, planning and scheduling. Sec-
tion 9.3 introduces a brief definition of AI, some of its techniques that have appli-
cations in maintenance and Decision Support Systems. A review of the literature is
then presented in Section 9.4 covering the applications of AI in maintenance. We
have focused on five AI techniques namely knowledge based systems, case based
reasoning genetic algorithms, neural networks and fuzzy logic. This review also
covers hybrid systems where two or more of the above mentioned AI techniques
are used in an application. Other AI techniques seem to have very few applications
in maintenance to date. A discussion of the development of the prototype hybrid
intelligent maintenance optimisation system (HIMOS) which was developed to
evaluate and enhance preventive maintenance (PM) routines of complex engineer-
ing systems follows in Section 9.5. HIMOS uses knowledge based system to
identify suitable models to schedule PM activities and case base reasoning to add
capability to utilise past experience in model selection. Future developments and
210 K. Kobaccy
great promise and indeed being investigated for application in more complex PM
situations, e.g., multiple PM routines.
9.3 AI Techniques
AI is a branch of computer science that develops programmes to allow machines to
perform functions normally requiring human intelligence (Microsoft ENCARTA
College Dictionary 2001). The goal of AI is to teach machines to think to a
certain extent under special conditions (Firebaugh 1988). There are many AI
techniques, the most used in maintenance decision support are as follows.
Knowledge based systems (KBS): use of domain specific rules of thumb or
heuristics (production rules) to identify a potential outcome or suitable course of
action.
Case based reasoning (CBR): utilises past experiences to solve new problems. It
uses case index schemes, similarity functions and adaptation. It provides machine
learning through updating of the case base.
Genetic algorithms (GAs): these are based on the principle that solutions can
evolve. Potential promising solutions evolve through mutation and weaker solutions
become extinct.
Neural networks (NNs): use back propagation algorithm to emulate behaviour
of human brain. Both of NNs and GAs are capable of learning how to classify,
cluster and optimise.
Fuzzy logic (FL): allows the representation of information of uncertain nature.
It provides a framework in which membership of a category is graded and hence
quantifies such information for mathematical modelling, etc.
There are several other AI techniques and these include Data Mining, Robotics
and Intelligent Agents. However, to date very few publications are available about
their applications in maintenance.
9.4 AI in Maintenance
AI techniques have been used successfully in the past two decades to model and
optimise maintenance problems. Since the resurgence of AI in the mid-1980s
researchers have consider the applications of AI in this field. The article by
Artificial Intelligence in Maintenance 213
Dhaliwal (1986) is one of the early ones that argued for the appropriateness of
using AI techniques for addressing the issues of operating and maintaining large
and complex engineering systems. Kobbacy (1992) discusses the useful role of
knowledge based systems in the enhacement of maintenance routines. Over the
years the applications of AI in maintenance grew to cover very wide area of appli-
cations using a variety of AI techniques. This can be explained by the individual
nature of each technique. For example GAs and NNs have the advantage of being
useful in optimising complex and nonlinear problems and overcome the limitations
of the classic black box approaches, where attempt is made to identify the system
by relating system outputs to inputs without understanding and modelling the un-
derlying process. Hence the widespread applications in the scheduling area and
also in fault diagnosis.
In this section, an up to date survey is presented covering the area of appli-
cation of AI techniques in maintenance including fault diagnosis. This chapter will
only refer to some of the references in the vast applications of AI in fault diagnosis.
Interested readers can refer to the recent comprehensive review by Kobbacy et al.
(2007) on applications of AI in Operations.
GAs are popular in maintenance applications because of their robust search capabili-
ties that help reduce the computational complexity of large optimisation problems
(Morcous and Lounis 2005), such as large scale maintenance scheduling models.
GAs have applications in infrastructure networks including programming the main-
tenance of concrete bridge decks (Morcous and Lounis 2005; Lee and Kim 2007),
pavement maintenance programme (Chootinan et al. 2006), and optimising highway
life-cycle by considering maintenance of roadside appurtenances (Jha and Abdullah
2006). GAs also have applications in maintenance activities in nuclear power plants
including optimising the technical specification of a nuclear safety system by
coupling GAs and Monte Carlo simulation in attempt to minimise the expected value
of system unavailability and its associated variance (Marseguerra et al. 2004).
Another important area of application is in manufacturing. Ruiz et al. (2006)
present an approach for scheduling of PM in a flowshop problem with the aim of
maximising availability. Sortrakul et al. (2005) present a heuristic based on genetic
algorithms to solve an integrated optimisation model for production scheduling and
preventive maintenance planning. Chan et al. (2006) propose a GA approach to
deal with distributed flexible manufacturing system scheduling problem subject to
machine maintenance constraint. Other popular application areas for GAs include
preventive maintenance scheduling optimisation. Application areas in PM include
chemical process operations (Tan and Kramer 1997), power systems (Huang
1998), single product manufacturing production line (Cavory et al. 2001) and me-
chanical components (Tsai et al. 2001). GAs are also used in deciding on oppor-
tunistic maintenance policies (Saranga 2004; Dragan et al. 1995).
GAs have had some moderate but constant interest over the past decade in the
area of fault diagnosis. Applications range from manufacturing systems (Khoo et
al. 2000), nuclear power plants (Yangping et al. 2000), electrical distribution net-
works (Wen and Chang 1998) to a new area of application in automotive fuel cell
power generators (Hissel et al. 2004).
NNs are popular AI technique applied in the areas of maintenance and in particular
in fault diagnosis. NNs are the primary information processing structure used in
neurocomputing i.e. systems that learn the relationship between data through a
process of training (Dendronic Decisions Ltd 2003).
NNs have many applications in the areas of predictive maintenance and
condition monitoring. Gilabert and Arnaiz (2006) present a case study for non-
critical machinery, where NN is used for elevator monitoring and diagnosis as no
previous experience existed. Al-Garni et al. (2006) also use NN for predicting the
failure rate of an airplane tyres. Gromann de Araujo Goes et al. (2005) have
developed a computerised online reliability monitoring system for nuclear power
plant applications. An interesting application, developed by Garcia et al. (2004),
uses NNs to aid tele-maintenance, where staff can carry out the work remotely and
in collaboration with other experts. Other applications of NNs in condition
monitoring include the work of Bansal et al. (2004) on machine systems, Booth
Artificial Intelligence in Maintenance 215
FL has been used in various applications in the maintenance area to deal with
uncertainity. Oke and Charles-Owaba (2006) apply an FL control model to Gant
charting preventive maintenance scheduling. Al-Najjar and Alsyouf (2003) use a
fuzzy multiple criteria decision making to select in advance the most informative
(efficient) maintenance approach, i.e. strategies, policies or philosophies. Braglia et
al. (2003) adopt FL to help an approach to allow analysts formulating efficiently
assessment of possible causes of failure in mode, effects and criticality analysis.
Sudiarso and Labib (2002) investigated FL approach to an integrated maintenance/
production scheduling algorithm. Jeffries et al. (2001) develop an efficient hybrid
method for capturing machine information in a packaging plant using FL, fuzzy
condition monitoring, in order to reduce wastage and maintenance overheads.
Examples of FL hybrid applications include the use of a KBS for bridge
damage diagnosis which aims at providing information about the impact of design
factors on bridge deterioration with FL used to handle uncertainties (Zhao and
Chen 2001). Sinha and Fieguth (2006) propose a neuro-fuzzy classifier that com-
Artificial Intelligence in Maintenance 217
bines FL and NNs for the classification of defects by extracting features in seg-
mented buried pipe images.
Applications for FL in fault diagnosis include fault diagnosis of railway wheels
(Skarlatos et al. 2004), thrusters for an open- frame underwater vehicle (Omerdic
and Roberts 2004), chemical processes (Dash et al. 2003) and rolling element
bearings in machinery (Mechefske 1998).
The main functional features which would be expected of such a system, to cope
with the above situation are (Kobbacy 2004):
HIMOS aims at deciding the optimal PM cycle interval for a repairable system by
selecting and applying the most appropriate optimisation model automatically and
without the need for expert interference (Kobbacy and Jeon 2001). HIMOS is the
result of developing its predecessor IMOS (Kobbacy et al. 1995b), the intelligent
maintenance optimisation system, which used rule based reasoning to select an
appropriate model for analysis. HIMOS employs hybrid reasoning by combining
rule-based reasoning (RBR) and case-base reasoning (CBR) to choose a model
from a model base for a given data set.
Analysis of a typical large data file by IMOS showed that about two thirds of
components cannot be modelled, mostly because of insufficient history data
needed for model selection (Kobbacy 2004). However, some of the cases which
could not be modelled may have parameters with values close to those of a models
acceptance level as stated in the rulebase. By introducing case based reasoning, the
system can model cases which are not identified by the rule base, although it has
analysed similar cases in the past. Thus, such a hybrid (KBS and CBR) system is
expected to increase the previous low percentage of model cases where the system
is able to identify a suitable model.
Artificial Intelligence in Maintenance 219
Figure 9.1 illustrates the conceptual structure of HIMOS which is divided into
two areas. The DSS contribution area contains a database to store maintenance
historical data, a model base for data analysis models and optimisation models, and
a user interface to communicate with the user. In the AI contribution area, there are
two bases which contain experts knowledge: knowledge base and case base.
Data Formatting and Analysis After reading data from the input data file, the
system formats and checks the data to create a suitable data set for the next step of
analysis. Suspect or missing items of data are flagged in order to be sorted out by
the system or investigated by the user.
The analysis consists of five steps: recognition of PM and CO patterns, calcu-
lation of current availability, Weibull distribution fitting to failure times, trend test
of frequency and severity to establish data stationarity with respect to frequency
and severity or otherwise, and if applicable analysis of Multi-PM cases. In the first
step a basic analysis is carried out to identify the features of the data set such as the
numbers of PM and CO events and the mean lives to failure, so that the data set
can be compared with characteristic data patterns in the model selection process.
The data produced in this process are referred to as metadata.
Model Base The model base contains two sets of models: the data analysis models
and the PM scheduling optimisation models. The data analysis models identify a
data pattern which together with the RBR/CBR help to select an optimisation
Artificial Intelligence in Maintenance 221
Model Selection Using the Rule-Base In HIMOS the rule base (or knowledge base)
consists of a list of rules capturing some of the knowledge of experts in maintenance
modelling concerning mathematical modelling techniques and their applicability to
various situations. The rules match data sets to the models by searching for patterns
in the data set for each component such as relative numbers of CO and PM events,
component life distribution, range of PM intervals, etc. The approach used to develop
the rule base is described in Kobbacy (2004). The knowledge base implemented in
HIMOS consists of the set of 15 rules, an example of which is shown below. If the
rule base failed to identify a suitable model the CBR is invoked.
Model Selection Using CBR CBR is an approach to problem solving that utilises
past experiences to solve new problems. The first step in the operation of a CBR
system is the retrieval in which the inputs are analysed to determine the critical
222 K. Kobaccy
features to use in retrieving past cases from the case database. Among the well
known methods for case retrieval is the nearest neighbour which is used in
HIMOS. To find the nearest neighbour matching the case being considered, the
case with the largest weighted average of similarity functions for selected features
is selected. In HIMOS four features were selected and all given equal weights.
These features are: number of PM, number of CO, trend value and variability of
PM cycle length. The reason for selecting these features is that they were found to
be the main causes for failure to select a suitable model using the rule based
system. The similarity function was selected as the difference between the values
of feature in the current and retrieved cases divided by the standard deviation of the
feature.
Once the best matching case has been retrieved, adaptation is carried out to
reduce any prominent difference between the retrieved case and the current case
through the derivational replay method. Thus in the CBR phase, the system uses
rules similar to those used in the KBS phase to find a solution. However some
critical values in the adaptation rules are more relaxed compared with the original
rules.
In the evaluation step the system displays multiple candidate models (possible
solutions) with their critical features for the current case (adaptation results). The
user can then evaluate these alternatives and selects one using their expertise.
For the non-expert user, the system itself provides the Recommended Model
as a result of evaluation. Here the system compares the results of adaptation with
the results of retrieval. If there is no matching model then no recommendation is
made, otherwise the system recommends the matching model. If there is more than
one matching model, the system merely recommends the first ranked (nearest
neighbour) model.
Table 9.1. Percentage use of maintenance models for HIMOS when applied to large
systems, 1633 components in three data files (Kobbacy 2004)
Model HIMOS*
RBR RBR+CBR
Stochastic RP 6.6 12.8
NHPP 1.6 1.6
NRP 2.3 2.3
Total stochastic 10.5 16.4
Geometric I 15.7 23.5
Geometric II 1.7 1.8
Weibull 1.7 3.7
Deterministic 1.8 1.9
No model suitable 68.6 52.7
HIMOS was validated using test cases by comparing the results of analysis of
selected cases by HIMOS with the recommendations of an expert panel.
For the validation HIMOS, eight data sets were used and a panel of five experts
were involved. In general there was agreement between HIMOS and the experts.
The experts had a measure of disagreement in their advices as a result of making
different assumptions in their analysis. Experts also made useful suggestion for the
operation of the system. Table 9.2 is a typical example of HIMOS and the experts
recommendations.
Data Set 3
Figure 9.3. Outline design of AMMCM (Adaptive Maintenance Measurement and Control
Model
9.8 Acknowledgments
The author wishes to acknowledge the contributions of those who collaborated at
the various stages of the development of IMOS and HIMOS. In particular I wish to
acknowledge the significant contribution of A.L. Labib in developing the proposal
for the AMMCM presented in Section 9.6.
9.9 References
Acosta, G.G., Verucchi, C.J. and Gelso, E.R. (2006) A current monitoring system for
diagnosing electrical failures in induction motors, Mechanical Systems and Signal
Processing, 20, 953965.
Ahmed, K., Langdon, A. and Frieze, P.A., (1991), An expert system for offshore structure
inspection and maintenance, Computers and Structures, 40, 143159.
Al-Garni, A.Z., Jamal, A., Ahmad, A.M. Al-Garni, A.M. and Tozan, M. (2006), Neural
network-based failure rate prediction for De Havilland Dash-8 tires, Engineering
Applications of Artificial Intelligence, 19, 681691.
Al-Najjar, B. and Alsyouf, I. (2003), Selecting the most efficient maintenance approach
using fuzzy multiple criteria decision making, International Journal of Production
Economics, 84, 85100.
Ascher, H.E. and Kobbacy, K.A.H. (1995), Modelling preventive maintenance for
deteriorating repairable systems, IMA Journal of Mathematics Applied in Business &
Indistry, 6, 8599.
Bansal, D., Evans, D.J. and Jones, B. (2004), A real-time predictive maintenance system for
machine systems, International Journal of Machine Tools and Manufacture, 44,
759766.
Baroni, P., Canzi, U. and Guida, G. (1997), Fault diagnosis through history reconstruction:
an application to power transmission networks, Expert Systems with Applications, 12,
3752.
Batanov, D., Nagarue, N. and Nitikhunkasem, P. (1993) EXPERT-MM: A knowledge-based
system for maintenance management, Artificial Intelligence in Engineering, 8, 283291.
Beaulah, S.A. and Chalabi, Z.C. (1997), Intelligent real-time fault diagnosis of greenhouse
sensors, Control Engineering Practice, 5, 15731580.
Booth, C. and McDonald, J.R. (1998), The use of artificial neural networks for condition
monitoring of electrical power transformers, Neurocomputing, 23, 97109.
Braglia, M., Frosolini, M. and Montanari, R. (2003), Fuzzy criticality assessment model for
failure modes and effects analysis, International Journal of quality & Reliability
Management, 20, 503524.
Cavory, G., Dupas R. and Goncalves, G. (2001), A genetic approach to the scheduling of
preventive maintenance tasks on a single product manufacturing production line.
International Journal of Production Economics 74, 135146.
Chan, F.T.S., Chung, S.H., Chan, L.Y., Finke, G. and Tiwari, M.K. (2006), Solving
distributed FMS scheduling problems subject to maintenance: Genetic algorithms
approaches, Robotics and Computer-Integrated Manufacturing, 22, 493504.
Chen, Q., Chan, Y.W. and Worden, K. (2003), Strucural fault diagnosis and isolation using
neural networks based on response-only data, Computers & Structures, 81, 21652172.
Artificial Intelligence in Maintenance 227
Chootinan, P., Chen, A., Horrocks, M.R. and Bolling, D. (2006), A multi-year pavement
maintenance program using a stochastic simulation-based genetic algorithm approach,
Transportation Research Part A: Policy and Practice, 40, 725743.
Cunningham, P., Smyth, B. and Bonzano, A. (1998), An incremental retrieval mechanism
for case-based electronic fault diagnosis. Knowledge-Based Systems 11, 239248.
Dash, S., Rengaswamy, R. and Venkatasubramanian, V. (2003), Fuzzy-logic based trend
classification for fault diagnosis of chemical processes, Computers & Chemical
Engineering, 27, 347362.
de Brito, J., Branco, F.A., Thoft-Christensen, P. and Sorensen, J.D. (1997), An expert
system for concrete bridge management, Engineering Structures, 19, 519526.
Dendronic Decisions Ltd (2003), www.dendronic.com/articles.htm.
Dhaliwal, D.S. (1986), The use of AI in maintaining and operating complex engineering
systems, in Expert systems and Optimisation in Process Control, A. Mamdani and J E
Pstachion, eds, 2833. Gower Technical Press, Aldershot.
Dragan, A.S., Walters, G.A. and Knezevic, J. (1995), Optimal opportunistic maintenance
policy using genetic algorithms, 1 formulation, Journal of Quality in Maintenance
Engineering, 1, 3449.
Drury, C.G. and Prabhu, P. (1996), Information requirements of aircraft inspection: frame-
work and analysis, International Journal of human-Computer Studies, 45, 679695.
Eldin, N.N. and Senouci, A.B. (1995), Use of neural networks for condition rating of joint
concrete pavements, Advances in Enginering software, 23, 133141.
Feldman, R.M., William, M.L., Slade, T., McKee, L.G. and Talbert, A. (1992), The
development of an integrated mathematical and knowledge-based maintenance delivery
system, Computers & Operations Research, 19, 425434.
Firebaugh, M.W. (1988), Artificial Intelligence: A Knowledge-based Approach, Boyd &
Fraser Publishing Co. Danvers, MA, USA.
Frank, P.M. and Ding, X. (1997), Survey of robust residual gereration and evaluation
methods in observed-based fault detection systems, Journal of Process Control, 7,
403427.
Frank, P.M. and Koppen-Seliger, B. (1997), New developments using AI in fault diagnosis,
Engineering applications in Artificial Intelligence, 10, 314.
Garcia, E., Guyennet, H., Lapayre, J.C. and Zerhouni, N. (2004), A new industrial
cooperative tele-maintenance platform. Computers & Industrial Engineering 46,
851864.
Gilabert, E. and Arnaiz, A. (2006), Intyelligent automation systems for predictive main-
tenance: A case study, Robotics and Computer Integrated Manufacturing, 22, 543549.
Gits, C.W. (1984), On the maintenance concept for a technical system, PhD Thesis,
Eindhoven Technische Hogeschool, Eindhoven.
Gromann de Araujo Goes, A., Alvarenga, M.A.B. and Frutuoso e Melo, P.F. (2005),
NAROAS: a neural network-based advanced operator support system for the assessment
of systems reliability, Reliability Engineering & System Safety, 87, 149161.
Hissel, D., Pera, M.C. and Kauffmann, J.M. (2004) Diagnosis of automotive fuel cell power
generators, Journal of power Sources, 128, 239246.
Huang, S.J. (1998), Hydroelectric generation scheduling an application of genetic-
embedded fuzzy system approach. Electric Power Systems Research 48, 6572.
Hui, S.C., Fong, A.C.M. and Jha, G. (2001) A web-based intelligent fault diagnosis system
for customer service support, Engineering Applications of Artificial Intelligence, 14,
537548.
Jeffries, M., Lai, E.. Plantenberg, D.H. and Hull, J.B. (2001), A fuzzy approach to the
condition monitoring of a packaging plant, Journal of Materials Processing technology,
109, 8389.
228 K. Kobaccy
Jha, M.K. and Abdullah, J. (2006) A Markovian approach for optimising highway life-cycle
with genetic algorithms by considering maintenance of roadside appurtenances, Journal
of the Franklin Institute, 343, 404419.
Jota, P.R.S., Islam, S.M.,Wu, T. and Ledwich, G. (1998), A class of hybrid intelligent
system for fault diagnosis in electric power systems. Neurocomputing 23, 207224.
Khoo, L.P., Ang, C.L. and Zhang, J. (2000), A Fuzzy-based genetic approsach to the
diagnosis of manufacturing systems, Engineering Applications of artificial Intelligence,
13, 303310.
Kobbacy, K.A.H. (1992), The use of knowledge-based systems in evaluation and enhance-
ment of maintenance riutines, International Journal of Production Economics, 24, 243
248.
Kobbacy, K.A.H. (2004), On the evolution of an intelligent maintenance optimisation
system, journal of the Operational Research Society, 55, 139146
Kobbacy, K.A.H. and Jeon, J. (2001), The development of a hybrid intelligent maintenance
optimisation system (HIMOS), Journal of the Operational Research society, 52,
762778.
Kobbacy, K.A.H., Percy, D.F. and Fawzi, B.B. (1995a), Sensitivity analysis for preventive
maintenance modeld, IMA Journal of Mathematics Applied in Business& industry, 6,53
66.
Kobbacy, K.A.H., Proudlove, N.L. and Harper, M.A. (1995b), Towards an intelligent
maintenance optimisation system, Journal of the Operatonal Research society, 46,
229240.
Kobbacy, K.A.H., Fawzi, B.B., Percy, D.F. and Ascher, H.E. (1997), A Full history
proportional hazards model for preventive maintenance modelling, Journal of Quality
and Reliability Engineering Internationa, 13, 187198.
Kobbacy, K.A.H., Percy, D. F. and Sharp, J.M. (2005), Results of preventive maintenance
survey, unpublished report,University od Salford..
Kobbacy, K.A.H., Vadera, S. and Rasmy, M.H.(2007), AI and OR in management of
operations:history and trends, Journal of the Operational Research Society, 58, 1028.
Kohno, T., Hamada, S., Araki, D., Kojima, S. and Tanaka, T. (1997) Error repair and
knowlledge acquisition via case-based reasoning, Artificial Intelligence, 91, 85101.
Kuo, H-C. and Chang, H-K. (2004) A new symbiotic evolution-based fuzzy-neural approach
to fault diagnosis of maine propulsion systems, Engineering Applications of Artificial
Intelligence, 17, 919930.
Labib, A.W. (1998) World class maintenance using a computerised maintenance
management system, Journal of Quality in Maintenance Engineering,4, 6675.
Lee, C-K. and Kim, S-K. (2007) GA-based algorithm for selecting optimal repair and
rehabilitation methods for reinforced concrete (RC) bridge decks, Automation in
Construction, 16, 153164.
Leung, D. and Romagnoli, J. (2002) An integration mechanism for multivariate knowledge-
based fault diagnosis, Journal of Process Control, 12, 1526.
Lin, C-C. and Wang, H-P. (1996), Performance analysisof routating machinary using
venhanced cerebellar model articulation controller (E-CMAC) neural netyworks,
Computers and industrial Engineering, 30, 227242.
Luxhoj, J.T. and Williams, T.P. (1996), Integrated decision support for aviation safety
inspectors. Finite Elements in Analysis and Design 23, 381403.
Marseguerra, M., Zio, E. and Podofillini, L. (2004), A multiobjective genetic algorithm
approach to optimisation of the technical specifications of a nuclear safety system,
Reliability Engineering & System Safety, 84, 8799.
Martland, C.D., McNeil, S., Axharya, D., Mishalani, R. and Eshelby, J. (1990), Applications
of expert systems in railroad maintenance:Scheduling rail relays, Transportation
Research Part A: General, 24, 3952.
Artificial Intelligence in Maintenance 229
Mechefske, C.K. (1998), Objective machinery fault diagnosis using fuzzy logic, Mechanical
Systems and signal Processing, 12, 855862.
Microsoft ENCARTA College Dictionary (2001), StMartins Press, N.Y.
Miller, D., Mellichamp, J.M. and Wang, J. (1990), An image enhanced knowledge based
expert system for maintenance trouble shooting, Computers in Industry, 15, 187202.
Milne, R., Nicole, C. and Trave-Massuyes, L. (2001) TIGER with model based diagnosis:
initial deployment, Knowledge-based Systems, 14, 213222.
Morcous, G. and Lounis, Z. (2005), Maintenance optimisation of infrastructure networks
using genetic algorithms, Automation in Construction, 14, 129142.
Nam, D.S., Jeong, C.W., Choe, Y.J. and Yoon, E.S. (1996), Operation-aided system for fault
diagnosis of continuous and semi-continuous processes, Computers& Chemical
Engineering, 20, 793803.
Oke, S.A. and Charles-Owaba, O.E. (2006), Application of fuzzy logic control model to
Gantt charting preventive maintenance scheduling, International Journal of Quality &
Reliability Management, 23, 441459.
Omerdic, E. and Roberts, G. (2004), thruster fault diagnosis and accommodation for open-
frame underwater vehicles, Control Engineering Practice, 12, 15751598.
Patel, S.A., Kamrani, A.K. and Orady, E. (1995), A knowledge-based system for fault
diagnosis and maintenance of advanced automated systems, Computers & Industrial
Engineering, 29, 147151.
Percy, D.F., Kobbacy, K.A.H. and Ascher, H.E. (1998), Using proportional intensities
models to schedule preventive maintenance intervals, IMA Journal of Mathematics
Applied in Business& industry, 9, 289302.
Rao, M., Yang, H. and Yang, H. (1998), Integrated distributed intelligent system archi-
techture for incidents monitoring and diagnosis, Computers in Industry, 37, 143151.
Ruiz, D., Canton, J., Nougues, J.M., Espuna, A. and Puigjaner, L. (2001), On-line fault
diagnosis system support for reactive scheduling in multipurpose batch chemical plants,
Computers & Chemical Engineering, 25, 829837.
Ruiz, R., Garcia-Diaz, C. and Maroto, C. (2006), Considering scheduling and preventive
maintenance in the flowshop sequencing problem, Computers & Operations
Rresearch,34, 33143330.
Saranga, H. (2004) Opportunistic maintenance using genetic algorithms, Journal of Quality
in Maintenance Engineering, 10, 6674.
Scenna, N.J. (2000) Some aspects of fault diagnosis in batch processes, Reliability
Engineering & System Safety, 70, 95110.
Sergaki, A. and Kalaitzakis, K. (2002), Reliability Engineering& System Safety, 77, 1930.
Sharma, R., Singh, K., Singhal, D. and Ghosh, R. (2004), Neural network applications for
detecting process faults in packed towers. Chemical Engineering and Processing 43,
841847.
Shayler, P.J., Goodman, M. and Ma, T. (2000), The exploitation of neural networks in
automative engine management systems, Engineering Applications of Artificial
Intelligence, 13, 147157.
Shyur, H.J., Luxhoj, J.T. and Williams, T.P. (1996), Using neural networks to predict
component inspection requirements for aging aircraft. Computers & Industrial
Engineering 30, 257267.
Simani, S. and Fantuzzi, C. (2000), Fault diagnosis in power plant using neural networks,
Information Sciences, 127, 125136.
Sinha, S.K. and Fieguth, P.W. (2006) Neuro-fuzzy network for the classification of buried
pipe defects, Automation in Construction, 15, 7383.
Skarlatos, D., Karakasis, K. and Trochidis, A. (2004), Railway wheel fault diagnosis using a
fuzzy-logic method, Applied Acoustics, 65, 951966.
230 K. Kobaccy
Sortrakul, N., Nachtmann, H.L. and Cassady, C.R. (2005), Genetic algorithms for integrated
preventive maintenance planning and production schedulling for a single machine,
Computers in Industry,56, 161168.
Spoerre, J.K. (1997), Application of the cascade correlation algorithm (CCA) to bearing
fault classification problems. Computers in Industry 32, 295304.
Sprague, R.H. and Watson, H.J. (1986) Decision support systems putting theory into
practice, Prentice Hall, Englewood Cliffs, New Jersey.
Srinivasan, D., Liew, A.C., Chen, J.S.P. and Chang, C.S. (1993) Intelligent maintenance
scheduling of distributed system components with operating constraints, Electric Power
Systems Research, 26, 203209.
Sudiaros, A. and Labib, A.W. (2002) A fuzzy logic approach to an integrated maintenance/
production scheduling algorithm, International Journal of Production Research, 40,
31213138.
Tan, J.S. and Kramer, M.A. (1997), A general framework for preventive maintenance
optimization in chemical process operations. Computers & Chemical Engineering 21,
14511469.
Tang, B-S., Jeong, S.K., Oh, Y-M. and Tan, A.C.C. (2004), Case-based reasoning system
with Petri nets for induction motor fault diagnosis, Expert Systems with Applications, 27,
301311.
Tarifa, E.E., Humana, D., Franco, S., Martinez, S.l. Nunez, A.F. and Scenna, N.J. (2003)
Fault diagnosis for MSF using neural networks, Desalination, 152, 215222.
Tsai, Y-T., Wang, K-S. and Teng, H-Y. (2001), Optimizing preventive maintenance for
mechanical components using genetic algorithms. Reliability Engineering & System
Safety 74, 8997.
Varde, P.V., Sankar, S. and Verma, A.K. (1998), An operator support system for research
reactor operations and fault diagnosis through a connectionist framework and PSA based
knowledge based system, Reliability Engineering and System safety, 60, 5369.
Varma, A. and Roddy, N. (1999), ICARUS: design and deployment of a case-based
reasoning system for locomotive diagnostics, Engineering Applications of Artificial
Intelligence 12, 681690.
Villanueva, H. and Lamba, H. (1997). Operator guidance system for industrial plant
supervision, Expert systems withy Applications, 12, 441454.
Wen, F. and Chang, C.S. (1998), A new approach to fault diagnosis in electrical distribution
networks using a genetic algorithm. Artificial Intelligence in Engineering 12, 6980.
Wu, H., Liu, Y., Ding, Y. and Qiu, Y. (2004), Fault diagnosis expert system for modern
commercial aircraft, Aircraft Engineering and Aerospace Technology, 76, 398403
Xia, Q. and Rao, M. (1999), Dynamic case-based reasoning for process operation support
systems. Engineering Applications of Artificial Intelligence 12, 343361.
Yang, B-S. and Kim, K.J. (2006) Applications of Dempster-Shafer theory in fault diagnisis
of induction motors, Mechanical systems and Signal Processing, 20, 403420.
Yang, B-S., Han, T. and Kim, Y-S (2004), Integration of ART-Kohonen neural network and
case-based reasoning for intelligent fault diagnosis, Expert Systems with Applications,
26, 387395.
Yang, B-S., Lim, D-S. and Tan, A.C.C. (2005), VIBEX : an expert system for vibtation fault
diagnosis of rotating machinery using decision tree and decision table, Expert Systems
with Applications, 28, 735742.
Yangping, Z., Bingquan, Z. and DongXin, W. (2000), Application of genetic algorithms to
fault diagnosis in nuclear power plants. Reliability Engineering & System Safety, 67,
153160.
Yu, R., Iung, B. and Panetto, H. (2003), A Multi-Agents based E-maintenance system with
case-based reasoning decision support, Engineering Applications of Artificial Intelli-
gence, 16, 321333.
Artificial Intelligence in Maintenance 231
Zhang, H.Y., Chan, C.W., Cheung, K.C. and Ye, Y.J. (2001) Fuzzy artmap neural network
and its application to fault diagnosis of navigation systems, Automatica, 37, 10651070.
Zhao, Z. and Chen, C. (2001), concrete bridge deterioration diagnosis using fuzzy inference
system, Advances in Engineering Software, 32, 317325.
Part D
Bo Henry Lindqvist
10.1 Introduction
A commonly used definition of a repairable system (Ascher and Feingold 1984)
states that this is a system which, after failing to perform one or more of its
functions satisfactorily, can be restored to fully satisfactory performance by any
method other than replacement of the entire system. In order to cover more realistic
applications, and to cover much recent literature on the subject, we need to extend
this definition to include the possibility of additional maintenance actions which
aim at servicing the system for better performance. This is referred to as preventive
maintenance (PM), where one may further distinguish between condition based
PM and planned PM. The former type of maintenance is due when the system ex-
hibits inferior performance while the latter is performed at predetermined points in
time.
Traditionally, the literature on repairable systems is concerned with modeling of
the failure times only, using point process theory. A classical reference here is
Ascher and Feingold (1984). The most commonly used models for the failure pro-
cess of a repairable system are renewal processes (RP), including the homogeneous
Poisson processes (HPP), and nonhomogeneous Poisson processes (NHPP). While
such models are often sufficient for simple reliability studies, the need for more
complex models is clear. In this chapter we consider some generalizations and
extensions of the basic models, with the aim to arrive at more realistic models which
give better fit to data. First we consider the trend renewal process (TRP) introduced
and studied in Lindqvist et al. (2003). The TRP includes NHPP and RP as special
cases, and the main new feature is to allow a trend in processes of non-Poisson
(renewal) type.
As exemplified by some real data, in the case where several systems of the
same kind are considered, there may be unobserved heterogeneity between the
systems which, if overlooked, may lead to non-optimal or possibly completely
wrong decisions. We will consider this in the framework of the TRP process,
which in Lindqvist et al. (2003) is extended to the so-called HTRP model which
236 B. Lindqvist
The last extension of the basic models to be considered in the present chapter
consists of using Markov models to model the behavior of periodically inspected
systems in between inspections, with the use of separate Markov models for the
maintenance tasks at inspections.
Recent review articles concerning repairable systems and maintenance include
Pea (2006) and Lindqvist (2006). A review of methods for analysis of recurrent
events with a medical bias is given by Cook and Lawless (2002). General books on
statistical models and methods in reliability, covering much of the topics con-
sidered here, are Meeker and Escobar (1998) and Rausand and Hyland (2004).
Consider a repairable system where time usually runs from t = 0 and events occur
at ordered times T1 , T2 ,. Here time is not necessarily calendar time, but can be for
example operation time, number of cycles, number of kilometers run, length of a
crack, etc. In the present treatment we shall disregard time durations of repair and
maintenance, and assume that the system is always restarted immediately after
failure or maintenance action. The inter-event, or inter-failure, times will be
denoted X 1 , X 2 ,. Here X i = Ti Ti 1 , i = 1, 2, , where for convenience we de-
fine T0 0 . Figure 10.1 illustrates the notation. We also make use of the counting
process representation N (t ) = number of events in (0, t ] .
In order to describe probability models for repairable systems we use some nota-
tion from the theory of point processes. A key reference is Andersen et al. (1993).
H t denotes the history of the failure process up to, but not including, time t .
Maintenance of Repairable Systems 237
Pr (event of type j in [t , t + t ) | H t )
(t ) = lim . (10.1)
t 0 t
From this we obtain an expression for the likelihood function, which is needed for
statistical inference. Suppose that a single system as described above is observed
from time 0 to time , resulting in observations T1 , T2 ,, TN ( ) . The likelihood
function is then given by (Andersen et al. 1993, Section II.7)
N ( )
{
L = (Ti ) exp (u ) du .
i =1 0 } (10.2)
Consider a system with failure rate z (t ) . Suppose first that after each failure, the
system is repaired to a condition as good as new, called a perfect repair. In this
case the failure process can be modeled by a renewal process (RP) with inter-event
time distribution F , denoted RP( F ) . Clearly, the conditional intensity defined in
Equation 10.1 is given by
(t ) = z (t TN (t ) ),
where t TN (t ) is the time since the last failure strictly before time t .
Suppose instead that after a failure, the system is repaired only to the state it
had immediately before the failure, called a minimal repair. This means that the
conditional intensity of the failure process immediately after the failure is the same
as it was immediately before the failure, and hence is exactly as it would be if no
failure had ever occurred. Thus we must have
(t ) = z (t ),
The idea behind the trend-renewal process is to generalize the following well known
property of the NHPP. First let the cumulative intensity function corresponding to
238 B. Lindqvist
t
an intensity function () be defined by (t ) = (u ) du . Then if T1 , T2 , is an
0
NHPP( ()) , the time-transformed stochastic process (T1 ), (T2 ), is HPP(1).
The trend-renewal process (TRP) is defined simply by allowing the above
HPP(1) to be any renewal process RP ( F ) . Thus, in addition to the intensity function
(t ) , for a TRP we need to specify a distribution function F of the inter-arrival
times of this renewal process. Formally we can define the process TRP( F , () ) as
follows.
t
Let (t ) be a nonnegative function defined for t 0 , and let (t ) = (u ) du .
0
The process T1 , T2 , is called TRP( F , () ) if the process (T1 ), (T2 ), is
RP( F ), that is if the (Ti ) (Ti 1 ); i = 1, 2, are i.i.d. with distribution
function F . The function () is called the trend function, while F is called the
renewal distribution. In order to have uniqueness of the model it is usually
assumed that F has expected value 1.
Figure 10.2 illustrates the definition. For the cited property of the NHPP, the
lower axis would be an HPP with unit intensity, HPP(1). For the TRP, this process
is instead taken to be any renewal process, RP(F), where F has expectation 1. This
shows that the TRP includes the NHPP as a special case. Further, if (t ) 1 , then
(Ti ) = Ti , and so T1 , T2 , is RP(F).
For an NHPP( ()) , the RP( F ) would be HPP(1) . Thus TRP (1 e x , ()) =
NHPP( ()). Also, TRP ( F ,1) = RP( F ) , which shows that the TRP class includes
both the RP and NHPP classes.
It can be shown (Lindqvist et al. 2003) that the conditional intensity function,
given the history H t , for the TRP( F , ()) is
(t ) = z ( (t ) (TN (t ) )) (t ) (10.3)
N ( )
L = { z[(Ti ) (Ti 1 )] (Ti )}exp{ z[(u ) (TN (u ) )] (u)du}. (10.4)
0
i =1
For the NHPP ( ()) we have z (t ) 1 , so the likelihood simplifies to the well
known expression (Crowder et al. 1991, p 166)
N ( )
L ={ (T )}exp{
i =1
i
0
(u ) du}.
N ( )
L ={ f [(T ) (T
i =1
i i 1 )] (Ti )}{1 F [ ( ) (TN ( ) )]}. (10.5)
This latter form of the likelihood of the TRP follows directly from the
definition, since the conditional density of Ti given T1 = t1 ,, Ti 1 = ti 1 is
f [ (ti ) (ti 1 )] (ti ) , and the probability of no failures in the time interval
(TN ( ) , ] , given T1 ,, TN ( ) , is 1 F [ ( ) (TN ( ) )] .
This again simplifies if (t ) 1 in which case it gives the likelihood of an
RP(F) observed on [0, ] .
Suppose that m systems of the same kind are observed, where the j-th system
( j = 1, 2,, m ) is observed in the time interval [0, j ] . For the j-th system, let N j
denote the number of failures that occur during the observation period, and let the
specific failure times be denoted T1 j < T2 j < < TN j . Figure 10.3 illustrates the
j
notation and explains the information given in a so-called event plot which is
provided by computer packages for analysis of this kind of data (see examples
below).
Example 2 Bhattacharjee et al. (2003) presented failure data for motor operated
closing valves in safety systems at two boiling water reactor plants in Finland.
Failures of the type external leakage were considered for 104 valves with a
follow-up time of nine years. An event plot for the 16 valves which experienced at
least on failure, is given in Figure 10.5. The remaining 88 valves had no failures.
240 B. Lindqvist
Figure 10.3. Observation of failure times of m systems. The j-th system is observed over
the time interval [0, j ] , with N j 0 observed failures
Figure 10.4. Event plot for times of valve seat replacements for 41 diesel engines, taken
from Nelson (1995)
When data are available for m systems as described above, one will typically
assume that the systems behave independently but with the same probability laws
(i.i.d. rules). The total likelihood for the data will then be the product of the
likelihoods at Equations 10.4 or 10.5, one factor for each of the m systems.
Maintenance of Repairable Systems 241
Figure 10.5. Event plot for times of external leakage from nuclear plant valves, taken from
Bhattacharjee et al. (2003). In addition, 88 valves had no failures in 3286 days (9 years)
However, even if the m systems are considered to be of the same type, they
may well exhibit different probability failure mechanisms. For example, systems
may be used under varying environmental or operational conditions.
To cover such cases we shall assume that failures of the j-th system follow the
process TRP ( F , j ()) , j = 1,, m , where the renewal function F is fixed and
differences between systems are modeled by letting the trend functions j (t ) vary
from system to system. The assumption of a fixed F parallels the NHPP case,
where F is the unit exponential distribution.
Assuming that systems work independently of each other, we obtain from
Equation 10.5 the full likelihood L j =1 L j where
m
Nj
j (t ) = g (x j ) (t ), j = 1,, m (10.7)
j (t ) = a j (t ) (10.8)
Nj
However, since the a j are unobservable, we need to take the expectation with
respect to the a j , giving
L j = E[ L j (a j )] = L j ( a j ) dH ( a j )
as the contribution to the likelihood from the j-th system. The total likelihood is
then the product
m
L = Lj . (10.9)
j =1
We shall use the notation HTRP ( F , (), H ) for the model with the likelihood at
Equation 10.9,). Here the renewal distribution F and the heterogeneity distribution
H are distributions corresponding to positive random variables with expected value
1, while the basic trend function (t ) is a positive function defined for t 0 .
Maintenance of Repairable Systems 243
A useful feature of the HTRP model is that several important models for repair-
able systems are easily represented as submodels. With the notation HPP, NHPP,
RP and TRP used as before, we define corresponding models with heterogeneity as
at Equation 10.8 by putting an H in front of the abbreviations. Specifically, from
a full model, HTRP ( F , (), H ) , we can identify the seven submodels described in
Table 10.1.
Table 10.1. The seven submodels of HTRP ( F , (), H ) . exp means the unit exponential
distribution, 1 means the distribution degenerate at 1 . The third column contains referen-
ces to work on the corresponding models or special cases of them.
Submodel HTRP-formulation
The HTRP and the seven submodels may also be represented in a cube, as
illustrated in Figures 10.6 and 10.7. Each vertex of the cube represents a model,
and the lines connecting them correspond to changing one of the three coordi-
nates in the HTRP notation. Going to the right corresponds to introducing a time
trend, going upwards corresponds to entering a non-Poisson case, and going back-
wards (inwards) corresponds to introducing heterogeneity. In analyzing data by
parametric HTRP models we shall see below how we use the cube to facilitate the
presentation of maximum log-likelihood values for the different models in a con-
venient, visual manner. The log-likelihood cube was introduced in Lindqvist et al.
(2003).
Example 1 (continued) Figure 10.6 shows the log-likelihood cube of the valve-seat
data. It should be noted that each arrow points in a direction where exactly one
parameter is added (see text of Figure 10.6 for definitions of parameters). Using
standard asymptotic likelihood theory we know that if this parameter has no
influence in the model, then twice the difference in log likelihood is approximately
chi-square distributed with one degree of freedom. For example, if twice the
difference is larger than 3.84, then the p-value of no significant difference is less
than 5% and we have an indication that the extra parameter in fact has some
relevance. Note that adding an extra parameter will always lead to a larger value of
the maximum log likelihood, but from what we just argued, the difference needs to
be more than, say, 3.84 / 2 = 1.92 to be of real interest.
244 B. Lindqvist
Figure 10.6. The log-likelihood cube for the Nelson valve seat data of Nelson (1995), fitted
with a parametric HTRP( F , (), H ) model and its sub-models. Here F is a Weibull-
distribution with expected value 1 and shape parameter s , (t ) = cbt b 1 is a power
function of t , and H is a gamma-distribution with expected value 1 and variance v . The
maximum value of the log likelihood is denoted l
Looking at the valve-seat data cube in Figure 10.6 we note first that going from
a vertex of the front face to the corresponding vertex of the back face (adding H
in front of the model acronym) there is never much to gain (1.17 at most from HPP
to HHPP). This indicates no apparent heterogeneity between the various engines.
By comparing the left and right faces we conclude, however, that there seems to
be a gain in including a time trend. Having already excluded heterogeneity we are
thus faced with the possibilities of either NHPP or TRP. Here the latter model
wins, since the difference in log-likelihood is as large as (343.66) (346.49) =
2.83 and twice the difference equal to 5.66 corresponding to an approximate p-value
of 0.017.
The resulting estimated TRP is seen to have a renewal distribution which is
Weibull with shape parameter 0.6806 which implies a decreasing failure rate. This
means that the conditional intensity function will jump upward at each failure,
which may be explained by burn-in problems at each valve-seat replacement.
Further, there will be an estimated time trend of the form
(t ) = 3.26 106 1.929 t 0.929 = 6.29 10 6 t 0.929 which increases with t so
that replacements are becoming more and more frequent.
The expected number of failures in 3286 days are hence 0.125 and 4.99 , respec-
tively, for the good and bad valves.
Consider again the situation illustrated in Figure 10.1, where the sojourns
X 1 , X 2 , are times to failure of a system which is repaired immediately before the
start of the sojourn. In the present section we consider the case when the failure
which we expect at the end of the sojourn X i , may be avoided by a preventive
maintenance (PM) after a time Z i in the sojourn. The experienced sojourn time
will in this case be Yi = min( X i , Z i ) , and it will result in either a failure or a PM
according to whether Yi = X i or Yi = Z i . We thus have a competing risks situation
with two risks, corresponding to failure and PM.
246 B. Lindqvist
Figure 10.7. The log-likelihood cube for the data of Bhattacharjee et al. (2003) concerning
failures of motor operated closing valves in nuclear reactor plants in Finland, fitted with a
parametric HTRP( F , (), H ) model and its sub-models. Here F is a Weibull-distribution
b 1
with expected value 1 and shape parameter s , (t ) = cbt is a power function of t , and
H is a two-point distribution with unit expectation, giving probability p for the value
low and 1 p for the value high. The maximum value of the log likelihood is denoted
by l
Doyen and Gaudoin (2006) recently presented a point process approach for
modeling of such competing risks situations between failure and PM. A general
setup for this kind of processes is furthermore suggested in the review paper
Lindqvist (2006).
For simplicity we shall in this chapter consider only the case where the
component or system is perfectly repaired or maintained at the end of each sojourn.
This will lead to the observation of independent copies of the competing risks
situation in the same way as for a renewal process. We will therefore in the follow-
ing consider only a single sojourn and hence suppress the subscripts of the ob-
served times. Thus we let X and Z be, respectively, the potential times to failure
and time to PM of a single sojourn. Then Y = min( X , Z ) is the observed sojourn,
and in addition we observe the indicator variable which we define to be 1 if
there is a PM ( Y = Z ) and 0 if there is a failure ( Y = X ). This situation has been
extensively studied by Cooke (1993, 1996), Bedford and Cooke (2001), Langseth
and Lindqvist (2003, 2006), Lindqvist et al. (2006) and Lindqvist and Langseth
(2005).
Thus note that the observable result is the pair (Y , ) , rather than the underlying
times X and Z , which may often be the times of interest. For example, knowing
Maintenance of Repairable Systems 247
Cooke (1993, 1996) suggested that the competing risks situation between failure
and PM will often satisfy what he called the random signs censoring property. The
important features of random signs censoring are that the marginal distribution of
X is always identifiable, and that an indication of the validity of this type of cen-
soring could be found from data plotting.
A lifetime Z is said to be a random signs censoring of X if the event {Z < X }
is stochastically independent of X , i.e. if the event of having a PM before failure is
not influenced by the time X at which the system fails or would have failed
without PM. The idea is that the system emits some kind of signal before failure,
and that this signal is discovered with a probability which does not depend on the
age of the system.
We now introduce some notation. Below we assume without further mention
that X , Z are positive and continuous random variables, with P( X = Z ) = 0 . We
let FX (t ) = P( X t ) and FZ (t ) = P( Z t ) be the cumulative distribution func-
tions of X and Z , respectively. The subdistribution functions of X and Z are
defined as, respectively, FX (t ) = P( X t , X < Z ) and FZ (t ) = P( Z t , Z < X ) .
Note that the functions FX and FZ are nondecreasing with FX (0) = 0 and
FZ (0) = 0 . Moreover, we have FX () + FZ () = 1 .
F X (t ) = P( X t | X < Z ) = P( X t ) = FX (t ) (10.10)
Moreover, he showed the kind of inverse statement that whenever Equation 10.11
holds, there exists a joint distribution of ( X , Z ) satisfying the requirements of
random signs censoring and giving the same sub-distribution functions.
On the other hand, if F X (t ) F Z (t ) for some t , then there is no joint distribu-
tion of ( X , Z ) for which the random signs requirement holds. For more discussion
on random signs censoring and its applications we refer to Cooke (1993, 1996) and
Bedford and Cooke (2001, Chapter 9). One idea is to estimate the functions F X (t )
and F Z (t ) from data to check whether Equation 10.11 may possibly hold and
when this is the case to suggest a model that satisfies the random signs property.
Lindqvist et al. (2006) introduced the so-called repair alert model which extends
the idea of random signs censoring by defining an additional repair alert function
which describes the alertness of the maintenance crew as a function of time. The
definition can be given as follows:
The pair ( X , Z ) of life variables satisfies the requirements of the repair alert
model provided the following two conditions both hold:
G( z)
P( Z z | Z < X , X = x) = , 0 < z x.
G ( x)
The function G is called the cumulative repair alert function. Its derivative g
(when it exists) is called the repair alert function. The repair alert model is hence a
specialization of random signs censoring, obtained by introducing the repair alert
function G .
Part (ii) of the above definition means that, given that there would be a failure
at time X = x , and given that the maintenance crew will perform a PM before that
Maintenance of Repairable Systems 249
time (i.e. given that Z < X ), the conditional density of the time Z of this PM is
proportional to the repair alert function g .
Lindqvist et al. (2006) showed that whenever Equation 10.11 holds there is a
unique repair alert model giving the same sub-distribution functions. Thus, restrict-
ing to repair alert models we are able to strengthen the corresponding result for
random signs censoring which does not guarantee uniqueness.
The repair alert function is meant to reflect the reaction of the maintenance
crew. More precisely, g (t ) ought to be high at times t for which failures are ex-
pected and the alert therefore should be high. Langseth and Lindqvist (2003) sim-
ply put g (t ) = (t ) where (t ) is the failure rate of the marginal distribution of
X . This property of g (t ) of course simplifies analyses since it reduces the number
of parameters, but at the same time it seems fairly reasonable given a competent
maintenance crew. In a subsequent paper, Langseth and Lindqvist (2006) present
ways to test whether g (t ) can be assumed equal to the hazard function (t ) .
It follows from the construction in Lindqvist et al. (2006) that the repair alert
model is completely determined by the marginal distribution function FX of X ,
the cumulative repair alert function G , the probability q P( Z < X ) , and the
assumption that X is independent of the event {Z < X } (i.e. random signs cen-
soring). Thus, given statistical data, the inference problem consists of estimating
FX (t ) (possibly on parametric form), the repair alert function g (or G ), and the
probability q of PM. We refer to Lindqvist et al. (2006) and Lindqvist and
Langseth (2005) for details on such statistical inferences.
The following is a simple example of a repair alert model.
P ( Z z, Z < X | X = x)
P( Z z | Z < X , X = x) =
P( Z < X | X = x)
P( Z z | X = x)
=
q
z ( q /x ) z
= = ,
q x
The following formula (taken from Lindqvist et al. 2006) shows in particular why
Equation 10.11 holds under the repair alert model:
f X ( y)
F Z (t ) = FX (t ) + G (t ) t G( y)
dy. (10.12)
Note that for random signs and hence for the repair alert model we have
F X (t ) = FX (t ) .
We next discuss some implications of the repair alert model, in particular how
the parameters q and G influence the observed performance of PM and failures.
In order to help intuition, we sometimes consider the power version G (t ) = t
where > 0 is a parameter. Then g (t ) = t 1 so = 1 means a constant repair
alert function, while < 1 and > 1 correspond to, respectively, a decreasing and
increasing repair alert function.
Under the random signs assumption, the parameter q = P( Z < X ) is connected
to the ability to discover signals regarding a possibly approaching failure. More
precisely, q is understood as the probability that a failure is avoided by a pre-
ceding PM.
Given that there will be a PM, one should ideally have the time of PM immedi-
ately before the failure. It is seen that this issue is connected to the function G . For
example, large values of will correspond to distributions with most of its mass
near x .
Moreover, it follows from Equation 10.12 that
M (X )
E (Z | Z < X ) = 0
(1 F Z ( z ))dz = E ( X ) E
G( X )
x
where M ( x) = G (t )dt . For the special case when G (t ) = t , we obtain the
0
simple result
E (Z | Z < X ) = E( X ) (10.13)
+1
f X ( y)
P (Y t ) = FX (t ) + qG (t ) t G( y)
dy
M (X ) x
E (Y ) = E ( X ) qE , where M ( x) =
G( X )
0
G (t )dt.
Furthermore, if G (t ) = t , then
q
E (Y ) = E ( X ) 1 . (10.14)
+1
qCPM + (1 q)CF
,
(
E ( X ) 1 q+1 )
where we used Equation 10.14.
This is a decreasing function of , which seems reasonable. On the other
hand, it is a decreasing function of q provided > CPM / (CF CPM ) . This last
inequality is likely to hold in many practical cases since the right hand side will
usually be much less than 1, while should for a competent maintenance crew be
larger than 1. Thus a high value of q is usually preferable.
252 B. Lindqvist
, 2 , 3 ,,
called PM epochs. Here > 0 is the length of what we shall call the PM interval.
Let X (t ) S denote the state of the system at time t , where the set S of possible
states is finite. It is assumed that X (t ) behaves like a time homogeneous Markov
chain as long as time runs inside PM intervals, i.e. inside time intervals
n t < (n + 1) for n = 0,1,. This Markov chain is governed by an infinitesimal
intensity matrix A , where the entry a jk of A for j k is the transition intensitiy
from state j to state k ; see for example Taylor and Karlin (1984, p 254). An
example of an intensity matrix A is given by Equation 10.15, an illustration of
which is provided by the state diagram in Figure 10.9. Let
denote transition probabilities for the Markov chain governed by A and let
P(t ) = ( Pjk (t ); j, k S )
Yn = X (n ) lim X (t ),
t n
Maintenance of Repairable Systems 253
which is the state of the system immediately before the n -th PM epoch. The effect
of PM at time n is to change the state of the system from Yn to Z n according to
a transition matrix R = ( R jk ) , where
P( Z n = k | Yn = j ) = R jk ; j, k S .
The model description is completed by defining the initial state of the Markov
chain X (t ) running inside the PM interval [n , (n + 1) ) to be X (n ) Z n
( n = 0,1,), where Z 0 is the initial state of the system, usually the perfect state in
S . It is furthermore assumed that the Markov chain X (t ) on [n , (n + 1) ) , given
its initial state Z n , is independent of all transitions occurring before time n .
Let the distribution of Z 0 X (0) be denoted = ( j ; j S ) , where
j = P( Z 0 = j ) . Then for any k S ,
P(Y1 = k ) = P( X ( ) = k )
= P( X ( ) = k | X (0) = j ) P( X (0) = j )
jS
= j Pjk ( ) = [ P( )]k .
jS
P(Yn +1 = k | Yn = j ) = P(Y
S
n +1 = k | Z n = , Yn = j ) P( Z n = | Yn = j )
= P
S
k ( ) R j = [ RP( )] jk .
Q = RP( ).
254 B. Lindqvist
P( Z n +1 = k | Z n = j ) = P( Z
S
n +1 = k | Yn +1 = , Z n = j )
P(Yn +1 = | Z n = j )
= P
S
j ( ) Rk = [ P( ) R] jk .
T = P( ) R.
Q RP( ) = .
MTBFcrit = / G
crit = 1/MTBFcrit = G / .
1 ( n +1)
n
Un = P ( X (t ) G )dt .
Maintenance of Repairable Systems 255
1
Un =
P0
jS
jG (t ) P( Z n = j )dt
T P( ) R = .
Following Hokstad and Frvig (1996) we shall define the critical safety unavail-
ability (CSU) of the system by
CSU = lim U n
n
1
=
P 0
jS
jG
(t ) j dt = Q
jS
j j
where
1
Qj =
0
PjG (t )dt
is the critical safety unavailability given that the system state is j at the beginning
of the PM interval.
S = {O, D, K I , K II },
where O = the system is as good as new, D = the system has a failure classified as
degraded (noncritical), K I = the system has a failure classified as critical, caused
by a sudden shock, K II = the system has a failure classified as critical, caused by
the degradation process.
It is assumed that the Markov chain X (t ) is defined by the state diagram of
Figure 10.9, and thus has infinitesimal transition matrix
256 B. Lindqvist
d k d k 0
0 k dk k dk
A= (10.15)
0 0 0 0
0 0 0 0
Figure 10.9. State diagram for the failure mechanism of Hokstad and Frvig (1996)
The model assumes that no repairs are done in the time intervals between PM
epochs. Moreover, since A is upper triangular, we can obtain P(t ) = etA rather
easily. It is clear that P(t ) can be written
0 PDD (t ) PDK (t ) PDK (t )
I II
0 0 1 0
0 0 0 1
where expressions for the entries are found in Lindqvist and Amunrustad (1998).
In practice it is of interest to quantify the effect of various forms of preventive
maintenance. This can be done in the presented framework by means of the repair
matrix R . Some examples are given below.
If all failures are repaired at PM epochs, then the PM always returns the system
back to state O , and we have
1 0 0 0
1 0 0 0
R=
1 0 0 0
1 0 0 0
Maintenance of Repairable Systems 257
Next, if only critical failures are repaired at PM epochs, then the appropriate R
matrix is
1 0 0 0
0 1 0 0
R=
1 0 0 0
1 0 0 0
More generally one may consider an extension of this by assuming that all
critical failures are repaired, while degraded failures are repaired with probability
1 r and remain unrepaired with probability r , 0 r 1 . The repair strategy is
thus determined by the parameter r .
This clearly leads to the matrix
1 0 0 0
1 r r 0 0
R=
1 0 0 0
1 0 0 0
1 0 0 0
1 r r 0 0
R=
1 rk1 0 rk1 0
1 rk 2
0 0 r
k 2
Here r has the same meaning as before, while 1 rk1 is the probability of success-
ful repair of a K I failure and 1 rk 2 is the similar for K II .
checking (see for example the consideration of maximum log likelihoods in the
examples of Section 10.2.5). Another way of extending the NHPP processes is via
the large class of imperfect repair models. The classical model is here the one
suggested by Brown and Proschan (1983) (see the review paper Lindqvist 2006 for
an introduction to the subsequent literature). Imperfect repair models combine two
basic ingredients, a hazard rate z (t ) of a new system together with a particular
repair strategy which governs a so called virtual age process. The idea is that the
virtual age of the system is reduced at repairs by a certain amount which depends
on the repair strategy. The extreme cases are the perfect repair (renewal) models
where the virtual age is set to 0 after each repair, and the minimal repair (NHPP)
models where the virtual age is not reduced at repairs and hence always equals the
actual age.
Second, we have put some emphasis on the consideration of possible hetero-
geneity between systems of the same kind. Recall our Example 2 based on data
from Bhattacharjee et al. (2003). The authors write in their conclusion: The
heterogeneity of failure behaviour of safety related components, such as valves in
our case study, may have important implications for reliability analysis of safety
systems. If such heterogeneity is not identified and taken into account, the deci-
sions made to maintain or to enhance safety can be non-optimal or even erroneous.
This non-optimality is more serious if the safety related decisions are made on the
basis of failure histories of the components. Still it is believed that heterogeneity
has been neglected in many reliability applications. In fact, analyses of reliability
data will often lead to an apparent decreasing failure rate which is counterintuitive
in view of wear and ageing effects. Proschan (1963) pointed out that such observed
decreasing rates could be caused by unobserved heterogeneity. Proschan presented
failure data from 17 air conditioner systems on Boeing 720 airplanes, concluding
that an HPP model was appropriate for each plane, but that the rates differed from
plane to plane. This is a classical example of heterogeneity in reliability. If times
between failures had been treated as independent and identically distributed across
planes, the conclusion would have been that these times between failures had a
decreasing failure rate.
It has long been known in biostatistics that neglecting individual heterogeneity
may lead to severe bias in estimates of lifetime distributions. The idea is that
individuals have different frailties, and that those who are most frail will die or
fail earlier than the others. This in turn leads to a decreasing population hazard,
which has often been misinterpreted in the same manner as mentioned for the
reliability applications. Important references on heterogeneity in the biostatistics
literature are Vaupel et al. (1979), Hougaard (1984) and Aalen (1988). It should be
noted that heterogeneity is in general unidentifiable if being considered an indi-
vidual quantity. For identifiability it is necessary that frailty is common to several
individuals, for example in family studies in biostatistics, or if several events are
observed for each individual, such as for the repairable systems considered in this
paper. The presence of heterogeneity is often apparent for data from repairable
systems if there is a large variation in the number of events per system. However, it
is not really possible to distinguish between heterogeneity and dependence of the
intensity on past events for a single process.
Maintenance of Repairable Systems 259
The third point to be mentioned regards the use, or lack of use, of methods for
competing risks in reliability applications. The following is a citation from
Crowder (2004) appearing in the article on Competing Risks in Encyclopedia of
Actuarial Sciences: If something can fail, it can often fail in one of several ways
and sometimes in more than one way at a time. In the real world, the cause, mode,
or type of failure is usually just as important as the time to failure. It is therefore
remarkable that in most of the published work to date in reliability and survival
analysis there is no mention of competing risks. The situation hitherto might be
referred to as a lost case. Fortunately, some work has been done recently in order
to include competing risks in the study of repaired and maintained systems. Much
of this work, partly reviewed in Section 10.3, has been motivated by the work of
Cooke (1996) and his collaborators. His point of departure was formulated in the
conclusion of Cooke (1996): The main themes of Parts I and II of this article are
that current RDB (Reliability Data Bank) designs: 1. are not giving RDB users
what they need; 2. are not doing a good job of analyzing competing risk data; 3. are
not doing a good job in handling uncertainty. Improvements in all these areas are
possible. However, it must be acknowledged that the models and methods pre-
sented here merely scratch the surface. It is therefore appropriate to conclude with
a summary of open issues...
The final section of the present chapter considers an example of an approach
which in some sense generalizes the competing risks issue, namely using Markov
chains to model failure mechanisms of various equipment.
The chapter has mostly considered the modeling of repairable systems, with
less mention of statistical methods. It is believed that much of future research on
maintenance of repairable systems will still be centered around modeling, possibly
with an increased emphasis on point process models including multiple types of
events (see for example Doyen and Gaudoin 2006). More detailed models of the
underlying failure and maintenance mechanisms may indeed be of great value for
planning and optimization of maintenance actions. On the other hand, the new
advances in modeling certainly lead to considerable statistical challenges. This
point was touched on by Cooke (1996) as cited above, and it is clear that the in-
formation in reliability databases could and should be handled by more sophisti-
cated methods than the ones that are traditionally used. Here there is much to learn
from the biostatistics literature where there has for a long time been an emphasis
on nonparametric methods and on regression methods using covariate information.
10.6 References
Aalen OO, (1988) Heterogeneity in survival analysis. Statistics in Medicine 7:11211137.
Andersen P, Borgan O, Gill R, Keiding, N, (1993) Statistical Models Based on Counting
Processes. Springer, New York.
Ascher H, Feingold H, (1984) Repairable Systems Modeling, inference, misconceptions
and their causes. Marcel Dekker, New York.
Bedford T, Cooke RM, (2001) Probabilistic Risk Analysis: Foundations and Methods;
Cambridge University Press: Cambridge.
260 B. Lindqvist
11.1 Introduction
Over the last few decades the maintenance of systems has become more and more
complex. One reason for this is that systems consist of many components which
depend on each other. On the one hand, interactions between components compli-
cate the modelling and optimization of maintenance. On the other hand, interactions
also offer the opportunity to group maintenance which may save costs. It follows
that planning maintenance actions is a big challenge and it is not surprising that
many scholars have studied maintenance optimization problems for multi-compo-
nent systems. In some articles new solution methods for existing problems are
proposed, in other articles new maintenance policies for multi-component systems
are studied. Moreover, the number of papers with practical applications of optimal
maintenance of multi-component systems is still growing.
Cho and Parlar (1991) give the following definition of multi-component
maintenance models: Multi-component maintenance models are concerned with
optimal maintenance policies for a system consisting of several units of machines
or many pieces of equipment, which may or may not depend on each other
(economically/stochastically/structurally). So, in these models it is all about making
an optimal maintenance planning for systems consisting of components that interact
with each other. We will come back later to the concepts of optimality and inter-
action. For now it is important to remember that the condition of the systems depends
on (the state of) the components which will only function if adequate maintenance
actions are performed.
In this chapter we will give an up-to-date review of the literature on multi-
component maintenance optimization. Let us start with a brief summary of the
overview articles that have appeared in the past. Cho and Parlar (1991) review
articles from 1976 to 1991. The authors divide the literature into five topical
categories: machine-interference/repair models, group/block/cannibalization/oppor-
tunistic models, inventory/maintenance models, other maintenance/replacement
models and inspection/maintenance models. Dekker et al. (1996) deal exclusively
264 R. Nicolai and R. Dekker
Presenting a scientific review on a certain topic implies that one tries to discuss all
relevant articles. Finding these articles, however, is very difficult. It depends on the
search engines and databases used, electronic availability of articles and the search
strategy. We used Google Scholar, Scirus and Scopus as search engines, and used
ScienceDirect, JStor and MathSciNet as (online) database. We primarily searched
on key words, abstracts and titles, but we also searched within the papers for
relevant references. Note that papers published in books or proceedings that are not
electronically available, are likely to have not been identified.
Terminology is another important issue, as the use of other terms can hide a
very interesting paper. The field has been delineated by maintenance, replacement
or inspection on one hand and optimization on the other. This combination, how-
ever, provides almost 5000 hits in Google Scholar.
Next, the term multi-component has been used in junction with related terms as
opportunistic maintenance (policies), piggyback(ing), joint replacement, joint
overhaul, combining maintenance, grouping maintenance, economies of scale and
economic dependence. With respect to the term stochastic dependence, we have
also searched for synonyms and related terms such as failure interaction,
probabilistic dependence and shock damage interaction. This yields approximately
500 hits. Relevant articles have been selected from this set by scanning the articles.
The vast literature on maintenance of multi-component systems has been
reviewed earlier by others. Therefore, we have also consulted existing reviews and
overview articles in this field. Moreover, we have applied a citation search
(looking both backwards and forwards in time for citations) to all articles found.
This citation search is an indirect search method, whereas the above methods are
direct methods. The advantage of this method is that one can easily distinguish
clusters of related articles.
Positive economic dependence implies that costs can be saved when several
components are jointly instead of separately maintained. Compared with the review
of Dekker et al. (1996) we refine the concept of (positive) economic dependence
and distinguish the following forms:
Economies of scale
General
Single set-up
Multiple set-ups
o Hierarchy of set-ups
Downtime opportunity
The term economies of scale is often used to indicate that combining mainten-
ance activities is cheaper than performing maintenance on components separately.
The term economies of scale is very general and it seems to be similar to positive
economic dependence. In this chapter we will speak of economies of scale when the
maintenance cost per component decreases with the number of maintained com-
ponents. Economies of scale can result from preparatory or set-up activities that can
be shared when several components are maintained simultaneously. The cost of this
set-up work is often called the set-up cost. Set-up costs can be saved when main-
tenance activities on different components are executed simultaneously, since exe-
cution of a group of activities requires only one set-up.
In this overview we distinguish between single set-ups and multiple set-ups. In
the latter case there usually is a hierarchy of set-ups. For instance, consider a
system consisting of two components, which both consist of two subcomponents.
Maintenance of the subcomponents of the components may require a set-up at
system level and component level. First, this means that the set-up cost at com-
ponent level is paid only once when the maintenance of two subcomponents of a
component is combined. Second, the set-up cost at system level is paid only once
when all subcomponents are maintained at the same time. Set-up costs usually
come back in the objective function of the maintenance problem. If economies of
scale are not explicitly modelled by including set-up costs in the objective func-
tion, then we classify the model in the category general.
Another form of positive dependence is the downtime opportunity. Component
failures can often be regarded as opportunities for preventive maintenance of non-
failed components. In a series system a component failure results in a non-operating
268 R. Nicolai and R. Dekker
General
In comparison with Dekker et al. (1996) the category general economies of scale is
new. The papers in this category deal with multi-component systems for which joint
maintenance of components is cheaper than individual maintenance of components.
This form of economies of scale cannot be modelled by introducing a single set-up
cost. The cost associated with the maintenance of components is often concave in the
number of components that are maintained simultaneously.
Dekker et al. (1998a) evaluate a new maintenance concept for the preservation
of highways. In road maintenance cost savings can be realized by maintaining
larger sections instead of small patches. The road is divided into sectors of 100-m
length. Set-up costs are present in the form of the direct costs associated with the
maintenance of different parts of the road. The set-up cost is a function of the
number of these parts in a maintenance group. A heuristic search procedure is pro-
posed to find the optimal maintenance planning.
Papadakis and Kleindorfer (2005) introduce the concept of network topology
dependencies (NTD) for infrastructure networks. In these networks two types of
NTD can be distinguished: contiguity and set-up discounts. Both types define
positive economic dependence between components. In the former case savings are
realized when costs are paid once when contiguous sections are maintained at the
same time. In the latter case savings are realized when costs are paid once for a
neighbourhood of the infrastructure network, independently of how much work is
carried out on it. For both types of dependencies a non-linear discount function is
defined. The authors consider the problem of maintaining an infrastructure network.
It is modelled as an undirected network. Risk measures or failure probabilities for
the segments of this network are assumed to be known. A maximum flow minimum
cut formulation of the problem is developed. This formulation makes it easier to
solve the problem exactly and efficiently.
Single Set-up
Nearly all articles reviewed by Dekker et al. (1996) fall into this category. The
objective function of the maintenance optimization model usually consists of a
Optimal Maintenance of Multi-component Systems: A Review 269
fixed cost (the set-up cost) and variable costs. In the articles discussed below, this
will not be different.
Castanier et al. (2005) consider a two-component series system. Economic
dependence between the two components is present in the following way. The set-
up cost for inspecting or replacing a component is charged only once if the actions
on both components are combined. That is, joint maintenance of components saves
costs. In this article the condition of the components is modelled by a stochastic
process and it is monitored by non-periodic inspections. In the opportunistic
maintenance policy several thresholds are defined for doing inspections, corrective
and preventive replacements, and opportunistic maintenance. These thresholds are
decision variables. Many articles on this type of models have appeared, but most of
these articles only consider single component models.
The articles of Scarf and Deara (1998, 2003) consider both economic and
stochastic dependence between components in a series system. This combination is
scarce in the literature. Positive economic dependence is modelled on the basis that
the cost of replacement of one or more components includes a one-off set-up cost
whose magnitude does not depend on the number of components replaced. We will
discuss these articles in more detail in Section 11.4.
In one of the few case studies found in the literature, Van der Duyn Schouten et
al. (1998) investigate the problem of replacing light bulbs in traffic control signals.
Each installation consists of three compartments for the green, red, and yellow
lights. Maintenance of light bulbs means replacement, either correctively or
preventively. First, positive economic dependence is present in the form of set-up
cost, because each replacement action requires a fixed cost in the form of
transportation of manpower and equipment. Second, the failure of individual bulbs
is an opportunity for doing preventive maintenance on other bulbs. The authors
propose two types of maintenance policies. In the first policy, also known as the
standard indirect-grouping strategy (introduced in maintenance by Goyal and Kusy
1985; for a review of this strategy we refer to Dekker et al. 1996), corrective and
preventive replacements are strictly separated. Economies of scale can thus only be
achieved by combining preventive replacements of the bulbs. The authors also
propose the following opportunistic age-based grouping policy. Upon failure of a
light bulb, the failed bulbs and all other bulbs older than a certain age are replaced.
Budai et al. (2006) consider a preventive maintenance scheduling problem
(PMSP) for a railway system. In this problem (short) routine activities and (long)
unique projects for one track have to be scheduled in a certain period. To reduce
costs and inconvenience for the travellers and operators, these activities should be
scheduled together as much as possible. With respect to the latter, maintenance of
different components of one track simultaneously requires only one track possession.
Time is discretized and the PMSP is written as a mixed-integer linear programming
model. Positive dependence is taken into account by the objective function, which is
the sum of the total track possession cost and the maintenance cost over a finite
horizon. To reduce possible end-of-horizon effects an end-of-horizon valuation is
also incorporated in the objective function. Note that the possession cost can be seen
as a downtime cost. The cost is modelled as a fixed/ set-up cost. This is the reason
that it is classified in this category. Besides this positive dependence there also exists
negative dependence between components, since some activities exclude each other.
270 R. Nicolai and R. Dekker
Multiple Set-ups
This is also a new category. The maintenance of different components may require
different set-up activities. These set-up activities may be combined when several
components are maintained at the same time. We have found one article in this
category; it assumes a complex hierarchical set-up structure.
Van Dijkhuizen (2000) studies the problem of clustering preventive main-
tenance jobs in a multiple set-up multi-component production system. As far as the
authors know, this is the first attempt to model a maintenance problem with a
hierarchical (tree-like) set-up structure. Different set-up activities have to be done
at different levels in the production system before maintenance can be done. Each
component is maintained preventively at an integer multiple of a certain basic
interval, which is the same for all components, and corrective maintenance is
carried out in between whenever necessary. So, every component has its own
maintenance frequency the frequencies are based on the optimal maintenance
planning for single components. Obviously, set-up activities may be combined
when several components are maintained at the same time. The problem is to find
the maintenance frequencies that minimize the average cost per unit of time. This
problem is an extension of the standard-indirect grouping problem (for an
overview of this problem see Dekker et al. 1996).
these policies the system is replaced at the time of the m-th failure, every T time
units, and at the minimum time of these events, respectively. These policies were
first introduced by Assaf and Shanthikumar (1987), Okumoto and Elsayed (1983)
and Ritchken and Wilson (1990), respectively. Popova and Wilson (1999) assume
that downtime costs are incurred when failed components are not repaired or
replaced. So, when the system operates there is also negative dependence between
the components. After all, when the components are left in a failed condition, with
the intention to group corrective maintenance, then downtime costs are incurred. In
the maintenance policies a trade-off between the downtime costs and the advan-
tages of grouping (corrective) maintenance is made.
Sheu and Jhang (1996) propose a new two-phase opportunistic maintenance
policy for a group of independent identical repairable units. Their model takes into
account downtime costs and the maintenance policy includes minimal repair,
overhaul, and replacement. In the first phase, (0,T], minor failures are removed by
minimal repairs and catastrophic failures by replacements. In the second phase,
(T,T+W], minor failures are also removed by minimal repairs, but catastrophic
failures are left idle. Group maintenance is conducted at time T+W or upon the k-th
idle, whichever comes first. The generalized group maintenance policy requires
inspection at either the fixed time T+W or the time when exactly k units are left
idle, whichever comes first. At an inspection, all idle components are replaced with
new ones and all operating components are overhauled so that they become as
good as new.
Higgins (1998) studies the problem of scheduling railway track maintenance
activities and crews. In this problem positive economic dependence is present in the
following way. The occupancy of track segments due to maintenance prevents all
train movements on those segments. The costs associated with this can be regarded
as downtime costs. The maintenance scheduling problem is modelled as a large scale
0-1 programming problem with many (non-linear) restrictions. The objective is to
minimize expected interference delay with the train schedule and prioritized finishing
time. The downtime costs are modelled by including downtime probabilities in the
objective function. The author proposes tabu search to solve the problem. The
neighbourhood, which plays a prominent role in local search techniques, is easily de-
fined by swapping the order of activities or maintenance crews.
The article of Sriskandarajah et al. (1998) discusses the maintenance scheduling
of rolling stock. Multiple train units have to be overhauled before a certain due date.
The aim is to find a suitable common due date for each train so that the due dates of
individual units do not deviate too much from the common due date. Maintenance
carried out too early or too late is costly since this may cause loss of use of a train.
A genetic algorithm is proposed to solve this scheduling problem.
Manpower restrictions
Safety requirements
Redundancy/production-loss
First grouping maintenance results in a peak in manpower needs. Manpower
restrictions may even be violated and additional labour needs to be hired, which is
costly. The problem here is to find the balance between workload fluctuation and
grouping maintenance.
Second, there are often restrictions on the use of equipment, when executing
maintenance activities simultaneously. For instance, use of equipment may hamper
use of other equipment and cause unsafe operations. Legal and/or safety require-
ments often prohibit joint operation.
Third, joint (corrective) maintenance of components in systems in which some
kind of redundancy is available may not be beneficial. Although there may exist
economies of scale through simultaneous repair of a number of (identical) com-
ponents, leaving components in a failed condition for some time increases the risk
of costly production losses. We will come back to this in Section 11.3.3. Produc-
tion loss may increase more than linearly with the number of components out of
operation. For an example of this type of economic dependence we refer to Stengos
and Thomas (1980). The authors give an example of the maintenance of blast
furnaces. The disturbance due to maintenance is substantially more, the more fur-
naces that are out of operation. That is, the cost of overhauling the furnaces in-
creases more than linearly with the number of furnaces out of action.
It appears that maintenance of systems with negative dependence is often
modelled in discrete time. The models can be regarded as scheduling problems with
many restrictions. These restrictions can easily be incorporated in discrete time
models such as (mixed) integer programming models. With respect to these models,
there is always the question whether the exact solution can be found efficiently. In
other words, the question arises whether the problem is NP-hard. An example of
discrete time modelling is given by the article of Grigoriev et al. (2006). In this
article the so-called periodic maintenance problem (PMP) is studied. In this problem
machines have to be serviced regularly to prevent costly production losses. The
failures causing these production losses are not modelled. Time is discretized into
unit-length periods. In each period at most one machine can be serviced. Apparently
negative economic dependence in the form of manpower restrictions or safety
measures play a role in the maintenance of the machines. The problem is to find a
cyclic maintenance schedule of a given length T that minimizes total service and
operating costs. The operating costs of a machine increase linearly with the number
of periods elapsed since last servicing that machine. PMP appears to be an NP-hard
problem and the authors propose a number of solution methods. This leads to the
first exact solutions for larger sized problems.
In Stengos and Thomas (1980) time is also discretized but the maintenance
problem, scheduling the overhaul of two pieces of equipment, is set up as a
Markov decision process. The pieces can be in different states and the probability
of failure increases with the time since the last overhaul. So in comparison with the
problem of Grigoriev et al. (2006), pieces can fail during operation. Negative
economic dependence is modelled as follows. The cost of overhauling the pieces
Optimal Maintenance of Multi-component Systems: A Review 273
increases more than linearly with the number of pieces out of action. The objective
is to minimize the loss of production cost, which is incurred when a piece is
overhauled. The optimal policy is found by a relative value successive approxima-
tion algorithm.
In Langdon and Treleaven (1997) the problem of scheduling maintenance for
electrical power transmission networks is studied. There is negative economic
dependence in the network due to redundancy/production-loss. Grouping certain
maintenance activities in the network may prevent a cheap electricity generator
from running, so requiring a more expensive generator to be run in its place. That
is, some parts of the network should not be maintained simultaneously. These
exclusions are modelled by adding restrictions to the MIP formulation of the prob-
lem. The authors propose several genetic algorithms and other heuristics to solve
the problem.
to assess the availability and the downtime costs of a k-out-of-n system. In their
article, Smith and Dekker (1997) optimize the following age-replacement policy. A
component is taken out for preventive maintenance and replaced by a stand-by one,
if its age has reached a certain value Tpm. Moreover, they determine the number of
redundant components needed in the system.
In the maintenance policies considered in the articles below, an attempt is made
to balance the negative aspects of downtime costs and the positive aspects of
grouping (corrective) maintenance. The opportunistic maintenance policies proposed
in these articles are age-based and also contain a threshold for the number of failures
(except for the policy introduced by Sheu and Kuo 1994).
In Dekker et al. (1998b) the maintenance of light-standards is studied. A light
standard consists of n independent and identical lamps screwed on a lamp assem-
bly. To guarantee a minimum luminance, the lamps are replaced if the number of
failed lamps reaches a pre-specified number m. In order to replace the lamps the
assembly has to be lowered. This set-up activity is an opportunity to combine
corrective and preventive maintenance. Several opportunistic age-based variants of
the m-failure group replacement policy (in its original form only corrective main-
tenance is grouped) are considered in this paper. Simulation optimization is used to
determine the optimal opportunistic age threshold.
Pham and Wang (2000) introduce imperfect PM and partial failure in a k-out-
of-n system. They propose a two-stage opportunistic maintenance policy for the
system. In the first stage failures are removed by minimal repair; in the second
stage failed components are jointly replaced with operating components when m
components have failed, or the entire system is replaced at time T, whichever
occurs first. Positive economic dependence is of an opportunistic nature. Joint
maintenance requires less time than individual maintenance.
Sheu and Kuo (1994) introduce a general age replacement policy for a k-out-of-
n system. Their model includes minimal repair, planned and unplanned replace-
ments, and general random repair costs. The system is replaced when it reaches age
T. The long-run expected cost rate is obtained. The aim of the paper is to find the
optimal age replacement time T that minimizes the long-run expected cost per unit
time of the policy.
The article of Sheu and Liou (1992) will be discussed in Section 11.4, because
they assume stochastic dependence between the components of a k-out-of-n system.
Instead, we want to give insight into the different ways of modelling failure
interaction between components and explain the implications of certain approaches
and assumptions with respect to practical applicability.
Stochastic dependence, also referred to as failure interaction or probabilistic
dependence, implies that the state of components can influence the state of the
other components. Here, the state can be given by the age, the failure rate, state of
failure or any other condition measure. In their seminal work on stochastic de-
pendence, Murthy and Nguyen (1985b) introduce three different types of failure
interaction in a two-component system.
Type I failure interaction implies that the failure of a component can induce a
failure of the other component with probability p (q), and has no effect on the other
component with probability 1 p (1 q). It follows that there are two types of
failures: natural and induced. The natural failures are modelled by random
variables and the induced failures are characterized by the probabilities p and q. In
Murthy and Nguyen (1985a) the authors extend type I failure interaction to systems
with multiple components. It is assumed that whenever a component fails it
induces a total failure of the system with probability p and has no effect on the
other components with probability (1 p). In this chapter we will consider this to
be the definition of type I failure interaction.
Type II failure interaction in a two-component system is defined as follows.
The failure of component 2 can induce a failure of component 1 with probability q,
whereas every failure of component 1 acts as a shock to component 2, without
inducing an instantaneous failure, but affecting its failure rate.
Type III failure interaction implies that the failure of each component affects
the failure rate of the other component. That is, every failure of one of the compo-
nents acts as a shock to the other component.
A potential problem of the failure rate interaction defined by the last two types,
is determining the size of the shock. In practice it is very difficult to assess the
effect of a failure of one component on the failure rate of another component.
Usually there is not much data on the course of the failure rate of a component
after the occurrence of a shock. Shocks can also be modelled by adding a (random)
amount of damage to the state of another component. Natural failures then occur if
the state of a component (measured by the cumulative damage) exceeds a certain
level. In this paper we will bring this modelling of type II and III failure interaction
together in one definition. That is, we renew the definition of type II failure
interaction for multi-component systems. It reads as follows. The system consists of
several components and the failure of a component affects either the failure rate of
or causes a (random) amount of damage to the state of one or more of the
remaining components. It follows that we regard a mixture of induced failures and
shock damage as type II failure interaction. Models with type II failure interaction
will also be called shock damage models.
In general, the maintenance policies considered in the literature on stochastic
dependence, are mainly of an opportunistic nature, since the failure of one compo-
nent is potential harmful for the other component(s). Modelling failure interaction
appears to be quite elaborate. Therefore, most articles only consider two-compo-
nent systems. Below we review the articles on failure interaction in the following
order. First, we will discuss the type I interaction models. For this type of inter-
276 R. Nicolai and R. Dekker
action different opportunistic versions of the well known age and block replace-
ment policies have been proposed. Second, the articles on type II interaction will
be reviewed. We will see that in most of these articles the occurrence of shocks is
modelled as a non-homogeneous Poisson process (NHPP) or that the failure rate of
components is adjusted upon failure of other components. Third, we pay attention
to articles that consider both types of failure interaction. Finally, we discuss other
forms of modelling failure interaction.
Lai and Chen (2006) consider a two-component system with failure rate
interaction. The lifetimes of the components are modelled by random variables
with increasing failure rates. Component 1 is repairable and it undergoes minimal
repair at failures. That is, component 1 failures occur according to a NHPP. Upon
failure of component 1 the failure rate of component 2 is modified (increased).
Failures of component 2 induce the failure of component 1 and consequently the
failure of the system. The authors propose the following maintenance policy. The
system is completely replaced upon failure, or preventively replaced at age T,
whichever occurs first. The expected average cost per unit time is derived and the
policy is optimized with respect to parameter T. The optimum turns out to be
unique.
Barros et al. (2006) introduce imperfect monitoring in a two-component
parallel system. It is assumed that the failure of component i is detected with
probability 1 pi and is not detected with probability pi. The components have
exponential lifetimes and when a component fails the extra stress is placed on the
surviving one for which the failure rate is increased. Moreover, independent shocks
occur according to a Poisson process. These shocks correspond to common cause
failures and induce a system failure. The following maintenance policy is proposed.
Replace the system upon failure (either due to a shock or failure of the components
separately), or preventively at time T, whichever occurs first. Assuming that
preventive replacement is cheaper, the total expected discounted cost over an
unbounded horizon is minimized. Numerical examples show the relevance of taking
into account monitoring problems in the maintenance model. The model is applied
to a parallel system of electronic components. When one fails, the surviving one is
overworked so as keep the delivery rate not affected.
Murthy and Nguyen (1985b) derive the expected cost of operating a two-com-
ponent system with type I or type II failure interaction for both a finite and an
infinite time period. They consider a simple, non-opportunistic, maintenance
policy. Always replace failed components immediately. This means that the system
is only renewed if a natural failure induces a failure of the other component.
Nakagawa and Murthy (1993) elaborate on the ideas of Murthy and Nguyen
(1985b). They consider two types of failure interaction between two components.
In the first case the failure of component 1 induces a failure of component 2 with a
certain probability. In the second case the failure of component 1 causes a random
amount of damage to the other component. In the latter case the damage
accumulates and the system fails when the total damage exceeds a specified level.
Failures of component 1 are modelled as an NHPP with increasing intensity
function. The following maintenance policy is examined. The system is replaced at
failure of component 2 or at the N-th failure of component 1, whichever occurs
first. For both models the optimal number of failures before replacing the system as
to minimize the expected cost per unit time over an infinite horizon is derived. The
maintenance policy for the shock damage model is extended as follows: the system
is also replaced at time T. This results in a two-parameter maintenance policy,
which is also optimized. The authors give an application of their models to the
280 R. Nicolai and R. Dekker
The optimization methods applied to finite horizon models are either exact
methods or heuristics1. Exact methods always find the global optimum solution of
a problem. If the complexity of an optimization problem is high and the computing
time of the exact method increases exponentially with the size of the problem, then
heuristics can be used to find a near-optimal solution in reasonable time.
The scheduling problem studied by Grigoriev et al. (2006) appears to be NP-
hard. Instead of defining heuristics, the authors choose to work on a relatively fast
exact method. Column-generation and a branch-and-price technique are utilized to
find the exact solution of larger-sized problems. The problem considered by
Papadakis and Kleindorfer (2005) is first modelled as a mixed integer linear pro-
gramming problem, but it appears that it can also be formulated as a max-flow
min-cut problem in an undirected network. For this problem efficient algorithms
exist and thus, an exact method is applicable.
Langdon and Treleaven (1997), Sriskandarajah et al. (1998), Higgins (1998)
and Budai et al. (2006) propose heuristics to solve complex scheduling problems.
The first two articles utilize genetic algorithms. Higgins (1998) applies tabu search
and Budai et al. (2006) define different heuristics that are based on intuitive
arguments. In all four articles the heuristics perform well; a good solution is found
within reasonable time.
11.7.1 Trends
In the last few years several articles have appeared on optimal maintenance of
systems with stochastic dependence. In particular, the shock-damage models have
received much attention. One explanation for this is that type II failure interaction
can be modelled in several ways, whereas there is not much room for extensions in
the type I failure model. Another reason is that since the field of stochastic de-
pendence is not very broad yet, it is easy to add a new feature such as minimal
repair or imperfect monitoring to an existing model. Third, many existing oppor-
tunistic maintenance policies for systems with economic dependence have not yet
been applied to systems with (type II) failure interaction.
Another upcoming field in multi-component maintenance modelling is the class
of finite horizon maintenance scheduling problems. Finite horizon models can be
1
Actually, if the maintenance policy is relatively easy, it is sometimes possible to determine
the expected maintenance costs over a finite period of time. For instance, Murthy and
Nguyen (1985a,b) consider failure-based policies in a system with stochastic dependence
and derive an expression for the expected cost of operating the system for a finite time.
Optimal Maintenance of Multi-component Systems: A Review 283
11.7.2.1 Case-studies
This review shows that case-studies are not represented very well in the field. This
is surprising, since maintenance is an applied topic. In our opinion many models
are just (mathematical) extensions of existing models and most of the times models
are not validated empirically. Case-studies can lead to new models, both in the
context of cost structures and dependencies between components.
11.8 Conclusions
In this chapter we have reviewed the literature on optimal maintenance of multi-
component maintenance. We first classified articles on the basis of the type of
dependence between components: economic, stochastic and structural dependence.
Subsequently, we subdivided these classes into new categories. For example, we
have introduced the categories positive and negative economic dependence. We
have paid attention to articles with both forms of interaction. Moreover, we have
defined several subcategories in the class of models with positive economic de-
pendence. With respect to articles in the class of stochastic dependence, we are the
first to review these articles systematically.
Another classification has been made on the basis of the planning horizon
models and optimization methods. We have focussed our attention on the use of
heuristics and exact methods in finite horizon models. We have concluded that this
is a promising open research area.
We have discussed the trends and the open areas of research reported in the
literature on multi-component maintenance. We have observed a shift from infinite
horizon models to finite horizon models and from economic to stochastic depen-
dence. This immediately defines the open research areas, which also include topics
such as case studies, modelling combinations of dependencies between compo-
nents and modelling multiple set-up activities.
Optimal Maintenance of Multi-component Systems: A Review 285
11.9 References
Assaf D, Shanthikumar J, (1987) Optimal group maintenance policies with continuous and
periodic inspections. Management Science 33:14401452
Barros A, Brenguer C, Grall A, (2006) A maintenance policy for two-unit parallel systems
based on imperfect monitoring information. Reliability Engineering and System Safety
91:131136
Budai G, Huisman D, Dekker R, (2006) Scheduling preventive railway maintenance
activities. Journal of the Operational Research Society 57:10351044
Castanier B, Grall A, Brenguer C (2005) A condition-based maintenance policy with non-
periodic inspections for a two-unit series system. Reliability Engineering & System Safety
87:109120
Cho D, Parlar M, (1991) A survey of maintenance models for multi-unit systems. European
Journal of Operational Research 51:123
Dekker R, Plasmeijer R, Swart J, (1998a) Evaluation of a new maintenance concept for the
preservation of highways. IMA Journal of Mathematics applied in Business and Industry
9:109156
Dekker R, van der Duyn Schouten F, Wildeman R, (1996) A review of multi-component
maintenance models with economic dependence. Mathematical Methods of Operations
Research 45:411435
Dekker R, van der Meer J, Plasmeijer R, Wildeman R, (1998b) Maintenance of light-
standards: a case-study. Journal of the Operational Research Society 49:132143
Goyal, S, Kusy M, (1985) Determining economic maintenance frequency for a family of
machines. Journal of the Operational Research Society 36:11251128
Grigoriev A, van de Klundert J, Spieksma F, (2006) Modeling and solving the periodic
maintenance problem. European Journal of Operational Research 172:783797
Grler , Kaya A, (2002) A maintenance policy for a system with multi-state components: an
approximate solution. Reliability Engineering & System Safety 76:117127
Higgins A, (1998) Scheduling of railway track maintenance activities and crews. Journal of
the Operational Research Society 49:10261033
Jhang J, Sheu S, (2000) Optimal age and block replacement policies for a multi-component
system with failure interaction. International Journal of Systems Science 31:593603
Lai M, Chen Y, (2006) Optimal periodic replacement policy for a two-unit system with
failure rate interaction. The International Journal of Advanced Manufacturing and
Technology 29:367371
Langdon W, Treleaven P, (1997) Scheduling maintenance of electrical power transmission
networks using genetic programming. In Warwick K, Ekwue A, Aggarwal R, (eds.)
Artificial intelligence techniques in power systems, Institution of Electrical Engineers,
Stevenage, UK, 220237
Murthy D, Nguyen D, (1985a) Study of a multi-component system with failure interaction.
European Journal of Operational Research 21:330338
Murthy D, Nguyen, D (1985b) Study of two-component system with failure interaction.
Naval Research Logistics Quarterly 32:239247
Nakagawa T, Murthy D, (1993) Optimal replacement policies for a two-unit system with
failure interactions. RAIRO Recherche operationelle / Operations Research 27:427438
Okumoto K, Elsayed E, (1983) An optimum group maintenance policy. Naval Research
Logistics Quarterly 30:667674
zekici S, (1988) Optimal periodic replacement of multicomponent reliability systems.
Operations Research 36:542552
Papadakis I, Kleindorfer P, (2005) Optimizing infrastructure network maintenance when
benefits are interdependent. OR Spectrum 27:6384
286 R. Nicolai and R. Dekker
12.1 Introduction
Businesses require equipment in order to function and deliver their outputs. In the
global, competitive environment, this equipment is critical to success. However,
equipment generally degrades with age and usage, and investment is required to
maintain the functional performance of equipment. For example, in mass urban
transportation, annual expenditure on equipment replacement for the Hong Kong
underground is of the order of $50 million, and further, the Hong Kong underground
network is a fraction of the size of that in London, Paris or New York. Where
equipment replacement impacts significantly on the bottom line of a corporation and
decision-making about such expenditure is under the control of the company
executive, the modelling of such decision making is within the scope of this chapter.
Capital equipment investment projects are typically driven by operating cost
control, technical obsolescence, requirements for performance and functionality
improvements, and safety. That is, rational decision-making about capital equip-
ment replacement will take account of engineering, economic, and safety require-
ments. In this chapter we will assume that the engineering requirements concerning
replacement will define certain choices for equipment replacement. For example,
engineers would normally propose a number of options for providing the continuity
of equipment function: retain the current equipment as is, refurbish the equipment in
order to improve operation and functionality, or replace the equipment with new im-
proved technology. We will further assume that safety requirements are addressed
when these options are analysed by engineers. Consequently, we argue that rational
choice between the defined replacement options is an economic question. Thus, a
logistics corporation may be considering replacement of certain assets in its road
transportation fleet. The organisation may have to raise capital to fund such
replacement. There is the expectation that engineers for the corporation will offer a
number of choices for replacement (e.g. buy tractors from company X or Y, buy
tractors now or in N years time, or scrap or retain existing tractors as spares) that
meet future functional and safety requirements. In this way, decision making about
288 P. Scarf and J. Hartman
replacement then necessarily considers the costs of the replacement options over
some suitable planning horizon. As capital equipment replacement potentially in-
curs significant costs, the cost of capital is a factor in the decision problem and
models to support decision making typically take account of the time value of
capital through discounting.
Capital equipment is a significant asset of a business. It consists of necessarily
complex systems and a business would typically own or operate a fleet of equipment:
the Mass Transit Railway Corporation Limited of Hong Kong operates hundreds of
escalators; Fed Ex Express, the cargo airline corporation operates more than 600
aircraft; electricity distribution systems comprise thousands of kilometres of cable
and hundreds of thousands of items such as transformers and switches; water supply
networks are on a similar scale. We can appeal to the law of large numbers and
assume with some justification that the economic costs that enter capital equipment
replacement decisions are deterministic. Consequently, we consider deterministic
models in this chapter and model rational decision making throughout using net
present value techniques (e.g. see Arnold 2006; Northcott 1985).
When considering optimal equipment replacement in an uncertain environment,
authors have argued the case for using real options (Dixit and Pindyck 1994; Bowe
and Lee 2004). Whenever replacement decisions may be exercised continuously, it
is argued that the choice to replace an existing asset with a new asset at a specified
time is characteristic of an American call optionthis approach seeks to value the
opportunity to replace the asset. Such a modelling approach would be valuable
when considering expansion of assets, for example, through the building of a new
transportation link for which the likely return on investment would be highly
uncertain. However, we do not consider this approach in this chapter.
We do not consider problems of component replacement in which the func-
tionality of repairable systems is optimized either on a cost basis or a required
reliability basis. Such maintenance does not typically involve capital expenditure,
and the models used are often stochastic in naturetimes to failure are considered
to be random. For a recent review of such models, see Wang 2002.
The outline of the chapter is as follows. In Section 12.2 we describe the frame-
work for the classification of models that are discussed in this chapter. This frame-
work considers the nature of capital equipment replacement problems in general
and presents further detail regarding the nature of cost factors that contribute to
replacement decisions. Section 12.3 looks at economic life models and discusses
several models and an application of one of the models. Section 12.4 deals with
replacement of a network system. Dynamic programming models are discussed in
Section 12.5 and the chapter concludes with a discussion of topics for future
research in Section 12.6.
ment, or entire fleet replacement (Scarf and Christer 1997). The capital replace-
ment models that are considered in this chapter may be classified as economic life
models or dynamic programming models. The former are concerned with deter-
mining the optimal lifetime of an item of equipment, taking account of costs over
some planning horizon. The latter considers replacement decisions dynamically,
determining whether plant should be retained or replaced after each period. Eco-
nomic life models may be further classified according to the length of the planning
horizon: infinite, variable finite (with length of the horizon a function of decision
variables), or fixed (with a variable number of replacement cycles). Dynamic
programming models generally require a finite horizon, but may be used to identify
the optimal time zero decision for an infinite horizon.
Early models (e.g. Eilon et al. 1966) were formulated in continuous time with
optimum policy obtained using calculus. More complex models are simpler to
implement under a discrete time formulation. In the case of economic life models,
optimization may be performed using a crude search when there exists a small
number of decision variables. For fleets with many items, the discrete time
formulation naturally gives rise to mathematical programming problems. Dynamic
programming models necessarily require a discrete time formulation. Real options
models are formulated in continuous time.
We begin by looking at simple economic life models. These are applied in a
case study on escalator replacement. Economic life models are then extended to
consider first an inhomogeneous fleet and second a network system viewed as an
inhomogeneous fleet with interacting items. A number of different dynamic pro-
gramming models are introduced for singular systems and then expanded to homo-
geneous and inhomogeneous fleets and networks of assets.
It is assumed that data relating to maintenance are available and sufficient for
modelling purposes. Data on other age related operating costs, such as fuel costs
and failures (breakdowns), would also ideally be available. Where usage of plant is
non-uniform, particularly if decreasing with age, usage data are also required for
replacement policy to be meaningful. This is because, for example, maintenance
costs for older plant may be artificially low due to under utilization or neglect of
good maintenance practice for plant near the end of their useful life. Some plant
may even be retired as occasional spares. Under reporting and thus bias of main-
tenance cost data may also be significant (Scarf 1994). Replacement models have
also been considered when cost information is obtained subjectively (Apeland and
Scarf 2003).
Penalty costs play a role in all replacement decisions (Christer and Scarf 1994).
It is only the extent to which penalty cost is quantified in the modelling process
that varies. Rather than attempt to estimate the values of difficult to quantify
parameters such as penalty cost and then determine optimal policy, the influence of
these parameters on the decision should be quantified. In this latter approach,
threshold values that lead to a step-change in optimum policy can be investigated
and presented and the decision makers can then consider whether they believe that
such values are realistic within the context of the problem. Thus, the penalty cost
can be used to measure in part the subjective component of a replacement choice.
All costs considered in the modelling will be discounted to net present value
through the use of a constant discount factor. We refer the reader to Kobbacy and
290 P. Scarf and J. Hartman
Nicol (1994) for a detailed discussion of the role of discounting in capital replace-
ment. Appropriate functions describing resale values are assumed to be known, as
are purchase costs. Tax considerations in particular contexts should be taken into
account and modelled.
Early economic life models such as Eilon et al. (1966) considered an idealised
equipment replaced at age T, that is, replacement every T time units, in perpetuity.
In this idealised framework, for T small, frequent replacement leads to high
replacement or capital costs. Infrequent replacement (large T), on the other hand,
results in high operating or revenue costs (assuming that operating costs increase
with the age of equipment). Trading-off capital costs against revenue costs leads to
an optimum age at replacement, T*, the so-called economic life. The decision
criterion is typically the total cost per unit time or the annuitythis latter term has
been called the rent by Christer (1984). In the case without discounting, the total
cost per unit time, c(T), and the annuity are equivalent and
T
c(T ) = { m0 (t ) dt + R}/ T , (12.1)
0
where m0 (t ) is the operating cost rate and R is the replacement cost, and assuming
no residual value. From Equation 12.1, it follows that T* is the solution of
T*
0
m0 (t ) dt + R = T * m0 (T *) ,
provided it exists. In its discrete time form the total cost per unit time is
c(T ) = { i =1 m0i + R}/ T , where m0i is the operating cost in time period i. With a
T
discount factor , discounting to year end, and a residual value function S(T), the
net present value (NPV) of all future costs in perpetuity is
T
cNPV (T ) = (1 + T + 2T + ...){ i =1
m0i i + T [ R S (T )]}
= (1 T ) 1{ i =1 m0i i + T [ R S (T )]}.
T
T
(1 + + 2 + ...) crent (T ) = (1 + T + 2T + ...){ i =1
m0i i + T [ R S (T )]} ,
Replacement of Capital Equipment 291
whence
(1 ) T
crent (T ) = T
(1 )
i =1 m0i i + T [ R S (T )]} .
{
Notice that as 1, crent (T ) c(T ), the total cost per unit time. The eco-
nomic life can be obtained by minimising crent (T ) , typically using a spreadsheet
by considering a range of values of T.
Here K and L are decision variables, with K modelling the time (from now) to
replacement of the existing asset; K+L is the time to second replacement. The
advantage of this model is that one only need estimate the operating cost of the
existing and new assets (as functions of age), the capital cost for the new asset, R1 ,
and the age-related resale or residual value of new and existing assets, S0 , S1 .
In the financial appraisal of projects, a standard approach fixes the time horizon
and determines the NPV of future costs over this horizon (e.g. Northcott 1985).
This fixed horizon model has been studied by Scarf and Hashem (2003) and its
simplicity lends itself to application in complex contexts (e.g. Scarf and Martin
2001). The annuity for this model can be derived from Equation 12.2 above simply
by setting X = K and K + L = h , the length of the planning horizon, and then
considering h as fixed. Whence, there is only one decision variable, X, the time to
replacement. Given the possibility that X = h , that is, no replacement over the
planning horizon whence we retain the current asset, the annuity function has a
discontinuity at X = h , and X * = h implies that it is not optimal to undertake the
292 P. Scarf and J. Hartman
(replacement) project. Furthermore, since the replacement at the end of the horizon
has a fixed cost (with respect to the decision variable X) its inclusion or exclusion
has no effect on the optimal time to replacement. It is natural not to include the
replacement cost at the horizon-end since a standard financial appraisal approach
would only account for revenue costs up to project execution, capital costs at
project execution, subsequent revenue costs up to the horizon-end, and residual
values. Including the replacement at h on the other hand allows cost comparisons
with the two-cycle model and the associated rent, Equation 12.2. We take the
former approach here however and the annuity is
X h
{ m0(i + ) i + X [ R1 S0 ( X + )] + m1(i X ) i
i =1 i = X +1
h h
crent ( X ) = h S1 (h X )}/ i =1
i, X < h, (12.3)
h h
{ m0(i + ) i h S0 (h + )}/ i, X = h.
i =1 i =1
2
crent ( K , L) = ( Km0 + R + Lm1 + R ) /( K + L) , (12.4)
and
We can consider a similar argument for the fixed horizon model. Thus,
h h
dcrent ( X ) / dX = ( m0 m1 ) / h, X < h and so dcrent ( X ) / dX > 0 if m0 > m1 . How-
h
ever, since crent ( X ) has a discontinuity at X=h, X*=0 is optimal only if m0 > m1
h h
and crent (0) < crent (h) . That is, if ( R + hm1 ) / h < m0 , that is, if
Thus, comparison of inequalities at Equations 12.5 and 12.6 shows that the two
models have different properties in terms of the behaviour of optimal policy as a
function of cost parameters. Thus the two-cycle model is inconsistent with standard
financial models. However, a simple modification to the model will correct this
inconsistency. Scarf et al. (2006) suggest simply to omit the replacement at the end
of the second cycle. For the constant revenue case above, the rent becomes
2
c rent ( K , L) = ( Km 0 + R + Lm1 ) /( K + L) and optimal policy would be K*=0
( L* = lmax ) if lmax (m0 m1 ) > R , which is consistent with the fixed horizon model
and hence with standard financial appraisal models.
However, it would appear that the two-cycle model with its two replacements
(at t=K and at t=K+L) is applicable for the case of increasing operating costs and
that a modified two-cycle model with one replacement (at t=K only) for operating
costs that are constant or increasing only slowly. However, this issue can be
resolved. When operating costs are increasing only slowly, typically L* does not
exist, and, in practice L must be constrained such that L lmax (as pointed out
above) since numerically we can only search for L* over a finite space. In
constraining L lmax under the two replacements formulation, we impose a
replacement at lmax when in fact there should not be a second replacement since
L* does not exist. This then suggests that the two-cycle replacement model should
be modified in the following subtle way: if there does not exist an L such that
2
crent ( K , L) has a minimum strictly within the search space, that is, within
{( K , L) : 0 < K < K max ,0 < L < lmax } then, when determining that K which
2
minimises crent ( K , lmax ) , no replacement cost should be incurred at t = K + lmax .
Thus the model should be modified so that there is only one replacement.
Otherwise the cost hurdle for replacement of the current asset will be set
artificially high (inequality at Equation 12.5). Thus, in all practical situations for
which operating costs are increasing only slowly, one should use this modified
two-cycle model or the fixed horizon model as a special case.
Using the fixed horizon model or equivalently using the modified two-cycle model
with a finite search space may lead to significant end-of-horizon effects (since
costs beyond the horizon-end are ignored). Thus time to first replacement will
depend on h (or equivalently lmax ). Choice of h (or lmax ) will need to be con-
sidered carefully; in practice the horizon may be specified by company policy on
accounting methods and discounting may reduce those costs incurred in the distant
294 P. Scarf and J. Hartman
Example 12.1
Decision making regarding the replacement of escalators on a mass transit rail
system in a particular city has been considered over a number of years by the
corporation that owns and operates the system (Scarf et al. 2006). Maintenance of
escalators is generally outsourced to equipment suppliers due to the difficulty that
alternative contractors have in obtaining proprietary spares. The original manu-
facturers can keep costs down as a result of the economy of scale that is achievable
through maintaining equipment over a large number of client organisations. Cur-
rently, the corporation operates of the order of 600 escalators and the annual
maintenance contract price is over $10 million. Escalator replacement is therefore a
significant issue within the organisation.
Studies by the corporation suggest that the economic life of escalators is of the
order of 25 years but that, based on overseas experience, escalator life can be
extended to up to 40 years. However, given the size of the fleet, a strategy has to be
set to manage escalator maintenance and to deal with the replacement or refurbish-
ment of older escalator assets. A key factor in this strategy is the approach of the
organisation to the re-negotiation of maintenance contracts and in particular to
Replacement of Capital Equipment 295
determine the scale of refurbishment of older assets and the level of major parts
replacement and supply within the negotiated contract.
For the presentation of the modelling work in this example, it is necessary to
consider the asset management options open to the corporation in a simple manner,
and a homogeneous sub-fleet of the escalators is considered, with modeling carried
out for a typical escalatorthis is a reasonable simplification since all escalators in
the sub-fleet were installed at approximately the same time. For this group,
replacement, although crudely costed, was not really a viable optioneconomic
costs were too high and disruption unacceptable given the duration of replacement
work. Refurbishment by the original manufacturer, replacing worn parts, upgrading
the control system and maintenance access was being carefully considered by the
corporation as a viable strategy for managing the asset life. Cost savings could be
achieved through a reduction in the annual maintenance contract price subsequent
to refurbishment. Thus, put simply, for the escalator group, the corporation was
faced with the decision: continue with the current relatively higher-price main-
tenance contract or refurbish and benefit from a new relatively lower-priced
maintenance contract. Other benefits would also accrue from refurbishment for
both contractor and the corporation. For the contractor, improved access and safety
for maintenance was part of the refurbishment package. For the corporation, up-
grade of the control system would result in fewer unplanned escalator stoppages.
We consider some four asset management options: do nothingcontinue
with high-price maintenance contract; refurbrenew worn parts, retro-fit new
control system and proceed with lower-price maintenance contract; delay
refurbdelay refurbishment for up to n years; replacea full replacement
option with nominal costs included for comparison purposes. The costs of refur-
bishment (per escalator) in the present study were obtained from initial quotations
from the respective manufacturers: these are $63K for refurbishment. On-going
annual maintenance contract costs (per escalator) are: $9K pre-refurbishment; $7K
post-refurbishment. Prior to refurbishment the cost of replacement of major parts is
in addition to the annual maintenance contract and major parts are replaced on the
basis of condition. Post-refurbishment, the annual maintenance contract includes
replacement of major parts at no extra cost. Given that we might expect major parts
to be replaced somewhat less frequently than dictated by their recommended lives,
we introduce a cost parameter to model such life-extensionthis is called the
effective life factor, . = 1 implies that major parts are replaced at a frequency
corresponding to their recommended life (for example, once every 25 years for the
steps at a cost of $48K), and the replacement frequency 1 / ( = 2 implies
replacement of steps once every 50 years). The cost of a replacement ($170K) is a
nominal figure and used mainly for crude comparison with refurbishment. In
practice, replacement may cost significantly more than this.
The corporation recommend a discount rate of r= 0.11 and a projected inflation
rate of i= 0.05. This corresponds to an effective discount factor, , of 0.057
( 1 /(1 + ) = (1 + i ) /(1 + r ) ). Integral to the refurbishment option is the up-grading of
the escalator control system to allow power-dip ride-throughthis facility
prevents unnecessary emergency stops caused by momentary power loss that can
cause injuries to passengers. However, the effectiveness of the ride-through
facility is uncertain; hence we introduce another cost parameter, control system
296 P. Scarf and J. Hartman
Table 12.1. Annuities ($000s per escalator per year) escalator for modified two-cycle model
with refurbishment at K years from now and again after a further L years. Annuities for
fixed horizon model with h = 22 years highlighted, except for X* = 22 (no replacement) for
which annuity = $139.4K. Cost parameters as follows: refurbishment cost, $62.9K; effective
discount factor, 0.06 (equivalent to inflation rate of approximately 0.05 and discount rate of
0.11); penalty cost of failure, $5K; effective life parameter, 1.5; control system retro-fit
effectiveness, 75%; cost of refurbishment delay, $10K; annual maintenance contract pre-
refurbishment, $8.8K (per escalator); annual maintenance contract post-refurbishment,
$6.9K (per escalator).
L, length of the second cycle, years
1 3 5 7 9 11 13 15 17 19 21
1 491.8 298.0 233.0 200.3 180.7 167.8 158.6 151.8 146.6 142.5 139.2
3 302.4 236.1 202.7 182.8 169.6 160.2 153.3 148.0 143.8 140.5 137.8
K,
5 241.1 206.8 186.2 172.5 162.9 155.8 150.3 146.0 142.5 139.7 137.4
length of
7 211.6 190.2 176.1 166.1 158.7 153.1 148.6 145.0 142.1 139.7 137.7
the first
9 193.9 179.3 169.0 161.4 155.5 150.9 147.2 144.2 141.7 139.6 137.9
cycle,
11 182.2 171.6 163.7 157.7 153.0 149.2 146.1 143.6 141.4 139.6 138.1
years
13 173.9 165.9 159.7 154.9 151.0 147.8 145.2 143.0 141.2 139.6 138.3
15 167.8 161.5 156.5 152.6 149.3 146.7 144.4 142.5 140.9 139.6 138.4
17 163.1 158.0 154.0 150.7 148.0 145.7 143.8 142.1 140.8 139.6 138.5
19 159.4 155.3 151.9 149.1 146.8 144.9 143.2 141.8 140.6 139.5 138.6
21 156.4 153.0 150.2 147.8 145.9 144.2 142.8 141.5 140.4 139.5 138.7
The cost parameters in Table 12.1 and Figure 12.1 are held at intermediate
values. In Figure 12.2, we present annuities for a number of replacement options
as a function of each of the cost parameters. These replacement options correspond
to those considered by the corporation, with refurb referring to immediate refur-
bishment (in year 1), and delay refurb referring to refurbishment in year 10 (from
Replacement of Capital Equipment 297
time of study). Given the size of the fleet, a constraint on the number of escalators
that can be refurbished at any one time and the duration of refurbishment, we
would expect the refurbishment programme to last some 15 years and therefore a
significant proportion of the fleet would experience this kind of delay prior to
refurbishment. Therefore we include it as a particular policy for indicative pur-
poses. We use the fixed horizon model here in order to make comparisons between
annuitiesthis is because one would wish to compare the cost of different options
over the same horizon. Equivalently, we could use the modified two-cycle model
with the additional constraint K+L= h= 22 (years), say.
170.0 150.0
160.0 140.0
150.0 130.0
140.0 120.0
130.0 110.0
1 5 9 13 17 21 1 5 9 13 17 21
K, years K, years
a b
1 6 0 0
.
1 5 0 0
.
1 4 0 0
.
1 3 0 0
.
1 5 9 1 3 1 7 2 1
Figure 12.1a,b. Annuities ($000s per escalator per year) for modified two-cycle model with
refurbishment at K years from now and operation for a further L years. Annuities for fixed
horizon model with h = 22 years also shown (X < 22: bold, solid curve; X = 22: ). Cost
parameters as Table 12.1 except: a effective discount factor equals 0.06 (equivalent to
inflation rate of approximately 0.05 and discount rate of 0.11); b no discounting.
From Figure 12.2 we can see that optimum policy is certainly sensitive to these
cost factors with the influence of cost parameters as expected. Threshold values that
lead to a step-change in the optimum policy (option) can be observed from these
figures. Thus while estimation of the penalty cost of failure, for example, may be
difficult and contentious, the importance of its effect can be observed. This may
then provide an incentive for further investigation of this parameter or discussion
about whether its true value is above or below the threshold of policy change.
As a final note for the escalator replacement problem in particular, one could
argue that the cost of differing options or policies will reflect the maintenance
contractors profit requirement, whatever the details of the arrangement, and there-
fore the total costs of options would expect to vary very little. What can differ,
however, is that some options may lead to lower risk (for example where the
contractor bears the cost of major parts wear-out which may be subject to signifi-
cant uncertainty) and lower risk is certainly desirable from the point of view of the
operator.
298 P. Scarf and J. Hartman
Consider now a fleet consisting of sub-fleets classified on the basis of class (e.g.
vehicle-type) and age (or condition) so that the operator of the fleet is concerned with
the replacement of sub-fleets, and not with replacement of individual equipment or
with replacement of the entire fleet. For this fleet, it is natural to focus on the replace-
ment of particular sub-fleet(s). The economic life models of the previous section
must be extended given that the replacement of particular sub-fleets has cost implica-
tions for the rest of the fleet.
200 200
190
annuity per escalator / HK$000s
180 180
170 170
160 160
150 150
140 140
130 130
1.00 1.25 1.50 1.75 2.00 0 20 40 60 80
effective life factor penalty cost of failure (HK$K)
a b
200 200
annuity per escalator / HK$000s
190 190
annuity per escalator / HK$000s
180 180
170 170
160 160
150 150
140 140
130 130
0.05 0.07 0.09 0.11 0.13 20 22 24 26 28
discount rate horizon / years
c d
129
1
00
8 0
0
1
1 7
6 0
0
2 0 2 2 2 4 2 6 2 8
Figure 12.2ad. Annuities (per escalator) as a function of cost parameters for fixed horizon
model, with h = 22 years for various refurbishment/replacement options: a annuity vs.
effective life parameter; b annuity vs. penalty cost of failure; c annuity vs. nominal discount
rate; d annuity vs. horizon length h. Cost parameter values when not varying set at: effective
life, 1.5; penalty cost of failure, $5K; nominal discount rate, 0.11; control system retro-fit
effectiveness, 75%; refurbishment delay cost, $10K.
Replacement of Capital Equipment 299
N ti
ctdc ( N , L; h) = i =1
ti { s = ti 1 +1
mi ( s ) s ti + nr + i Rr + i Si (ti )} (12.7)
where t i = ij=10 Li with L0 = 0 . Here mi (.) is the age related operating cost of
the whole fleet in cycle i; S i (.) is the age related resale value of plant in sub-fleet
i; Rr +i is the cost each of replacement plant in sub-fleet r+i; and v is the discount
rate. The costs mi (.) and S i (.) may be expressed as
mi ( s ) = k =i
r + i 1 nk
j =1
M k ( kj + s ), (i = 1,..., N ),
Si ( Li ) = j =1 Si1 ( ij + Li ), (i = 1,..., N ) ,
ni
300 P. Scarf and J. Hartman
where M k (.) is the age related operating cost per unit time for an individual plant
in sub-fleet k (k = 1,..., r + N ) , and S i1 (.) is the age related resale value for
individual plant in sub-fleet i. (Also, kj = 0 for k>r). Appropriate penalty costs,
associated with failures, may be incorporated into the operating costs.
The annuity, c tdc ( N , L; h) / ih=1 i , or other suitable objective function may
then be minimized subject to the constraint iN=1 Li = h . Technological change is
allowed for in that costs relating to proposed plant for cycles 2,..,N may be
assigned as appropriate. The optimum replacement schedule may be obtained by
minimizing the objective function over all possible schedules. In practice the range
of possible schedules would be narrowed greatly by the experience of the operator.
However, as the decision-maker will not have a firm value for the horizon length,
the optimum policy must be robust to variation in h. Furthermore, because the fleet
is mixed, both different replacement schedules and different planning horizon
lengths will give rise to different age compositions of the fleet at the end of the
horizon. Thus replacement policies may need to be compared not just on the basis
of cost but also on the basis of the age composition of the fleet at the end of the
planning horizon. This final age composition can be considered as quantifying the
end-of-horizon effect.
Non-uniform usage, particularly between sub-fleets, may be allowed for by
varying the fleet size at replacements. For example, if older plant are under-
utilized, a smaller number of new plant would be required to meet the demand
currently placed on an older sub-fleet. This effectively reduces the replacement
cost for that sub-fleet by factor which is the ratio of the utilization of the old to the
new sub-fleet. Of course, other more complex methods of accounting for differing
usage may be considered. Given sufficient data, operating costs could be quantified
in terms of usage and optimum policy may be obtained given forecasts for usage of
sub-fleets over the planning horizon.
The models may be extended to the case in which sub-fleets are retired as
spares. The number of sub-fleets would simply increase by one at each replace-
ment, with the costs associated with retired sub-fleet added. Predicting operating
costs for a retired sub-fleet would be difficult however, as it is likely that no data
would be available for this. Also it is assumed that equipment is bought new: in
principle it is a simple matter to extend Equation 12.7 to the case in which used
equipment may be purchased.
Note that the formulation as presented allows for the possibility for a sub-fleet
to be composed of a single unit of equipment. This may be appropriate if the fleet
is small. The complexity of the computational problem increases rapidly as the
number of sub-fleets increases. However we do not consider efficient algorithms
for determining optimum policy here.
Example 12.2
Scarf and Hashem (1997) consider the inter-city coach fleet operated by Express
National Berhad in Malaysia. The fleet comprised of 160 vehicles of 5 vehicle-
types of varying ages, with maintenance cost modelled as M ( ) = a b and resale
Replacement of Capital Equipment 301
values S ( ) = 0.6 R(0.81) , for replacement cost R (Table 12.2). The data available
were not sufficient for obtaining the maintenance cost model for all vehicle-
typesfor example, for the MAN, only data relating to their first year of operation
were available. Furthermore, for older vehicles the costs appeared to be decreasing.
This could perhaps be put down to under-utilization (partial retirement) and also
neglect of vehicles reaching the end of their useful life. It was therefore necessary
to pool the data to obtain reasonable cost models. The fitted maintenance cost
models for the Cummins, Isuzu CJR and MAN were obtained by first fitting an
overall cost model to data on vehicles up to eight years old, and then scaling this
model to the costs of the individual vehicletypes in the manner described in
Christer (1988). The costs for the older sub-fleets, the Mitsubishi and Isuzu CSA,
were taken as constant. Penalty costs for breakdowns on the road were also
modelledsee Scarf and Hashem (1997) for a full discussion of this. It was known
that the Mitsubishi and Isuzu sub-fleets were in partial retirement and candidates
for immediate replacement, capital expenditure permitting. The usage of sub-fleets
was unknown, although with a daily requirement for 125 vehicles, it was reason-
able to suppose that the usage level for the Mitsubishi and Isuzu sub-fleets was
about half that of the other newer sub-fleets. This assumption led to the null
optimal policyreplace the Mitsubishi and Isuzu CSA sub-fleets as soon as
possiblewhich is uninteresting from a model validation point of view. Therefore
in order to illustrate the replacement model, we consider the following sub-
problem in detail: investigate replacement policy for the fleet comprising of the
Cummins, Isuzu CJR and MAN, assuming a fixed fleet size (93 vehicles) and
uniform usage.
Table 12.2. Fleet composition by vehicle type showing purchase cost, R, maintenance cost
parameters and age distribution at time of replacement study
Age distribution: number of vehicles in
model M ( ) = a b each age group (2 year intervals)
R, M$000s
1011
a b
23
45
67
89
>12
<2
For the three sub-fleets problem, optimal policy is presented for each of the six
replacement schedules in Table 12.3. Horizon lengths, h, of 120, 150 and 180
months (15 years) are considered. For the fleet as a whole it is difficult to deter-
mine optimal replacement policy as two sub-fleets are partially retired, and usage
levels are unknown. The problem is made more difficult because it is likely that
the maintenance of these sub-fleets is less thorough than that for the newer sub-
fleets. Under simple usage assumptions, the optimum policy is to replace the
Mitsubishi and Isuzu CSA sub-fleets immediately. For the particular sub-problem
302 P. Scarf and J. Hartman
relating to the Cummins, Isuzu CJR and MAN, it appears that the optimum
replacement schedule depends on the length of the horizon. Also, the end-of-
horizon effect, as represented by the mean age of the fleet, also varies with the
replacement schedule. The choice of optimal policy is therefore not straight-
forward. Over a fifteen year planning horizon, there is little to choose between the
three schedules Cummins-IsuzuCJR-MAN, Cummins-MAN-IsuzuCJR and MAN-
Cummins-IsuzuCJR, both in terms of cost and age. Sensitivity to model para-
meters is considered more fully in Scarf and Hashem (1997).
Table 12.3. Optimum policy for each schedule for various horizon lengths, h=120,150,180
months; penalty cost, M$2000, annual discount rate 0.97. Cost of equivalent rent (M$000s
per month for whole fleet), average age of fleet at end of horizon, and optimum cycle
lengths. Replacement schedules: CIM Cummins-IsuzuCJR-MAN, etc.
in year t. For network expansion projects ft 0 = 0 . Let C (>0) be the capital cost of
project P. Assume income cashflows are negative and expenditure cashflows are
positive, and that all cashflows are incurred at the year end and discounted at rate
v. If project P is released in year x from now then the total cashflow over h years
from now will be
x 1
t =1
ft 0 t + x ( h x
t =0
ft1 t + C . ) (12.8)
If project P is not released then the cashflow over the horizon will be t =1 ft 0 t .
h
Define the gain from releasing project P in year x to be the difference between
these cashflows:
g P ( x; h) = t =1 ft 0 t 1 [ t =1 ft 0 t + x { t = 0 ft1 t + C}]
h x 1 h x
= t = x ( ft 0 ft1 x ) t x C.
h
leased subject to the constraints that the capital investment budget is not exceeded
in each year. That is maximize
n k
i =1
x gij (h)
j =1 ij
subject to
n
i =1 ij
x Ci B j for all j = 1,...,k; (12.9)
k
j =1 ij
x 1 for all i = 1,...,n; (12.10)
xij = 0,1.
Constraint set Equation 12.9 ensures that the budget for year j is not exceeded.
Constraint set at Equation 12.10 ensures that project i is released at most once over
the planning horizon. Note that if an individual project has negative gain whatever
its execution time, then the contribution to the objective function from this project
will be greatest when this project is not released over (0, k). Typically such plan-
ning may be informative over the planning horizon, but only decisions relating to
the immediate future (one to two years) would be acted on. Therefore policy would
be continually updated, implying a rolling horizon approach. Where a network
consists of many identical components, the modelling of project planning may be
extended to the case in which a proportion of similar projects are released in a
given year. This could be done by formulating the capital rationing model (CRM)
as a mixed programming problem.
Consider now dependence between projects. For example, a major expansion
project, while not replacing existing assets, may have significant operating cost or
performance implications for particular assets: the building of a large ring-main in
a water supply network is one such example. Essentially, if two projects P1 and P2
interact in this way, then new projects P1' = ( P1, not P2 ) , P2' = (not P1, P2 ) , and
'
P12 = ( P1, P2 ) would have to be introduced, along with the constraint to ensure that
at most one of P1' , P2' , and P12 '
is released over the planning horizon. While this
approach may lead to a significant increase in the number of projects in the
model, in principle the solution procedure would remain unchanged. The existence
of future-cost dependencies between projects would have to be identified by the
network owner. This may be extremely difficult in practice. However such depend-
ency would very much characterize the network replacement problem, and there-
fore the approach described is an advance over current methods. A similar ap-
proach has been taken by Santhanam and Kyparisis (1996) in modelling depend-
ency in the project release of information systems. Capital costs may be considered
simply using the concept of shared set-up. It is possible that it may be optimal to
release both P1 and P2 during the planning horizon, but not simultaneously. This
presents a more difficult modelling task, without introducing many pseudo-pro-
jects, that is. For example, we could consider: release P1 at time s and P2 at time
t; however for k=10, say, this would mean the introduction of 25 variables,
x( P1P2 )( s ,t ) , for the P1 , P2 decision alone!
Replacement of Capital Equipment 305
x 1 h x h x
P ( x*, x; h) = t = x ft 0 t + t = 0 ft1 t + x t = 0 ft1 t + x
where x is the execution time for the project under capital rationing. The marginal
increase in revenue expenditure would be found by summing over all projects. In a
similar manner, the marginal increase in revenue expenditure due to projects delayed
in year j could be found by summing over all projects with x = j , and this measure
indicates how much more capital investment would be required to reduce revenue
expenditure to the optimum level.
Uncertainty in the cashflow/performance model parameter estimates, reflecting
the extent of currently available information about particular components and
potential projects, and the extent of technological developments (new materials and
techniques), may be propagated through into uncertainty in the gain function, g (.) .
This would be most easily done using the delta method; see Baker and Scarf (1995)
for an example of this in maintenance. The variance of the gain, as well as the
expected gain, may then be used to produce the project priority list and those
projects for which the expected gain is high and the uncertainty in the gain
(variance of the gain) is low are candidates for release; these projects would be
viewed as sound investments. Markovitz (1952) is the classic reference here; for a
more recent discussion see Booth and King (1998). Also, a real options approach
might be taken (e.g. Bowe and Lee 2004). Where there are no data regarding a
potential project, there will be no objective basis for determining if and where the
project lies on the project priority list. One possible approach to this problem
would be to use data relating to other projects that are similar in design. Also
subjective data may be collected, and used to update component data for the whole
network in the manner described in OHagan (1994) and Goldstein and OHagan
(1996) in the context of sewer networks. These methods are particularly useful for
multi-component systems in which there are only limited data for a limited number
of individual components. On the other hand, it may be that the income cashflow
may be deterministic in some situations. For example, expansion of the network
may be initiated by legislation, and the compensation for the investment costs are
fixed and predetermined per customer connection.
306 P. Scarf and J. Hartman
Bellman (1955) introduced the first dynamic programming model to analyze the
equipment replacement problem. In this model, the state of the system is defined as
the age of the asset and the decision to be evaluated at each stage is whether to
keep or replace the asset. Thus, a solution consists of keep and replace decisions in
each period of the horizon.
The dynamic program can be described by the network in Figure 12.3. Each
node in the network represents the age of the asset, which is the state of the system,
at the end of the period. The states are labelled according to the age of the asset
along the y-axis, increasing from 1 to N, the maximum allowable age of the asset
(N = 5 in the figure), at the end of the time period which is labelled on the x-axis
from 0 to T, the horizon time (T = 4 in the figure). The arcs connecting the nodes
represent keep and replace decisions. An arc representing a keep decision (K)
connects a state (age) of n to n+1 in consecutive stages (periods), as the asset ages
one period. A replace decision (R) connects a state of n to a state of 1, as the n-
period old asset is salvaged and a new asset is purchased and used for one period.
The initial decision is made at time zero, with n = 4 in the figure, and the asset is
salvaged at the end of the horizon.
Define ft(n) as the minimum net present value cost of making optimal keep and
replace decisions for an asset of age n at time period t through time period T.
Mathematically, we evaluate ft(n) with the following recursion:
K : ( Ct +1 (n + 1) + ft +1 (n + 1) )
ft (n) = min , n N , t T 1 (12.11)
R : Pt St (n) + ( Ct +1 (1) + f t +1 (1) )
Replacement of Capital Equipment 307
N=5 K
4
3 R
1
0 1 2 3 T=4
Figure 12.3. Dynamic programming network for an age-based model
If the n-period old asset is kept (K), the operating and maitenance (O&M) cost
Ct+1(n+1) is incurred for the asset in the following period. As the asset is age n+1
at the end of the period, ft+1(n+1) defines the costs going forward. (This is why ft
is often referred to as the cost to go function in dynamic programming.) If the
asset is replaced (R), then a salvage value St(n) is received and a purchase price Pt
is paid for a new asset. The new asset is utilized for the period as the state
transitions to an age of 1, defined by costs ft+1(1) going forward. If the asset
reaches the maximum age of N, then only the replace decision is feasible.
When the horizon time T is reached, the asset is salvaged such that
Example 12.3
Assume a four-period old asset is owned at time zero, its maximum age is 5, and an
asset is required to be in service in each of the next four periods (such that the
decisions are represented in Figure 12.3). The purchase price is $50,000 with first
year O&M costs $10,000, increasing 20% per period of use. The salvage value is
expected to decline 30% (from the purchase price) after the first year of use and an
additional 10% each year thereafter. For simplicity, we assume no technological
change and the interest rate is 12% per period.
308 P. Scarf and J. Hartman
Table 12.4. Dynamic programming state values ft(n) for the example problem
t\n 1 2 3 4 5
0 -- -- -- $47,996 --
1 $16,332 -- -- -- $35,602
2 $407 $6,292 -- -- --
3 $17,411 $12,455 $7,353 -- --
4 $35,000 $31,500 $28,350 $25,515 --
Table 12.4 shows the results of solving the dynamic programming algorithm.
The values in the final row (f4 (n)) are the negative salvage values received for a
given asset of age n at that time. To illustrate a calculation at t = 3, consider n = 1.
Substituting into Equation 12.11:
The recursion continues in this fashion until f0 (4) is evaluated, with the decision
to replace the asset immediately with a new asset. This new asset is retained through
the horizon. The net present value cost of this sequence of decisions is $47,996.
The benefit of using this model, in addition to allowing for replacements after
each period, is that periodic costs are explicitly modeled on each arc in the network.
This allows for detailed cost modelling of technological change, as in Regnier et al.
(2004) or those costs associated with after-tax analysis, as in Hartman and Hartman
(2001).
A similar line of models have also been developed such that the condition of
the asset, not its age, is tracked (i.e. Derman 1963). As opposed to moving from
state to state by increasing the age of the asset, there is some probability that the
asset will degrade to a lower condition during a period. The work assuming sto-
chastic deterioration has been extended to include technological change (Hopp and
Nair 1994) or consider probabilistic utilization (Hartman 2001).
0 1 2 3 4
The objective is to find the sequence of service lives that minimizes costs from
time 0 through time T. (As previously, T = 4 in the figure.) Assuming costs along
an arc connecting node t to node t+n are defined as net present value costs at time
t, the optimal sequence of decisions can be determined by solving the following
recursion:
where ctn represents the cost of retaining the asset for n periods from period t.
Using our previous notation, ctn is defined as
n
ctn = Pt + j Ct + j ( j ) n St + n (n) (12.14)
j =1
This model can be solved similarly to the age-based model, assuming that
f(T) = 0 is substituted into Equation 12.13. Note that the network in Figure 12.4
assumes that a new asset is purchased at time 0. To include the option to keep or
replace an asset owned at time zero, another set of arcs must be drawn, emanating
from node 0, representing the length of time to retain the owned asset with its
associated costs. As these arcs parallel those illustrated in Figure 12.4, the higher
cost parallel arcs can be deleted, as they will not reside on the optimal path. This
can be completed in a pre-processing step, with the recursion ensuing as defined.
Example 12.4
Utilizing the same data from Example 12.3, the network in Figure 12.4 represents
the options associated with purchasing a new asset in each period. We would add
an arc from node 0 to node 1 to represent the decision to retain the four-period old
asset for one additional period (to its maximum feasible age of 5).
Table 12.5 provides the net present value costs (at time t) on the arcs from node
t to node t+n. The arc from node 0 to 1 represents the cost of retaining the four-
period old asset for one period, as this is cheaper than salvaging the used asset and
purchasing a new asset for one period of use. The values of c02, c03, and c04 include
the revenue received for salvaging the four-period old asset at time zero. With the
values in Table 9.5, the dynamic programming recursion in Equation 12.13 can be
solved.
310 P. Scarf and J. Hartman
Table 12.5. Arc costs for Figure 12.4 using the example data
t \ t+n 1 2 3 4
0 -$1,989 $17,868 $33,051 $47,996
1 $27,679 $43,383 $58,566
2 $27,679 $43,383
3 $27,679
$43,383 + $0
f (2) = min = $43,383,
$27, 679 + 0.893($27, 679)
defining that it is cheaper to keep the asset for two periods (from the end of period
2 to the end of the horizon) rather than replacing it after one period of use. Con-
tinuing in this manner, it is found that f(0) = $47,996, signaling that the four-period
old asset should be sold and the new asset should be retained through the horizon.
This is the same solution found with Bellmans model.
While this model can be shown to be more computationally efficient than the
age-based model, it is the ease with which multiple challengers (as parallel arcs) or
technological change is modelled that has led to numerous extensions in the
literature. See Oakford et al. (1984), Bean et al. (1985, 1994), and Hartman and
Rogers (2006).
T=4
0
0 4 3 2 1
Figure 12.5. Dynamic programming network for a cumulative usage-based model
A node in the network represents the cumulative service that has been accrued
through a given stage. For example, after the first stage in Figure 12.5, either 0 or 4
periods of service have been reached. As the horizon is 4, a solution must ultimately
result in 4 periods of service. As with the other dynamic programming, models, the
goal is to find the minimum cost path from the initial node, representing no service at
time zero, to the final node, representing an entire horizons worth of service after the
final stage.
To determine an optimal solution it is assumed that the costs are stationary and
the stages (lengths of service) are ordered according to increasing annualized costs.
Thus, before the recursion can be solved, the annualized costs of keeping an asset
for each possible service life must be computed such that the stages can be ordered
accordingly.
Example 12.5
We revisit the previous examples again. From the given costs, the annual equiva-
lent costs are computed as given in Table 12.6. For example, to retain the asset for
two years costs $25,670 per year, equivalently, assuming a 12 percent interest rate.
The net present value (NPV) costs are also given. We restrict the set of decisions to
those of a new asset namely how many to purchase and how long to retain them
over the finite horizon.
Table 12.6. Annual equivalent costs of keeping the asset for up to five years
Given the information in Table 12.6, the stages are ordered according to ages 4,
3, 5, 2, and 1, as the annual equivalent costs increase accordingly. As an asset is
only required for four periods, the age 5 cost can be ignored.
312 P. Scarf and J. Hartman
According to Figure 12.5, an asset can be retained a maximum of one time for
four years, at a cost of $73,511. Thus, the states in the first stage and their values
are
f1 (0) = 0,
f1 (4) = $73,511.
Similar reasoning defines f2(0)=0, f2(3)=$58,566, and f2(4)=$73,511. For the third
stage, the decisions are more interesting because an asset can be retained for two
years twice in the sequence. Thus
f3 (0) = 0,
f3 (2) = $43,383,
f3 (3) = $58,566,
{ }
f3 (4) = min $73,511,$43,383 + (0.893) 2 $43,383 = $73,511.
The final stage evaluates using assets for a single period with previous com-
binations (three-period and two-period aged assets). It can be shown that the opti-
mal decision is to retain the asset for all four periods at a net present value cost of
$73,511. Note that this is the same decision found with the two previous formula-
tions, as $73,511 less the salvage value of the four-period old asset ($25,515) is
$47,996.
This recursion was not developed in order to provide another computational
approach to the equipment replacement problem. Rather, it was developed to illus-
trate the relationship between the infinite and finite horizon solutions under station-
ary costs. Specifically, as the optimal solution to the infinite horizon problem is to
repeatedly replace an asset at its economic life (age which minimizes equivalent
annualized costs), the question being investigated was whether the solution (re-
placing at the economic life) translates to the finite horizon case.
It was shown that using the infinite horizon solution provides a good answer
when O&M costs increase over the life of an asset more drastically than salvage
values decline. In the case when the salvage value declines are more drastic than
the O&M cost increases, it is generally better to retain the final asset in the se-
quence for a period longer than the economic life of the asset. For the cases when
O&M cost increases and salvage value declines are similar, then it is beneficial to
solve a dynamic programming recursion to find the optimal policy.
ing an infinite horizon. Unfortunately, this does not guarantee the existence of an
optimal time zero decision.
For the age or period based dynamic programming recursions, the models must
be solved over T, T+1, T+2, , T+N horizons. If the time zero decision does not
change for these problems, then the optimal time-zero decision is found. If this is
not the case, the progression must continue until N consecutive time zero decisions
are identified. This may be more easily facilitated using a forward recursion. In the
period-based model, this requires defining f(t) as a function of f(t1), f(t2), etc.,
with f(0) = 0. We illustrate by revisiting Example 12.4.
Example 12.6
We illustrate the first few stages of the forward recursion, as its implementation is
better suited for infinite horizon analysis. As noted earlier, the recursion is
initialized with f(0) = 0. Stepping forward in time, it is assumed that T = 1. Using
the values from Table 12.2, it is clear that the only feasible decision is to retain the
four-period old asset for one period such that f(1) = -$1,989. For the second stage,
there are two feasible decisions to evaluate, such that
The first decision evaluates using the new asset for one period, assuming (from
stage 1) that the four-period old asset is retained for one period. The second
decision assumes the four-period old asset is retired immediately and a new asset is
used for two periods. This process moves forward in time, increasing the value of T
in each step. The process stops when, in this case, five consecutive solutions (with
increasing T) result in the same time zero decision.
The presented dynamic programming algorithms are designed for single asset
systems. More complex systems are obviously defined by multiple assets which are
not independent, otherwise the presented models would be sufficient. The most
straightforward case in where all assets operate in parallel, such as a fleet.
Jones et al. (1991) offered the first dynamic programming recursion for the
parallel machine replacement model, which can be used to analyze fleet replace-
ment decisions. Machines are assumed to operate in parallel and thus the capacity
of the system is equal to the sum of the individual asset capacities. In addition to
defining the capacity of the system, the assets are often linked economically. Jones
et al. focused on the assumption that a fixed cost would be charged in any period in
which a replacement occurs (in addition to the typical per unit charges for each
asset replaced). This provides an incentive to replace multiple assets together over
time so as to reduce the number of times the fixed charge in incurred over some
horizon.
To model replacement decisions for this system, the state of the system is
defined as the number of assets aged 1 through N, represented as a vector, [m1, m2,
314 P. Scarf and J. Hartman
[3,2,1]
[2,1,3] [4,2,0]
[1,3,2] [5,1,0] [0,5,1]
[3,2,1] [3,3,0] [0,3,3] [1,5,0]
[6,0,0] [3,0,3] [3,0,3]
[0,6,0] [3,3,0]
[6,0,0] [0,0,6]
[0,6,0]
[6,0,0]
0 1 2 T=3
Define n as the decision of what minimum aged assets are to be replaced for a
given state at time t. That is, all assets of age n and older are replaced while the
remaining assets are retained. We can model the recursion in general as follows:
N N
f t (m1, m2 ,..., mN ) = min n K t 1n >1 + m j Pt m j St ( j ) +
(12.15)
j =n j =n
N n 1
m j Ct +1 (1) + m j Ct +1 ( j + 1) + f t +1 (mm + mm +1 + ... + mN , m1, m2 ,..., mn 1,0,0,...0)
j =n j =1
Examining the recursion, a purchase price is paid and a salvage value is received
for all assets that are replaced. All of the newly purchased assets (the total number
of assets is the sum of mn+mn+1++ mN) incur the O&M cost of a new asset while
the O&M costs of the retained assets are incurred according to their age. A fixed
charge Kt is paid if at least one group of assets is replaced (n>1), captured by the
indicator function. The resulting state is a group of new assets (age 1) with all other
assets incrementing one period in age.
A number of extensions to this model have been published in the literature,
although many utilize integer programming modeling techniques to deal with the
large-state space. Chand et al. (2000) focus on the use of dynamic programming and
include capacity expansion decisions with the replacement decisions. Unfortunately,
capital budgeting constraints greatly complicate the problem as it cannot be assumed
that groups of assets must be kept or replaced together.
While the theorems presented in Jones et al. (1991) greatly reduce the compu-
tational difficulties of solving the dynamic program for the parallel replacement
problem, it should be clear that using dynamic programming to address replacement
decisions for more complex systems may be difficult due to computational com-
plexities that arise due to the number of combinations of replacement alternatives.
(See Hartman and Ban (2002) and the references therein for a discussion of these
issues.)
Consider a more complex system in which a number of machines are used in
series (such as a production line) and there are a number of lines in parallel, such
as the one given in Figure 12.7. The lines are labeled 1, 2 and 3 while the machines
are labeled a, b, c, and d.
a b c d
Figure 12.7. System with assets in series and lines in parallel
316 P. Scarf and J. Hartman
The capacity of a line is now defined by the machine in the line with the mini-
mum capacity. However, the capacity of the system is raised due to the parallel
design. The capacity of the system is defined by the sum of the capacity of each
line. Therefore, it is defined by the sum of the minimum capacity asset in each line.
Reliability is measured similarly to a capacity, in that it is reduced by the series
structure but increased with parallel (redundant) structure. For a given series, the
reliability of the line is equal to the product of the reliability of each individual
asset. That is because if one asset is down, the line is down. The reliability of the
system, assuming only one line must be up and running, is increased as the system
is operating even if three lines are down.
If one defines minimum system capacity or reliability constraints, these can be
incorporated into a dynamic programming recursion that evaluates the possibility of
replacing any combination of assets in each period over some horizon. Presumably,
newer assets would have higher capacity or reliability, either due to technological
change or due to the fact that they are new (and have not deteriorated), and thus
would increase the respective capacity or reliability of the system (in order to meet
the defined constraints).
The difficulty with using a dynamic programming recursion to evaluate these
decisions is not in capturing the capacity or reliability constraints. Rather, the
difficulty is in the exponential growth in the number of possible combinations of
replacements in each period. Consider the 12 assets shown in Figure 12.7. In the
most general problem, each asset and each combination of assets can be replaced in
each period, totaling 212 combinations each period for each state of the system.
This system could easily become more complicated, merely by defining a, b, c, and
d as processes, each of which may have a number of assets in parallel (or in series).
In the parallel machine replacement problem, a similar problem was encoun-
tered, but the number of possible decisions was reduced to N (the maximum allow-
able age for an asset) for each possible state in each period with the two theorems
introduced by Jones et al., without sacrificing optimality. Unfortunately, the inter-
action of the assets may prohibit the application of these theorems to other systems.
In fact, defining the state of the system is not entirely clear.
For the system described in Figure 12.7, we could define the system as a matrix
of asset ages. Each row would be defined by the age of each machine in a given
line, with a row defined for each line. If an asset is replaced, then the age would
translate to 1 in the next stage while it would merely increment 1 period if the
machine is retained. This modeling approach could be expanded to the case of
multiple machines in a given process by expanding the size of the matrix.
Again, the difficulty would be in restricting the number of decisions to evaluate
for each state in a given period. Following the approach of Jones et al. (1991),
older assets would be replaced first (and even further restricted to have to be above
a certain age for consideration) and similarly aged assets of the same type would be
replaced in the same time period. Another approach would be to only consider
replacing assets that increase the system capacity or reliability. Thus, replacements
could be examined in the order of either increasing capacity or increasing reli-
ability. Whether these heuristic approaches provide a good solution for a given
problem instance would require extensive numerical testing.
Replacement of Capital Equipment 317
12.7 References
Apeland, S. and Scarf, P.A. (2003) A fully subjective approach to capital equipment
replacement. Journal of the Operational Research Society 54, 371378.
Arnold, G. (2006) Essentials of Corporate Financial Management. Pearson, London.
Baker, R.D. and Scarf, P.A. (1995) Can models to small data samples lead to maintenance
policies with near-optimal cost? IMA Journal of Mathematics Applied in Business and
Industry 6, 312.
Bean, J.C., Lohmann, J.R. and Smith, R.L. (1985) A dynamic infinite horizon replacement
economy decision model. The Engineering Economist 30, 99120.
Bean, J.C., Lohmann, J.R. and Smith, R.L. (1994) Equipment replacement under
technological change, Naval Research Logistics, 41, 117128.
Bellman, R.E. (1955) Equipment replacement policy. Journal of the Society for the
Industrial Applications of Mathematics 3, 133136.
Booth, P. and King, P. (1998) The relationship between finance and actuarial science. In
Hand, D.J., Jacka, S.D. (Eds), Statistics in Finance, Arnold, London, pp.740.
Bowe, M. and Lee, D.L. (2004), Project evaluation in the presence of multiple embedded
real options: evidence from the Taiwan High-Speed Rail Project, Journal of Asian
Economics 15, 7198.
318 P. Scarf and J. Hartman
Brint, A.T., Hodgkins, W.R., Rigler, D.M and Smith, S.A. (1998) Evaluating strategies for
reliable distribution. IEEE Comput.Applns. in Power 11, 4347.
Chand, S., McClurg, T. and J. Ward (2000) A model for parallel machine replacement with
capacity expansion. European Journal of Operational Research, 121. 519531.
Christer, A.H. (1984) Operational research applied to industrial maintenance and
replacement. In Eglese, R.W. and Rand, G.K. (Eds) Developments in Operational
Research (pp.3158). Pergamon Press, Oxford.
Christer, A.H. (1988) Determining economic replacement ages of equipment incorporating
technological developments. In Rand, G.K. (Eds) Operational Research 87 (pp.343
354). Elsevier, Amsterdam.
Christer, A.H. and Scarf, P.A. (1994) A robust replacement model with applications to
medical equipment. J.Opl.Res.Soc. 45:261275.
Derman, C. (1963) Inspection-maintenance-replacement schedules under markovian
deterioration. In Mathematical Optimization Techniques, University of California Press,
Berkely, CA, pp. 201210.
Dixit, A.K. and Pindyck R.S. (1994) Investment Under Uncertainty Princeton University
Press, New Jersey.
Eilon, S., King, J.R. and Hutchinson, D.E. (1966). A study in equipment replacement.
Opl.Res.Quart. 17:5971.
Elton, D.J. and Gruber, M.J. (1976) On the optimality of an equal life policy for equipment
subject to technological change. Opl.Res.Quart. 22:9399.
Goldstein, M. and OHagan, A. (1996) Bayes linear sufficiency and systems of expert
posterior assessments. Journal of the Royal Statistical Society Series B 58, 301316.
Hartman, J.C. (1999) A General Procedure for Incorporating Asset Utilization Decisions
into Replacement Analysis. Eng. Econ., 44(3):217238.
Hartman, J.C. (2001) An Economic Replacement Model with Probabilistic Asset Utilization.
IIE Transactions, 33, 717729.
Hartman, J.C. (2004) Multiple asset replacement analysis under variable utilization and
stochastic demand. European Journal of Operational Research 59, 145165.
Hartman, J.C. and J. Ban (2002) The series-parallel replacement problem. Robotics and
Computer Integrated Manufacturing, 18, 215221.
Hartman, J.C. and R.V. Hartman (2001) After-Tax Replacement Analysis. The Engineering
Economist, 46, 181204.
Hartman, J.C. and Murphy, A. (2006) Finite Horizon Equipment Replacement Analysis. IIE
Transactions 38, 409419.
Hartman, J.C. and Rogers, J.L. (2006) Dynamic Programming Approaches for Equipment
Replacement Problems with Continuous and Discontinuous Technological Change. IMA
Journal of Management Mathematics, 17, 143158.
Hopp, W.J. and Nair, S.K. (1991) Timing replacement decisions under discontinuous
technological change. Naval Research Logistics 38, 203220.
Hopp, W.J. and Nair, S.K. (1994) Markovian deterioration and technological change. IIE
Transactions, 26, 7482.
Jones, P.C., Zydiak, J.L. and Hopp, W.J. (1991) Parallel machine replacement. Naval
Research Logistics, 38, 351365.
Karabakal, N., Lohmann, J.R. and Bean, J.C. (1994) Parallel replacement under capital
rationing constraints. Management Science 40, 305319.
Kobbacy, K. and Nicol, D. (1994) Sensitivity of rent replacement models. Int.J.Prod.Econ.
36, 267279.
Markovitz, H.M. (1952) Portfolio selection. Journal of Finance 7, 7791.
Northcott, D. (1985) Capital Investment Decision Making. Dryden Press, London.
Oakford, R.V., Lohmann, J.R. and Salazar, A. (1984) A dynamic replacement economy
decision model. IIE Transactions, 16, 6572.
Replacement of Capital Equipment 319
13.1 Introduction
Maintenance is the set of activities carried out to keep a system into a condition
where it can perform its function. Quite often these systems are production systems
where the outputs are products and/or services. Some maintenance can be done
during production and some can be done during regular production stops in
evenings, weekends and on holidays. However, in many cases production units
need to be shut down for maintenance. This may lead to tension between the
production and maintenance department of a company. On one hand the production
department needs maintenance for the long-term well-being of its equipment, on
the other hand it leads to shutting down the operations and loss of production. It
will be clear that both can benefit from decision support based on mathematical
models.
In this chapter we give an overview of mathematical models that consider the
relation between maintenance and production. The relation exists in several ways.
First of all, when planning maintenance one needs to take production into account.
Second, maintenance can also be seen as a production process which needs to be
planned and finally one can develop integrated models for maintenance and pro-
duction. Apart from giving a general overview of models we will also discuss some
sectors in which the interactions between maintenance and production have been
studied.
Many review articles have been written on maintenance, e.g. Cho and Parlar
(1991), but to our knowledge only one on the combination between maintenance
and production, Ben-Daya and Rahim (2001). This review differs from that in
several aspects. First of all, we also consider models which take production restric-
tions into account, rather than integrated models. Second we discuss some specific
sectors. Finally, we discuss the more recent articles since that review.
Maintenance is related to production in several ways. First of all, maintenance is
intended to allow production, yet to execute maintenance production often has to be
stopped. This negative effect has therefore to be considered in maintenance plan-
322 G. Budai, R. Dekker and R. Nicolai
ning and optimization. It comes specifically forward in the costing of downtime and
in opportunity maintenance. All articles taking the effect of production on mainten-
ance explicitly into account fall into this category.
Second, maintenance can also be seen as a production process which needs to be
planned. Planning in this respect implies determining appropriate levels of capacity
(e.g. manpower) concerning the demand.
Third, we are concerned with production planning in which one needs to take
maintenance jobs into account. The point is that the maintenance jobs take pro-
duction capacity away and hence they need to be planned together with production.
Maintenance has to be done either because of a failure or because the quality of the
produced items is not high enough. In this third category we also consider the
integrated planning of production and maintenance.
The relation between maintenance and production is also determined by the
business sector. We consider the following sectors: railways, road, airlines and
electrical power system maintenance.
The outline of the rest of this chapter is now as follows. In Section 13.2 we
present an overview of the main elements of maintenance planning as these are
essential to understand the rest of this chapter. Following our classification scheme,
in Section 13.3 we review articles in which maintenance is modelled explicitly and
where the needs of production are taken into account. Since these needs differ
between business sectors, we discuss in Section 13.4 the relation between pro-
duction and maintenance for some specific business sectors. In Section 13.5 we
consider the second category in our classification scheme: maintenance as a pro-
duction process which needs to be planned. In Section 13.6 we are concerned with
production planning in which one needs to take maintenance jobs into account
(integrated production and maintenance planning). Trends and open research areas
are discussed in Section 13.7 and, finally, conclusions are drawn in Section 13.8.
the subsystems, what information is available and what elements can be easily
replaced. These are typical maintainability aspects, but they have little to do with
production.
In the tactical phase, usually between a month and year, one plans for the major
maintenance/upgrade of major units and this has to be done in cooperation with the
production department. Accordingly, specific decision support is needed in this
respect. Another tactical problem concerns the capacity of the maintenance crew.
Is there enough manpower to carry out the preventive maintenance program?
These questions can be addressed by use of models as will be indicated later on.
In the short term scheduling phase one determines the moment and order of
execution, given an amount of outstanding corrective or preventive work. This is
typically the domain of work scheduling where extensive model-based support can
be given.
We will next consider another important aspect in maintenance, which is the
type of maintenance. A typical distinction is made between corrective and preven-
tive maintenance work. The first is carried out after a failure, which is defined as
the event by which a system stops functioning in a prescribed way. Preventive
work however, is carried out to prevent failures. Although this distinction is often
made, we like to remark that the difference is not that clear as it may seem. This is
due to the definition of failure. An item may be in a bad state, while still func-
tioning and one may or may not consider this as a failure. Anyhow, an important
distinction between the two is that corrective maintenance is usually not plannable,
but preventive maintenance typically is.
The execution of maintenance can also be triggered by condition measurements
and then we speak of condition-based maintenance. This has often been advocated
as more effective and efficient than time-based preventive maintenance. Yet it is
very hard to predict failures well in advance, and hence condition-based mainten-
ance is often unplannable. Instead of time based maintenance one can also base the
preventive maintenance on utilisation (run hours, mileage) as being more appropri-
ate indicators of wear out.
Finally one may also have inspections which can be done by sight or instruments
and often do not affect operation. They do not improve the state of a system how-
ever, but only the information about it. This can be important in case machines start
producing items of a bad quality. There are inspection-quality problems where in-
spection optimization is connected to quality control.
Another distinction is about the amount of work. Often there are small works,
grouped into maintenance packages. They may start with inspection, cleaning and
next some improvement actions like lubricating and or replacing some parts. These
are typically part of the preventive maintenance program attached to a system and
have to be done on a repetitive basis (monthly, quarterly, yearly or two-yearly).
Next, one has replacements of parts or subsystems and overhauls or refurbishments
where a substantial system is improved. The latter are planned well in advance and
carried out as projects with individual (or separate) budgets.
A traditional optimization problem has been the choice and trade-off between
preventive and corrective maintenance. The typical motivation is that preventive
maintenance is cheaper than corrective. Maintenance costs are usually due to man-
hours, materials and indirect costs. The difference between corrective and preven-
324 G. Budai, R. Dekker and R. Nicolai
tive maintenance costs is especially in the latter category. They represent loss of
production and environmental damage or safety consequences. Costing these
consequences can be a difficult problem and is tackled in Section 13.3.1. It will
also be clear that preventive maintenance should be done when production is least
effected. This can be done using opportunities, which has given rise to a specific
class of models dealt in a separate section (Section 13.3.2).
In Dekker and Van Rijn (1996) a decision-support system (PROMPT) for op-
portunity-based preventive maintenance is discussed. PROMPT was developed to
take care of the random occurrence of opportunities of restricted duration. Here,
opportunities are not only failures of other components, but also preventive main-
tenance on (essential) components. Many of the techniques developed in the
articles of Dekker and Smeitink (1991), Dekker and Dijkstra (1992) and Dekker
and Smeitink (1994) are implemented in the decision-support system. In PROMPT
preventive maintenance is split up into packages. For each package an optimum
policy is determined, which indicates when it should be carried out at an opportu-
nity. From the separate policies a priority measure is determined with which main-
tenance package should be executed at a given opportunity.
In Dekker et al. (1998b) the maintenance of light-standards is studied. A light-
standard consists of n independent and identical lamps screwed on a lamp assembly.
To guarantee a minimum luminance, the lamps are replaced if the number of failed
lamps reaches a prespecified number m. In order to replace the lamps the assembly
has to be lowered. As a consequence, each failure is an opportunity to combine
corrective and preventive maintenance. Several opportunistic age-based variants of
the m-failure group replacement policy (in its original form only corrective main-
tenance is grouped) are considered. Simulation optimization is used to determine the
optimal opportunistic age threshold.
Dagpunar (1996) introduces a maintenance model where replacement of a com-
ponent within a system is possible when some other part of the system fails, at a
cost of c2. The opportunity process is Poisson. A component is replaced at an
opportunity if its age exceeds a specified control limit t. Upon failure a component
is replaced at cost c4 if its age exceeds a specified control limit x, otherwise it is
minimally repaired at cost c1. In case of a minimal repair the age and failure rate of
the component after the repair is as it was immediately before failure. There is also
a possibility of a preventive or interrupt replacement at cost c3 if the component
is still functioning at a specified age T. A procedure to optimise the control limits t
and T is given in Dekker and Plasmeijer (2001).
maintenance, in the form of reduced failure rates, must be weighed against the
costs. The approach in this study first attempts to estimate the effect of the failure
rate of a piece of equipment on the overall performance/profitability of the plant.
An integrated production and maintenance planning problem is also solved to
determine the effects of PM on production. Finally, the results of these two
procedures are then utilized in a final optimization problem that uses the relation-
ship between profitability and failure rate as well as the costs of different main-
tenance policies to select the appropriate maintenance policy.
Vatn et al. (1996) present an approach for identifying the optimal maintenance
schedule for the components of a production system. Safety, health and environ-
ment objectives, maintenance costs and costs of lost production are all taken into
consideration, and maintenance is thus optimized with respect to multiple ob-
jectives. The approach is flexible as it can be carried out at various levels of detail,
e.g. adapted to available resources and to the managements willingness to give
detailed priorities with respect to objectives on safety vs. production loss.
Frost and Dechter (1998) define the scheduling of preventive maintenance of
power generating units within a power plant as constraint satisfaction problems.
The general purpose of determining a maintenance schedule is to determine the du-
ration and sequence of outages of power generating units over a given time period,
while minimizing operating and maintenance costs over the planning period.
Vaurio (1999) develops unavailability and cost rate functions for components
whose failures can occur randomly. Failures can only be detected through periodic
testing or inspections. If a failure occurs between consecutive inspections, the unit
remains failed until the next inspection. Components are renewed by preventive
maintenance periodically, or by repair or replacement after a failure, whichever
occurs first (age-replacement). The model takes into account finite repair and
maintenance durations as well as costs due to testing, repair, maintenance and lost
production or accidents. For normally operating units the time-related penalty is
loss of production. For standby safety equipment it is the expected cost of an
accident that can happen when the component is down due to a dormant failure,
repair or maintenance. The objective is to minimize the total cost rate with respect
to the inspection and the replacement interval. General conditions and techniques
are developed for solving optimal test and maintenance intervals, with and without
constraints on the production loss or accident rate. Insights are gained into how the
optimal intervals depend on various cost parameters and reliability characteristics.
Van Dijkhuizen (2000) studies the problem of clustering preventive main-
tenance jobs in a multiple set-up multi-component production system. This article
has been reviewed in Chapter 11, which gives an overview of multi-component
maintenance models.
Cassady et al. (2001) introduce the concept of selective maintenance. Often
production systems are required to perform a sequence of operations with finite
breaks between each operation. The authors establish a mathematical programming
framework for assisting decision-makers in determining the optimal subset of main-
tenance activities to perform prior to beginning the next operation. This decision
making process is referred to as selective maintenance.
The article of Haghani and Shafahi (2002) deals with the problem of scheduling
bus maintenance activities. A mathematical programming approach to the problem
328 G. Budai, R. Dekker and R. Nicolai
is proposed. This approach takes as input a given daily operating schedule for all
buses assigned to a depot along with available maintenance resources. Then a daily
inspection and maintenance schedule is designed for the buses that require
inspection so as to minimize the interruptions in the daily bus-operating schedule,
and maximize the reliability of the system and efficiently utilize the maintenance
facilities.
Charles et al. (2003) examine the interaction effects of maintenance policies on
batch plant scheduling in a semiconductor wafer fabrication facility. The purpose
of the work is the improvement of the quality of maintenance department activities
by the implementation of optimized preventive maintenance (PM) strategies and
comes within the scope of total productivity maintenance (TPM) strategy. The
production of semiconductor devices is carried out in a wafer lab. In this produc-
tion environment equipment breakdown or procedure drifting usually induces un-
scheduled production interruptions.
Cheung et al. (2004) consider a plant with several units of different types.
There are several shutdown periods for maintenance. The problem is to allocate
units to these periods in such a way that production is least effected. Maintenance
is not modelled in detail, but incorporated through frequency or period restrictions.
downtime required for maintenance. The main question is when to carry out
maintenance such that the inconvenience for the train operators, the disruption to
and from the scheduled trains, the infrastructure possession time for maintenance
are minimized and the maintenance cost is the lowest possible. For a more detailed
overview of techniques used in planning railway infrastructure maintenance we
refer to Dekker and Budai (2002) and Improverail (2002). In some articles (see,
e.g. Higgins 1998, Cheung et al. 1999 and Budai et al. 2006) the track possession
is modelled in between operations. This can be done for occasionally used tracks,
which is the case in Australia and some European countries. If tracks are used
frequently, one has to perform maintenance during nights, when the train traffic is
almost absent or during weekends (with possible interruption of the train services),
when there are less disturbances for the passengers. In the first case one can either
make a cyclic static schedule, which is done by Den Hertog et al. (2005) and Van
Zante-de Fokkert (2001) for the Dutch situation, or a dynamic schedule with a
rolling horizon, which is done in Cheung et al. (1999). The latter schedule has to
be made regularly.
Some other articles deal with grouping railway maintenance activities to reduce
costs, downtime and inconvenience for the travellers and operators. Here we
mention the study of Budai et al. (2006) in which the preventive maintenance
scheduling problem is introduced. This problem arises in other public/private
sectors as well, since preventive maintenance of other technical systems (machine,
road, airplanes, etc.) also contains small routine works and large projects.
4 km) are overhauled in one stretch. In the latter case the traffic is diverted to other
lanes or the side of the road. It is shown that the latter is both advantageous for the
traffic as well as cheaper, provided the volume of traffic on the road is not too
high. Another interesting contribution is from Rose and Bennett (1992) who pro-
vide a model to locate and decide on the size (or capacity) of road maintenance
depots, for corrective maintenance.
Maintenance costs are a substantial factor of an airlines costs. Estimates are that
20% of the cost is due to maintenance. Maintenance is crucial because of safety
reasons and because of high downtime costs. Apart from a crash, the worst event
for an airline is an aircraft on ground (AOG) because of failures. Accordingly a lot
of technology has been developed to facilitate maintenance. We like to mention in-
flight diagnosis, such that quick actions can be taken on ground and a very high
level of modularity, such that failed components can easily be replaced. Yet in an
aircraft there is still a high level of time-based preventive maintenance rather than
condition-based maintenance. A plane has to undergo several checks, ranging from
an A check taking about an hour after each flight, to a monthly B check, a yearly C
check and a five-yearly D check, where it is completely overhauled and which can
take a month. The presence of the monthly check implies that planes cannot always
fly the same route, but need to be rotated on a regular basis. It also implies that
airlines need multiple units of a type in order to provide a consistent service.
Several studies have addressed the issue of fleet allocation and maintenance
scheduling. In the fleet allocation one decides which planes fly which route and at
which time. One would preferably make an allocation which remains fixed for a
whole year, but due to the regular maintenance checks this is not possible. Gopalan
and Talluri (1998) give an overview of mathematical models on this problem.
Moudani and Mora-Camino (2000) present a method to do both flight assignment
and maintenance scheduling of planes. It uses dynamic programming and heuris-
tics. A case of a charter airline is considered. Sriram and Haghani (2003) also
consider the same problem. They solve it in two phases. Finally, Feo and Bard
(1989) consider the problem of maintenance base planning in relation to an airlines
fleet rotation, while Cohn and Barnhart (2003) consider the relation between crew
scheduling and key maintenance routing decisions.
In another line of research, Dijkstra et al. (1994) develop a model to assess
maintenance manpower scheduling and requirements in order to perform inspec-
tion checks (A type) between flight turnarounds. It appears that their workload is
quite peaked because of many flights arriving more or less at the same time (so-
called banks) in order to allow fast passenger transfers.
The same problem is also tackled by Yan et al. (2004). The articles in this line
of research consider in effect the production planning of maintenance, a topic also
addressed in Section 13.5.
As the last article in this category we would like to mention Cobb (1995) who
presents a simulation model to evaluate current maintenance system performance
or the positive effect of ad hoc operating decisions on maintenance turn times (i.e.
the time maintenance takes to carry out a check or to do a repair).
Maintenance and Production: A Review 331
Kralj and Petrovic (1988) have presented an overview article on optimal main-
tenance of thermal generating units in power systems. They primarily focused on
articles published in IEEE Transactions on Power Apparatus and Systems. Here we
will briefly discuss the typical problems of the maintenance of power systems and
review two articles dealing with these problems.
First of all, note that maintenance of power systems is costly, because it is im-
possible to store generated electrical energy. Moreover, the continuity of supply is
very important for its customers.
A second problem of scheduling the maintenance of power systems is that joint
maintenance of units is often impossible or very expensive, since that would too
much effect production.
Frost and Dechter (1998) consider the problem of scheduling preventive
maintenance of power generating units within a power plant. The purpose of the
maintenance scheduling is to determine the duration and sequence of outages of
power generating units over a given time period, while minimizing operating and
maintenance costs over the planning period, subject to various constraints. A
subset of the constraints contains the pairs of components that cannot be main-
tained simultaneously. In this article the maintenance problem are cast as constraint
satisfaction problems (CSP). The optimal solution is found by solving a series of
CSPs with successively tighter cost-bound constraints.
Langdon and Treleaven (1997) study the problem of scheduling maintenance
for electrical power transmission networks. Grouping maintenance in the network
may prevent the use of a cheap electricity generator, so requiring a more expensive
generator to be run in its place. That is, some parts of the network should not be
maintained simultaneously. These exclusions are modelled by adding restrictions
to the MIP formulation of the problem.
Christer (1997) consider the problem of manpower planning for hospital building
maintenance.
Another typical production planning problem is with respect to layout planning.
A case study for a maintenance tool room is described in Rosa and Feiring (1995).
The study by Rose and Bennett (1992), which was discussed in Section 13.4, also
falls into this category.
In a number of articles conceptual models are developed that integrate the pre-
ventive and corrective aspects of the maintenance planning, with aspects of the
production system such as quality, service level and priority and capacity activities.
For instance, Finch and Gilbert (1986) present an integrated conceptual framework
for maintenance and production in which they focus especially on manpower
issues in corrective and preventive work. Weinstein and Chung (1999) test the
hypothesis that integrating the maintenance policy with the aggregate production
planning will significantly influence total cost reduction. It appears that this is the
case in the experimental setting investigated in this study. Lee (2005) considers
production inventory planning, where high level decisions on maintenance (viz.
their effects) are made.
Another group of articles deal with integrating process design, production and
maintenance planning. Already at the design stage decisions on the process system
and initial reliabilities of the equipments are made. Pistikopoulos et al. (2000)
describe an optimization framework for general multipurpose process models,
Maintenance and Production: A Review 333
which determine both the optimal design as well as the production and main-
tenance plans simultaneously. In this framework, the basic process and system
reliability-maintainability characteristics are determined in the design phase with
the selection of system structure, components, etc. The remaining characteristics
are determined in the operation phase with the selection of appropriate operating
and maintenance policies. Therefore, the optimization of process system effective-
ness depends on the simultaneous identification of optimal design, operation and
maintenance policies having properly accounted for their interactions. In Goel et
al. (2003) a reliability allocation model is coupled with the existing design, produc-
tion, and maintenance optimization framework. The aim is to identify the optimal
size and initial reliability for each unit of equipment at the design stage. They
balance the additional design and maintenance costs with the benefits obtained due
to increased process availability.
In the classical economic manufacturing quantity (EMQ) model items are produced
at a constant rate p and the demand rate for the items is equal to d < p. The aim of
the model is to find the production uptime that minimizes the sum of the inventory
holding cost and the average, fixed, ordering cost. This model is an extension of
the well known economic order quantity (EOQ) model, the difference being that in
the EOQ model orders are placed when there is no inventory. Note that the EMQ
model is also referred to as economic production quantity (EPQ) model.
In the extensive literature on production and inventory problems, it is often
assumed that the production process does not fail, that it is not interrupted and that
it only produces items of acceptable quality. Unfortunately, in practice this is not
always the case. A production process can be interrupted due to a machine break-
down or because the quality of the produced items is not acceptable anymore. The
EMQ model has been extended to deal with these aspects and we thus divide the
literature on EMQ models into two categories. First, we consider EMQ problems
that take into account the quality aspects of the items produced. The second
category of EMQ models analyzes the effects of (stochastic machine) breakdowns
on the lot sizing decision.
mally repaired and put back into commission. Okamura et al. (2001) generalize the
model of Srinivasan and Lee (1996) by assuming that both the demand as well as
the production process is a continuous-time renewal counting process. Further-
more, they suppose that machine breakdown occurs according to a non-homo-
geneous Poisson process. In Lee and Srinivasan (2001) the demand and production
rates are considered constant and a production run begins as soon as the inventory
drops to zero. If the facility fails during operation, it is assumed to be repaired, but
restoring the facility only to the condition it was in before the failure. Lee and
Srinivasan (2001) consider an (S, N) policy, where the control variable N specifies
the number of production cycles the machine should go through before it is set
aside for preventive maintenance overhaul, which restores the facility to its original
condition.
Recently, Lin and Gong (2006) determined the effect of breakdowns on the
decision of optimal production uptime for items subject to exponential deteriora-
tion under a no-resumption policy. Under this policy, a production run is executed
for a predetermined period of time provided that no machine breakdown has
occurred in this period. Otherwise, the production run is immediately aborted. The
inventories are built up gradually during the production uptime and a new pro-
duction run starts only when all on-hand inventories are depleted. If a breakdown
occurs then corrective maintenance is carried out and this takes a fixed amount of
time. If the inventory build-up during the production uptime is not enough to meet
the demand during the entire period of the corrective maintenance, shortages (lost
sales) will occur. Maintenance restores the production system to the same initial
working conditions.
A recent article of Kenne et al. (2006) considers the effects of both preventive
maintenance policies and machine age on optimal safety stock levels. Significant
stock levels, as the machine age increases, hedge against more frequent random
failures. The objective of the study is to determine when to perform preventive
maintenance on the machine and to find the level of the safety stock to be main-
tained.
13.6.5 Miscellaneous
Finally, we list some articles that deal with integrated maintenance and production
planning, but their approaches for modelling or the problem settings are different
from the articles in the previous categories discussed earlier. For instance, the
model presented in Ashayeri et al. (1996) deals with the scheduling of production
and preventive maintenance jobs on multiple production lines, where each line has
one bottleneck machine. The model indicates whether or not to produce a certain
item in a certain period on a certain production line.
In Kianfar (2005) the manufacturing system is composed of one machine that
produces a single product. The failure rate of the machine is a function of its age
and the demand of the manufacturing product is time-dependent. Its rate depends
on the level of advertisement of the product. The objective is to maximize the
expected discounted total profit of the firm over an infinite time horizon.
Sarper (1993) considers the following problem. Given a fixed repair/main-
tenance capacity, how many of each of the low demand large items (LDLIs) should
be started so that there are no incomplete jobs at the end of the production period?
The goal is to ensure that the portion of the total demand started will be completed
regardless of the amount by which some machines may stay idle due to insufficient
work. A mixed-integer model is presented to determine what portion of the
demand for each LDLI type should be rejected as lost sales so that the remaining
portion can be finished completely.
been published with the majority dating from the 1990s and the new millennium.
The most popular area in this review is also the oldest one, i.e. on integrated
models for maintenance and production. However, still many papers appear in that
area and the models become more and more complex, with more decision
parameters and more aspects.
The topics on opportunity maintenance and scheduling maintenance in line
with production have also been popular, but maybe more in the past than today.
We did expect to find more studies on specific business sectors, but could only find
many for the airline sector. That sector seems to be the most popular as it has both
a lot of interaction between maintenance and production as well as high costs
involved. In the other sectors, we do see the interaction, but perhaps more papers
will be published in the future. The other sections are interesting but small in terms
of papers published.
In general, the demands on maintenance become higher as public and com-
panies are less likely to accept failures, bad quality products or non-performance.
Yet at the same time societys inventory of capital goods is increasing as well as
ageing in the western societies. This is very much the case for roads, railways,
electric power generation, transport, and aircrafts. As there are continuous pres-
sures on maintenance budgets we do foresee the need for research supporting
maintenance and production decisions, also because decision support software is
gaining in popularity and more data becomes electronically available. A theory is
therefore needed for such decision support systems. As several case studies have
taught us that practical problems have many complex aspects, there is a high need
for more theory that can help us to understand and improve complex maintenance
decision-making.
13.8 Conclusions
In this chapter we have given an overview of planning models for production and
maintenance. These models are classified on the basis of the interactions between
maintenance and production. First, although maintenance is intended to allow
production, production is often stopped during maintenance. The question arises
when to do maintenance such that production is least effected. In order to answer
this question planning models should take into account the needs of production.
These needs are business sector specific and thus applications of planning models
in different areas have been considered. In comparison with other specific sectors,
much work has been done on modelling maintenance for the airline sector. Second,
maintenance itself can also be seen as a production process which needs to be
planned. Models for maintenance production planning mainly address allocation
and manpower determination problems. Finally, maintenance also affects the pro-
duction process since it takes capacity away. In production processes maintenance
is mostly initiated by machine failures or low quality items. Maintenance and
production should therefore be planned in an integrated way to deal with these
aspects. Indeed, integrated maintenance and production planning models determine
optimal lot sizes while taking into account failure and quality aspects. We observe
340 G. Budai, R. Dekker and R. Nicolai
a non-stop attention for such models, which take more and more real world as-
pects into account.
Although many articles have been written on the interaction between production
and maintenance, a careful reader will detect several open issues in this review. The
theory developed thus far, is far from complete and any real application, is likely to
reveal many more open issues.
13.9 Acknowledgements
The authors would like to thank Georgios Nenes, Sophia Panagiotidou, and the
editors for their helpful suggestions and comments.
13.10 References
Al-Zubaidi H, Christer A, (1997) Maintenance manpower modelling for a hospital building
complex. European Journal of Operational Research 99:603618
Ashayeri J, Teelen A, Selen W, (1996) A production and maintenance planning model for
the process industry. International Journal of Production Research 34: 33113326
Bckert W, Rippin D, (1985) The determination of maintenance strategies for plants subject
to breakdown. Computers and Chemical Engineering 9(2):113126
Ben-Daya M, Makhdoum M, (1998) Integrated production and quality model under various
preventive maintenance policies. Journal of the Operational Research Society 49(8):
840853
Ben-Daya M, Rahim M, (2001) Integrated production, quality & maintenance models: an
overview. in M. Rahim and M. Ben-Daya (eds), Integrated models in production
planning, inventory, quality, and maintenance, Kluwer Academic Publishers, 328
Beng G, (1994) Telecommunications systems maintenance. Computers and Operations
Research 21:337351
Budai G, Huisman D, Dekker R, (2006) Scheduling preventive railway maintenance
activities. Journal of the Operational Research Society 57:10351044
Cassady C, Pohl E, Murdock W, (2001) Selective maintenance modeling for industrial
systems. Journal of Quality in Maintenance Engineering 7(2):104117
Charles A, Floru I, Azzaro-Pantel C, Pibouleau L, Domenech S, (2003) Optimization of
preventive maintenance strategies in a multipurpose batch plant: application to
semiconductor manufacturing. Computers and Chemical Engineering 27:449467
Chelbi A, Ait-Kadi D, (2004) Analysis of a production/inventory system with randomly
failing production unit submitted to regular preventive maintenance. European Journal
of Operational Research 156:712718
Cheung B, Chow K, Hui L, Yong A, (1999) Railway track possession assignment using
constraint satisfaction. Engineering Applications of AI 12(5):599611
Cheung K, Hui C, Sakamoto H, Hirata K, O'Young L, (2004) Short-term site-wide
maintenance scheduling. Computers and Chemical Engineering 28:91102
Cho D, Parlar M, (1991) A survey of maintenance models for multi-unit systems. European
Journal of Operational Research 51:123
Chung K, (2003) Approximations to production lot sizing with machine breakdowns.
Computers & Operations Research 30:14991507
Cobb R, (1995) Modeling aircraft repair turntime: simulation supports maintenance
marketing. Journal of Air Transport Management 2:2532
Maintenance and Production: A Review 341
Goel H, Grievink J, Weijnen M, (2003) Integrated optimal reliable design, production, and
maintenance planning for multipurpose process plant. Computers and Chemical
Engineering 27:15431555
Gopalan R, Talluri K, (1998) Mathematical models in airline schedule planning: a survey.
Annals of Operations Research 76(1): 155185
Groenevelt H, Pintelon L, Seidmann A, (1992a) Production batching with machine
breakdowns and safety stocks. Operations Research 40(5):959971
Groenevelt H, Pintelon L, Seidmann A, (1992b) Production lot sizing with machine
breakdowns. Management Science 48(1):104123
Haghani A, Shafahi Y, (2002) Bus maintenance systems and maintenance scheduling:
model formulations and solutions. Transportation Research Part A 36:453482
Higgins A, (1998) Scheduling of railway maintenance activities and crews. Journal of the
Operational Research Society 49:10261033
Improverail (2002) http://www.tis.pt/proj/improverail/downloads/d6final.pdf (accessed
September 26, 2006)
Iravani S, Duenyas I, (2002) Integrated maintenance and production control of a
deteriorating production system. IIE Transactions 34:423435
Kenne J, Boukas E, (2003) Hierarchical control of production and maintenance rates in
manufacturing systems. Journal of Quality in Maintenance Engineering 9:6682
Kenne J, Boukas E, Gharbi A, (2003) Control of production and corrective maintenance
rates in a multiple-machine, multiple-product manufacturing system. Mathematical and
Computer Modelling 38:351365
Kenne J, Gharbi A, Beit M, (2006) Age-dependent production planning and maintenance
strategies in unreliable manufacturing systems with lost sale. Accepted for publication in
European Journal of Operational Research 178(2):408420
Kianfar F, (2005) A numerical method to approximate optimal production and maintenance
plan in a flexible manufacturing system. Applied Mathematics and Computation
170:924940
Knight P, Jullian F, Jofre L, (2005) Assessing the size of the prize: developing business
cases for maintenance improvement projects. Proceedings of the International Physical
Asset Management Conference, 284302
Kralj B, Petrovic R, (1988) Optimal preventive maintenance scheduling of thermal
generating units in power systems a survey of problem formulations and solution
methods. European Journal of Operational Research 35:115
Kyriakidis E, Dimitrakos T, (2006) Optimal preventive maintenance of a production system
with an intermediate buffer. European Journal of Operational Research 168:8699
Lam K, Rahim M, (2002) A sensitivity analysis of an integrated model for joint
determination of economic design of x -control charts, economic production quantity
and production run length for a deteriorating production system. Quality and Reliability
Engineering International 18:305320
Langdon W, Treleaven P, (1997) Scheduling maintenance of electrical power transmission
networks using genetic programming. In Warwick K, Ekwue A, Aggarwal A, (eds),
Artificial intelligence techniques in power systems, Institution of Electrical Engineers,
Stevenage, UK, 220237
Lee H, (2005) A cost/benefit model for investments in inventory and preventive
maintenance in an imperfect production system. Computers and Industrial Engineering
48:5568
Lee H, Rosenblatt M, (1987) Simultaneous determination of production cycle and inspection
schedules in a production system. Management Science 33:11251137
Lee H, Rosenblatt M, (1989) A production and maintenance planning model with restoration
cost dependent on detection delay. IIE Transactions 21(4):368375
Maintenance and Production: A Review 343
Van Zante-de Fokkert J, den Hertog D, van den Berg F, Verhoeven J, (2001) Safe track
maintenance for the Dutch Railways, Part II: Maintenance schedule. Technical report,
Tilburg University, the Netherlands
Vatn J, Hokstad P, Bodsberg L, (1996) An overall model for maintenance optimization.
Reliability Engineering and System Safety 51:241257
Vaurio J, (1999) Availability and cost functions for periodically inspected preventively
maintained units. Reliability Engineering and System Safety 63:133140
Wang C, (2006) Optimal production and maintenance policy for imperfect production
systems. Naval Research Logistics 53:151156
Wang C, Sheu S, (2003) Determining the optimal production-maintenance policy with
inspection errors: using a Markov chain. Computers & Operations Research 30:117
Weinstein L, Chung C, (1999) Integrating maintenance and production decisions in a
hierarchical production planning environment. Computers & Operations Research
26:10591074
Wijnmalen D, Hontelez A, (1997) Coordinated condition-based repair strategies for
components of a multi-component maintenance system with discounts. European
Journal of Operational Research 98:5263
Yan S, Yang T, Chen H, (2004) Airline short-term maintenance manpower supply planning.
Transportation Research Part A 38:615642
Yao X, Xie X, Fu M, Marcus S, (2005) Optimal joint preventive maintenance and
production policies. Naval Research Logistics 52:668681
14
Wenbin Wang
14.1 Introduction
In this chapter we present a modelling tool that was created to model the problems
of inspection maintenance and planned maintenance interventions, namely delay
time modelling (DTM). This concept provides a modelling framework readily
applicable to a wide class of actual industrial maintenance problems of assets in
general, and inspection problems in particular.
The concept of the delay time was first mentioned by Christer (1976) in a con-
text of building maintenance. It was not until 1984, the concept was first applied to
an industrial maintenance problem (Christer and Waller 1984). Since then, a series
of research papers appeared with regard to the theory and applications of delay
time modelling of industrial asset inspection problems; see Christer (1999) for a
detailed review. The delay time concept itself is simple which defines the failure
process of an asset as a two-stage process. The first stage is the normal operating
stage from new to the point that a hidden defect has been identified. The second
stage is defined as the failure delay time from the point of defect identification to
failure. It is the existence of such a failure delay time which provides the opportu-
nity for preventive maintenance to be carried out to remove or rectify the identified
defects before failures. With appropriate modelling of the durations of these two
stages, optimal inspection intervals can be identified to optimise a criterion func-
tion of interest.
The delay time concept is similar in definition to the well known potential
failure (PF) interval in reliability centred maintenance (Moubray 1997). It is noted,
however, that two differences between these two definitions mark a fundamental
difference in modelling maintenance inspection of assets. First, the delay time is
random in Christers definition while the PF interval is assumed to be constant.
Second, the initial point of a defect identification is very important to the set up of
an appropriate inspection interval, but ignored by Moubray. Nevertheless, Moubray
did not provide any means of modelling the inspection practice, while DTM
346 W. Wang
example, the case for a single component will be considered in Section 14.3. The
interaction between inspection and equipment performance may be captured using
the delay time concept presented below.
Let the item of an asset be maintained on a breakdown basis. The time history
of breakdown or failure events is a random series of points; see Figure 14.1. For
any one of these failures, the likelihood is that, had the item been inspected at some
point just prior to failure, it could have revealed a defect which, though the item
was still working, would ultimately lead to a failure. Such signals include exces-
sive vibration, unusual noise, excessive heat, surface staining, smell, reduced out-
put, increased quality variability, etc. The first instance where the presence of a
defect might reasonably be expected to be recognised by an inspection, had it taken
place, is called the initial point u of the defect, and the time h to failure from u is
called the delay time of the defect; see Figure 14.2. Had an inspection taken place
in (h, u + h) , the presence of a defect could have been noted and corrective actions
taken prior to failure. Given that a defect arises, its delay time represents a window
of opportunity for preventing a failure. Clearly, the delay time h is a characteristic
of the item concerned, the type of defect, the nature of any inspection, and perhaps
the person inspecting. For example, if the item was a vehicle, and the maintenance
practice was to respond when the driver reported a problem, then there is in effect a
form of continuous monitoring inspection of cab related aspects of the vehicle,
with a reasonably long delay time consistent with the rate of deterioration of the
defect. However, should the exhaust collapse because a support bracket was
corroded through, the likely warning period for the driver, the delay time, would be
virtually zero, since he would not normally be expected to look under the vehicle.
At the same time, had an inspection been undertaken by a service mechanic, the
delay time may have been measured in weeks or months. Had the exhaust
collapsed because securing bolts became loose before falling out, then the driver
could have had a warning period of excessive vibration, and perhaps noise, and the
defects would have had a drive related delay time measured in days or weeks.
Time
Figure 14.1. Failure points
u failure
Figure 14.2. The delay time for a defect
348 W. Wang
To see why the delay time concept is of use, consider Figure 14.3 incorporating
the same failure point pattern as Figure 14.1 along with the initial points associated
with each failure arising under a breakdown system. Had an inspection taken place
at point (A), one defect could have been identified and the seven failures could
have been reduced to six. Likewise, had inspection taken place at points (B) and
point (A), four defects could have been identified and the seven failures could have
been reduced to three. Figure 14.3 demonstrates that provided it is possible to
model the way defects arise, that is the rate of arrival of defects (u ) , and their
associated delay time h , then the delay time concept can capture the relationship
between the inspection frequency and the number of plant failures.
We are assuming for now that inspections are perfect, that is, a defect is recog-
nised if, and only if, it is there and is removed by corrective action. Delay time
modelling is still possible if these assumptions are not valid, but this more complex
case is discussed in Section 14.3.1.
Time
B A B
Figure 14.3. initial points; failure points
1. An inspection takes place every T time units, costs cs units and requires
d s time units, where d s << T .
2. Inspections are perfect in that all (and only) defects present are identified.
3. Defects identified are repaired during the inspection period.
4. Defects arise according to a homogeneous Poisson process (HPP) with the
rate of occurrence of defects, , per unit time.
5. The delay time, H , of a random defect is described by a pdf. f (h) , cdf.
F (h) , and is independent of the initial point U .
6. Failure will be repaired immediately at an average cost c f and downtime
df .
7. The plant has operated sufficiently long since new to be considered effec-
tively in a steady state.
8. Defects and failures only arise whilst plant is operating.
Delay Time Modelling 349
T
E[ N f (T )] = F (t )dt (14.2)
0
v(t ) = F (t ) (14.3)
The original model developed in Christer and Waller (1984) for Equation 14.2
uses a different approach, but leads to the same result.
Section 14.3.1 outlined a basic delay time model under perfect inspections. It is
established under a set of assumptions, and some of them may not be valid in
practical situations. These assumptions greatly simplify the mathematics involved
but also restrict a wider use of the models developed. Perhaps the most restrictive
assumption is that of perfect inspections. In almost all the case studies conducted
using the delay time concept, we found none of them supported the perfect inspec-
350 W. Wang
tion assumption. The other concerning assumption is the HPP for defect arrival in
the case of a complex system. One would naturally think as the system ages there
could be more defect arrivals than that of a younger system. In this section, we
introduce one delay time model that relaxes the perfect inspection assumption. The
delay time model using a NHPP is presented in Christer and Wang (1995) and
Wang and Christer (2003). These models are mainly developed for complex sys-
tems, but a non-perfect inspection single component delay time model can also be
developed along a similar line (Baker and Wang 1991).
All the assumptions proposed in Section 14.3.1 will hold except the perfect
inspection one. Assume for now that if a defect is present at an inspection; then
there is a probability r that the defect can be identified. This implies that there is a
probability 1 r that the defect will be unnoticed. Figure 14.4 depicts such a
process.
A B C time
Figure 14.4. Failure process of a multi-component system subject to three non-perfect in-
spections at points A, B, and C; two potential failures were removed and two missed
It has been proved that the failure process over each inspection interval is still
an NHPP (Christer and Wang 1995), but not identical over the earlier inspection
intervals of the system. It can be shown that as the number of inspections increases,
the number of failures over each inspection interval becomes stable and identical,
so we need to study the asymptotic behaviour of the failure process assuming the
number of previous inspections is very large.
Let
i --- i-th inspection
U --- random variable of the initial time u
r --- probability of perfect inspection
i (t ) --- ROCOF at time t , t [(i 1)T , iT )
E[ N f ((i 1)T , iT )] --- expected number of failures over [(i 1)T , iT )
E[ N s (iT )] --- expected number of defects identified at iT
It can be shown (Christer et al. 1995; Christer and Wang 1995) that vi (t ) is given
by
iT
E [ N f ((i 1)T , iT )] = ( i 1)T
vi (t )dt
(14.5)
{ }
iT i
= ( i 1)T n =1
(1 r )i n +1[ F (t (n 1)T ) F (t nT )] + F (t (i 1)T ) dt
E[ N s (iT )]
nT iT (14.6)
i
= n =1
(1 r )i n +1r ( n 1)T
[1 F (iT u )]du + r ( i 1)T
[1 F (iT u )] du
The expected downtime is given by Equation 14.1 with the expected number of
failures given by by Equation 14.5, so that
d f E[ N f ((i 1)T , T )] + d s
D(T ) = (14.7)
T + ds
The use of Equation 14.7 assumes that the system is already in a steady state
with i . For computation purpose we can select a large i , and then n starts
from the first k where (1 r )i k +1 and is a very small number.
Equation 14.7 is established assuming that the defects identified at an inspection
will always be removed without costing any extra downtime or cost. This assump-
tion can be relaxed. Let d r be the mean downtime per defect being repaired. Then
using the same approach as before, the expected downtime is given by
If the objective function is the expected cost per unit time, we obtain this by
simply substituting the downtime parameters in Equations 14.7 or 14.8 by the
corresponding cost parameters.
Example 14.1 Assume that the rate of occurrence of defects is two per day, and the
delay time distribution is exponential with scale parameter 0.03 measured in days.
The downtime measures are d f = 30 and d s = 30 min respectively. The probability
of a perfect inspection is assumed to be 0.7. Using Equations 14.5 and 14.7, we have
the expected downtime against inspection intervals as shown in Figure 14.5. It can be
seen from Figure 14.5 that a weekly inspection interval is the best.
352 W. Wang
35
25
20
15
10
10
13
16
19
22
1
7
Inspection interval
Figure 14.5. Expected downtime per unit time vs. inspection interval (in days)
Time
Figure 14.6. Failure process of a multi-component system, where denotes initial points;
failure points
Delay Time Modelling 353
Time
Figure 14.7. Failure process of a single component system
For the system in Figure 14.6, the system may be renewed at inspection points if
these inspections are perfect, and the rate of arrival of defects is constant. However
for the system in Figure14.7, the system can be renewed either at a failure or at an
inspection. We present the case with a perfect inspection assumption. The case of an
imperfect inspection delay time model for a single component can be found in
Baker and Wang (1991, 1993).
We need the following additional assumptions and notation;
We first consider a simple case that an inspection renews the system regardless of
whether a defect was identified or not. This effectively assumes an exponential
distribution for the initial time U .
Since each failure or inspection renewed the system with associated downtimes
or costs, the process is a renewal reward process, and the long term expected cost
per unit time, C (T ) , is given by Ross (1983):
E(CC)
C(T) =
E(CL)
where CC is the renewal cycle cost and CL is the renewal cycle length which is
the interval between two consecutive renewals. There could be two different
renewal cycles, one is the failure renewal and the other is the inspection renewal.
Taking the expected cost per renewal cycle as an example, since a failure will
cost c f with probability of it happening as P( X < T ) , then the expected cost due
to a failure renewal within T is
T
c f P( X < T ) = c f 0
g (u ) F (T u ) du , (14.9)
and finally the expected cost due to an inspection renewal without a defect being
identified at T is given by
cs P (U T ) = cs T
g (u ) du (14.11)
E(CC)
T T (14.12)
= cf 0
g (u ) F (T u )du +(cr + cs ) g (u ){1 F (T u )}du + cs g (u ) du
0 T
As to the expected cycle length, we model two possibilities. The first is that the
cycle ends at a failure before T . Define p (t ) the density function for the time to
failure which is given readily by
d t
p (t ) =
dt
P( X t ) = 0
g (u ) f (t u ) du
T t T
E (CL) = t
0 0
g (u ) f (t u )dudt + T (1 0
g (u ) F (T u ) du ) (14.13)
For the detailed derivation of Equations 14.914.13 see Baker and Wang (1991,
1993).
Finally the expected cost per unit time is given by
C(T) =
T T
cf 0
g (u ) F (T u )du + (cr + cs ) 0
g (u ){1 F (T u )}du + cs T
g (u ) du (14.14)
T t T
t0 0
g (u ) f (t u )dudt + T (1 0
g (u ) F (T u ) du )
Example14.2 Assume both the initial time and delay time distributions are expo-
nential with scale parameters 0.6 and 0.75 respectively. The time unit is 100 days
and the cost parameter values are c f = 1000, cr = 150 and cs = 15 respectively.
Using Equation 14.14, the calculated expected cost per unit time as a function of T
is shown in Figure 14.8.
Expected cost per unit time
290
270
250
230
210
190
170
150
0.1
0.3
0.5
0.7
0.9
1.1
1.3
1.5
1.7
1.9
2.1
2.3
Inspection interval
Figure 14.8. Expected cost per unit time vs. inspection interval
iT
[(i 1)d s + d f ]P ((i 1) < X < iT ) = [(i 1)d s + d f ] g (u ) F (iT u )du (14.15)
( i 1)T
This is because inspections are perfect so that if a failure at time X, then the
initial time U must be bounded within [(i 1)T , X ), X < iT . There are (i 1)
inspections with no defect identified before the failure so (i 1) times of the
inspection downtime are added.
Equation 14.15 models only one of the possibilities and a failure can be in any
of the inspection intervals so summing over all possible intervals i from 1 to
infinity gives the expected downtime due to a failure:
356 W. Wang
iT
(14.16)
= [(i 1)d + d ]
i =1
g (u ) F (iT u ) du
s f
( i 1)T
Equation 14.16 is always finite since all the probability terms for large i tend
to zero because g (u ) tends to zero for u > (i 1)T when i is large.
Similarly the expected downtime due to an inspection renewal with a defect
identified is
iT
i =1
((i 1)d s + d r ) g (u )[1 F (iT u )]du (14.17)
( i 1)T
Summing Equations 14.16 and 14.17 gives the complete expected downtime
per renewal cycle:
E(CD)
=
i =1 {[(i 1)d + d ] s r
iT
( i 1)T
g (u ) du +( d f d r )
iT
( i 1)T
g (u ) F (iT u ) du } (14.18)
E (CL)
{ }
iT t iT (14.19)
= i =1 ( i 1)T
t ( i 1)T
g (u ) f (t u ) dudt + iT ( i 1)T
g (u ){1 F (iT u )}du
C(T) =
i =1 {[(i 1)d + d ] g (u)du + (d d ) g(u)F (iT u)du}
s r
iT
( i 1)T
f r
iT
( i 1)T (14.20)
i =1 { t g (u) f (t u)dudt + iT g(u)[1 F (iT u)]du}}
iT
( i 1)T
t
( i 1)T
iT
( i 1)T
Following a discussion with the chief technician, it seemed best to focus on the
following items, to ensure a sample of similar machine types, under heavy and
constant use, with a usefully long history of failures, and with reasonably well-
defined modes of failures. Two pumps were chosen, namely volumetric infusion
pumps and peristaltic pumps all from the intensive-care, neurosurgery and heart-
care units. There were 105 volumetric pumps and the most frequent failure mode
was the failure of the pressure transducer. There were 35 peristaltic pumps and the
most frequent failure mode was battery failure. For a detailed description of the
case, data and model fitting see Baker and Wang (1991). Several distributions were
chosen for the initial and delay time distributions for both pumps, and it turned out
that in both cases a Weibull distribution was the best for the initial time distribution
and an exponential distribution for the delay time distribution. The estimated
parameter values based on history data using the maximum likelihood method for
both pumps are shown in Table 14.1.
Although the cost data were not recorded, it was relatively easy to estimate the
cost of an inspection (called preventive maintenance in the hospital) and the cost of
an inspection repair if a defect was identified. However, it was extremely difficult
to have an estimate for the failure cost since if the pump failed to work while
needed the penalty cost could be very high compared with the cost of the pump
itself. Nevertheless, some estimates were provided, which are shown in Table 14.2
This time we cannot derive an analytical formula for the expected cost because
of the use of the Weibull distribution. Numerical integrations have to be used to
calculate Equation 14.20. We did this using the maths software package MathCad
and the results are shown in Figures 14.9 and 14.10.
358 W. Wang
2.4
2.2
2
Expected_Cost( T )
C(T)
1.8
1.6
1.4
1.2
0 20 40 60 80 100 120
T
Figure 14.9. Expected cost per unit time vs. inspection interval for the volumetric infusion
pump
2.5
Expected_Cost( T ) 1.5
C(T)
0.5
0 20 40 60 80 100 120
T
Figure 14.10. Expected cost per unit time vs. inspection interval for the peristaltic pump
Time is given in days in Figures 14.9 and 14.10, so the optimal inspection
interval for the volumetric infusion pump is about 30 days and for the peristaltic
pump is around 70 days. The hospital at the time checked the pumps at an interval
of six months, so clearly for both pumps the inspection intervals should be
shortened. However, it has to be pointed out that the model is sensitive to the
failure cost, and had a different estimate been provided, the recommendation
would have been different.
Delay Time Modelling 359
In previous sections, delay time models for both a complex system and a single
compnent have been introduced. However in a practical situation, before the con-
struction of expected cost or downtime models, it is necessary to estimate the values
of the parameters that characterise the defect arrival and failure processes. In this
section we discuss various methods developed to estimate the parameters from
either subjective data of experts opinions or objective data collected at failures
and inspections.
Naturally, the parameter estimation process is not the same for the different
types of delay-time model, i.e. single component models where a single potential
failure mode is modelled and only one defect may (or may not) be present at any
one time, compared with complex system models where many defects can exist
simultaneously and many failures can occur in the interval between inspections.
This is particularly important for the method using objective data. In this section,
we mainly focus on the estimation methods for complex systems since these
systems are the most applicable asset items for DTM. The details of the approaches
developed for parameters estimation for a single component DTM can be found in
Baker and Wang (1991, 1993).
14.5.2.1 Subjective Estimation of the Delay Times Through an On-site and On-spot
Survey
This method needs to be done over a time period to collect detailed information and
assessment at every maintenance intervention or failure; Christer and Waller (1984).
At every failure repair, the maintenance technician repairing the plant would be
asked to estimate:
HLA: how long ago the defect causing the failure may first have been expected
to have been recognised at an inspection.
If a defect was identified at an inspection, then in addition to HLA, the techni-
cian would be asked to estimate:
360 W. Wang
HML: how much longer could the defect be left unattended before a repair was
essential.
The estimates are given by h = HLA for a failure, and h = HLA + HML for an
inspection repair; see Figure 14.11a,b. f (h) is then estimated from the data of { h }.
14.5.2.2 Subjective Estimation of the Delay Times Based Identified Failure Modes
The method introduced earlier is a questionnaire survey based approach where the
subjective opinions of maintenance engineers were asked. It has the advantage of
directly facing the defect or failure when the information regarding the delay time
was requested. However, it has also the following problems: (a) it is a time con-
suming process in conducting such a survey, particularly in the case that the
frequency of failures or defects is not high, which implies a longer time to get
sufficient data; (b) the estimation process is not easy to control since all the forms
are left at the hands of the maintenance engineers involved without an analyst
present, which may result in confusion and mistakes as experienced in the studies
of Christer and Waller (1984) and Christer et al. (1998b).
Wang (1997) recommended a new approach to estimate directly the delay time
distribution based on pre-defined major failure modes or types. The idea is as
follows:
The following phases for the estimating of the delay time were suggested; Wang
(1997).
The problem identification phase This is for the identification of all major failure
types and possible causes of the failures. This was normally done via a failure
mode and criticality analysis so that a list of dominant failures can be obtained.
This process will entail a series of discussions with the maintenance engineers to
clarify any hidden issues. If some failure data exists it should be used to validate
the list, or otherwise a questionnaire should be designed and forwarded to the
person concerned for a list of dominant failure types.
Expert identification and choice phase The term expert is not defined by any
quantitative measure of resident knowledge. However, it is clear in the case here
that a person who is regarded by others as being one of the most knowledgeable
about the machine should be chosen as the expert. The shop floor fitters or any
maintenance technicians or engineers who maintain the machine would be the
desired experts; Christer and Waller (1984). After the set of experts is identified, a
choice is made of which experts to use in the study. Full discussion with manage-
ment is necessary in order to select the persons who know the machine best.
Psychologically, five or fewer experts are expected to take part of the exercise, but
not less than three.
The question formulation phase The questions we want to ask in this case are the
rate of occurrence of defects, (assuming we are modelling a complex plant) and the
delay time distribution. In the case addressing the rate of arrival of a defect type,
we can simply ask for a point estimate since it is not random variable. Without
maintenance interventions, this would, in the long term, be equal to the average
number of the same failure type per unit time. For example we may ask how many
failures of this type will occur per year, month, week or day?. It is noted that this
quantity is usually observable. In fact, our focus is mainly on the delay time
estimates.
Given the amount of uncertainty inherent in making a prediction of the delay
time, the experts may feel uncomfortable about giving a point estimate, and may
prefer to communicate something about the range of their uncertainty. Accepting
these points, perhaps the best that experts could do in this case would be to give
their subjective probability mass function for the quantity in question. In other
words, they could provide an estimate over the interval such that the mass above
the interval is proportional to their subjective probability measures. Alternatively,
three point estimates can be asked, such as the most likely, the minimum and the
maximum durations of the delay times for a particular type of failure.
The word delay time was not entered in the question since it will take some
effort to explain what is the delay time. Instead, we just asked a similar question
like HLA. But this question was still difficult for the experts to understand based
upon our case experience. The lesson learned is to demonstrate one example for
them before starting the session.
362 W. Wang
The elicitation phase Elicitation should be performed with each expert individu-
ally. If possible, the analyst should be present, which proved to be vital in our case
studies. The above-mentioned histogram was used to draw the answer from the
experts so that the experts can have a visual overview of their estimates and a
smooth histogram could be achieved if the experts are advised to do so. The maxi-
mum number of the histogram intervals is set to be five, which is advised by
psychological experiments.
The updating phase This phase is mainly for after some failure and recorded
findings become available. In a sense it is a way of calibrating.
A case study using the above method is detailed in Akbarov et al. (2006).
mechanism enters the process when objective data become available, which
requires a repeated evaluation of the likelihood function which will be introduced
later. In the framework of Bayesian statistics and assuming no objective data is
available at the beginning, we basically first assume a prior on the parameters
which characterize the underlying defect and failure arrival processes. When
objective data becomes available, we calculate the joint posterior distribution of the
parameters, and then we may use this posterior distribution to evaluate the expec-
ted cost or downtime per unit time conditional on observed data.
Assuming for now that we are interested in the rate of arrival of defects, , and
the delay time pdf., f (h) , which is characterised by a two parameter distribution
f (h | , ) . Unlike the methods proposed in Christer and Waller (1984) and Wang
(1997), here we treat parameters and the and in f (h | , ) as random
variables. The classical Bayesian approach is used here to define the prior dis-
tributions for model parameters , and as f ( | ) , f ( | ) and
f ( | ) , where is the set of hyper-parameters within f ( | ) .
Once those are available, the point estimates of , and are the ex-
pected values of them and are given by
= 0
f ( | ) d , = 0
f ( | )d and = 0
f ( | )d
E[ g ( , , )]
(14.21)
=
0 0 0
g ( , , ) f ( | ) f ( | ) f ( | )d d d .
gs = 0 0 0
g ( , , ) f ( | ) f ( | ) f ( | )d d d . (14.22)
Equation 14.22 is only one of such equations and if several such subjective
estimates (different) were provided, we could have a set of equations like Equation
14.22. The hyper-parameters may be estimated by solving the equations like
Equation 14.22 in the case that the number of equations like Equation 14.22 is at
least the same as the number of hyper-parameters in . We now demonstrate this
in our case.
Suppose that the experts can provide us the following subjective statistics in
estimating :
364 W. Wang
E[ g ( , , )]
=
0 0 0
Tf ( | ) f ( | ) f ( | ) d d d = 0
Tf ( | ) d
n f + nd = Tf ( | ) d . (14.23)
0
-T
Similarly, from the property of the HPP, that is, P( N d (0,T) = n| ) = e (T ) , we
n
n!
have
pnd = 0
Pr ( N d (0,T) = 0| )f (| )d = 0
e T f ( | ) d . (14.24)
where N d (0, T ) is the number of defects in [0, T ) . If we have only two hyper-para-
meters in , then solving Equations 14.23 and 14.24 simultaneously in terms of
will give the estimated values of the hyper-parameters in . Note that is
independent with and so that the integrals of f ( | ) and f ( | ) are
dropped from Equation 14.21. Similarly if more subjective estimates were provided,
the hyper-parameters in and can be obtained. For a detailed description of
such an approach to estimate delay time model parameters see Wang and Jia (2007).
Obviously this approach is better than the previously developed subjective
methods in terms of the way to get the data and the accuracy of the estimated
parameters. It is also naturally linked to the objective method in estimation DTM
parameters to be presented in the next section via Bayesian theorem if such
objective data becomes available, Wang and Jia (2007).
Objective data for complex systems under regular inspections should consist of the
failures (and associated times) in each interval of operation between inspections
and the number of defects found in the system at each inspection. From this data
information, we estimate the parameters for the chosen form of the delay time
model.
Delay Time Modelling 365
Initially, we consider a simple case of the estimation problem for the basic
delay time model where only the number of failures, mi , occurring in each cycle
[(i 1), iT ) and the number of defects found and repaired, ji , at each inspection (at
time iT ) are required. We do not know the actual failure times within the cycles
The probability of observing mi failures in [(i 1), iT ) is
E [ N f (( i 1)T ,iT )]
e E[ N f ((i 1)T , iT )]mi
P ( N f ((i 1)T , iT ) = mi ) = (14.25)
mi !
e E[ N s ( iT )]
E[ N s (iT )] j i
P ( N s (iT ) = ji ) = (14.26)
ji !
As the observations are independent, the likelihood of observing the given data
set is just the product of the Poisson probabilities of observing each cycle of data,
mi and ji . As such, the likelihood function for K intervals of data is
K e E[ N f (( i 1)T , iT )]
E[ N f ((i 1)T , iT )]m e E[ N i s ( iT )]
E[ N s (iT )] j
i
L () =
mi !
ji !
,
(14.27)
i =1
where is the set of parameters within the delay time model. The likelihood
function is optimised with respect to the parameters to obtain the estimated values.
This process can be simplified by taking natural logarithms. The log-likelihood
function is
( )
( m log{E[ N ((i 1)T , iT )]} + j log{E[ N (iT )]} E[ N ((i 1)T , iT )] E[ N s (iT )])
K
= i =1 i f i s f (14.28)
( log(m !) + log( j !) )
K
i =1 i i
where the final summation term is irrelevant when maximising the log-likeli-
hood as it is a constant term and therefore not a function of any of the parameters
under investigation.
When the times of failures are available, it is often necessary to refine the
likelihood function at Equation 14.27 by considering the detailed pattern of be-
haviour within each interval in terms of the number of failures and their associated
times. Define t ij the time of the j-th failure in the i-th inspection interval; the
likelihood is given by (Christer et al. 1998a)
K mi e E[ N s (iT )] E[ N s (iT )] ji
E [ N f (( i 1)T , iT )]
L () = v (t )e (14.29)
j =1 i ij ji !
i =1
In the case study of Christer et al. (1995), only the daily numbers of failures are
available. They formulated a different likelihood taking account of this pattern of
data. It was done essentially by formulating the probability of a particular number
of failures for each day over each inspection interval, and then the likelihood for a
particular inspection interval is just the product of these probabilities and the
probabilty of observing some number of defects at the inspection; see Christer et
al. (1995) for details.
A copper works in the north-west of England has used the same extrusion press for
over 30 years, and the plant is a key item in the works since 70% of its products go
through this press at some stage of their production. The machine comprises a
1700-ton oil-hydraulic extrusion press with one 1700 kW induction heater and
completely mechanized gear for the supply of billets to the press and for the
removal of the extruded products. The machine was operated 1518 h a day (two
shifts), five days a week, excluding holidays and maintenance down-time. Preven-
tive maintenance (PM) has been carried out on this machine since 1993, which
consisted of a thorough inspection of the machinery, along with any subsequent
adjustments or repairs if the defects found can be rectified within the PM period.
Any major defects which cannot be rectified during the PM time were supposed to
be dealt with during non-production hours. PM lasted about 2 h and is performed
once a week at the beginning of each week.
Questions of concern are (i) whether PM is or could be effective for this
machine; (ii) whether the current PM period is the right choice, particularly the one
week PM interval which was based upon maintenance engineers subjective judge-
ment; (iii) whether PM is efficient, i.e. whether it can identify most defects present
and reduce the number of failures caused by those defects.
In this case study, the delay time model introduced earlier was used to address
the above questions. The first question can also be answered in part by comparing
the total downtime per week under PM with the total downtime per week per week
of the previous years without PM. A parallel study carried out by the company
revealed that PM has lowered the total downtime. The proportion of downtime was
reduced from 7.8% to 5.8%.
To establish the relationship between the downtime measure and the PM
activities using the delay time concept, the first task is to estimate the parameters
of the underlying delay time distribution from available data, and hence build a
model to describe the failure and PM processes. The type of delay time model used
in the study is the non-perfect inspection model.
In the original study, Christer et al. (1995), a number of different candidate
delay time distributions were considered including exponential and Weibull distri-
butions. The chosen form for the delay time distribution is a mixed distribution
consisting of an exponential distribution (scale parameter ) with a proportion P of
defects having a delay time of 0. The cdf. is given by
F(h) = 1 ( 1 P)e h
Delay Time Modelling 367
Inserting the optimal parameter estimates into the log-likelihood function gives
an ML value of 101.86. See Christer et al. (1995) on the analysis and the fit of the
model to the data.
14.7 Conclusion
There is considerable scope for advances in maintenance modelling that impact
productivity upon current maintenance practice. This chapter reports upon one
methodology for modelling inspection practice. The power of mathematics and
statistics is used to exploit an elementary mathematical construct of failure process
to build operational models of maintenance interactions. The delay time concept is
a natural one within the maintenance engineering context. More importantly, it can
be used to build quantitative models of the inspection practice of asset items, which
have proved to be valid in practice. The theory is still developing, but so far there
has been no technical barrier to developing DTM for any plant items studied.
This chapter has introduced the delay time concept and has shown how it can
be applied to various production equipment to optimise inspection intervals. To
provide substance to this statement, the processes of model parameter estimation
and case examples outlining the use of delay time modelling in practice are
introduced. We only presented some fundamental DTMs and associated parameters
estimation procedures, but interested readers can refer to the references listed at the
end of the chapter for further consultation.
Delay Time Modelling 369
14.8 Dedications
This chapter is dedicated to Professor Tony Christer who recently passed away.
Tony was a world class researcher with an international reputation. He was the
originator of the delay time concept and had produced in conjunction with others a
considerable number of papers in delay time modelling theory and applications. He
was a great man who enthused, mentored and guided many of us to strive for
higher quality research. He will be sadly missed by all who knew him.
14.9 References
Abdel-Hameed, M., (1995), Inspection, maintenance and replacement models, Computers
and Operations Research, V22, 4, 435441
Akbarov, A., Wang W. and Christer A.H., (2006), Problem identification in the frame of
maintenance modelling: a case study, to appear in I. J. Prod. Res.
Baker, R.D. and Wang, W., (1991), Estimating the delay time distribution of faults in
repairable machinery from failure data, IMA J. Maths. Applied in Business and Industry,
4, 259282.
Baker, R. and Wang, W., (1993), Developing and testing the delay time model, Journal of
Operational Research Society, Vol. 44, No. 4, 361374.
Barlow, R.E and Proschan, F., (1965), Mathematical theory of reliability, Wiley, New York.
Carr, M.J., and Christer, A.H, (2003) Incorporating the potential for human error in
maintenance models, J. Opl. Res. Soc., 54 (12), 12491253
Christer, A.H., (1976), Innovative decision making, proceedings of NATO conference on
the role of effectiveness of theory of decision in practice, eds. Bowen K.C and White
D.J., Hodder and Stoughton, 368377.
Christer, A.H., (1999), Developments in delay time analysis for modeling plant main-
tenance, J. Opl. Res. Soc., 50, 11201137.
Christer, A.H. and Redmond, D.F., (1990), A recent mathematical development in main-
tenance theory, Int. J. Prod. Econ, 24, 227234.
Christer, A.H. and Waller, W.M., (1984), Delay time Models of Industrial Inspection Main-
tenance Problems, J. Opl. Res. Soc., 35, 401406.
Christer, A.H and Wang, W., (1995), A delay time based maintenance model of a multi-
component system, IMA Journal of Maths. Applied in Business and Industry, Vol. 6,
205222.
Christer, A.H and Whitelaw, J. (1983), An Operational Research approach to breakdown
maintenance: problem recognition, J Opl Res Soc, 34, 10411052.
Christer, A.H., Wang, W., Baker, R.D. and Sharp, J.M., (1995), Modelling maintenance
practice of production plant using the delay time concept, IMA J. Maths. Applied in
Business and Industry, Vol. 6, 6783.
Christer, A.H., Wang, W., Sharp, J.M. and Baker, R.D., (1997), A stochastic modelling
problem of high-tech steel production plant, in Stochastic Modelling in Innovative
Manufacturing, Lecture Notes in Economics and mathematical Systems, (Eds. by A.H
Christer, Shunji Osaki and L. C. Thomas), Springer, Berlin, 196214.
Christer, A.H., Wang, W., Choi, K. and Sharp, J.M., (1998a), The delay-time modelling of
preventive maintenance of plant given limited PM data and selective repair at PM, IMA
J. Maths. Applied in Business and Industry, Vol. 9, 355379.
370 W. Wang
Christer, A.H., Wang, W., Sharp, J.M. and Baker, R.D., (1998b), A case study of modelling
preventive maintenance of production plant using subjective data, J. Opl. Res. Soc., 49,
210219.
Christer, A.H., Wang, W. and Lee, C., (2000), A data deficiency based parameter estimating
problem and case study in delay time PM modelling, Int. J. Prod. Eco. Vol. 67, No. 1,
6376
Christer, A.H. Wang, W., Choi, K. and Schouten, F.A., (2001), The robustness of the semi-
Markov and delay time maintenance models to the Markov assumption, IMA. J.
Management Mathematics, 12, 7588.
Kaio, N. and Osaki, S., (1989), Comparison of Inspection Policies Journal of the Opera-
tional Research Society, Vol. 40, No. 5, 499503
Luss, H., (1983), An Inspection Policy Model for Production Facilities, Management
Science, Vol. 29, No. 9, 11021109
McCall, J., (1965), Maintenance Policies for Stochastically Failing Equipment: A Survey,
Management Science, Vol. 11, No. 5, 493524
Moubray, J., (1997), Reliability Centred Maintenance, Butterworth-Heineman, Oxford.
Ross, (1983), Stochastic processes, Wiley, New York
Taylor, H.M., and Karlin, S., (1998), An introduction to stochastic modeling, 3rd Ed.,
Academic press, San Diego.
Thomas, L.C., Gaver, D.P. and Jacobs, P.A. (1991), Inspection Models and their application,
IMA Journal of Management Mathematics, 3(4):283303
Wang, W., (1997), Subjective estimation of the delay time distribution in maintenance mo-
delling, European Journal of Operational Research, 99, 516529.
Wang W., (2000), A model of multiple nested inspections at different intervals, Computers
and Operations Research, 27, 539558.
Wang W., (2006), Modelling the probability assessment of the system state using available
condition information, to appear in IMA. J. Management Mathematics
Wang W. and Christer A.H., (1997), A modelling procedure to optimise component safety
inspection over a finite time horizon, Quality and Reliability Engineering International,
13, No. 4, 217224.
Wang W. and Christer A.H., (2003), Solution algorithms for a multi-component system
inspection model, Computers and OR, 30, 190134.
Wang W. and Jia, X., (2007), A Bayesian approach in delay time maintenance model
parameters estimation using both subjective and objective data, Quality Maintenance
and reliability Int. , 23, 95105
Part E
Management
15
Maintenance Outsourcing
15.1 Introduction
Every business (mining, processing, manufacturing and service-oriented businesses
such as transport, health, utilities, communication) needs a variety of equipment to
deliver its outputs. Equipment is an asset that is critical for business success in the
fiercely competitive global economy. However, equipment degrades with age and
usage and ultimately become non-operational and businesses incur heavy losses
when their equipment is not in full operational mode. For example, in open cut
mining, the loss in revenue resulting from a typical dragline being out of action is
around one million dollars per day and the loss in revenue from a 747 plane being
out of action is roughly half a million dollars per day. Non-operational equipment
leads to delays in delivery of goods and services and this in turn causes customer
dissatisfaction and loss of goodwill.
Rapid changes in technology have resulted in equipment becoming more com-
plex and expensive. Maintenance action can reduce the likelihood of such equip-
ment becoming non-operational (referred to as preventive maintenance) and also
restore a non-operational unit to an operational state (referred to as corrective main-
tenance). For most businesses it is no longer economical to carry out maintenance in
house. There are a variety of reasons for this including the need for a specialist work
force and diagnostic tools that often require constant upgrading. In these situations it
is more economical to outsource the maintenance (in part or total) to an external
agent through a service contract. Campbell (1995) gives details of a survey where it
was reported that 35% of North American companies had considered outsourcing
some of their maintenance.
Consumer durables (products such as kitchen appliances, televisions, auto-
mobiles, computers, etc.) that are bought by individuals are certainly getting more
complex. A 1990 automobile is immensely more complex than its 1950 counter-
part. Customers need assurance that a new product will perform satisfactorily over
its lifetime. In the case of consumer durables, manufacturers have used warranties
to provide this assurance during the early part of a products useful life. Under
374 D. Murthy and N. Jack
warranty the manufacturer repairs all failures that occur within the warranty period
and this is often done at no cost to the customer. The warranty period for most
consumer durables has been increasing and the warranty terms have been
becoming more favourable to the customer. For example, the typical warranty pe-
riod for an automobile in 1930 was 90 days, in 1970 it was 1 year, and in 1990 it
was 3 years. A warranty is tied to the sale of a product and the cost of servicing the
warranty is factored into the sale price. For customers who need assurance beyond
the warranty period, manufacturers and/or third parties (such as financial institu-
tions, insurance companies and independent operators) offer extended warranties
(or service contracts) at an additional cost to the customer. Extended warranties for
automobiles of 57 years are now fairly common.
Governments (local, state or national) own infrastructure (roads, rail and com-
munication networks, public buildings, dams, etc.) that were traditionally main-
tained by in-house maintenance departments. Here there is a growing trend towards
outsourcing these maintenance activities to external agents so that the governments
can focus on their core activities.
In all the above cases, we have an asset (complex equipment, consumer durable
or an element of public infrastructure) that is owned by the first party (the owner)
and the asset maintenance is outsourced to the second party (the service agent who
is also referred to as the contractor in many technical papers) under a service
contract. This chapter deals with maintenance outsourcing from the perspectives of
both the owner (the customer for the maintenance service) and the service agent
(the service provider). We focus on the first case (where the customer is a
business) and we develop a framework to indicate the different issues involved,
carry out a review of the literature, and indicate topics that need further investiga-
tion and research.
The outline of the chapter is as follows. Section 15.2 deals with the customer
and the agent perspectives. In Section 15.3, we propose a framework to study main-
tenance outsourcing. Section 15.4 reviews the relevant literature on maintenance
outsourcing and on extended warranties. Section 15.5 deals with a game theoretic
approach to maintenance outsourcing and extended warranties. In Section 15.6 we
briefly discuss agency theory and its relevance to maintenance outsourcing and, in
Section 15.7 we conclude with a brief discussion of future research in maintenance
outsourcing.
15.2.1.1 Businesses
Businesses (producing products and/or services) need to come up with new
solutions and strategies to develop and increase their competitive advantage.
Outsourcing is one of these strategies that can lead to greater competitiveness
(Embleton and Wright 1998). It can be defined as a managed process of trans-
ferring activities performed in-house to some external agent. The conceptual basis
for outsourcing (see Campbell 1995) is as follows:
1. Domestic (in-house) resources should be used mainly for the core com-
petencies of the company.
2. All other (support) activities that are not considered strategic necessities
and/or whenever the company does not possesses the adequate compet-
ences and skills should be outsourced (provided there is an external agent
who can carry out these activities in a more efficient manner).
Most businesses tend not to view maintenance as a core activity and have
moved towards outsourcing it. The advantages of outsourcing maintenance are as
follows:
For very specialised (and custom built) products, the knowledge to carry out the
maintenance and the spares needed for replacement need to be obtained from the
original equipment manufacturer (OEM). In this case, the customer is forced into
having a maintenance service contract with the OEM and this can result in a non-
competitive market. In the USA, Section II of the Sherman Act (Khosrowpour
1995) deals with this problem by making it illegal for OEMs to act in this manner.
When the maintenance service is provided by an agent other than the original
equipment manufacturer (OEM) often the cost of switching prevents customers
from changing their service agent. In other words, customers get locked in and
are unable to do anything about it without a major financial consequence.
376 D. Murthy and N. Jack
REGULATOR OWNER
GOVERNMENT OPERATOR
Figure 15.1. Different parties that need to be considered in the maintenance of infrastructures
OWNER SERVICE
CONTRACT
(CUSTOMER) AGENT
ment. The stress can be thermal, mechanical, electrical, etc., and the reliability de-
creases as the stress increases and/or the environment gets harsher.
When a failure occurs, the asset can be restored to an operational state through
corrective maintenance (CM). In the case of equipment, this involves repairing or
replacing the failed components. In the case of the road example, the CM involves
filling the potholes and resealing a section of the road. The degradation in the asset
state can be controlled through use of preventive maintenance (PM) and, in the
case of equipment, this involves regular monitoring and replacing of components
before failure.
The asset state at any given time (subsequent to it being put into operation) is a
function of its inherent reliability and past history of usage and maintenance. This
information is important in the context of maintenance service contracts for used
assets. The information that the service agent (and the customer) has can vary from
very little to lot (if detailed records of past usage and maintenance have been kept).
Finally, for some assets, the delivery of maintenance requires the service agent
to visit the site where the asset is located (for example, lifts in buildings and roads)
and for others (most consumer durables and some industrial equipment) the failed
asset can be brought to a service centre to carry out the maintenance actions.
15.3.2 Maintenance
System level modeling If only CM and no PM is used and the time to repair is very
much smaller than time between failures, then one can model failures over time as
a stochastic point process with an intensity function (t ) that is increasing with t
(time or age) to capture the degradation with time (see Rigdon and Basu 2000).
The effect of operating stress and operating environment can be modeled through a
Cox-regression model where the intensity function is modified to g ( z ) (t ) where
z is the vector of covariates representing the stress and environmental variables
(see Cox and Oakes 1984).
The effect of PM actions can be modeled through a reduction in the intensity
function as shown in Figure 15.3. The level of PM (indicated by in the figure)
determines the reduction in the intensity function and the cost of a PM action
increases with the level of PM.
INTENSITY FUNCTION
PM ACTIONS
1 2
T1 TIME T2
Figure 15.3. Effect of PM actions on the intensity function
15.3.3 Contract
The contract is a legal document that is binding on both parties (customer and
service agent) and it needs to deal with technical, management and economic
issues.
In scenario S-1, the service agent is only providing the resources (workforce
and material) to execute the work. This corresponds to the minimalist approach to
outsourcing. In scenario S-2, the service agent decides on how and when and what
is to be done is decided by the customer. Finally, in scenario S-3 the service agent
makes all three decisions.
There is growing trend towards functional guarantee contracts. Here the contract
specifies a level for the output generated from equipment, for example, the amount
of electricity produced by a power plant, or the total length of flights and number of
landings and takeoffs per year. The service agent has the freedom to decide on the
maintenance needed (subject to operational constraints) with incentives and/or
382 D. Murthy and N. Jack
penalties if the target levels are exceeded or not. For more on this, see Kumar and
Kumar (2004).
In the context of infrastructures, there is a trend towards giving the service
agent the responsibility for ongoing upgrades or the responsibility for the initial
design resulting in a BOOM (build, own, operate and maintain) contract.
The levels of risk to both parties vary with the contract scenario.
The literature deals with maintenance outsourcing mainly from the customer
perspective and is focussed on management issues. More specifically, attempts are
made to address one or more of the following questions in a qualitative manner:
Some of the relevant papers are Campbell (1995), Judenberg (1994), Martin
(1997), Levery (1998) and Sunny (1995).
Unfortunately, cost has been the sole basis used by businesses for making
maintenance out-sourcing decisions. Sunny (1995) looks at what activities are to be
outsourced by looking at the long strategic dimension (core competencies) as well
as the short-term cost issues.
Bertolini et al. (2004) take a quantitative approach and use the analytic hierarchy
process (AHP) to make decisions regarding the outsourcing of maintenance.
Ashgarizadeh and Murthy (2000) and Murthy and Ashgarizadeh (1998, 1999)
look at maintenance outsourcing from both customer and service agent perspec-
384 D. Murthy and N. Jack
tives and propose game-theoretic models to determine the optimal strategies for
both parties. This approach is discussed further in Section 15.5.
On the application side, Armstrong and Cook (1981) look at clustering of
highway sections for awarding maintenance contracts to minimise the cost and use
a fixed-charge goal programming model to determine the optimal strategy.
Bevilacqua and Braglia (2000) illustrate their AHP model in the context of an
Italian brick manufacturing business having to make decisions regarding main-
tenance outsourcing.
Stremersch et al. (2001) look at the industrial maintenance market.
Consider the case where the service agent is the leader and offers n options
( Ai (i ),1 i n, ) to the customer where i ,1 i n, are the decision variables
corresponding to the different options that the agent needs to select optimally. As
an illustrative case, let n = 2 and the two options that the service agent offers for
CM actions are as follows:
Option 1 [Fixed Price Service Contract A1 (1 ) ]: For a fixed price P , the
service agent agrees to rectify all failures occurring over a period L at no
additional cost to the customer. If a failure is not rectified within a period , the
service agent incurs a penalty. If Y denotes the time for which the equipment is in
the non-operational state before it becomes operational, then the penalty incurred is
given by max{0, (Y )} , where is the penalty cost per unit time. This
ensures that the service agent does not deprive the customer of the use of the
equipment for too long. Here, 1 = {P, , }.
Option 2 [Pay for each repair contract A2 ( 2 ) ]: In this case, whenever a
failure occurs, the service agent charges an amount Cs for each repair and does not
incur any penalty if the equipment is in the non-operational state for greater than
units of time. Here, 2 = {Cs }.
In the Stackelberg game formulation, given the set of options (along with the
values for the decision variables of the service agent), the customer chooses the
best option to optimize his/her goal. This generates the optimal response function
A *(1 , 2 , , n ) as shown in Figure 15.5. Using this, the service agent then
optimally selects the decision variables to optimize his/her objective.
Ai (i ), 1 i n
SERVICE AGENT CUSTOMER
A* (1 , 2 , , n )
Figure 15.5. Stackelberg game formulation
Murthy and Asgharizadeh (1998, 1999) and Asgharizadeh and Murthy (2000)
use a Stackelberg game formulation for a special case where the time between
equipment failures is given by an exponential distribution so that the failures over
time occur according to a Poisson process. They consider the two options dis-
cussed earlier and consider the following three cases:
In case 1 the service agent has to decide the optimal number of customers to
service and in case 3 he has to decide the optimal number of repair facilities.
Jack and Murthy (2006) consider the case where the product is complex and so the
specialist knowledge of the manufacturer is required to carry out any repairs after
the base warranty expires. The consumer must decide how long to keep the item
and how to maintain it until replacement. Two maintenance options are available:
the consumer can (i) pay the manufacturer to repair the item each time it fails, or
(ii) purchase an extended warranty (EW) from the manufacturer. These are similar
to Options 2 and 1 respectively, discussed earlier. The EW contract specifies that
the manufacturer will again rectify all failures free of charge to the consumer. The
consumer has flexibility in choosing when the EW will begin and the length of
cover. The price of the EW depends on these two variables and is set by the
manufacturer. The manufacturer also has to decide the price of each repair if the
item fails and the consumer does not have an EW. A Stackelberg game formulation
is used to determine the optimal strategies for both the consumer and the manu-
facturer.
COSTS
INCENTIVES
MONITORING PRINCIPAL
CONTRACT
INFORMATIONAL
RISK PREFERENCES
ASYMMETRY
AGENT
Moral hazard. Moral hazard refers to lack of effort (or shirking) on the part of the
agent. The agent does not put in the agreed-upon effort because the objectives of
the two parties are different and the principal cannot assess the level of effort that
the agent has actually used.
Adverse selection. Adverse selection refers to any misrepresentation of ability by
the agent and the principal is unable to completely verify this before deciding to
hire the agent.
Information. To counteract adverse selection, the principal can invest in getting
information about the agents ability. One way of getting the desired information is
by contacting people for whom the agent has provided service in the past.
Monitoring. The principal can counteract the moral hazard problem by monitoring
the actions of the agent. Monitoring provides information about the agents actual
actions.
Information asymmetry. There are several uncertainties that affect the overall
outcome of the relationship. The two parties, in general, will have different infor-
mation to make an assessment of these uncertainties and will also differ in terms of
other information.
Risk. This results from the different uncertainties that affect the outcome of the
relationship. The risk attitude of the two parties, in general, will differ for a variety
of reasons. A problem arises when this disagreement is over the allocation of risk
between the two parties.
Costs. There are various kinds of costs for both parties. Some of these depend on
the outcome (which is influenced by uncertainties) but also in acquiring informa-
tion, monitoring and the administration of the contract. The heart of the principal-
agent theory is the trade-off between (i) the cost of monitoring the actions of the
Maintenance Outsourcing 389
agent and (ii) the cost of measuring the outcomes of the relationship and the trans-
ferring of risk to the agent.
Contract. The design of the contract that takes into account the issues discussed
above is the challenge that lies at the heart of the principal-agent relationship.
warranties to meet the different needs across the customer population. Agency
theory offers a framework to evaluate the costs of different policies taking into
account all the relevant issues.
15.8 References
Armstrong, R.D. and Cook, W.D. (1981), The contract formation problem in preventive
pavement maintenance: A fixed-charge goal-programming model, Comp. Environ.
Urban Systems, 6, 147155
Ashgarizadeh, E. and Murthy, D.N.P. (2000), Service contracts a stochastic model,
Mathematical and Computer Modelling, 31, 1120
Barlow, R.E. and Hunter, L.C. (1960), Optimum preventive maintenance policies,
Operations Research, 8, 90100
Bertolini, M., Bevilacqua, M. Braglia, M. and Frosolini, M. (2004), An analytical method
for maintenance outsourcing service selection, International Journal on Quality &
Reliability Management, 21, 772788
Bevilacqua, M. and Braglia, M. (2000), The analytic hierarchy process applied to
maintenance strategy selection, Reliability Engineering & System Safety, 70, 7183.
Biedenweg, F. M. (1981), Warranty Analysis: Consumer Value vs. Manufacturers Cost,
Unpublished Ph.D. Thesis, Stanford University, U.S.A.
Blischke, W.R. and Murthy, D.N.P. (1994), Warranty Cost Analysis. Marcel Dekker, New
York
Blischke, W.R. and Murthy, D.N.P. (1996), Product Warranty Handbook, Marcel Dekker,
New York
Blischke, W.R. and Murthy D.N.P. (2000), Reliability, Wiley, New York
Campbell, J.D. (1995), Outsourcing in maintenance management: a valid alternative to self-
provision, Journal of Quality in Maintenance Engineering, 1, 1824.
Maintenance Outsourcing 391
Cho, D. and Parlar, M. (1991), A survey of maintenance models for multi-unit systems,
European Journal of Operational Research, 51, 123.
Cox, D.R. and Oakes, D. (1984), Analysis of Survival Data, Chapman and Hall, New York
Day, E. and Fox, R.J. (1985), Extended warranties, service contracts and maintenance
agreements A marketing opportunity? Journal of Consumer Marketing, 2, 7786
Dekker, R., Wildeman, R.E. and van der Duyn Schouten, F.A. (1997), Review of multi-
component models with economic dependence, Zor/Mathematical Methods of
Operations Research, 45, 411435.
Desai, P.S. and Padmanabhan, V. (2004), Durable good, extended warranty and channel
coordination. Review of Marketing Science, 2, Article 2, available at
www.bepress.com/romsjournal/vol2/iss1/art2
Dunn, S. (1999), Maintenance outsourcing Critical issues, available at: www.plant-
maintenance.com/maintenance_articles_outsources.html
Eisenhardt, K.M. (1989), Agency theory: An assessment and review, The Academy of
Management Review, 14, 5774
Embleton, P.R. and Wright, P.C. (1998), A practical guide to successful outsourcing,
Empowerment in Organizations, Vol. 6 No. 3, pp. 94106
Eppen, G.D., Hanson, W.A. and Martin, R.K. (1991), Bundling new products, new
markets, low risks, Sloan Management Review, Summer, 714
Hollis, A. (1999), Extended warranties, adverse selection and aftermarkets. The Journal of
Risk and Insurance, 66, 321343
Iskandar, B.P., and Murthy, D.N.P. (2003), Repair-replace strategies for two-dimensional
warranty policies, Mathematical and Computer Modelling, 38, 12331241
Iskandar, B.P., Murthy, D.N.P. and Jack, N. (2005), A new repair-replace strategy for items
sold with a two-dimensional warranty, Computers and Operations Research, 32,
669682
Jack, N. and Murthy, D.N.P. (2001), A servicing strategy for items sold under warranty, Jr.
Oper. Res. Soc., 52, 12841288
Jack, N. and Murthy, D.N.P. (2006), A Flexible Extended Warranty and Related Optimal
Strategies, Jr. Oper. Res. Soc. (accepted for publication)
Jack, N. and Van der Duyn Schouten, F. (2000), Optimal repair-replace strategies for a
warranted product, Int. J. Production Economics, 67, 95100
Jardine, A.K.S. and Buzacott, J.A. (1985), Equipment reliability and maintenance, European
Journal of Operational Research, 19, 285296.
Judenberg, J. (1994), Applications maintenance outsourcing, Information Systems
Management, 11, 3438
Khosrowpour, M. (ed) (1995), Managing Information Technology Investments with
Outsourcing, Idea Group Publishing, Harrisburg
Kraus, S. (1996), An overview of incentive contracting, Artificial Intelligence, 83, 297346
Kumar, R. and Kumar, U. (2004), Service delivery strategy: Trends in mining industries, Int.
J. Surface Mining, Reclamation and Environment, 18, 299307
Laffont, J. and Martimort, D, (2002) The Theory of Incentives: the Principal-Agent Model,
Princeton University Press
Levery, M. (1998), Outsourcing maintenance: a question of strategy, Engineering
Management Journal, February, 3440.
Lutz, N.A. and Padmanabhan, V. (1994), Income variation and warranty policy. Working
Paper, Graduate School of Business, Stanford University.
Lutz, N.A. and Padmanabhan, V. (1998), Warranties, extended warranties and product
quality. International Journal of Industrial Organization, 16, 463493.
Macho-Stadler, I. and Perez-Castrillo, D. (1997), An Introduction to the Economics of
Information, Oxford University Press
392 D. Murthy and N. Jack
Martin, H.H. (1997), Contracting out maintenance and a plan for future research, Journal of
Quality in Maintenance Engineering, 3, 8190
McCall, J.J. (1965), Maintenance policies for stochastically failing equipment: A survey,
Management Science, 11, 493524.
Murthy D.N.P. and Ashgarizadeh, E. (1998), A stochastic model for service contract; Int. Jr.
of Reliability Quality and Safety Engineering; 5, 2945
Murthy D.N.P. and Ashgarizadeh, E. (1999), Optimal decision making in a maintenance
service operation, European Journal of Operational Research, 116, 259273
Murthy, D.N.P. and Djamaludin, I. (2002), Product warranty A review, International
Journal of Production Economics, 79, 231260
Nguyen, D.G. (1984), Studies in Warranty Policies and Product Reliability. Unpublished
Ph.D. Thesis, The University of Queensland, Australia.
Nguyen, D.G. and Murthy, D.N.P. (1986), An optimal policy for servicing warranty, Jr.
Oper. Res. Soc., 37, 10811088
Nguyen, D.G. and Murthy, D.N.P. (1989), Optimal replace-repair strategy for servicing
items sold with warranty, Euro. Jr. of Oper. Res., 39, 206212
Padmanabhan, V. (1995), Usage heterogeneity and extended warranties. Journal of
Economics and Management Strategy, 4, 3353
Padmanabhan, V. (1996), Extended warranties, in Product Warranty Handbook, W.R.
Blischke and D.N.P. Murthy (eds), Marcel Dekker, New York
Padmanabhan, V. and Rao, R.C. (1993), Warranty policy and extended warranties: theory
and an application to automobiles. Marketing Science, 12, 230247
Pierskalla, W.P. and Voelker, J.A. (1976), A survey of maintenance models: The control and
surveillance of deteriorating systems, Naval Research Logistics Quarterly, 23, 353388.
Pintelton, L.M. and Gelders, L. (1992), Maintenance management decision making,
European Journal of Operational Research, 58, 301317.
Rigdon, S.E. and Basu, A.P. (2000), Statistical Methods for the Reliability of Repairable
Systems, Wiley, New York
Ross, S.M. (1980), Stochastic Processes, Wiley, New York
Sahin, I. and Polatoglu, H. (1998), Quality, warranty and preventive maintenance. Kluwer:
Amsterdam
Scarf, P.S. (1997), On the application of mathematical models to maintenance, European
Journal of Operational Research, 63, 493506.
Sherif, Y.S. and Smith, M.L. (1986), Optimal maintenance models for systems subject to
failure - A review, Naval Logistics Research Quarterly, 23, 4774.
Stremersch, S., Wuyts, S. and Frambach, R.T. (2001), The purchasing of full-service
contracts: An exploratory study within the industrial maintenance market, Industrial
Marketing Management, 30, 112
Sunny, I. (1995), Outsourcing maintenance: making the right decisions for the right reasons,
Plant Engineering, 49, 156157.
Thomas, L.C. (1986), A survey of maintenance and replacement models for maintainability
and reliability of multi-item systems, Reliability Engineering, 16, 297309
UK Competition Commission (2003): A report into the supply of extended warranties on
domestic electrical goods within the UK, available at:
www.competition-commission.org.uk/inquiries/completed/2003/warranty/index.htm
Valdez-Flores, C. and Feldman, R.M. (1989), A survey of preventive maintenance models
for stochastically deteriorating single-unit systems, Naval Research Logistics Quarterly,
36, 419446.
Van Ackere, A. (1993), The principal-agent paradigm: Its relevance to various functional
fields, European Journal of Operational Research, 70, 83103
Maintenance Outsourcing 393
16.1 Introduction
Businesses need equipment to produce their outputs (goods/services). Equipment
degrades with age and usage, and eventually fails (Blischke and Murthy 2000).
This impacts business performance in several ways reduced equipment avail-
ability, lower output quality, higher operating costs, increased customer dissatisfac-
tion, etc. The degradation can be controlled through preventive maintenance (PM)
actions whilst corrective maintenance (CM) actions restore failed equipment to its
working state.
Prior to 1970, businesses owned the equipment, and maintenance was done in
house. Since 1970, there has been a shift towards outsourcing of maintenance. This
was primarily due to a change in the management paradigm where activities in a
business were classified as either core or non-core, with the non-core activities to
be outsourced to external agents if this was deemed to be cost effective. Also, as
technology became more complex it was no longer economical to carry out in-
house maintenance due to the need for expensive maintenance equipment and
highly trained maintenance staff.
Since 1990, there has been an increasing trend towards leasing rather than
owning equipment. According to Fishbein et al. (2000) there are several reasons
for this. Some of these are as follows:
Rapid technological advances have resulted in improved equipment ap-
pearing on the market, making the earlier generation equipment obsolete at
an ever-increasing pace.
The cost of owning equipment has been increasing very rapidly.
Businesses viewing maintenance as a non-core activity.
It is often economical to lease equipment, rather than buy, as this involves
less initial capital investment and often there are tax benefits that make it
attractive.
396 D. Murthy and J. Pongpech
Abbreviations
AFT: Accelerated failure time
PH: Proportional hazard
NHPP: Non-homogeneous Poisson process
ROCOF: Rate of occurrence of failure
CM: Corrective maintenance
PM: Preventive maintenance
Notation
F (t ) : Failure distribution for the time to first failure of new equipment
f (t ), r (t ) : Failure density and hazard functions associated with F (t )
0 (t ) : Intensity function with only CM actions
Maintenance of Leased Equipment 397
There are several types of leases but, unfortunately, there is no standard terminol-
ogy. The terms used in the USA often differ from those used in the UK. We briefly
discuss the three main types.
USA, the Internal Revenue Code defines a true lease as a transaction that allows
the lessor to claim ownership and the lessee to claim rental payments as tax de-
ductions.
The advantages and disadvantages of an operating lease from the lessees per-
spective are as follows:
Advantages
The lessee can obtain new equipment (based on the latest technologies) and
thus avoid the risks associated with equipment obsolescence.
The lessee usually gets maintenance and other supports from the lessor so
that the business can focus on core activities.
Equipment disposal is the lessors responsibility.
Disadvantages
If the lessees needs change over the lease period, then premature termina-
tion of the lease agreeement can incur penalties.
The risks associated with the lessor do not provide the level of maintenance
needed.
Advantages
The lessee is able to spread the payments over the lease period (no need for
initial cash at purchase).
It offers greater flexibility as the lessee can choose from a range of lease
options especially, in the consumer product market when there are
several institutions offering different types of leases.
Disadvantages
If the lessee fails to make lease payments as per schedule, the leased equip-
ment can be repossessed and sold by the lessor to recover the payments
due.
Maintenance of Leased Equipment 399
Maintenance is often not a part of the lease agreement so that the lessee has
to provide for this separately.
The overall cost to the lessee is significantly higher than purchase price of
the equipment because the payments include not only the financing costs,
but also other costs associated with insurance, taxes, etc.
REGULATOR OWNER
CUSTOMER
(USER)
SERVICE EQUIPMENT
PROVIDER (ASSET)
OUTPUTS
(PRODUCTS /
SERVICES)
GOVERNMENT OPERATOR
Customer: The customer is the lessee. The lessee can be an individual (purchasing
a car under finance lease), a business (operating industrial or commercial equip-
ment under operational lease) or a government agency (responsible for operating
an infrastructure, such as train network, under a buyback lease).
Equipment: Equipment can be an infrastructure (for example, parts of road net-
work, railway network, sewerage and water network, electricity network, etc.);
400 D. Murthy and J. Pongpech
industrial equipment (for example, trucks, cranes, plant machinery, etc.); commer-
cial equipment (for example, office furniture, vending machines, photocopiers,
etc.) and, consumer products (for example, refrigerators, computers, etc.). The cost
of the equipment (or asset) can vary significantly. Ezzel and Vora (2001) give
some interesting statistics relating to sale and leaseback, and operating leases in the
USA over the period 19841991.
Owner: The owner is a person or agency that owns the equipment from a legal
point of view. In the case of a finance lease, the financial institution is the owner as
the equipment is mortgaged to the institution.
Service provider: In the case of an operating lease, the lessor is the service pro-
vider. However, if the lessor decides to outsource the maintenance to some external
service agent, then the agent is the service provider. In the case of a finance lease,
the lessee is responsible for the maintenance and might decide to outsource it to an
external agent.
Outputs (products/services): If the lessee is a business, then the leased equipment
is used to produce its outputs goods and/or services as discussed in Section 16.1.
For consumer goods, the output is the utility (in the case of a kitchen appliance) or
the satisfaction (in the case of a television) derived by the lessee.
Operator: In general, the lessee is the operator of the equipment. However, the
lessee, in turn, might hire some other business to operate the equipment and
produce the desired outputs. An example of this is a business that leases a fleet of
aircraft, then outsources the flying to another business that employs the crew and
operates the planes.
Government: Government plays an important role in the context of sale and buy-
back leases of infrastructure. The lessee can be a department of the government or
an independent unit acting as a proxy for the government. Decisions relating to
subsidy, tax incentives, etc., are decided by the government and have a significant
impact on the lease structure.
Regulator: This applies mainly for equipment used in certain industry sectors
(such as health, transport, energy) where public safety is of great concern. The re-
gulator is often an independent body that monitors and makes recommendation that
can be binding on the owners and operators of equipment.
Vickerman (2004) deals with the infrastructure maintenance issues in the
context of rail and road transport in the UK and discusses the role of government
and regulators. Interested readers should consult the references cited in the paper
for more details.
There are many different scenarios depending on the number of parties involved.
Table 16.1 gives three different scenarios involving four parties. Other scenarios
can include additional parties such as the government and/or the regulator.
In the remainder of the chapter we focus our attention on industrial and
commercial equipment leased under an operating lease and this corresponds to
Scenario 1.
Maintenance of Leased Equipment 401
According to Baker and Hayes (1981), some of the pioneers in business equipment
leasing were IBM and Xerox. Since then, the number of businesses that lease
business equipment has grown significantly and many kinds of equipment are
leased. ELA (2005) gives a list of some of the businesses leasing their products
under operating leases.
We focus on the maintenance (provided by the lessor) of equipment leased
under an operating lease.1 A framework to study this involves several key elements
and these are indicated in Figure 16.2.
LEASE CONTRACT
EQUIPMENT
Lessor: The lessor is not only the owner of the leased equipment, but also the
maintenance service provider. The lessor is a business (either manufacturer or
some other entity) and as such has certain business objectives. At the strategic level
these can include issues such ROI, market share, profits, etc. In order to achieve
these objectives, the lessor needs to have proper strategies at the strategic level (to
deal with issues such as type and number of equipment to lease, upgrade options to
1
In the case of a finance lease, the lessee has the option of either doing the maintenance in
house or outsourcing it to some third party. For more on maintenance outsourcing, see
Deelen et al. (2003).
402 D. Murthy and J. Pongpech
The literature on equipment leasing deals with a variety of issues. For a broader
overview see Baker and Hayes (1981), Schallheim (1994) and Coyle (2000).
The bulk of the literature deals with issues from the lessees perspective, and
these can be broadly divided into two groups (a) management oriented and (b)
economics and finance oriented. The management oriented literature is mainly
qualitative and deals with the following issues:
Buy vs. lease options through proper cost and benefit analysis
Selection of the most appropriate lease option
Negotiating the terms of the lease option
Administration of lease contracts
Maintenance of Leased Equipment 403
See Deelen et al. (2003) and ELA (2005) for more details.
The economics and finance oriented literature looks at both the lessor and
lessee perspectives and the leased equipment market resulting from the interaction
between these two parties. Ezzel and Vora (2001), Sharpe and Nguyen (1995),
Desai and Purohit (1998), Stremersch et al. (2001), Handa (1991) and Kim et al.
(1978) are an illustrative sample where readers can find more details.
The literature on maintenance is vast and there are many survey papers and
books on the topic. They deal with a range of issues determining optimal main-
tenance strategies, planning and implementation of maintenance actions, logistics
of maintenance, etc. References to these can be found in review/survey papers
(McCall 1965; Pierskalla and Voelker 1976; Sherif and Smith 1976; Jardine and
Buzacott 1985; Gits 1986; Thomas 1986; Valdez-Flores and Feldman 1989; Cho
and Parlar 1991; Pintelton and Gelders 1992; Dekker et al. 1997; Scarf 1997).
There are very few papers dealing with the maintenance of leased equipment and
these will be discussed later in the chapter.
One needs to differentiate between first and subsequent failures. The first failure
depends on the age of the equipment (in the case of used equipment) and the sub-
sequent failures depend on the type of CM actions (to rectify failures) and the PM
actions (to avoid failures).
f ( t ) = dF ( t ) / dt and r ( t ) = f ( t ) 1 F ( t ) (16.1)
respectively. In the case of used equipment, let A denote the age at the start of the
lease. Then, the time to first failure is given by the conditional failure distribution
function
F (t ) F ( A)
F (t A) = , t A. (16.2)
1 F ( A)
failed then it is called minimal repair (see Barlow and Hunter 1960). This is
appropriate for complex equipment where the equipment failure is due to failure of
one or a few components. The equipment becomes operational by replacing (or re-
pairing) the failed components. This action has very little impact on the reliability
characteristics of the equipment. If the failure rate changes (in either direction)
after repair, it is called imperfect repair. Many different types of imperfect repair
models have been proposed and for a review of such models see Pham and Wang
(1996).
The time to repair is in general a random variable and needs to be modeled by a
distribution function. Typically, the time to repair is often very much smaller than
the time between random variables (in a statistical sense) so that one can ignore
this and treat repair as being instantaneous for determining failures over time. With
this assumption, the failures over time (with only CM actions) occur according to a
non-homogeneous Poisson process (NHPP) with intensity function 0 (t ) = r (t ) ,
the hazard function defined earlier. The intensity function (characterizing the
failures over time) is also referred to as rate of occurrence of failure (ROCOF).
The cost of repair is also a random variable and needs to be modeled by a
distribution function. Let C f denote the average cost of each minimal repair.
( ) ( )
t +j = t j j (16.3)
( )
0 j t j ( 0 ) (16.4)
This implies that PM action cannot make the equipment better than new.
As a result, if PM actions are carried out at time instants t j , j 1, and the
reduction in the intensity function given by j , j 1, then the intensity function is
given by
j
( t ) = 0 ( t ) , t
i =0
i j < t < t j +1 , (16.5)
for j 0 , with t0 = 0 and 0 = 0 . This implies that the reduction resulting from
action at t j lasts for all t t j as shown in Figure 16.4.
0 (t )
(t )
2
1
Time t1 t2
Figure 16.4. Effect of PM action on the intensity function for new equipment
The cost of each PM action depends on the reduction in the intensity function.
Let C p ( ) denote the cost of PM action and this is an increasing function of .
0 (t )
(t )
A-x A Time
Figure 16.5. Effect of upgrade action on the intensity function for used equipment
The cost of this type of PM action depends on the reduction in the virtual age
and is modeled by a function Cu ( x) which is an increasing function of x .
16.4.2 Penalties
Both the lessor and the lessee can incur penalties if they violate the terms of the
contract. In the case of the lessee, it could be the usage intensity exceeding that
specified in the contract (provided the lessor can monitor this). In the case of the
lessor, the penalties are linked to equipment failures and the time to repair failed
equipment.
Two simple forms of penalty are as follows.
Penalty 1: Let N ( L) denote the number of equipment failures over the lease
period L . If N ( L) exceeds (a pre-specified value) the lessor incurs a penalty.
The amount that the lessor pays to the lessee at the end of the contract is
Cn [max{N ( L) , 0}] .
Penalty 2: Let the random variable Y denote the time that the lessor takes to restore
failed equipment to its working state. If Y exceeds (a pre-specified value) then
the lessor incurs a penalty given by Ct [max{(Y ), 0}] .
408 D. Murthy and J. Pongpech
Whenever a failure occurs, the lessor incurs a direct cost in restoring the failed
equipment to its operating state. Also, the lessor can incur indirect costs resulting
from the penalties incurred. As a result, the total CM costs are the sum of both the
direct and the indirect costs. These costs can be lowered through greater PM effort
but this implies increased PM costs. The total cost to the lessor as a function of the
PM effort is as shown in Figure 16.6 and the optimal PM effort is one that mini-
mizes the total costs.
Since the CM costs are uncertain, the optimal PM effort is based on minimizing
the expected total cost. This requires the lessor to first define the kind of PM policy
that would be employed and then to optimally select the parameters of the policy
so as to minimize the expected total cost.
TOTAL COST
COSTS
CM COST PM COST
OPTIMAL PM EFFORT
One can define many different types of PM policies that the lessor can use. We
first consider new equipment lease. We define a few policies and indicate the
parameters that need to be optimally selected. Later we look at used equipment
lease.
1
t
0 ( t ) = (16.6)
y m
G ( y ) = 1 exp , 0 y < (16.7)
n
with shape parameter m < 1 (implying decreasing repair rate) and scale parameter
n > 0 . We assume the following parameter values:
Intensity function: = 1 (year) and > 1 (implying increasing failure rate)
Repair time: m = 0.5 and n = 0.5 (mean time to repair is one day)
Reduction in intensity function: C p ( ) = 100 + 50 ($)
wx
Reduction in age: Cu ( x) = ($) with w = 10 and = 0.1
1 e ( )
A x
k
J ( ) = C f E N ( L ) + C ( ) +
j =1
p j
(16.8)
Ct E N ( L ) ( y ) g ( y ) dy + Cn E N ( L )
The first term on the LHS is the cost of rectifying failures, the second term is the
PM costs, and the third and fourth terms represent the penalty costs associated with
repair times and number of failures over the lease period. The parameters, given by
the set {k , t j , j ,1 j k } need to be selected optimally to minimize J ( ) .
Example 16.1 Table 16.2 (extracted from Table 3 of Jaturonnatee et el. 2005)
shows k * , the optimal number of PM actions (the optimal values for the remaining
parameters are omitted) and J * ( * ) , the corresponding expected costs for a range
of and Cn .
The optimisation needs to take into account the following constraint:
j
0<
i =0
i < 0 (t j ) 0 (0), j 1 (16.9)
with t0 = 0 and 0 = 0 .
Maintenance of Leased Equipment 411
The number of PM actions carried out over the lease period is k (T ) given by the
largest integer less than [ L / T ] . The expected total cost is given by Equation 16.8
{
with t j = jT , j 1, and the parameters, given by the set, T , j ,1 j k (T ) , }
need to be selected optimally to minimize J ( ) subject to the constraint given by
Equation 16.9.
Example 16.2 Table 16.3 (extracted from Pongpech and Murthy 2006) shows T *
(the optimal values for the other parameters are omitted) and the corresponding
expected total cost for = 3 and L = 5 .
The age of the used equipment is A and the lessor carries out an overhaul which
reduces its age by x before the equipment is leased out. The analysis is similar to
Policy 1 and the expected total cost (see, from Pongpech et al. (2006) for details) is
given by
k
J ( ) = C f E N ( L ) + C p ( j ) + Ct E N ( L ) ( y ) g ( y ) dy
j =1 (16.10)
+ Cn E N ( L ) + Cu ( x )
Example 16.3 Table 16.4 (extracted from Pongpech et al. 2006) shows x* and k *
(the optimal values for the remaining parameters are omitted) and the correspond-
ing optimal expected total cost for A = 5 , = 2 and L = 5 .
A x* k* J ( * )
1 0.0 4 $2280.00
2 0.6 4 $3111.58
3 1.2 4 $3752.68
4 2.0 4 $4290.03
5 2.5 4 $4792.55
6 3.6 4 $5198.07
7 4.2 4 $5601.10
As can be seen, x* (the reduction in age due to PM actions before the equipment is
leased out) increases with A as is to be expected since the ROCOF increases with
age. Note that no upgrade is needed when the equipment is fairly young ( A = 1 ).
Also, k * does not change when = 2 . However, when > 2 , then we find that
k * increases as A increases.
5. From the lessors point of view, the size and variety of equipment to stock
for leasing are both important issues. The optimal choice of these and the
replacement decisions must take into account the needs of different lessees
and the investment needed for the purchase of new stock.
16.7 References
Baker CR, Hayes RS (1981) Lease Financing A Practical Guide, John Wiley, New York,
USA
Barlow RE, Hunter LC (1960) Optimum preventive maintenance policies, Operation
Research, 8:90100
Blischke WR, Murthy DNP (2000) Reliability Modeling, Prediction, and Optimization, John
Wiley, New York, USA
Cho D, Parlar M (1991) A survey of maintenance models for multi-unit systems, European
Journal of Operational Research, 51:123
Coyle B (2000) Leasing, Glenlake, Chicago, USA
Deelen L, Dupleich M, Othieno L, Wakelin O (2003) Leasing for small and micro
enterprises a guide for designing and managing leasing schemes in developing
countries, Berold, R. (ed), Cristina Pierini, Turin, Italy.
Dekker R, Wildeman RE, Van Der Duyn Schouten FA (1997) Review of multi-component
models with economic dependence, Mathematical Methods of Operations Research,
45:411435
Desai P, Purohit D (1998) Leasing and selling: optimal marketing strategies for a durable
goods firm, Management Science, 44 (11):1934
http://www.leasefoundation.org/pdfs/2001StateofIndustryRpt.pdf
ELA (2002a) Equipment Leasing and Financial Foundation 2002 State of the Industry
Report, Price Water House Coopers, Available on
http://www.leasefoundation.org/pdfs/2002SOIRpt.pdf
ELA (2002b) Equipment Leasing Association Online Focus Groups Report, Available on
http://www.chooseleasing.org/Market/2002FocusGroupsRpt.pdf
ELA (2005) The economic contribution of equipment leasing to the U.S. economy: growth,
investment & jobsupdate, Equipment Leasing Association, Global Insight, Advisory
Services Group, Available on http://www.elaonline.org/press/
Ezzel JR, Vora PP (2001) Leasing versus purchasing: Direct evidence on a corporations
motivation for leasing and consequences of leasing, The Quarterly Review of Economics
and Finance, 41:3347
Fishbein BK, McCarry LS, Dillon PS (2000) Leasing: A step toward producer responsi-
bility, Available on http://www.informinc.org.
Gits CW (1986) On the maintenance concept for a technical system: II. Literature review,
Maintenance Management International, 6:181196
Handa P (1991) An economic analysis of leasebacks, Review of Quantitative Financing and
Accounting, 1:177189
Jardine AKS, Buzacott JA (1985) Equipment reliability and maintenance, European Journal
of Operational Research, 116:259273
Jaturonnatee J, Murthy DNP, Boondiskulchok R (2005) Optimal preventive maintenance of
leased equipment with corrective minimal repair, European Journal of Operational
Research, Available online 30 March 2005
Kim EH, Lweellen WG, McConnell JJ (1978) Sale-and-leaseback agreements and enterprise
valuation, Journal of Financial and Quantitative Analysis, 13:871881
Maintenance of Leased Equipment 415
Ashraf Labib
17.1 Introduction
Computerised maintenance management systems (CMMSs) are vital for the co-
ordination of all activities related to the availability, productivity and maintain-
ability of complex systems. Modern computational facilities have offered a dra-
matic scope for improved effectiveness and efficiency in, for example, main-
tenance. Computerised maintenance management systems (CMMSs) have existed,
in one form or another, for several decades.
The software has evolved from relatively simple mainframe planning of main-
tenance activity to Windows-based, multi-user systems that cover a multitude of
maintenance functions. The capacity of CMMSs to handle vast quantities of data
purposefully and rapidly has opened new opportunities for maintenance, facilitat-
ing a more deliberate and considered approach to managing assets.
Some of the benefits that can result from the application of a CMMS are:
Resource control tighter control of resources
Cost management better cost management and audibility
Scheduling ability to schedule complex, fast-moving workloads
Integration integration with other business systems
Reduction of breakdowns improved reliability of physical assets through
the application of an effective maintenance programme
The most important factor may be reduction of breakdowns. This is the aim of
the maintenance function and the rest are nice objectives (or by-products).
This is a fundamental issue as some system developers and vendors as well as
some users lose focus and compromise reduction of breakdowns in order to main-
tain standardisation and integration objectives, thus confusing aim with objectives.
This has led to the fact that the majority of CMMSs in the market suffer from
serious drawbacks, as will be shown in the following section.
418 A. Labib
form of the data captured and the historical nature of certain elements of it. In
short, companies tend to spend a vast amount of capital in acquisition of off-the-
shelf systems for data collection, but their added value to the business is question-
able.
Few books have been published about the subject of CMMSs (Bagadia 2006;
Mather 2002; Cato and Mobley 2001; Wireman 1994). However, they tend to
highlight its advantages rather than its drawbacks.
All CMMSs offer data collection facilities; more expensive systems offer
formalised modules for the analysis of maintenance data, and the market leaders
allow real time data logging and networked data sharing (see Table 17.1). Yet,
despite the observations made above regarding the need for information to aid
maintenance management, a black hole exists in the row titled Decision analy-
sis in Table 17.1, because virtually no CMMS offers decision support.1 This is a
definite problem, because the key to systematic and effective maintenance is
managerial decision-making that is appropriate to the particular circumstances of
the machine, plant or organisation. This decision-making process is made all the
more difficult if the CMMS package can only offer an analysis of recorded data.
As an example, when a certain preventive maintenance (PM) schedule is input into
a CMMS, for example to change the oil filter every month, the system will simply
produce a monthly instruction to change the oil filter and is thus no more than a
diary.
The use of CMMSs for decision support lags significantly behind the more
traditional applications of data acquisition, scheduling and work order issuing.
While many packages offer inventory tracking and some form of stock level
monitoring, the reordering and inventory holding policies remain relatively
simplistic and inefficient. See the work of Exton and Labib (2002) and Labib and
Exton (2001). Also, there is no mechanism to support managerial decision-making
with regard to inventory policy, diagnostics or setting of adaptive and appropriate
preventive maintenance schedules.
A noticeable problem with current CMMS packages regards provision of
decision support. Figure 17.1 illustrates how the use of CMMS for decision support
lags significantly behind the more traditional applications of data acquisition,
scheduling and work-order issuing.
A Black Hole
Mai ntenance budgeti ng
Inventor y contr ol
70 75 80 85 90 95 100
It is worrying the fact that almost half of the companies are either in some
degree dissatisfied or neutral with their CMMS and that the responses indicated
that manufacturing plants demand more user-friendly systems.
This is a further proof of the existence of a black hole. To make matters
worse, it appears that there is a new breed of CMMSs that are complicated and lack
basic aspects of user-friendliness. Although they emphasise integration and logis-
tics capabilities, they tend to ignore the fact that the fundamental reason for imple-
menting CMMSs is to reduce breakdowns. These systems are difficult to handle
for both production operators and maintenance engineers; they are accounting-
and/or IT-orientated rather than engineering-orientated. Results of an investigation
(EPSRC GM/M35291) show that managers lack of commitment to maintenance
models has been attributed to a number of reasons:
Managers are unaware of the various types of maintenance models.
A full understanding of the various models and the appropriateness of these
systems to companies is not available.
Managers do not have confidence in mathematical models due to their
complexities and the number of unrealistic assumptions they contain.
This correlates with surveys of existing maintenance models and optimisation
techniques. Ben-Daya et al. (2001) and Sherwin (2000) have also noticed that
models presented in their work have not been widely used in industry for several
reasons, such as:
Unavailability of data
Lack of awareness about these models
Restrictive assumptions of some of these models
Finally, here is an extract from the Professor Nigel Slack (Warwick University)
textbook on operations management regarding critical commentary of ERP imple-
mentations (which may as well apply to CMMSs as many of them tend to be nowa-
days classified as specialised ERP systems):
Far from being the magic ingredient which allows operations to fully integrate
all their information, ERP is regarded by some as one of the most expensive
ways of getting zero or even negative return on investment. For example, the
American chemicals giants, Dow Chemical, spent almost half-a-billion dollars
and seven years implementing an ERP system which became outdated almost
as it was implemented. One company, FoxMeyer Drug, claimed that the ex-
pense and problems which it encountered in implementing ERP eventually
drove it to bankruptcy. One problem is that ERP implementation is expensive.
This is partly because of the need to customise the system, understand its
implications for the organisation, and train staff to use it. Spending on what
some call the ERP ecosystem (consulting, hardware, networking and compli-
mentary applications) has been estimated as being twice the spending on the
software itself. But it is not only the expense which has disillusioned many
companies, it is also the returns they have had for their investment. Some
studies show that the vast majority of companies implementing ERP are
disappointed with the effect it has had on their businesses. Certainly many
422 A. Labib
companies find that they have to (sometimes fundamentally) change the way
they organise their operations in order to fit in with ERP systems. This or-
ganisational impact of ERP (which has been described as the corporate
equivalent of dental root canal work) can have a significantly disruptive effect
on the organisations operations.
Hence, theory and implementation of existing maintenance models are, to a
large extent, disconnected. It is concluded that there is a need to bridge the gap
between theory and practice through intelligent optimisation systems (e.g. rule-
based systems). It is also argued that the success of this type of research should be
measured by its relevance to practical situations and its impact on the solution of
real maintenance problems. The developed theory must be made accessible to
practitioners through IT tools. Efforts need to be made in the data capturing area to
provide necessary data for such models. Obtaining useful reliability information
from collected maintenance data requires effort. In the past, this has been referred
to as data mining as if data can be extracted in its desired form if only it can be
found.
In the next section we introduce a decision analysis model. We then show how
such a model has been implemented for decision support in maintenance systems.
In this particular company there are 130 machines, varying from robots and
machine centres to manually operated assembly tables. Notice that, in this case
study, only two criteria are used (frequency and downtime). However, if more
criteria are included, such as spare parts cost and scrap rate, the model becomes
multi-dimensional, with low, medium, and high ranges for each identified criterion.
The methodology implemented in this case was to follow three steps. These steps
are: i. criteria analysis, ii. decision mapping, and iii. decision support.
Computerised Maintenance Management Systems 425
B re a k d o wn tre n d s (h )
1200
1000
800
600
400
200
0
Nov D ec Ja n Feb M ar A pr M ay Ju n Ju l A ug Sep Oct Nov
As indicated earlier, the aim of this phase is to establish a Pareto analysis of two
important criteria: downtime the main concern of production and frequency of
calls the main concern of asset management. The objective of this phase is to
assess how bad are the worst performing machines for a certain period of time, say
one month. The worst performers in both criteria are sorted and grouped into high,
medium, and low sub-groups. These ranges are selected so that machines are
distributed evenly among every criterion. This is presented in Figure 17.4. In this
particular case, the total number of machines is 120. Machines include CNCs,
robots, and machine centres.
The aim of this step is twofold: it scales high, medium, and low groups and hence
genuine worst machines in both criteria can be monitored on this grid; it also
monitors the performance of different machines and suggests appropriate actions.
The next step is to place the machines in the decision making grid shown in
Figure 17.5, and accordingly, to recommend asset management decisions to man-
agement. This grid acts as a map where the performances of the worst machines are
placed based on multiple criteria. The objective is to implement appropriate actions
that will lead to the movement of machines towards the north-west section of low
downtime, and low frequency. In the topleft region, the action to implement, or the
rule that applies, is OTF (operate to failure). The rule that applies for the bottomleft
region is SLU (skill level upgrade) because data collected from breakdowns
attended by maintenance engineers indicates that machine [G] has been visited
many times (high frequency) for limited periods (low downtime). In other words
maintaining this machine is a relatively easy task that can be passed to operators
after upgrading their skill levels.
Machines that are located in the topright region, such as machine [B], is a
problematic machine, in maintenance words a killer. It does not breakdown
frequently (low frequency), but when it stops it is usually a big problem that lasts
for a long time (high downtime). In this case the appropriate action to take is to
analyse the breakdown events and closely monitor its condition, i.e. condition base
monitoring (CBM).
A machine that enters the bottomright region is considered to be one of the
worst performing machines based on both criteria. It is a machine that maintenance
engineers are used to seeing it not working rather than performing normal operat-
ing duty. A machine of this category, such as machine [C], will need to be
structurally modified and major design out projects need to be considered, and
hence the appropriate rule to implement will be design out maintenance (DOM).
If one of the antecedents is a medium downtime or a medium frequency, then
the rule to apply is to carry on with the preventive maintenance schedules. However,
not all of the media are the same. There are some regions that are near to the top left
corner where it is easy FTM (fixed time maintenance) because it is near to the
OTF region and it requires re-addressing issues regarding who will perform the
instruction or when will the instruction be implemented. For example, in the case of
machines [I] and [J], they are situated in region between OTF and SLU and the
question is about who will do the instruction operator, maintenance engineer, or
sub-contractor. Also, a machine such as machine [F] has been shifted from the OTF
region due to its relatively higher downtime and hence the timing of instructions
needs to be addressed.
Other preventive maintenance schedules need to be addressed in a different
manner. The difficult FTM issues are the ones related to the contents of the
instruction itself. It might be the case that the wrong problem is being solved or the
right one is not being solved adequately. In this case machines such as [A] and [D]
need to be investigated in terms of the contents of their preventive instructions and
an expert advice is needed.
Computerised Maintenance Management Systems 427
Once the worst performing machines are identified and the appropriate action is
suggested; it is now a case of identifying a focused action to be implemented. In
other words, we need to move from the strategic systems level to the operational
component level. Using the analytic hierarchy process (AHP), one can model a
hierarchy of levels related to objectives, criteria, failure categories, failure details
and failed components. For more details on the AHP readers can consult Saaty
(1988). This step is shown in Figure 17.6.
The AHP is a mathematical model developed by Saaty (1980) that prioritises
every element in the hierarchy relative to other elements in the same level. The
prioritization of each element is achieved with respect to all elements in the above
level. Therefore, we obtain a global prioritized value for every element in the
lowest level. In doing that we can then compare the prioritized fault details (level 4
in Figure 17.6), with PM signatures (keywords) related to the same machine. PMs
can then be varied accordingly in an adaptive manner to shop floor realities.
The proposed decision analysis maintenance model as shown previously in
Figure 17.2 combines both fixed rules and flexible strategies since machines are
compared on a relative scale. The scale itself is adaptive to machine performance
with respect to identified criteria of importance. Hence flexibility concept is
embedded in the proposed model.
428 A. Labib
Level 2: Critical
Machines
System A System B System C
Level 3: Critical
Faults
Level 4: Fault Motor Faults No Power Faults Panel Faults Switch Faults
Details
0.75
0.4
0 10 20 30 40 50
Frequency
12 (No. of times)
0.7
0.2
The output strategies have a membership function and we have assumed a cost
(or benefit) function that is linear and follows the following relationship (DOM >
CBM >SLU > FTM > OTF) as shown in Figure 17.9a.
The rules are then constructed based on the DMG grid where there will be 9
rules. An example of the rules is as follows:
If frequency is high and downtime is low then maintenance strategy is SLU
(skill level upgrade).
If frequency is low and downtime is high then maintenance strategy is
CBM (condition based maintenance).
Rules are shown in Figure 17.9b.
0 20 30 40 50
OTF FTM SLU CBM DOM Units of Cost
(x 1,000/unit)
Figure 17.9. a Output (strategies) membership function. b The nine rules of the DMG
Computerised Maintenance Management Systems 431
The fuzzy decision surface is shown in Figure 17.10. In this figure, given any
combination of frequency (x-axis) and downtime (y-axis) one can determine the
most appropriate strategy to follow (z axis).
DOM
CBM
SLU
FTM
OTF
It can be noticed from Figure 17.11 that the relationship of (DOM > CBM
>SLU > FTM > OTF) is maintained. As illustrated in Figure 17.11, given an 380-h
downtime and a 12 x frequency, the suggested strategy to follow is CBM.
Figure 17.11. The fuzzy decision surface showing the regions of different strategies
432 A. Labib
17.5.5 Discussion
The concept of the DMG was originally proposed by Labib (1996). It was then
implemented in a company that has achieved a world-class status in maintenance
(Labib 1998a). The DMG model has also been extended to be used as a technique
to deal with crisis management in an award winning paper (Labib 1998b).
The DMG could be used for practical continuous improvement process because,
when machines in the top ten have been addressed, they will then, if and only if
appropriate action has been taken, move down the list of top ten worst machines.
When they move down the list, other machines show that they need improvement
and then resources can be directed towards the new offenders. If this practice is
continuously used then eventually all machines will be running optimally.
If problems are chronic, i.e. regular, minor and usually neglected, some of these
could be due to the incompetence of the user and thus skill level upgrading would
be an appropriate solution. However, if machines tend towards RCM then the
problems are more sporadic and when they occur could be catastrophic. Uses of
maintenance schemes such as FMEA and FTA can help determine the cause and
may help predict failures thus allowing a prevention scheme to be devised.
Figure 17.12 shows when to apply TPM and RCM. TPM is appropriate at the
SLU range since skill level upgrade of machine tool operators is a fundamental
concept of TPM, whereas RCM is applicable for machines exhibiting severe
failures (high downtime and low frequency). Also, CBM and FMEA will be ideal
for this kind of machine and hence an RCM policy will be most applicable. The
significance of this approach is that in one model we have RCM and TPM in a
unified model rather than two competing concepts.
17.8 References
Bagadia, K. (2006), Computerized Maintenance Management Systems Made Easy,
McGraw-Hill.
Brashaw, C. (1998), Characteristics of acoustic emission (AE) signals from ill fitetd copper
split bearings, Proc 2nd Int. Conf on Planned Maintenance, Reliability and Quality.
Ben-Daya, M., Duffuaa, S.O. and Raouf, A. (eds) (2001), Maintenance Modelling and
Optimisation, Kluwer Academic Publishers, London.
Boznos, D. (1998), The Use of CMMSs to Support Team-Based Maintenance, MPhil thesis,
Cranfield University.
Computerised Maintenance Management Systems 435
Terje Aven
18.1 Introduction
This chapter discusses the use of risk analysis to support decision making on
maintenance activities. In recent years there has been a growing interest in the use
of risk analysis and risk based (informed) approaches for guiding decisions on
maintenance, see, e.g., Vatn et al. (1996), Clarotti et al. (1997), Dekker (1996) and
Cepin (2002), and this topic has also been given much attention in industry see for
example van Manen et. al. (1997), Knoll et al. (1996), Perryman et al. (1995) and
Podofillini et al. (2006). This chapter provides a critical review of some of the key
building blocks of the theories and methods developed. We also discuss some
critical factors for ensuring a successful use of risk analysis for maintenance
applications. The issues discussed include:
Risk descriptions and categorisations
Uncertainty assessments
Risk acceptance and risk informed decision making
Selection of appropriate methods and tools
An example is presented of a detailed risk analysis, showing the effect of mainten-
ance efforts on risk.
The chapter is organised as follows. First in Section 18.2 we review the basic
elements of risk management and risk management processes, and clarify the risk
perspective adopted in this chapter. Then in Section 18.3 we address the use of risk
analysis to support decisions on maintenance. Various types of decision situations
and analyses are covered. Section 18.4 presents the case mentioned above. In
Section 18.5 we discuss key building blocks of the theories and methods devel-
oped, as well as the critical factors for ensuring a successful use of risk analysis for
maintenance applications. Section 18.6 concludes. When not otherwise stated, we
use terminology from ISO (2002).
438 T. Aven
List of abbreviations:
PLL Potential loss of life (expected number of fatalities per year)
FAR Fatal accident rate (expected number of fatalities per 100 million
exposed hours)
ETA Event tree analysis
FTA Fault tree analysis
CCA Cause consequence analysis
FMECA Failure mode and effect and criticality analysis
HAZOP Hazard and operability studies
RIF Risk influencing factor
BORA Barrier operational risk analysis
RCM Reliability centred maintenance
HMI Human machine interface
TTS Technical condition safety
The purpose of risk management is to ensure that adequate measures are taken to
protect people, the environment and assets from harmful consequences of the
activities being undertaken, as well as balancing different concerns, in particular
risks and costs. Risk management includes both measures to avoid the occurrence
of hazards and reduce their potential harm. Traditionally risk management was
based on a prescriptive regulating regime, in which detailed requirements were set
to the design and operation of the arrangements. This regime has gradually been
replaced by a more goal oriented regime, putting emphasis on what to achieve
rather than the solutions.
Risk management is an integral aspect of a goal oriented regime. It is acknowl-
edged that risk cannot be eliminated but must be managed. There is an enormous
drive and enthusiasm in various industries and society as a whole nowadays to
implement risk management in the organizations. There are high expectations, that
risk management is the proper framework for obtaining high levels of performance.
To support decision making on design and operation, risk analyses are conduc-
ted. The analyses cover identification of hazards and threats, cause analyses, con-
sequence analyses and risk description. Evaluations of the results of the analyses
are carried out. The totality of the analyses and the evaluations are referred to as
risk assessments. Risk assessment is followed by risk treatment, which is the
process and implementation of measures to modify risk, including measures to
avoid, reduce (optimize), transfer or retain risk. Risk transfer means sharing with
another party the benefit or loss associated with a risk. It is typically affected
through insurance. Risk management covers all co-ordinated activities to direct and
control an organisation with regard to risk. The risk management process is the
systematic application of management policies, procedures and practices to the
tasks of establishing the context, assessing, treating, monitoring, reviewing and
communicating risks; see Figure 18.1.
Risk Analysis in Maintenance 439
IDENTIFY RISKS
EVALUATE RISKS
TREAT RISKS
of risk analysis we obtain when models are developed to represent cause and/or
consequence scenarios. The standard tools used are FTA (fault tree analysis) and
ETA (event tree analysis) and the combination of the two, CCA (cause conse-
quence analysis). These models are important elements in a qualitative risk analy-
sis, and provide the basis for a quantitative risk analysis. These are all standard risk
analysis methods and we refer to texts books for description of discussion of these
methods; see, e.g. Aven (1992) and Modarres (1993).
The models are used to identify critical systems, and thus provide a basis for
selecting appropriate maintenance activities. To illustrate this, let R be a risk index,
for example expressing the expected number of fatalities (PLL) or the probability
of a system failure, and let Ri be the risk index when subsystem i is in the
functioning state. Then a common way of ranking the different subsystems is to
compute the risk improvement potential (also referred to as the risk achievement
worth) Ii = Ri R, i.e. the maximum potential risk improvement that can be
obtained by improving system i (Aven 1992; Haimes 1998). The potential Ii is
referred to as a risk importance measure. An application of this approach is
presented in Brewer and Canady (1999). Criteria are established based on such a
ranking to identify when maintenance improvements are needed to reduce risks.
Identifying critical items is an important basis for maintenance management, and is
one of the key steps in various maintenances frameworks, e.g. the RCM (reliability
centred maintenance) approach (Andersen and Neri 1990).
In risk analysis, the maintenance efforts are incorporated by:
High Main
system objectives
level
Comparison
of
High analysis
alternatives
level Historical
A data
n System
attributes
R a
Expert
i l
opinions
s y
Intermediate
k s System
i analysis level performance Suitable
s models
Low Maintenance Component
analysis performance performance
level
Figure 18.2. Model showing the relationship between maintenance efforts and risk
(Apeland and Aven 2000)
Traditionally, risk analysis using FTA and ETA have not had the level of detail
that is necessary to support many decision related to maintenance. However, recent
developments within risk analysis allow for more detailed analysis taking into
account risk influencing factors, for example maintenance activities. In Section 18.4
we will look closer into this type of risk analysis and show how maintenance
activities can be incorporated. Here we summarise the basic features of the method,
using a cause analysis based example as an illustration:
To carry out such an analysis there are a number of challenges, of which the
following are some of the more important:
Determine which F factors that should be included in the fault tree. The F
factors are fixed, meaning that the probability assignments are conditioned
on these factors. If some of the F factors are to be considered unknown to
the analyst, these factors need to be included in the fault tree, or the factors
should be divided into two categories, reflecting unknown factors on the
one hand and some given factors on the other. Such a distinction is made in
the SAM-method (Pate-Cornell and Murphy 1996).
Find adequate procedures for specifying the probabilities P(Bi|F). These
procedures need to be based on models and methods used for barrier per-
formance analyses, such as human reliability analysis.
We refer to Section 18.4. The above analysis provides decision support, by de-
scribing the effect of maintenance efforts on risk. To make a decision costs and
others aspects also need to be considered, and an important issue is then how this
should be done. A standard approach is the cost-benefit analysis based on compu-
tation of the expected net present value. We will discuss this issue in Section 18.5.
18.4 A Case
In this section we present a risk analysis incorporating operational and main-
tenance factors. The presentation is based on Sklet et al. (2005), and is referred to
as the BORA (barrier and operational risk analysis) approach. The approach is
inspired by the I-Risk method (Papazoglou et al. 2003). The case relates to an
offshore installation, and releases of hydrocarbons.
The BORA approach consists of the following steps:
The basic building blocks of the BORA model are barrier block diagrams, event
trees, fault trees, and influence diagrams. Barrier block diagrams are used to
illustrate the event scenarios and the effect of barrier systems on the event
sequences and consist of initiating events, barriers aimed to influence the event
sequence in a desired direction, and possible outcomes of the event sequence.
Event trees are used in the quantitative analysis of the scenarios. The performance
of the safety barriers are analyzed using fault trees. Influence diagrams are used to
analyze how the RIFs affect the initiating events in the event trees and the basic
events in the fault trees.
This case restricts attention to modeling of the containment function (prevent
release of hydrocarbons). For this function a number of release scenarios have
been modeled by use of barrier block diagrams. Each barrier block diagram com-
prises the following:
An initiating event, i.e. a deviation from the normal situation which may
cause a release of hydrocarbons.
Barrier systems aimed to prevent release of hydrocarbons.
The possible outcomes of the event sequence, which depend upon the
successful operation of the barrier system(s).
The barrier block diagram for the release scenario Release due to valve(s) in
wrong position after maintenance is illustrated in Figure 18.3.
Leak test
Release of
hydrocarbons
As seen in Figure 18.3, several of the barriers are non-physical by nature, thus
requiring human and operational factors to be included in the risk model.
In order to perform a quantitative risk analysis, frequencies/probabilities of
three main types of events need to be quantified:
1. The frequency of the initiating event, i.e. in the example case: The fre-
quency of valve in wrong position after maintenance.
2. The probability of failure of the barrier systems, which for the example case
includes: i) failure to reveal valve(s) in wrong position after maintenance by
self control/use of checklists, ii) failure to reveal valve(s) in wrong position
after maintenance by third party control of work, and iii) failure to detect
potential release during leak test prior to start-up.
3. The (end event) frequency of release of hydrocarbons due to valve in wrong
position (needed for further analysis of the effect of the consequence
barriers).
The frequency of the initiating event is in our example a function of the annual
number of maintenance operations where valve(s) may be set in wrong position in
hydrocarbon systems, and the probability of setting a valve in wrong position per
maintenance operation.
In order to determine the probability of failure of barrier systems, the barrier
systems may be further analyzed by use of fault trees as shown in Figure 18.4.
A11 A12
Corresponding analysis may be performed for all barriers for all the identified
release scenarios. For further illustration of the quantification methodology in the
BORA project, we consider the initiating event and the basic events shown in
Figures 18.3 and 18.4:
448 T. Aven
Valve(s) in wrong position after maintenance that may cause release (the
initiating event).
Use of self control/checklists not specified in program (basic event A11).
Use of self control/checklists specified, but not performed (basic event A12).
The operator fails to detect valve(s) in wrong position by self control/use of
checklists (basic event A13).
The first step in the quantification process is to assign industry average frequencies
and probabilities for all the initiating events in the event trees and basic events in
the fault trees.
Generic data may be found in generic databases or company internal databases.
Alternatively, industry average values can be established by use of expert judg-
ment. For our example case, Table 18.1 shows the assigned industry average fre-
quencies and probabilities for the initiating events and basic events in Figure 18.4.
RIFs for every initiating event in the event trees and every basic event in the fault
trees need to be identified. An example of an influence diagram for the basic event
Operator fails to detect a valve in wrong position by self check/checklist is
shown in Figure 18.5.
Figure 18.5. Influence diagram for the basic event Operator fails to detect a valve in wrong
position by self check/checklist
Risk Analysis in Maintenance 449
Table 18.2 shows the RIFs for the all the relevant events in our example case.
Table 18.2. Proposed RIFs for basic events in the example case
Event description RIFs
Valve in wrong position after Process complexity
aintenance Maintainability/accessibility
HMI (valve labeling and position feedback features)
Time pressure
Competence (of area technician)
Work permit
Self control/use of Program for self control
checklists not specified
Self control/use of Work practice (regarding use of self control/checklists)
checklists not performed Time pressure
Work permit
Area technician fails to detect HMI (valve labeling and position feedback features)
valves(s) in wrong position by Maintainability/accessibility
self control/ use of checklists Time pressure
Competence (of area technician)
Procedures for self control
Work permit
The first step is to assess the status of the RIFs. Two schemes are being used for
scoring of RIFs.
Scheme 1. Use of results from existing projects like technical condition safety
TTS (Thomassen and Srum 2002), the risk level on the Norwegian continental
shelf (PSA 2004), and investigations of incidents. The TTS project is a review
method to map and monitor the technical safety level based on the status of safety
critical elements and safety barriers, and each system is given a score (rating)
according to predefined performance standards. Table 18.3 shows the definition of
grades.
The next task is to adjust the industry average probabilities based on the scoring of
the RIFs. Three main aspects are discussed: a) the formulas for calculation of
installation specific frequencies/probabilities, b) assignment of appropriate RIF
scores, and c) weighting of RIFs. The procedure is illustrated by use of numbers
from the example case.
n
Prev = Pave w Q
i =1
i i (18.1)
w
i =1
i =1 (18.2)
Plow / Pave if si = A
Qi = 1 if si = C (18.3)
P / P if s = F
high ave i
where si denotes the score or status of RIF no i. Hence if the score si is A, and Plow
is 10% of Pave, then Qi is equal to 0.1. And if the score si is F, and Phigh is ten times
higher than Pave, then Qi is equal to 10. If the score si is C, then Qi is equal to 1.
Furthermore, if all scores are C, then Prev = Pave, if all scores are A, then Prev = Plow,
and if all scores are F, then Prev = Phigh.
Note that in this study we use a fixed factor of ten to describe the variations
caused by different scores, from A to F. That is, if all scores are A, Plow is 10% of
Pave, and if all the scores si are F, then Phigh is ten times higher than Pave.
Furthermore; we have adopted the grade score from the TTS project; A=3,
B=2, C=1, D=0, E= 2 and F= 5. Thus we have, letting Qi(j) denote the value of
Qi if the score si takes the value j, the results shown in Table 18.5.
By use of (18.1), Prev is equal to (Pave x 1.918). In our example case, the RIF
analysis gave an increase of the probability of occurrence of the basic event by a
factor of 1.9 (from Pave = 0.01 to Prev = 0.019).
A revised value for the installation specific risk may be calculated by use of the
platform specific data (Prev) as input data in the risk model (event trees/fault trees)
described above.
18.4.7 Remarks
We refer to Sklet et al. (2005) for a detailed discussion of this approach, and
relevant references for similar methods.
Compared to a traditional QRA model, the BORA approach is a more detailed
method, and includes considerably more risk influencing factors that gives more
detailed information of factors contributing to the total risk, i.e. a more detailed
risk picture. The analysis allows one to study the effect of maintenance efforts on
risk, and thus provide support for maintenance decisions. The risk analysis can be
used to identify the critical factors, as well as expressing the effect of risk reducing
measures.
that the use of expected values is the appropriate criterion for determining the best
policies. The justification is the statistical property of a mean. If we consider a
large set of similar activities and Xi is the consequences of the i-th activity, then the
law of large number says that under certain conditions the mean of the Xis is
approximately equal to EXi. Also the portfolio theory supports the use of the ex-
pected values; see e.g. Abrahamsen et al. (2005).
The use of traditional cost-benefit analyses to support decision making is based
on the same type of logic. Cost-benefit analyses means that we assign monetary
values to all relevant attributes, including costs and safety and summarise the
performance of an alternative by the expected net present value, E[NPV]. The main
principle in transformation of goods into monetary values is to find out what the
maximum amount society is willing to pay to obtain an improved performance.
Use of cost-benefit analysis is seen as a tool for obtaining efficient allocation of the
resources, by identifying which potential actions are worth undertaking and in what
fashion. By adopting the cost-benefit method the total welfare is optimised. This is
the rationale for the approach. Although cost-benefit analysis was originally
developed for the evaluation of public policy issues, the analysis is also used in
other contexts, in particular for evaluating projects and activities in firms. The
same principles apply, but using values reflecting the decision makers benefits and
costs, and the decision makers willingness to pay.
However, risk is more than expected values. The most common definition of
risk in the engineering community is that risk is the combination of consequences
and probability, i.e. the combination (X, P), where P refers to probability; see e.g.
ISO (2002). We extend this definition by using the pair (X, U), where U refers to
uncertainty. Probability is a way of expressing the uncertainties. Following these
perspectives on risk, there is a need to see beyond the expected values. The argu-
ments can be summarised as follows.
What we search for is desirable outcomes X, for example no accidents and high
profit. In practice we have a finite number of projects, and the mean numbers based
on these projects are not the same as the expected value. An accident could result
in losses that are significant also in a corporate perspective the standard deviation
of the project loss could be significant relative to the total cash flow of the firm.
And since the uncertainties in the consequences are large, the assumptions and
suppositions made in the calculation of the expected value may influence the
results to large extent. The assessments made should be seen as considerations
based on relevant information, but there could be different assessments, different
views and different perspectives on the uncertainties. This applies in particular to
assigned, small probabilities of rare events.
A complicating factor is that safety and risk involve the balance between
different attributes, including lives and money. The above expected value approach,
for example based on cost-benefit analyses, is based on one being able to transform
all values to one unit, the economic value. And from a business perspective, firms
may argue that this is the only relevant value. All relevant values should be trans-
formed to this unit. This means that the expected costs of accidents and lives should
be incorporated in the evaluations.
But what is the economic value of a life? For most human beings it is infinite;
most people would not be willing to give his or her life for a certain amount of
454 T. Aven
money. We say that a life has a value in itself. But of course, an individual may
accept a risk for certain money or other benefits. And for the firm, this is the way
of thinking the balance of costs and risk. The challenge is however to perform
this balance. What are reasonable numbers for the firm to use for valuing that a life
has a value in itself? Obviously there are no correct answers, as it is a managerial
and strategic issue. High values may be used if it can be justified that this would
produce high performance levels, on both safety and production.
Consequently, uncertainty needs to be considered, beyond the expected values,
which means that the principles of robustness and caution (precaution) have a role
to play. A risk-aversion behaviour is often the result. The point is that we put more
weight on possible negative outcomes than the expected values support. Many
firms seem in principle to be in favour of a risk neutral strategy for guiding their
decisions, but in practice it turns out that they are often risk averse. The justifica-
tion is partly based on the above arguments. In the case with a large accident, the
possible total consequences could be rather extreme the total loss for the firm in a
short and long term perspective is likely to be high due to loss of production,
penalties, loss of reputation, changes in the regulation regimes, etc. The overall
loss is difficult to quantify the uncertainties are large and it is seldom done in
practice, but the overall conclusion is that investments in safety are required. The
expected value is not the only basis for making this conclusion. We apply a
cautionary principle, expressing that in the face of uncertainty, caution should be a
ruling principle. For example, in a process plant, major hydrocarbon leaks might
occur, requiring investments in various safety systems and barriers to reduce the
possible consequences we are cautious. Uncertainties in phenomena and pro-
cesses justify investments in safety.
Thus to conclude on maintenance alternatives, we need an approach which
provide decision support beyond expected values. We recommend an assessment
process following a structure as summarized in the following (Aven and Vinnem
2007).
For a specified alternative, say A, we assess the consequences or effects of this
alternative seen in relation to the defined attributes (safety, costs, reputation, etc.).
Hence we first need to identify the relevant attributes (X1, X2, ) and then assess
the consequences of the alternative for these attributes. These assessments could
involve qualitative or quantitative analysis. Regardless of the level of quantifica-
tion, the assessments need to consider both what the expected consequences are, as
well as uncertainties related to the possible consequences. Often the uncertainties
could be large. In line with the adopted perspective on risk, we recommend a struc-
ture for the assessment according to the following scheme:
Hence for each alternative and attribute we may have information covering the
following points:
Predictions of attribute (e.g. zero fatalities)
Expected value (e.g. 0.1 fatalities)
Probability distribution (e.g. expressing a probability of a major accident)
Risk description on a lower level (e.g. prediction of number of leaks,
expected number of leaks, etc.)
Aspects of the consequences
Uncertainty factors
Manageability factors
These assessments provide a basis for comparing alternatives and making a de-
cision.
Compared to standard ways of presenting risk results, this basis is much more
comprehensive. In addition, sensitivity analyses and robustness analyses are to be
performed. Of course, the depth of the analysis will be a function of the decision
situation, the risks involved and the resources to be used. The full risk descriptions
as outlined above would be used only in special situations, requiring a comprehen-
sive decision support basis.
We refer to Aven and Vinnem (2007) for further reflections on the above
issues, and in particular the use of cost-benefit analyses. A key question discussed
456 T. Aven
is to what extent it is appropriate to adjust the value of a (statistical) life and adjust
the discount rate to take into account the uncertainties.
In maintenance application there is often reference to the use of risk acceptance
criteria, as upper limits of risk acceptance expressed for example by the PLL or
FAR values; see e.g. Khan and Haddara (2003). We are sceptical to the prevailing
thinking concerning risk acceptance criteria; see Aven and Vinnem (2005, 2007).
We all agree on the need for considering risk as a basis for making decisions under
uncertainty. Such considerations must however be seen in relation to other con-
cerns, costs and benefits. Care should be shown when using pre-determined risk
acceptance criteria in order to obtain good arrangements, plans and measures, as
they easily lead to the wrong focus using risk analysis to verify that these limits
are met and there is no drive for risk reduction and safety improvements.
The use of risk acceptance criteria cannot replace managerial review and
judgement. The decision support analyses need to be evaluated in the light of the
premises, assumptions and limitations of these analyses. The analyses are based on
a background information that must be reviewed together with the results of the
analyses. Risk analysis provides decision support, not hard decisions. We refer to
Aven and Vinnem (2007).
18.6 Conclusions
This chapter has presented and discussed the use of risk analysis for the selection
and prioritisation of maintenance activities. The chapter has reviewed some critical
aspects of risk analysis important for the successful implementation of such analy-
ses in maintenance. This relates to risk descriptions and categorisations, uncer-
tainty assessments, risk acceptance and risk informed decision making, as well as
selection of appropriate methods and tools.
In the risk analysis, the maintenance efforts are incorporated by:
Showing the relation between maintenance effort and component perform-
ance
Showing the relation between component performance and overall risk in
dices
An example is shown in Section 18.4. This example demonstrates some of the
problems related to incorporating the maintenance efforts into the risk analysis.
The analysis needs to be rather detailed to support the decision making. Develop-
ing suitable methodology is not straightforward, for example on how to assign
installation specific probabilities, based on the information available (including
reliability and maintenance data). Further research is undoubtedly required to give
confidence in the methods to be used. A detailed analysis requires substantial input
data, and the data must be relevant. Such analyses cannot be performed without
extensive use of expert judgment. However, expert judgment is not to be seen as
something negative. The risk analysis is a tool for summarising the information
available (including uncertainties), and expert judgment constitutes an important
part of this information.
Risk Analysis in Maintenance 457
18.7 References
Abrahamsen, E.B., Aven, T., Vinnem, J.E. and Wiencke, H.S. (2005) Safety Management
and the use of expected values. Risk, Decision and Policy, 9, 347358.
Andersen, R.T. and Neri, L. (1990) Reliability-Centred Maintenance. Management and
Engineering Methods, Elsevier Applied Sciences, London.
Apeland, S. and Aven, T. (2000) Risk based maintenance optimization: foundational issues.
Reliability Engineering and System Safety, 67, 285292.
Aven, T. (1992), Reliability and Risk Analysis, Elsevier Applied Science, London.
Aven, T. and Jensen, U. (1999) Stochastic Models in Reliability, Springer-Verlag, New
York.
Aven, T. and Kristensen, V. (2005) Perspectives on risk Review and discussion of the
basis for establishing a unified and holistic approach. Reliability Engineering and
System Safety, 90, 114.
Aven, T. and Vinnem, J.E. (2005) On the use of risk acceptance criteria in the offshore oil
and gas industry. Reliability Engineering and System Safety, 90, 1524.
Aven, T., Vinnem, J.E. and Wiencke, H.S. (2007) A decision framework for risk
management. Reliability Engineering and System Safety, 92, 433448.
Aven, T. and Vinnem, J.E. (2007) Risk Management, with Applications from the Offshore
Oil and Gas Industry, Springer Verlag, New York.
Brewer, H.D. and Canady, K.S. (1999) Probabilistic safety assessment support for the
maintenance rule at Duke Power Company. Reliability Engineering and System Safety,
63, 243249.
Cepin, M. (2002) Optimization of safety equipment outages improves safety. Reliability
Engineering and System Safety, 77, 7180.
Clarotti, C.A., Lannoy, A. and Procaccia, H. (1997) Probabilistic risk analysis of ageing
components which fail on demand; A Bayesian model: Application to maintenance
optimization of diesel engine linings. In Proceedings of Ageing of materials and
methods for the assessment of lifetimes of engineering plant, Cape Town, pp. 8594.
Dekker, R. (1996) Applications of maintenance optimization models: A review and analysis.
Reliability Engineering and System Safety, 51, 229240.
Faber, M.H. (2002) Risk-Based Inspection: An Introduction, Structural Engineering
International, 12, 186194.
Haimes, Y.Y. (1998) Risk modeling, Assessment, and Management, Wiley, New York.
ISO (2002) Risk management vocabulary. ISO/IEC Guide 73.
Khan, F.I. and Haddara, M.M. (2003) Risk-based maintenance (RBM): a quantitative
approach for maintenance/inspection scheduling and planning, Journal of Loss
Prevention, 16, 561573.
Knoll, A., Samanta, P.K. and Vesely, W.E. (1996) Risk based optimization of the Frequency
of EDG on-line maintenance at Hope Creek. In Proceedings of Probabilistic Safety
Assessment, Park City, pp. 378384.
Modarres, M. (1993) What Every Engineer should Know about Reliability and Risk
Analysis, Marcel Dekker, New York.
van Manen, S.E., Janssen, M.P. and van den Bunt, B. (1997) Probability-based optimization
of maintenance of the River Maas Weir at Lith. In Proceedings of European Safety and
Reliability conference (ESREL), Lisbon, pp. 17411748.
Papazoglou, I.A., Bellamy, L.J., Hale, A.R., Aneziris ON, Post JG, Oh JIH. (2003) I-Risk:
development of an integrated technical and Management risk methodology for chemical
installations. Journal of Loss Prevention in the Process Industries, 16, 575 591.
Pat-Cornell, E.M. and Murphy, D.M. (1996) Human and management factors in
probabilistic risk analysis: the SAM approach and observations from recent applications.
Reliability Engineering and System Safety, 53, 115126.
458 T. Aven
Perryman, L.J., Foster, N.A. and Nicholls, D.R. (1995) Using PRA in support of
maintenance optimization, International Journal of Pressure Vessels & Piping, 61, 593
608.
Podofillini, L., Zio, E. and Vatn, J. (2006) Risk-informed optimisation of railway tracks
inspection and maintenance procedures, Reliability Engineering and System Safety, 91,
2035.
PSA, 2004. Trends in Risk Levels on the Norwegian Continental Shelf Main report Phase 4
2003 (in Norwegian). The Petroleum Safety Authority Norway, Stavanger, Norway.
Rausand, M. and Hyland, A. (2003) System Reliability Theory, Wiley, New York.
Renn, O. and Klinke, A. (2002) A New approach to risk evaluation and management: Risk-
based precaution-based and discourse-based strategies, Risk Analysis, 22, 10711094.
Sandy, M., Aven, T. and Ford, D. (2005) On integrating risk perspectives in project
management. Risk Management: an International Journal, 7, 721.
Sklet, S., Hauge, S., Aven, T. and Vinnem, J.E. (2005) Incorporating human and
organizational factors in risk analysis for offshore installations. Proceedings ESREL
2005, pp. 18391847.
Thomassen, O., Srum, M. 2002. Mapping and monitoring the safety level. SPE 73923,
Society of Petroleum Engineers.
Vatn, J., Hokstad, P. and Bodsberg, L. (1996) An overall model for maintenance
optimization. Reliability Engineering and System Safety, 51, 241257.
19
19.1 Introduction
Maintenance is an important support function for the business processes with
significant investment in physical assets which plays an important role in achieving
organizational goals. However, the cost of maintenance and downtime is too high
for many industries. For example, the cost of maintenance in a highly mechanized
mine can be 4060% of the operating cost (Campbell 1995), the maintenance
spending in the UKs manufacturing industry ranges from 12 to 23% of the total
factory operating costs (Cross 1988) and as per a study in Germany, the annual
spending on maintenance in Europe is around 1500 billion euros (Altmannshopfer
2006). All these have motivated the senior managers and maintenance engineers to
measure the contribution of maintenance towards total business goals or in terms of
return on investment, etc.
Prior to the 1940s, maintenance was considered as a necessary evil and the
general attitude to maintenance was It costs what it costs. During 195080, with
the advent of techniques like preventive maintenance and condition monitoring, the
perception changed to maintenance is an important support function and it can be
planned and controlled. Today maintenance is considered as an integral part of the
business process and it is perceived as: It creates additional value (Liyanage and
Kumar 2003). The creation of additional value by maintenance is expressed in
terms of increased productivity, better utilisation of plant and system, lower
accident rates and better working environment. With increasing awareness that
maintenance creates additional value in the business process; more and more
companies are treating maintenance as an integral part of the business process, and
maintenance function has become an essential element of strategic thinking of
many companies involved in service and manufacturing industry. With this change
in the mindset of senior asset managers and owners, it has become essential to
measure the performance of manufacturing process to understand the tangible and,
if possible, intangible contribution of maintenance towards business goals. How-
ever, without any formal measures of performance, it is difficult to plan, control
460 U. Kumar and A. Parida
and improve the maintenance process. With this, the focus has shifted to measure
the performance of maintenance. Maintenance performance needs to be measured
to evaluate, control and improve the maintenance activities for ensuring achieve-
ment of organizational goals and objectives.
In recent years, maintenance performance measurement (MPM) has received a
great amount of attention from researchers and practitioners due to a paradigm shift
in maintenance. This chapter deals with the broad topic of performance measure-
ment (PM), metrics and measures for MPM, reviews the existing MPM frame-
works, discusses various issues and challenges associated with the development
and implementation of an MPM system. The outline of the chapter is as follows: an
overview of various PM frameworks and their development are presented in
Section 19.2. Definitions of maintenance performance indicator (MPI), and MPM
system, and their salient features are discussed in Section 19.3. The important
issues associated with the development of MPM system are discussed in Section
19.4, while the MPIs under different criteria are explained in Section 19.5. The
MPM system and the framework are explained in Section 19.6. Some of the MPIs
and MPM system in different industries are discussed in Section 19.7. The final
section concludes the chapter with limitations of the current literature and practice.
Performance measure is the term used when talking about PM in general. Per-
formance indicators (PIs) are measures that describe how well an operation is
achieving its objectives. A PI of an activity is a ratio of two variables: the output to
the input of that activity. A performance measure thus can be defined as metrics for
quantifying the efficiency and/or effectiveness of past or future activities, where as
a performance metric is the definition of the scope, content and component parts of
a broadly based performance measures (Neely et al. 2002). The characteristics of
performance measures include relevance, interpretability, timeliness, reliability and
validity (Al-Turki and Duffuaa 2003). PI is a more specific measurement gauges or
it indicates performance. PIs are broadly classified as leading or lagging indicators.
Leading indicators are performance driver and are used for understanding the
present status and taking corrective measures to achieve the desired target. A
leading indicator is of the non-financial and statistical type that fairly and reliably
predicts in advance. A leading indicator thus works as a performance driver and
ascertains the present status in comparison with the reference indicator level. In
maintenance departmental level, condition monitoring indicators such as noise,
vibration, thermograph measurement and particles in oil can be leading indicator.
Lagging indicators are outcome measures and provide basis for studying the devia-
tions after the completion of the activities. Cost of maintenance and mean time
between failures (MTBF), are few examples of lagging indicators. Since PIs are
just the indicator of performance, Key performance indicator (KPI) is an aggrega-
tion of various PIs in a logical way. Thus, KPI is more strategic and important
indicator of performance (Wireman 1998). The main purpose of KPI is to pinpoint
possible areas for improvement within an organization.
Until 1980, the PM was mostly based on financial measures. Kaplan and Norton
(1992) suggested the balanced scorecard as a more pragmatic and progressive
framework to measure the performance in a balanced way. The balanced scorecard,
with its four perspectives, focuses on both tangible and intangible perspectives of
the business process like; customers, internal processes, financial, and innovation
and learning. Subsequently, various researchers have developed frameworks con-
sidering non-financial measurements and intangible assets to achieve competitive
advantages (Parida and Kumar 2006). Some studies have shown that companies
using an integrated balanced PM system perform better than those which do not
measure their performance (Kennerly and Neely 2003; Lingle and Schiemann
1996). Some of the major PM frameworks developed by various authors and
researchers are Du Pont Pyramid (Chandler 1977), PM matrix (Keegan et al. 1989),
results and determinants matrix (Fitzgerald et al. 1991), balanced scorecard (BSC)
(Kaplan and Norton, 1992), SMART pyramid, (Lynch and Cross 1991), integrated
PM framework (Medori and Steeple 2000), performance prism, (Neely et al. 2002),
BSC of advanced information (Abran and Buglione 2003), and European Founda-
tion for Quality Management (EFQM) (Wongrassamee et al. 2003).
462 U. Kumar and A. Parida
The most important issue in developing an MPM system is to measure the value
created by maintenance process. As a manager, one must know that what is being
done is what is needed by the business process, and if the maintenance output is
not contributing/creating any value for the business, it needs to be restructured. For
example, ratio of investment made and trends in cost per ton.
New operating and maintenance strategies are adopted and followed by industries
in quick response to market demand, for the reduction of production loss and
process waste. MPM measures the value created by the maintenance. Some of the
important questions related to strategy are as follows:
How does one assess and respond to stakeholders (internal and external)
needs?
How does one translate the corporate goal and strategy into targets and
goals at the operational level (converting a subjective vision into objective
goals)?
How does one integrate the results and outcomes from the operational level
to develop lead indicators at the corporate level (converting objective out-
comes into strategic KPIs and linking them to strategic goals and targets)?
How to support innovation and training for the employees to facilitate an
MPM oriented culture?
Most organizations make the mistake of measuring what is easy to measure, rather
than what is required to be measured. Thus, over a period of time, the indicators
are out of tune with the corporate strategy. Besides, a large amount of undesired
data creates the data overload, which are rarely utilised for analysis or decision
making. Therefore, the MPIs need to be identified and selected to meet the specific
requirement of the organization and its related issues.
Today organizations are trying to adopt a flat and compact organizational structure,
a virtual work organization, and empowered, self-managing, knowledge manage-
ment work teams and work stations. The organisational maintenance issues are to
measure maintenance effectiveness and resources spent on maintenance. Typically
in an organization, the top level looks for the investment and decides the corporate
strategy, based on which the operation and maintenance strategies are formulated.
Depending on the maintenance strategy, maintenance program and policies are
defined, which are implemented by the middle level. The operational level under-
takes the actual tasks of performing the activities. The issues pertaining to organi
zation are:
Need for developing a reliable and meaningful MPM system.
Commitment of the top management for the MPM system.
Converting the subjective corporate goals to specific targets and MPIs re-
quired to be measured.
Involvement of the employees in implementation of the MPM system.
Method and means of these measurements.
Periodicity (time period) of such measurements.
466 U. Kumar and A. Parida
Figure 19.1. Linkages between objective outcomes at operational level to strategic level and
breaking down of goals into objective targets
As shown in the figure, while cascading down the corporate goals of a mining
company with an installed capacity of 0.6 million ton per month, the monthly
production target of 0.51 million ton per month of iron ore pellet will cascade down
to a system availability of 96% at the tactical level, which must be translated into
maximum allowed planned stop of 20 h per month and unplanned plant stop of 8.8 h
per month. Similarly, while aggregating the MPIs such as planned and unplanned
stops needs to be aggregated to higher level in terms of availability and capacity
utilization. The calculations are as under:
Plant capacity = 0.6 million ton per month
Saleable quantity = 0.51 million ton per month
Plant capacity is 835 tons per hour
Goals (tactical): Availability (A) = 96%, Speed (P) = 90%
and Quality (Q) = 99%
OEE = A P Q = .96 .90 .99 = 0.85
Non-availability = 24 30 0.4 = 28.8 h per month
Planned stop = 20 h/month and unplanned stop = 8.8 h/month
468 U. Kumar and A. Parida
Since the actual production and the OEE level has gone down, now the
management has to take remedial measures and appropriate decision making to
achieve the desired level of OEE and production.
Down-time for the number of minor and major stops. This is expressed in
hours and minutes for the total number of stops or for each minor and
major stop.
Rework. Rework due to maintenance lapses (for example; not sharpening
the tools) expressed in time (hours and minutes), the number of pieces on
which rework has been carried out and the cost of the rework undertaken.
for checking the frequency of failure and time taken to fix the failure. Some of the
MPIs considered under this criterion are:
Number of new ideas generated for improvement
Skills and competency development/training
The MPIs at functional and tactical levels gets aggregated as KPI at the
strategic level. For example, MPIs like the availability, performance (production
rate) and quality at operational level aggregates to OEE at the tactical level, and to
capacity utilization at the strategic level under plant/equipment criteria.
Maintenance Performance Measurement (MPM) System 473
The International Atomic Energy Agency (IAEA) has been actively sponsoring
work in the area of indicators to monitor nuclear power plant (NPP) operational
safety performance from the early 1990s. The safe operation of the nuclear power
plants is the accepted goal for the top management. A high level of safety results
from the integration of the good design, operational safety and human perform-
ance. In order to be effective, a holistic and integrated approach is required to be
adopted for providing a performance measurement framework and identifying the
performance indicators with desired safety attributes for the operation of the nu-
clear plant.
The NPP performance parameters include both safety and economic perform-
ance indicators, with overriding safety aspects. To assess the operational safety of
NPP, a set of tools like the plant safety aspect (PSA), regulating inspection, quality
assurance and self assessment are used. Two categories of indicators commonly
applied are risk based indicators and safety culture indicators.
The cost of maintenance and its influence on the total system effectiveness of the
oil and gas industry is too high to ignore (Kumar and Ellingsen 2000). The safe
operations of oil and gas production units are the accepted goal for the manage-
ment of the industry. A high level of safety is essential through the integration of
good design, operational safety and human performance. To be effective, an inte-
grated approach is required to be adopted for identifying the MPIs with desired
safety attributes for the operation of the oil and gas production unit.
Some of the MPIs reported from plant level to result unit level to result area
level for the Norwegian oil and gas industry grouped into different categories are
as follows (Kumar and Ellingsen 2000):
Production
Produced volume (Sm3)
Planned production (Sm3)
Technical integrity
Backlog preventive maintenance (man-hours)
Backlog corrective maintenance (man-hours)
Maintenance
Maintenance man-hours total
Maintenance man-hours safety systems
Maintenance Performance Measurement (MPM) System 475
Deferred production
Due to maintenance (Sm3)
Due to operation (Sm3)
Due to drilling/well operations (Sm3)
Weather and other causes (Sm3)
ance measurement frameworks. There is a further scope to study the impact of dif-
ferent culture and human behavioral aspects associated with MPM.
19.8 References
Abran, A. and Buglione, L. (2003), A multidimensional performance model for consolidating
Balanced Scorecards, Advances in Engineering Software, 34, pp. 339349
hren, T and Kumar, U. (2004), Use of maintenance performance indicators: a case study at
Banverket. Conference proceedings of the 5th Asia-Pacific Industrial Engineering and
Management Systems Conference (APIEMS2004). Gold Coast, Australia
Altmannshoffer, R. (2006). Industrielles FM, Der Facility Manager (In German), April
Issue, pp. 1213.
Al-Turki, U. and Duffuaa, S. (2003), Performance measures for academic departments,
International Journal of Educational Management, Vol. 17, No. 7, pp. 330338
Andersen, B. and Fagerhaug, T. (2002), Eight steps to a new performance measurement
system, Quality Progress, 35, 2, pp. 1125.
Campbell, J.D. (1995), Uptime: Strategies for Excellence in Maintenance Management.
Portland, OR: Productivity Press
Chandler, A.D. (1977), The Visible Hand: the Managerial Revolution in American Business,
Boston, MA, Harvard University Press, pp. 417
Cross, M. (1988), Raising the value of maintenance in the corporate environment,
Management Research News, Vol. 11, No. 3, pp. 811
DOE-HDBK-1148-2002 (2002) Work Smart Standard (WSS) Users Handbook, Department
of Energy, USA, www.eh.doe.govt/tecgstds/standard/hdbk1148/hdbk11482002.pdf
Fitzgerald, L., Johnson, R., Brignall, S., Silvestro, R. and Voss, C. (1991), Performance
Measurement in Service Businesses, London, CIMA
IAEA, International Atomic Energy Agency, (2000), A Framework for the Establishment of
Plant specific Operational Safety Performance Indicators, Report, Austria
Kaplan, R.S. and Norton, D.P. (1992), The balanced scorecard: measures that drive
performance, Harvard Business Review, JanuaryFebruary, pp. 7179
Keegan, D., Eiler, R. and Jones, C. (1989), Are your performance measures obsolete?
Management Accounting, June, pp. 4550
Kennerly, M. and Neely, A. (2003), Measuring performance in a changing business
environment, International Journal of Operation and Production Management, Vol. 23,
No. 2, pp. 213229
Kumar, U. and Ellingsen, H. P. (2000), Development and implementation of maintenance
performance indicators for the Norwegian oil and gas industry, Conference proceedings
of 15th European Maintenance Conference (Euro Maintenance 2000), Gothenburg,
Sweden
Lingle, J.H. and Schiemann, W.A. (1996), From balanced scorecard to strategy gauge: is
measurement worth it? Management Review, March, pp. 5662
Liyanage, J.P. and Kumar, U. (2003), Towards a value-based view on operations and
maintenance performance management, Journal of Quality in Maintenance Engineering,
Vol. 9, pp. 333350
Lynch, R.L. and Cross, K.F. (1991), Measure up!: the Essential Guide to Measuring
Business Performance, London, Mandarin
Medori, D. and Steeple, D. (2000), A framework for auditing and enhancing performance
measurement systems, International Journal of Operation & Production Management,
Vol. 20, No. 5, pp. 520533
478 U. Kumar and A. Parida
Meyer, M.W. and Gupta, V. (1994), The performance paradox, in Straw, B. M. and
Cummings, L.L. (Eds), Research in Organizational Behavior, Vol. 16, Greenwich, CT,
JAI Press, pp. 309369
Miles, M.B. and Huberman, A.M. (1994). Qualitative Data Analysis, Sage Publication,
California, USA.
Murthy, D.N.P, Atrens, A. and Eccleston, J.A. (2002), Strategic maintenance management,
Journal of Quality in Maintenance Engineering, Vol. 8, No. 4, pp. 287305
Neely, A.D. (1999), The performance measurement revolution: why now and where next,
International Journal of Operation and Production Management, Vol. 19, No. 2, pp.
205228
Neely, A., Adams, C. and Keenerly, M. (2002), The Performance Prism, Prentice Hall,
Financial Times, Harlow, UK
Parida, A., Chattopadhyay, G. and Kumar, U. (2005), Multi criteria maintenance performance
measurement: a conceptual model, in Proceedings of the 18th International Congress of
COMADEM, 31st Aug2nd Sep 2005, Cranfield, UK, pp. 349356
Parida, A. and Kumar, U. (2006), Maintenance performance measurement (MPM): issues
and challenges, Journal of Quality in Maintenance Engineering, Vol. 12, No. 3, pp.
13552511
Tsang, A.H.C. (1998), A strategic approach to managing maintenance performance, Journal
of Quality in Maintenance Engineering, Vol. 4, No. 2, pp. 8794
Wealleans, D. (2000), Organizational Measurement Manual, Abingdon, Oxon, GBR,
Ashgate Publishing Limited
Wireman, T. (1998), Developing Performance Indicators for Managing Maintenance, New
York, Industrial Press, Inc.
Wongrassamee, S., Gardiner, P.D. and Simmons, J.E.L. (2003), Performance measurement
tools: the balanced scorecard and the EFQM Excellence Model, Measuring Business
Performance, Vol. 7, pp. 1429
20
20.1 Introduction
Service parts are ubiquitous in modern societies. Their need arises whenever a
component fails or requires replacement. In some sectors, such as the aerospace
and automotive industries, a very wide range of service parts are held in stock, with
significant implications for availability and inventory holding. Their management
is therefore an important task.
A distinction should be drawn between preventive maintenance and corrective
maintenance. Demand arising from preventive maintenance is scheduled and is
deterministic, at least in principle. Demand arising from corrective maintenance,
after a failure has occurred, is stochastic and requires forecasting.
Fortuin and Martin (1999) categorise the contexts for service logistics as
follows:
Technical systems under client control (e.g. machines in production depart-
ments, transport vehicles in a warehouse);
Technical systems sold to customers (e.g. telephone exchange systems,
medical systems in hospitals)
End products used by customers (e.g. TV sets, personal computers, motor
cars)
In the first context, there is usually a specialist department within the client
organization performing maintenance activities and managing service parts in-
ventories. In the second context, a specialist department within the vendor organi-
zation will generally undertake these tasks. In both cases, a large amount of infor-
mation is known by the vendor, or can be shared with the vendor. This information
may include scheduled (preventive) maintenance activities, times between failures,
usage rates and condition of equipment.
When a wealth of data is available, it is possible to identify explanatory
variables which may be used to predict the demand of service parts. For example,
480 J. Boylan and A. Syntetos
Ghobbar and Friend (2002) showed that the average demand interval for aircraft
spare parts depends on the aircraft utilization rate, the component overhaul life and
the type of primary maintenance process. In a further study, Ghobbar and Friend
(2003) showed how forecast accuracy depends on various characteristics of the
demand process, including the seasonal period length, as well as the primary
maintenance process. Hua et al. (2006) used two zero-one explanatory variables,
plant overhaul and equipment overhaul, to help predict demand of spare parts
in the petrochemical industry. In other cases, explanatory variables have been used
to predict part of the demand for a stock keeping unit (SKU). For example,
Kalchschmidt et al. (2006) identified clusters of customers whose sales were
correlated with promotional activities and clusters of customers that were unaffec-
ted, using appropriate forecasting methods for each group.
In the third context, parts are used by consumers and much less information is
available. Fortuin and Martin (1999, p 957) commented, Clients are anonymous,
their usage of consumer products and their maintenance concept are not known.
Most demand arises from purely corrective maintenance (e.g. on TV sets, personal
computers) required in the case of a defect. Even when preventive maintenance
occurs (e.g. on motor cars), prediction is complicated by the maintenance concept
of consumers being unknown. For example, customers may not bring in their cars
at the correct time for a service, or may not bring them in at all. In many practical
situations where end products are used by consumers, the vendor must gauge
demand for service parts from the demand history alone. Such demand patterns are
often sporadic, with occasional spikes of demand. Alternatively, demand for an
SKU may be decomposed into regular and irregular components (Kalchschmidt et
al. 2006). In both cases, sporadic demand for service parts poses a considerable
challenge to those responsible for managing inventories. It is this challenge that
will be addressed in this chapter.
The remainder of the chapter is structured as follows. In the next section we
address issues pertinent to the classification of service parts for forecasting and
inventory management related purposes. Parametric and non-parametric approaches
to forecasting service parts requirements are then discussed in Sections 20.3 and
20.4 respectively. In Section 20.5, we present various metrics appropriate for
measuring the performance of the inventory management system whereas in
Section 20.6 we review the limited number of studies that provide empirical
evidence on: i) the performance of forecasting methods for service parts and ii) the
empirical fit of statistical distributions to the corresponding underlying demand
patterns. Finally, the conclusions of our work are summarized in Section 20.7.
A product life cycle approach is often used in marketing, with three phases of
growth, maturity and decline. A similar classification may be adopted for stock
control, with the phases aligned directly to the decisions required for the inventory
management of service parts.
Fortuin (1980) suggested three phases: initial, normal and final. In the initial
phase, when the part is introduced, there are two decisions: i) should the item be
stocked and ii) if so, what are the initial stock requirements? In the normal phase,
an inventory policy must be determined and the parameters estimated. If an order-
up-to (OUT) policy is adopted, for example, then the order-up-to-Level must be
calculated. As the part nears the end of its life, suppliers may become reluctant to
manufacture small volumes, as required by clients, particularly if the part has high
manufacturing set-up costs. In this final phase, a decision must be taken on the size
of a single order to cover all remaining demand (sometimes known as an all time
buy). Teunter (1998) analysed this problem from a theoretical perspective, while
Teunter and Fortuin (1998) reported a case-study of a company facing such a
decision.
Faster moving service parts are commonly forecast using time-series methods. The
specific method that should be employed depends on the characteristics of the
Forecasting and Inventory Management for Service Parts 483
70
60
50
Demand (Units)
40
30
20
10
0
Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec
Time period
Slow demand Lumpy demand
The first and second factors determine the intermittence of demand. In response
to this intermittence, for those SKUs with very few customers, it may become
feasible to liaise directly with them, and to enhance forecasts accordingly.
The third and fourth factors determine the erraticness of demand. As orders
become more irregular, exploiting early information at the customer level becomes
more attractive. Of course, such early indications are not always available. This
will often be the case when addressing consumer demand. It is also possible that
early confirmed orders may give a good indication of final orders. This is particu-
larly useful when there is a strong correlation between customers demands.
The five factors, and their effect on intermittence, erraticness and lumpiness,
are summarised in Figure 20.2.
Numerousness of customers
Intermittence
Frequency of individual orders
Heterogeneity of customers
Erraticness
Variety of customers requests
For those items without early indicators, forecasting must be undertaken using
a purely time-series approach. This is usually linked to a demand distribution, so
that inventory levels may be set to achieve high percentage service level targets.
Many inventory management systems make distributional assumptions of demand
according to the ABC classification. For example, A and B items may be taken to
be normally distributed, whilst C items are assumed to be Poisson. In practice,
however, many service parts have demand that is more erratic than Poisson
(sometimes known as over-dispersed). The Poisson dispersion index (ratio of the
variance to the mean of demand, including zero demands) can be used to classify
SKUs as Poisson or non-Poisson. If the index is close to unity, then a Poisson
distribution is indicated; if the index is greater than unity, then other distributions,
such as the negative binomial, may be more appropriate, or a non-parametric
approach may be required, as discussed in Section 20.4.
An obvious way to classify service parts is by frequency of demand. As demand
occurrence becomes more infrequent, with some periods having no demand at all, a
number of difficulties emerge. From a forecasting perspective, methods such as
Forecasting and Inventory Management for Service Parts 485
Low High
p=1.34 (break-point)
High
Erratic Lumpy
(Croston) (Croston)
2
CV =0.28
Smooth Intermittent
(SES) (Croston)
Low
In summary, service parts may be in the initial, normal or final phases of their
life cycle. In this chapter, we focus attention on the normal phase. Although service
parts may be classified as A, B or C in a Pareto analysis, it is likely that most parts
486 J. Boylan and A. Syntetos
will be categorized as C. The service requirements for the part may be guided by
criticality and cost considerations, as well as the ABC classification. Further
refinements are necessary to the Pareto classification in order to allocate the most
appropriate forecasting methods to each SKU.
Enhancement of the ABC classification in the manner described above gives a
coherent approach to classification according to forecasting performance and a
foundation for theoretically-informed usage of terms such as erratic, as shown in
Figure 20.4 (after Syntetos 2001, adapted by Boylan et al. 2006).
High
Intermittent
Mean inter-
demand interval
Non-intermittent
Low
High
Erratic Lumpy
AND
Coefficient of variation
of demand sizes
Non-erratic Clumped
AND
Low
Demand for service parts is most commonly intermittent in nature. The demand
pattern is characterized by infrequent demands, often of variable size, occurring at
irregular intervals. Consequently, as discussed in Section 20.3.2, it is preferable to
model demand from constituent elements, i.e. the demand size and inter-demand
interval. Therefore, compound theoretical distributions (that explicitly take into
account the size-interval combination) are typically used in such contexts of
application. We first discuss some issues related to modelling demand arrivals and
488 J. Boylan and A. Syntetos
the variance and the average of the demand history data of each item. The resulting
distribution of demand per period was called a package Poisson distribution. The
same distribution has appeared in the literature under the name hypothetical SKU
(h-SKU) Poisson distribution (Williams 1984), where demand is treated as if it
occurs as a multiple of some constant, or clumped Poisson distribution, for mul-
tiple item orders for the same SKU of a fixed clump size (Ritchie and Kingsman
1985) (please also refer to Figure 20.4 where a definition of clumped demand is
offered). In an earlier work, Friend (1960) also discussed the use of a Poisson
distribution for demand occurrence, combined with demands of constant size. The
package Poisson distribution requires, as the Poisson distribution itself, an esti-
mate of the mean demand only.
If demand occurs as a Bernoulli process and orders follow the logarithmic-
Poisson distribution (which is not the same as the Poisson-logarithmic process that
yields NBD demand) then the resulting distribution of total demand per period is
the log-zero-Poisson (Kwan 1991). The log-zero-Poisson is a three parameter dis-
tribution and requires a rather complicated estimation method. Moreover, it was
found by Kwan (1991) to be empirically outperformed by the NBD. Hence, the
log-zero Poisson cannot be recommended for practical applications. One other
compound binomial distribution appeared in the literature is that involving nor-
mally distributed demand sizes (Croston 1972, 1974). However, and as discussed
above, a normality assumption is unrealistic and therefore the distribution is not re-
commended for practical applications.
Single exponential smoothing (SES) and simple moving averages (SMA) are often
used in practice to forecast intermittent demand. Both methods have been shown to
perform satisfactorily on real service parts data. However, the standard fore-
casting method for such items is considered to be Crostons method (Croston 1972,
as corrected by Rao 1973). Croston suggested treating the size of orders ( z t ) and
the intervals between them ( p t ) as two separate series and combining their ex-
penentially weighted moving averages (obtained using SES) to achieve a forecast
of the demand per period. (Recently, some adaptations of Crostons method have
appeared in the literature that rely upon SMA rather than SES estimates and such
modifications are further discussed later in this section.)
In Crostons work, both demand sizes and intervals were assumed to have
constant means and variances, for modelling purposes, and demand sizes and
demand intervals to be mutually independent. Demand was assumed to occur as a
Bernoulli process. Subsequently, the inter-demand intervals are geometrically
distributed (with mean p ). The demand sizes were assumed to follow the normal
distribution (with mean and variance 2 ).
These assumptions have been challenged in respect of their realism (see, for
example, Willemain et al. 1994) and they have also been challenged in respect of
their theoretical consistency with Crostons forecasting method. The latter issue is
further discussed in Section 20.3.3.
490 J. Boylan and A. Syntetos
Crostons method works in the following way: SES estimates of the average
size of the demand ( zt ) and the average interval between demand incidences ( p t ),
are made after demand occurs (using the same smoothing constant value, ). If no
demand occurs, the estimates remain exactly the same. The forecast of demand per
period ( Yt ) is given by: Yt = zt / p t . If demand occurs in every time period,
Crostons estimator is identical to SES. For constant lead times of length L , the
mean lead-time demand estimate ( YL ) is then obtained as follows:
YL = LYt (20.1)
( zt ) = ( zt ) = (20.2a)
( pt ) = ( p t ) = p (20.2b)
According to Croston, the expected estimate of demand per period in that case
would be: (Yt ) = ( zt / p t ) = ( zt ) / ( p t ) = / p (i.e. the method is unbiased).
If it is assumed that estimators of demand size and demand interval are
independent, then
z 1
t = ( zt ) (20.3)
p t p t
but
1 1
(20.4)
p
t ( p t )
and therefore Crostons method is biased. It is clear that this result does not depend
on Crostons assumptions of stationarity and geometrically distributed demand
intervals.
More recently, Boylan and Syntetos (2003), Syntetos and Boylan (2005) and
Shale et al. (2006) presented correction factors to overcome the bias associated with
Crostons approach. Some of these papers discuss: i) Crostons applications under a
Poisson demand arrival process and ii) estimation of demand sizes and intervals
using an SMA (using the ratio of the former to the latter as an estimate of demand
per period). The correction factors are summarized in the Table 20.1. (where k is the
length of the moving average and is the smoothing constant for SES).
Forecasting and Inventory Management for Service Parts 491
k k 1
SMA k +1 k
Boylan and Syntetos (2003) Shale et al. (2006)
At this point it is important to note that SMA and SES are often treated as
equivalent when the average age of the data in the estimates is the same (Brown,
1963). A relationship links the number of points in an arithmetic average (k) with
the smoothing parameter of SES ( ) for stationary demand. Hence it may be used
to relate the correction factors presented in Table 20.1 for each of the two demand
generation processes considered. The linking equation is
2
k= (20.5)
Snyder (2002) pointed out that Crostons model assumes stationarity of demand
intervals and yet an SES estimator is used, implying a non-stationary demand
process. The same comment applies to demand sizes. Snyder commented that this
renders the model and method inconsistent and he proposed some alternative
models, and suggested a new forecasting approach based on parametric bootstrap-
ping. Shenstone and Hyndman (2005) developed this work by examining Snyders
models. In their paper they commented on the wide prediction intervals that arise
for non-stationary models and recommended that stationary models should be re-
considered. However, they concluded: ... the possible models underlying Crostons
and related methods must be non-stationary and defined on a continuous sample
space. For Crostons original method, the sample space for the underlying model
included negative values. This is inconsistent with reality that demand is always
non-negative (Shenstone and Hyndman, 2005, pp 389390).
In summary, any potential non-stationary model assumed to be underlying
Crostons method must have properties that do not match the demand data being
modeled. Obviously, this does not mean that Crostons method and its variants are
not useful. Such methods do constitute the current state of the art in intermittent
demand parametric forecasting. An interesting line of further research would be to
consider stationary models for intermittent demand forecasting rather than restrict-
492 J. Boylan and A. Syntetos
ing attention to models implying Crostons method. For example, Poisson auto-
regressive models have been suggested to be potentially useful by Shenstone and
Hyndman (2005).
L = L t (20.6)
t = MSEt (20.7)
t 1.25MADt (20.8)
When using SES, under the above model formulation, the standard deviation of
the lead time forecast error was shown (Johnston and Harrison 1986) to be
correctly calculated as follows:
Under the stationary mean model assumption (the demand level is assumed to
be constant) the forecast error correlation still exists because of the uncertainty
associated with the variance of the forecasts, which is carried forward from one
period to another. In addition, if a biased estimator is in place to forecast future
demand requirements, the auto-correlation can be also attributed to the bias. This
issue has been analytically addressed by Strijbosch et al. (2000) and Syntetos et al.
(2005).
1. Obtain historical demand data in chosen time buckets (e.g. days, weeks,
months)
2. Estimate transition probabilities for two-state (zero vs. non-zero) Markov
model
3. Conditional on last observed demand, use Markov model to generate a
sequence of zero/non-zero values over forecast horizon
494 J. Boylan and A. Syntetos
Willemain et al. (2004, p 381) argued that we need to assess the quality not
of a point forecast of the mean but of a forecast of the entire distribution, but they
conceded that it is impossible to compare this on an item-specific basis. Instead,
the authors recommended pooling percentile estimators across items and measur-
ing the conformance of the observations (expressed using the corresponding
percentiles) to a uniform distribution. The researchers claimed significant improve-
ments in forecasting accuracy achieved by using their approach over single ex-
ponential smoothing and Crostons method. (Issues related to assessing forecasting
performance are further considered in the next section.)
Gardner and Koehler (2005) criticized this study in terms of its methodological
arrangements and experimental structure, pointing out that:
Willemain et al. did not use the correct lead time demand distribution for
either SES or Crostons method. This was a twofold criticism consisting of
arguments against the use of Equation 20.6 for estimating the lead-time
demand variance (please refer to Section 20.3.4) and the use of the normal
distribution for representing demand
They did not consider published modifications to Crostons method such as
the estimator proposed by Syntetos and Boylan (2005)
Further empirical evidence is required in order to develop our understanding of
the benefits offered by such a non-parametric approach. In particular, a comparison
between the recently developed adaptations of Crostons method see Table 20.1
(in conjunction with an appropriate distribution) with the bootstrapping approach
should prove to be beneficial from both theoretical and practitioner perspectives.
argue that slow-moving service parts should attract a higher percentage cost, since
these parts are at the highest risk of obsolescence.
Service level is generally interpreted as off the shelf availability but the way
in which it is measured varies. Three common measures are defined as follows
(Silver et al. 1998):
The fraction of replenishment cycles in which total demand can be
delivered from stock (known as P1). This is equivalent to a specified prob-
ability of no stock-outs during a replenishment cycle.
The fraction of total demand that can be delivered from stock (known as
P2). This is also called the fill rate.
The fraction of time during which there is stock on the shelf (known as P3).
This is sometimes used when equipment is needed for emergency purposes.
The fill rate is probably the measure with the greatest appeal to practitioners,
since it relates most directly to customer satisfaction. Care is needed with its appli-
cation, since different results are obtained if it is calculated over a lead-time or
over all time. If unsatisfied demand is back-ordered, Brown (1967) showed that
LS
P2 = 1 (1 P2LT ) (20.10)
Q
where P2LT is the measure over lead-time, P2 is the measure over all time, L is the
lead-time, S is total demand in a year, Q is the order-quantity and it is assumed that
LS < Q.
Ronen (1982) showed that, if unsatisfied demand is lost, then
1
P2 = (20.11)
LS
(1 P2LT ) + 1
Q
These measures are based on the fraction of units satisfied from stock. Some
organizations also use measures that relate to the successful completion of an
order-line for a number of units of the same SKU. Typically, these are based on
the fraction of order-lines completely satisfied (partial satisfaction does not count).
Boylan and Johnston (1994) identified relationships between such measures and
fill-rates.
In addition to these standard measures, other suggestions have been made.
Gardner (1990) recommended the use of trade-off curves, showing the effect of
inventory investment on the average delay in filling backorders. Separate curves
are drawn for each forecasting method, allowing the manager to see at a glance if
one method dominates the others.
Sani and Kingsman (1997) proposed the use of average regret measures. The
service regret is the amount each method falls short of the maximum service level
over all methods for that SKU. (The method may be a forecasting or inventory
method.) The regret is then divided by the maximum service level and the ratios
496 J. Boylan and A. Syntetos
are averaged across all SKUs. A cost regret measure is defined similarly. This
approach allows more detailed assessments of the interaction between forecasting
and inventory methods. Eaves and Kingsman (2004) suggested assessment of fore-
cast performance according to implied stock-holdings. These are based on a calcu-
lation of the exact safety margin providing a maximum stock-out of zero. The
advantage of this approach is that it gives monetary values of stock-savings. How-
ever, these savings may not be achieved in practice using a standard stock control
method based on the mean and variance of lead-time forecasts.
Whilst it is essential to assess stock-holding costs and service levels, it is also
important to be able to diagnose the reasons for any deterioration in these measures.
Boylan and Syntetos (2006) argue that, since this may arise as a result of forecasting
methods or inventory rules (see Figure 20.5), the accuracy of forecasting methods
should also be monitored.
Forecasting Stock-holding
Stock
method costs
management
1 n
Yt Yt
MAPE =
n
100
t =1 Yt
(20.12)
1 n | Yt Yt |
sMAPE =
n t =1
100
| Yt + Yt | / 2
(20.13)
Forecasting and Inventory Management for Service Parts 497
n
1
ME =
n
(Y Y )
t =1
t t (20.14)
This measure is simple to interpret: if it is close to zero, then the forecast method
is unbiased; negative values indicate that forecasts are consistently too high, while
positive values show that forecasts are too low. Its use is recommended for inter-
mittent service parts.
If a forecast method has high forecast errors, but is approximately unbiased, then
the positive and negative errors cancel one another out, yielding a mean error close
to zero. To capture the degree of error, regardless of sign, other error measures are
required. The mean absolute error (MAE) is often used for an individual SKU and is
defined as follows:
n
1
MAE =
n
Y Y
t =1
t t (20.15)
This error measure should not be averaged over a whole set of parts, since it
may be dominated by a few SKUs with large errors. To avoid this problem, four
alternatives have been suggested: the MAE: mean ratio, the geometric mean ab-
solute error, the percentage better measure and the mean absolute scaled error.
Each of these measures will be reviewed in turn.
Hoover (2006) proposed the application of the MAE: mean ratio for intermit-
tent demand:
Y Y t t n
t =1
n
Y Y
t =1
t t
MAE:Mean = n
= n
(20.16)
Yt =1
t Yt =1
t
cism stands. Therefore, the MAE: mean ratio can be recommended for non-trended
intermittent service parts.
A second alternative to the MAE is the geometric mean absolute error (GMAE)
defined below; for a single series:
n 1/ n
GMAE =
i =1
Yt Yt
(20.17)
This can be generalized across series by taking the geometric mean again to
obtain the geometric mean (across series) of the geometric mean (across time) of
the absolute errors (GMGMAE):
1/ n 1/ N
N n
GMGMAE = Yit Yit (20.18)
i =1 t =1
where Yit is the observation for the i-th SKU at time t, Yit is the forecast of
demand for the i-th SKU at time t, and N is the number of SKUs.
An outlying observation, producing a large error by any statistical method, will
affect the GMAE similarly for all methods, and so the ratio of the GMAE for one
method to another will be robust to outliers. (The same robustness property applies
to the GMGMAE). This was first shown by Fildes (1992), using a general argu-
ment, and applied to intermittent data by Syntetos and Boylan (2005). In fact, these
authors used a slightly more complex measure, the geometric root mean square
error (GRMSE); however, Hyndman (2006) pointed out that the GRMSE and the
GMAE are identical.
Although the measure is robust to outliers, it is sensitive to zero errors (Boylan
and Syntetos 2006). Just one exact forecast will yield a zero error and a zero
GMAE, regardless of the size of the other errors. This problem may be overcome,
for stationary errors, by using the geometric mean (across series) of the arithmetic
mean (across time) of the absolute errors (GMAMAE):
1/ N
N
1 ni
GMAMAE =
Yit Yit
(20.19)
i =1 ni t =1
This measure collapses to zero only if a series has zero forecast errors for all
periods of time, and so is more robust to zero errors than the GMGMAE. The
measure is also robust to occasional large forecast errors, provided the remaining
errors are stable, and are not unduly affected by trend or seasonality. It can there-
fore be recommended for application in these cases.
Another approach, which is simple to use and interpret, is the percentage better
method. According to this approach, for each service part, one forecast method is
compared to another according to a criterion such as mean error or geometric root
mean square error (Syntetos and Boylan 2005). The percentage better shows the
Forecasting and Inventory Management for Service Parts 499
percentage of series for which one method has the lower error. This approach is
robust to large forecast errors and the results can be subjected to formal statistical
tests (Syntetos 2001). It is a useful measure, although it does not quantify the
degree of improvement in forecast error.
Hyndman (2006) recently suggested a new error measure for intermittent de-
mand. This measure, known as the mean absolute scaled error (MASE), is defined
as follows:
The errors are scaled based on the in-sample MAE from the nave forecasting
method (i.e. the forecast for the next period is this periods observation). The
measure is robust to outliers, and is valid for all non-constant series.
Hyndman (2006) gave an example of the application of the MASE on
intermittent data from a major Australian lubricant manufacturer. He compared the
out-of-sample MASE of four methods: nave, overall mean, single exponential
smoothing and Crostons method. The nave method has the lowest MASE. This
result is valid statistically, but is counter-intuitive from an inventory-management
perspective. Boylan and Syntetos (2006) commented that the nave method is
sensitive to large demands and will generate high forecasts. Its use will almost
certainly lead to over-stocking and possibly to obsolescence. This example high-
lights the danger of relying on statistical error measures alone. As noted earlier in
this section, attention should always be paid to the stock-holding and service
implications of different forecasting methods. Improvements in forecasting accu-
racy do not necessarily translate into improved stock-control performance. How-
ever, if stock-control performance has deteriorated, then forecast error measures
can be used to diagnose problems with forecasting methods, and to suggest alter-
natives.
Kwan (1991) conducted research to identify the theoretical distributions that best fit
the empirical distributions of demand sizes, inter-demand intervals and demand per
unit time period for low demand items. Regarding inter-demand intervals, both the
500 J. Boylan and A. Syntetos
geometric and the negative exponential distribution were found to provide a good fit
to the demand patterns observed. The geometric distribution was also found to be a
reasonable approximation to the distribution of inter-demand intervals, for real
demand data, by Dunsmuir and Snyder (1989) and Willemain et al. (1994). Janssen
(1998) tested the Bernoulli demand generation process on a set of empirical data
obtained from a Dutch wholesaler of fasteners. The results indicated that the
Bernoulli demand generation process is a reasonable approximation for intermittent
demand processes. Finally, Eaves (2002) examined the demand patterns associated
with 6795 service parts from the Royal Air Force (UK). The findings of this detailed
study provide support for both Poisson and Bernoulli processes. In particular, the
geometric distribution was found to provide a statistically significant fit (5% sig-
nificance level) to 91% of his sample whereas the negative exponential distibution
fitted 88% of the demand histories examined.
Kwan (1991) tested the empirical fit of the log-zero-Poisson (lzP) and negative
binomial (NBD), amongst other possible underlying demand distributions. The
NBD was found to be the best, fitting 90% of the SKUs. Boylan (1997) tested the
goodness-of-fit of four demand distributions (NBD, lzP, condensed negative bino-
mial distribution (CNBD) and gamma distribution) on real demand data. The CNBD
arises if we consider a condensed Poisson incidence distribution (censored Poisson
process in which only every second event is recorded) assuming that the mean rate
of demand incidence is not constant, but varies according to a gamma distribution.
The empirical sample used for testing goodness-of-fit contained the six months
histories of 230 SKUs, demand being recorded weekly. The analysis showed strong
support for the NBD. The results for the gamma distribution were also encouraging,
although not as good, for slow moving SKUs, as the NBD.
Willemain et al. (1994) compared SES and Crostons method on both theoretically
generated and empirical intermittent demand data (54 series). They concluded that
Croston's method is robustly superior to exponential smoothing and can provide
tangible benefits to stockists dealing with intermittent demand. A very important
feature of their research, though, was the fact that industrial results showed very
modest benefits as compared with the simulation results.
Sani and Kingsman (1997) compared the performance (service level and in-
ventory costs) of various empirical and theoretically proposed stock control poli-
cies for low demand items as well as that of various forecasting methods (SMA,
SES, Croston) on 30 service parts. Their results indicated: i) the very good overall
performance of SMA; ii) the fact that stock control policies that have been
developed in conjunction with specific distributional assumptions such as the
power approximation (explicitly built upon the assumption of a compound Poisson
underlying demand pattern please also refer to Section 20.3) perform particularly
well.
Willemain et al. (2004) assessed the forecast accuracy of SES, Crostons
method (both in conjunction with a hypothesised normal distribution) and the non-
parametric approach that they proposed (please refer to Section 20.4) on 28,000
Forecasting and Inventory Management for Service Parts 501
service inventory items. They concluded that the bootstrap method was the most
accurate forecasting method and that Crostons method had no significant ad-
vantage over SES. As discussed in Section 20.4, some reservations have been
expressed regarding the studys methodology. Nevertheless, the bootstrapping
approach is intuitively appealing for very lumpy demand items. More empirical
studies are needed to substantiate its forecast accuracy in comparison with other
methods.
Syntetos and Boylan (2005) conducted an empirical investigation to compare
the forecast accuracy of SMA, SES, Crostons method and a bias-corrected
adaptation of Crostons estimator (termed the Syntetos-Boylan approximation,
SBA; please refer to Table 20.1). The forecast accuracy of these methods was
tested, using a wide range of forecast accuracy metrics, on 3000 service parts from
the automotive industry. The results demonstrated quite conclusively the superior
forecasting performance of the SBA method. In a later project, Syntetos and
Boylan (2006) assessed the empirical stock control implications of the same esti-
mators on the same 3000 SKUs. The results demonstrated that the increased
forecast accuracy achieved by using the SBA method (also known as the Approxi-
mation Method) is translated to a better stock control performance (service level
achieved and stock volume differences). A similar finding was reported in an
earlier research project conducted by Eaves and Kingsman (2004). They compared
the empirical stock control performance (implied stock holdings given a specified
service level) of the above discussed estimators on 18750 service parts from the
Royal Air Force (UK). They concluded that the best forecasting method for a
spare parts inventory is deemed to be the approximation method (Eaves and
Kingsman 2004, p 436).
20.7 Conclusions
Service parts, particularly those subject to corrective maintenance, present a
considerable challenge for both forecasting and inventory management. If stocking
decisions are made injudiciously, then the result will be poor service or excessive
stock-holdings, possibly leading to obsolescence. Conversely, effective forecasting
and stock control will lead to cost savings and improved customer service
A number of stock-control methods may be employed for slow-moving service
parts. Sani and Kingsman (1997) recommended the (R, s, S) policy, based on its
inventory cost and service performance in an empirical study. However, empirical
evidence is not extensive, and further research is needed in this area.
Classification of service parts is an essential element in their management. Four
purposes are served by classification:
Determination of service targets
Establishment of inventory decisions
Choice of forecasting approach
Choice of forecasting method
502 J. Boylan and A. Syntetos
20.8 References
Bartezzaghi E, Verganti R, Zotteri G, (1996) A framework for managing uncertain lumpy
demand. Paper presented at the 9th International Symposium on Inventories, Budapest,
Hungary
Blischke WR, Murthy DNP, (1994) Warranty cost analysis. Marcel Dekker, Inc., New York
Boylan JE, (1997) The centralisation of inventory and the modelling of demand. Unpublished
Ph.D. Thesis, University of Warwick, UK
Boylan JE, Johnston FR, (1994) Relationships between service level measures for inventory
systems. Journal of the Operational Research Society 45: 838844
Boylan JE, Syntetos AA, (2003) Intermittent demand forecasting: size-interval methods
based on average and smoothing. Proceedings of the International Conference on
Quantitative Methods in Industry and Commerce, Athens, Greece
Boylan JE, Syntetos AA, (2006) Accuracy and accuracy-implication metrics for intermittent
demand. Foresight: the International Journal of Applied Forecasting 4: 3942.
504 J. Boylan and A. Syntetos
Boylan JE, Syntetos AA, Karakostas GC, (2006) Classification for forecasting and stock-
control: a case-study. Journal of the Operational Research Society: in press
Brown RG, (1963) Smoothing, forecasting and prediction of discrete time series. Prentice-
Hall, Inc., Englewood Cliffs, N.J.
Brown RG, (1967) Decision rules for inventory management. Holt, Reinhart and Winston,
Chicago
Burgin TA, (1975) The gamma distribution and inventory control. Operational Research
Quarterly 26: 507525
Burgin TA, Wild AR, (1967) Stock control experience and usable theory. Operational
Research Quarterly 18: 3552
Croston JD, (1972) Forecasting and stock control for intermittent demands. Operational
Research Quarterly 23, 289304
Croston JD (1974) Stock levels for slow-moving items. Operational Research Quarterly 25:
123130
Department of Defense USA, (1980) Procedures for performing a Failure Mode, Effects and
Criticality Analysis. MIL-STD-1629A
Dunsmuir WTM, Snyder RD, (1989) Control of inventories with intermittent demand.
European Journal of Operational Research 40: 1621
Eaves AHC, (2002) Forecasting for the ordering and stock holding of consumable spare
parts. Unpublished Ph.D. thesis, Lancaster University, UK
Eaves A, Kingsman BG, (2004) Forecasting for ordering and stock holding of spare parts.
Journal of the Operational Research Society 55: 431437
Ehrhardt R, Mosier C, (1984) A revision of the power approximation for computing (s, S)
inventory policies. Management Science 30: 618622
Fildes R, (1992) The evaluation of extrapolative forecasting methods. International Journal
of Forecasting 8: 8198
Fortuin L, (1980) The all-time requirements of spare parts for service after sales
theoretical analysis and practical results. International Journal of Operations and
Production Management 1: 5969
Fortuin L, Martin H, (1999) Control of service parts. International Journal of Operations and
Production Management 19: 950971
Friend JK, (1960) Stock control with random opportunities for replenishment. Operational
Research Quarterly 11: 130136
Gallagher DJ, (1969) Two periodic review inventory models with backorders and stuttering
Poisson demands. AIIE Transactions 1: 164171
Gardner ES, (1990) Evaluating forecast performance in an inventory control system.
Management Science 36: 490499
Gardner ES, Koehler AB, (2005) Correspondence: Comments on a patented bootstrapping
method for forecasting intermittent demand. International Journal of Forecasting 21:
617618
Ghobbar AA, Friend CH, (2002) Sources of intermittent demand for aircraft spare parts
within airline operations. Journal of Air Transport Management 8: 221231
Ghobbar AA, Friend CH, (2003) Evaluation of forecasting methods for intermittent parts
demand in the field of aviation: a predictive model. Computers and Operations Research
30: 20972014.
Hoover J, (2006) Measuring forecast accuracy: omissions in todays forecasting engines and
demand-planning software. Foresight: the International Journal of Applied Forecasting
4: 3235
Hua ZS, Zhang B, Yang J, Tan DS, (2006) A new approach of forecasting intermittent
demand for spare parts inventories in the process industries. Journal of the Operational
Research Society: in press.
Forecasting and Inventory Management for Service Parts 505
Hyndman RJ, (2006) Another look at forecast-accuracy metrics for intermittent demand.
Foresight: the International Journal of Applied Forecasting 4: 4346
Janssen FBSLP, (1998) Inventory management systems; control and information issues.
Published Ph.D. thesis, Centre for Economic Research, Tilburg University, The
Netherlands
Johnston FR, (1980) An interactive stock control system with a strategic management role.
Journal of the Operational Research Society 31: 10691084
Johnston FR, Boylan JE, (1996) Forecasting for items with intermittent demand. Journal of
the Operational Research Society 47: 113121
Johnston FR, Harrison PJ, (1986) The variance of lead-time demand. Journal of the
Operational Research Society 37: 303308
Kalchschmidt M, Verganti R, Zotteri G, (2006) Forecasting demand from heterogeneous
customers. International Journal of Operations and Production Management 26: 619638
Kwan HW, (1991) On the demand distributions of slow moving items. Unpublished Ph.D.
thesis, Lancaster University, UK
Makridakis S, (1993) Accuracy measures: theoretical and practical concerns. International
Journal of Forecasting 9: 527529
Makridakis S, Hibon M, (2000) The M3-Competition: results, conclusions and implications.
International Journal of Forecasting 16: 451476
Naddor E, (1975) Optimal and heuristic decisions in single and multi-item inventory
systems. Management Science 21: 12341249
Quenouille MH, (1949) A relation between the logarithmic, Poisson and negative binomial
series. Biometrics 5: 162164
Rao AV, (1973) A comment on: Forecasting and stock control for intermittent demands.
Operational Research Quarterly 24: 639640
Ritchie E, Kingsman BG, (1985) Setting stock levels for wholesaling: performance
measures and conflict of objectives between supplier and stockist. European Journal of
Operational Research 20: 1724
Ronen D, (1982) Measures of product availability. Journal of Business Logistics 3: 4558
Sani B, Kingsman BG, (1997) Selecting the best periodic inventory control and demand
forecasting methods for low demand items. Journal of the Operational Research Society
48: 700713
Shale EA, Boylan JE, Johnston FR, (2006) Forecasting for intermittent demand: the
estimation of an unbiased average. Journal of the Operational Research Society 57:
588592
Shenstone L, Hyndman RJ, (2005) Stochastic models underlying Crostons method for
intermittent demand forecasting. Journal of Forecasting 24: 389402
Silver EA (1970) Some ideas related to the inventory control of items having erratic demand
patterns. CORS Journal 8: 87100.
Silver EA, Pyke DF, Peterson R, (1998) Inventory management and production planning
and scheduling (3rd edition). John Wiley & Sons, New York
Snyder R, (2002) Forecasting sales of slow and fast moving inventories. European Journal
of Operational Research 140: 684699
Strijbosch LWG, Heuts RMJ, van der Schoot EHM, (2000) A combined forecast-inventory
control procedure for spare parts. Journal of the Operational Research Society 51:
11841192
Syntetos AA, (2001) Forecasting of intermittent demand. Unpublished PhD Thesis,
Buckinghamshire Chilterns University College, Brunel University, UK
Syntetos AA, Boylan JE, (2001) On the bias of intermittent demand estimates. International
Journal of Production Economics 71: 457466
Syntetos AA, Boylan JE, (2005) The accuracy of intermittent demand estimates.
International Journal of Forecasting 21: 303314
506 J. Boylan and A. Syntetos
Syntetos AA, Boylan JE, Croston JD, (2005) On the categorization of demand patterns.
Journal of the Operational Research Society 56: 495503
Syntetos AA, Boylan JE (2006) On the stock control performance of intermittent demand
estimators. International Journal of Production Economics 103: 3647
Teunter RH, (1998) Inventory control of service parts in the final phase. Published PhD
Thesis, University of Groningen, The Netherlands
Teunter RH, Fortuin L, (1998) End-of-life-service: a case-study. European Journal of
Operational Research 107: 1934
Vereecke A, Verstraeten P, (1994) An inventory management model for an inventory
consisting of lumpy items, slow movers and fast movers. International Journal of
Production Economics 35: 379389
Ward JB, (1978) Determining re-order points when demand is lumpy. Management Science
24: 623632
Watson RB, (1987) The effects of demand-forecast fluctuations on customer service and
inventory cost when demand is lumpy. Journal of the Operational Research Society 38:
7582
Willemain TR, Smart CN, Shockor JH, DeSautels PA, (1994) Forecasting intermittent
demand in manufacturing: a comparative evaluation of Crostons method. International
Journal of Forecasting 10: 529538
Willemain TR, Smart CN, Schwarz HF, (2004) A new approach to forecasting intermittent
demand for service parts inventories. International Journal of Forecasting 20: 375387
Williams TM, (1984) Stock control with sporadic and slow-moving demand. Journal of the
Operational Research Society 35: 939948
Part F
Jrn Vatn
21.1 Introduction
This chapter presents two case studies of maintenance optimization in the rail
industry. The first case study discusses grouping of maintenance activities into
maintenance packages. The second case study uses a life cycle cost approach to
prioritize between maintenance and renewal projects under budget constraints.
Grouping of maintenance activities into maintenance packages is an important
issue in maintenance planning and optimization. This grouping is important both
from an economic point of view in terms of minimization of set-up costs, and also
with respect to obtaining administratively manageable solutions. If several main-
tenance activities may be specified as one work-order in the computerized main-
tenance management system, we would have less work-orders to administer. The
maintenance intervals are usually determined by considering the various compo-
nents or activities separately, and then the activities are grouped into maintenance
packages. By executing several activities at the same time, the set-up costs may be
shared by several activities. However, this will require that we have to shift the
intervals for the individual activities. If we try to put too many activities into the
same group, the gain with respect to set-up costs may be dominated by the costs of
changing the intervals for the individual activities. The case study we present for
maintenance grouping is related to train maintenance, and especially we focus on
activities related to components in the bogie.
Another problem most industries are facing is the limited resources available
for maintenance and renewal, implying that optimization has to be conducted under
budget constraints. Then two main questions should be addressed, first of all
whether the budget constraints should be eliminated to some extent by putting
more resources into maintenance and renewal in case we have more good projects
than we have resources. The other question is how to prioritize, given the budget
constraints. In the case study we present an approach to cost-benefit analysis of the
various projects. This gives a ranked list of projects to consider for execution. The
510 J. Vatn
proposed method has been implemented by the Norwegian National Rail Admini-
stration (JBV), responsible for the Norwegian railway net.
Section 21.2 presents some general information about rail maintenance in
Norway as a basis for the two case studies. The first case study in Section 21.3
discusses grouping of maintenance activities into maintenance packages. The
second case study in Section 21.4 uses a life cycle cost approach to prioritize be-
tween maintenance and renewal projects under budget constraints.
are Carreteroa et al. (2003), Zoeteman (2003), Veit and Wogowitsch (2003), Vatn
et al. (2003), Zarembski and Palese (2003), Pedregala et al. (2004), Meier-Hirmer1
et al. (2005), Budai et al. (2005) and Reddy et al. (2006). Railway research related
to maintenance is, however, dominated by wear modelling. Especially wheel-rail
wear models and track degradation models are important because the major main-
tenance and renewal costs of a railway line are due to track components. Some
important references are Bing and Gross (1983), Li and Selig (1995), Sato (1995),
Bogdaanski et al. (1996), Ferreria and Murray (1997), Zhang et al. (1997), Kay
(1998), Zakharov et al. (1998), Salim (2004), Telliskivi and Olofsson (2004),
Grassie (2005) and Braghin et al. (2006). A complete survey of reported models is
beyond the scope of this chapter.
Rolling stock maintenance is characterized by the fact that the trains have to be
taken out of service while they are maintained in a maintenance depot. This causes
a lot of challenges related to scheduling of the train services taking the need for
maintenance into account. The scheduling problem is not considered here, and we
only present a rather simple model for grouping of some maintenance activities
assuming that we have access to the train whenever we want. Sriskandarajah et al.
(1998) present a methodology utilizing genetic algorithms on a much more com-
plex situation within train maintenance scheduling. In our example we only con-
sider the following cost elements:
Man-hour costs and material costs related to preventive maintenance of
each component.
Set-up costs to get access to the components to be maintained, and by
paying the set-up costs access to several components is obtained.
Costs of taking the train out of service. These costs are included in the set-
up costs from a modelling point of view.
Man-hour costs and material costs related to corrective maintenance. Typi-
cally set-up costs can not be shared by other components unless preventive
maintenance is advanced (opportunity maintenance).
Costs related to the effect of a failure, i.e. punctuality, safety and material
damage costs.
In classical maintenance optimization the objective is to find the optimum
frequency of maintenance of one component at a time. However, in the multi-
component situation there exist dependencies between the components, e.g. they
may share common set-up costs (economy of scope), the costs may be reduced if
the contract to a maintenance contractor is huge (economy of scale), etc. This will
complicate the modelling from the single component approach, e.g. see Dekker et
al. (1997) for a survey of models used in the multi-component situation. In this
chapter we only consider the situation where we can save some set-up costs by
executing several maintenance activities at the same time.
512 J. Vatn
We often distinguish between the static and the dynamic planning regimes. In
the static regime the grouping is fixed during the entire system lifetime, whereas in
the dynamic regime the groups are re-established over and over again. The static
grouping situation may be easier to implement than the dynamic, and the main-
tenance effort is constant, or at least predictable. The advantage of the dynamic
grouping is that new information, unforeseen events, etc., may require a new
grouping and changing of plans. For an introduction to maintenance grouping we
refer to Wildeman (1996) who discusses these different regimes in detail. In the
example that follows we illustrate some aspects of dynamic grouping related to
maintenance activities on a train bogie.
The trains are regularly taken out of service and sent to the maintenance depot for
execution of maintenance. Several subsystems are maintained at the same time,
and this makes the definition of set-up costs rather complicated when we develop
grouping strategies. In principle, some of the set-up costs are related to the fact that
the train is sent to the depot for maintenance, whereas some other parts of the set-
up costs are specific for one subsystem. In the following, we will simplify and only
consider costs related to the bogie, i.e. we assume one fixed set-up costs related to
the bogie. We also assume that the train is available at the maintenance depot at
any time. This is also a simplification, since each train follows a schedule, and can
only enter the maintenance depot at some of the end stations for the different
services. In order to get access to the various components in the bogie some dis-
assembling is required before maintenance can be executed, and also some re-
assembling is required after execution of maintenance. The costs of disassembling
and re-assembling are here included in the set-up cost. In the model presented we
also assume that the set-up costs are the same for all activities. It is further assumed
that there is one and only one maintenance activity related to each component. This
simplifies notation because we then may alternate between failure of component i
and executing maintenance activity i where there is a unique relation between
component and activity. The basic notation to be used is as follows.
Notation
ciP Planned maintenance cost, exclusive set-up cost. Typically the costs of
replacing one unit periodically.
cUi Unplanned costs upon a failure. These costs include the corrective
maintenance costs, safety costs, punctuality costs, and costs due to ma-
terial damage.
S Set-up costs, i.e. the costs of preparing the preventive maintenance of a
group of components maintained at the same time. We assume the same
set-up costs for all activities.
E,i(x) Effective failure rate for component i when maintained at intervals of
length x.
Mi(x) Mi(x) = x cUi E,i(x) = expected costs due to failures in a period [0,x)
for a component maintained at time 0, exclusive planned maintenance
cost
Maintenance in the Rail Industry 513
i(x,k) i(x,k) = [ciP + S/k + Mi(x)]/x = average costs per unit time if x is the
length of the interval between planned maintenance, and the set-up
costs are shared by totally k activities.
*i,k The minimum value of i(x,k), i.e., minimization over x.
x*i,k The x-value that minimizes i(x,k).
ki,Av Average number of components sharing the set-up costs for the i-th
component, i.e. the i-th component is in average maintained together
with ki,Av 1 other components.
*i,Av Average minimum costs per unit time over all k-values.
x*i,Av Optimum value of xi over all k-values. x*i,Av is measured in million
kilometres since last maintenance on component i.
t0 Point of time when we are planning the next group of activities. Initially
t0 = 0. t0 is measured in running (million) kilometres since t = 0.
xi Age of component i at time t0, i.e., time since preventive maintenance
t*i,Av t*i,Av = t0 +x*i,Av xi = optimum time in running (million) kilometres.
Kk Candidate group, i.e. the set of the first k components to be maintained
according to individual schedule with t*i,Av as the basis for due time.
N Number of activities/components.
T End of planning horizon, i.e. we are planning from t0 = 0 to T.
If the grouping was fixed, i.e. static grouping, the optimization problem would
just be to minimize ii(x,k) for all k components maintained at the same time.
514 J. Vatn
Static grouping will not be discussed, but we present an approach for dynamic
grouping. Mathematically, the challenge now is to establish the grouping either in a
finite or infinite time horizon. In addition to the grouping, we also have to schedule
the execution time for each group (maintenance package). The grouping and the
scheduling cannot be done separately. Generally, such optimization problems are
NP hard (see Garey and Johnson (1979), for a definition), and heuristics are re-
quired. Before we propose our heuristic we present some motivating results.
Let *i,k be the minimum average costs when one component is considered
individually, and let x*i,k be the corresponding optimum x value. It is then easy to
prove that mi(x*i,k) = Mi(x*i,k) = *i,k meaning that when the instantaneous expected
unplanned costs per unit time, mi(x), exceeds the average costs per unit time,
maintenance should be carried out. The way to use the result is now the following.
Assume we are going to determine the first point of time to execute the
maintenance, i.e. to find t = x*i,k starting at t = 0. Further, assume that we know the
average costs per unit time (*i,k) but that we have for some reason lost or for-
gotten the value of x*i,k. What then we can do is to find t such that mi(t) = Mi(t) =
*i,k yielding the first point of time for maintenance. Then from time t and the
remaining planning horizon we can pay *i,k as the minimum average costs per unit
time. This is the traditional marginal costs approach to the problem, and brings the
same result as minimizing Equation 21.1. The advantage of the marginal thinking
is that we are now able to cope with the dynamic grouping. Assume that the time
now is t0, and xi is the age (time since last maintenance) for component i in the
group we are considering for the next execution of maintenance. Further, assume
that the planning horizon is [t0,T). The problem now is to determine the point of
time t (t0) when the next maintenance is to be executed. The total costs of
executing the maintenance activities in a group is S + iciP which we pay at time t.
Further, the expected unplanned costs in the period [t0 , t) is iMi(t-t0+xi) iMi(xi).
For the remaining time of the planning horizon the total costs are (Tt)i*i,k
provided that each component i can be maintained at perfect match with k1
activities the rest of the period. Since *i,k depends on how many components that
share the set-up cost, which we do not know at this time, we use some average
value *i,Av. We assume that we know this average value at the first planning. To
determine the next point of time for maintaining a given group of components we
thus minimize:
c1 (t ; k ) = S + c
iK k
i
P
+ M i (t t0 + xi ) M i ( xi ) + (T t )*i ,Av (21.2)
c2 (t ; k ) = c
iK k
i
P
+ S / ki ,Av + M i ( xi*,Av ) M i ( xi ) + (T ti*,Av )*i ,Av (21.3)
provided they can be maintained at perfect match with other activities, i.e. the
set-up costs are shared with ki,av 1 activities, and executed at time t*i,Av. The total
optimization problem related to the next group of activities is therefore to
minimize:
c(t ; k ) = S + c
iK k
i
P
+ M i (t t0 + xi ) M i ( xi ) + (T t )*i ,Av
+ c
iK k
i
P
+ S / ki ,Av + M i ( xi*,Av ) M i ( xi ) + (T ti*,Av )*i ,Av (21.4)
The idea is simple, we first determine the best group to execute next, and the
best time to execute it. Further we assume that subsequent activities can be
executed at their local optimum. It is expected to do better by taking the second
grouping into account when planning the first group, and not only treat the
activities individually. See, e.g. Budai et al. (2005) for more advanced heuristics in
similar situations to those presented here. The heuristic is as follows.
Step 0: Initialization. This means to find initial estimates of ki,Av, and use these k-
values as basis for minimization of Equation 21.1. This will give initial estimates
for x*i,Av and *i,Av. Finally the time horizon for the scheduling is specified, i.e., we
set t0 = 0 and choose an appropriate end of the planning horizon (T).
Step 1: Prepare for defining the group of activities to execute next. First calculate
t*i = x*i,Av + t0 xi and sort in increasing order.
Step 2: Establish the candidate groups, i.e. for k = 1 to N we use the ordered t*i s to
find a candidate group of size k to be executed next. If t*k > mini<k (t*i +x*i,Av) this
means that at least one activity in the candidate group needs to be executed twice
before the last one is scheduled which does not make sense. Hence, in this situation
the last candidate group is dropped and we are not searching for more candidate
groups at the time being.
Step 3: For each candidate group Kk, minimize c(t,k) in Equation 21.4 with respect
to execution time t. Next choose the candidate group Kk that gives the minimum
cost. This group should then be executed at the corresponding optimum time t.
Step 4: Prepare for the next group, i.e. we assume that all activities in the chosen
candidate group are executed at time t. This corresponds to setting xi = 0 for i Kk,
xi = xi +tt0 for i Kk and then update the current time, i.e. t0 = t. If t0 < T GoTo
Step 1, else we are done.
There are several ways to improve the algorithm. One intuitive improvement is
to improve the estimates of ki,Av and corresponding x*i,Av and *i,Av to be specified in
Step 0. This is easy, since in Step 4 we get a new value of k for those activities
included in the candidate group, and when the algorithm terminates we simply set
ki,Av as the average for each activity i in the period [0,T). We may then start over
again at Step 0 with these new values of ki,Av.
516 J. Vatn
In Step 2 we establish candidate groups. For k =12 we note that t*12 > t*1 + x*1,Av
which means that we only process candidate groups with k < 12.
In Step 3 we calculate c(t,k), and the minimum values are shown in Table 21.3.
The minimum is found for k = 10. Further c(t,10) has its minimum for t* = 0.829
million kilometres. We observe that for those activities included in the first group,
the t*i -values are rather close to 0.829 million kilometres.
In Step 4 we now proceed, and set xi to 0 for those activities which are executed
(i.e. i 10), whereas xi = xi + 0.829 million kilometres for i > 10. Finally we set t0
= 0.829 million kilometre before we go to Step 1 again. The next group of activi-
ties is similarly found to be executed at t* = 1.606 million kilometres. This next
group comprises some activities not included in the first group, but also some
activities that was executed in the first group and are now executed for the second
time. We proceed until t0 > 15.
When the procedure terminates, we have a total cost of 1.2 million Euros. We
have also recorded the average values of ki,Av which in this example ranges from
13.5 to 17 which is slightly higher than the initial assessment of ki,Av = 13. By re-
peating the entire procedure with the new values for ki,Av a small reduction in costs
of 1% is obtained.
The dynamic scheduling regime presented above is a good basis for opportunity
based maintenance. The scheduling we have proposed may be used to set up an
explicit maintenance plan for the time horizon [0, T). But even though the plan
exists, we may consider changing it as new information becomes available, either
in terms of new reliability parameter estimates, or if unforeseen failures occur. In
operation, for any time t0 we may update the scheduling of preventive mainten-
ance.
Upon a failure requiring the set-up costs to be paid, it is rather obvious that
activities that already were due if they were treated individually according to
Equation 21.1 should be executed upon this opportunity. Further, activities not
scheduled in the next group (maintenance package) should not be executed since
they were not even included in a group to be executed later than the time of this
520 J. Vatn
opportunity. The basic question is thus which of the remaining activities in the next
due group that should be executed at this opportunity. Let Kk be the set of k
activities in this group. Assume that we have found that it is favourable to execute
the first i1 < k activities on this opportunity. The procedure to test whether or not
activity i also should be executed is as follows:
First perform a scheduling by starting at Step 1 in Section 21.3.2. First we
assume that all activities up to i are executed on this opportunity, i.e. xj = 0,
j i, and xj is set to the time since activity j were executed for j > i.
Let C1 be the minimum value of c(t,k) obtained in Step 3 plus the marginal
cost, ciP of executing activity i.
Next, we assume that only activities up to i1 is executed, i.e. xj = 0, j i1,
and xj is set to the time since activity j was executed for j i.
Let C2 be the minimum value of c(t,k) obtained in Step 3 this second time.
If C1 > C2 is it not beneficial to do activity i.
If it was beneficial to do activity i at t0 we should test for i = i+1 as long as i k.
The procedure is demonstrated by the following example.
We assume that a failure occurs at time t = 0.8 million km. From Table 21.3 we
observe that the first 10 activities were scheduled for execution at time 0.829
million km. Since the schedule costs is already paid by the corrective activity, it is
obvious that the first four activities, i.e. those with individual optimum less than
t = 0.8 million km, should be done. Then we test whether activity 5 (t*5 = 0.805)
should be done at this opportunity. We calculate C1 = 1.188267 million Euros and
C2 = 1.188274 million Euros, hence activity 5 should be done. Then we proceed
similarly, and find that also activity 6 should be executed. For activity 7 (t*7 = 0.879)
we find that it is not cost effective to executed this activity. Since the first six
activities have been executed upon this opportunity, the next planned maintenance
can be postponed from the original t = 0.829 million km to t = 0.985 million km.
The infrastructure manager usually has a limited budget for maintenance and re-
newal of the railway network. This calls for a structured approach to prioritization
of possible projects. In this section we discuss a portfolio approach to greater pro-
jects, in contrast to the situation in Section 21.3 where the scheduling of periodical
activities were discussed. Examples of such greater projects are:
Ballast cleaning when the ballast is polluted and stones are crushed
Rail grinding when the rail surface is rough
Tamping and leveling when track geometry is degraded
Sandblasting of bridges exposed to corrosion
Renewal of overgrown ditches
Point replacement of rails, e.g. in curvatures with high wear factor
Maintenance in the Rail Industry 521
Notation
C/B Cost-benefit ratio, i.e. the net present value of the benefits divided by
the net present value of the costs of the project
{RC(t)} Portfolio costs of renewals without the project
{RC*(t)} Portfolio costs of renewals with the project
{T*} Set of renewal times with the project
{T} Set of renewal times without the project
c(t) Time dependent cost as at point of time t (from now)
c*(t) Time dependent cost when a maintenance or renewal project is
executed
d Factor to describe increase in time dependent cost due to degradation,
i.e. the increase from one year to another is d 100%
LCC Life cycle cost
N Calculation period for net present value calculations
r Discount rate
RIF Risk influencing factor, i.e. a factor that influences the risk level
RLT Residual lifetime without the project
RLT* Residual lifetime with the project
Renewal costs
Costs Savings
c(t) c*(t)
T Time
Special attention will be paid to projects that aim at extending the lifelength of
a railway system. A typical example is rail grinding for lifelength extension of the
rail, but also the fastenings, sleepers and the ballast will take advantages of the rail
grinding. Figure 21.2 shows how a smart activity ( ) may suppress the increase
in c(t) and thereby extend the point of time before the costs explode and a renewal
is necessary.
From a modelling point of view the situation is rather complex because
different projects are interconnected. For example, by executing a ballast cleaning
project the track quality is increased, reducing the need for tamping and leveling.
On the other hand, by tamping and point-wise supplement of ballast in pumping
areas (surface water) we may postpone the much more expensive ballast cleaning.
A third factor to take into account is the fact that for each tamping cycle there is
some stone crushing, and hence we should also be reluctant to do too much
tamping. Despite the fact that railways have existed for over 160 years there is a
lack of documented mathematical models describing the interaction between
different components in the railway, and the effect of the various maintenance
activities. When developing a tool for prioritization it has therefore been necessary
to base the model on model parameters specified by the maintenance planners and
their experts. In the future, it is planned to improve the models based on the
findings from a joint research project between Norway and Austria.
In the following we describe the basic input for performing the cost benefit
analysis. The numerical calculations are supported by a computerized tool (PriFo).
number of cracks in the rails, but also to the accident consequences such as speed,
terrain description, etc.
Table 21.4 shows an example related to the derailment frequency. In the
modelling, f0 corresponds to the average derailment frequency related to rail
problems. The value of f0 is found by analysing statistics over derailments in
Norway, where we find f0 = 3 104 per kilometre per year.
Renewal
Variable cost
Renewal*
)
c(t
c*(t)
Time
RLL
RLL*
= smart maintenance activity, e.g., rail grinding
The variation width (w) in Table 21.4 shows the maximum negative or positive
effect of each RIF. In this model the values of the various RIFs are standardised,
which means that 1 represents the worst value of the RIF, 0 represent the base
case, and +1 represents the best value of the RIF. The interpretation of w is as
follows: If one RIF equals 1, then the derailment frequency is w times higher than
for the base case, and if the RIF equals 1 then the derailment frequency is w times
lower than the base case. Assuming that the various RIFs act independently of each
other an influence model for the derailment frequency may be written
f = f 0 i wi RIF i
(21.5)
where wi is the variation width of RIF number i, and RIFi is the value of RIF
number i. By using Equation 21.5 with the generic weights from Table 21.4, we
may easily assess the derailment frequency only by assessing the values of the
RIFs for a given railway line or section.
In addition to the current value of the risk, the future increase also has to be
described corresponding to the two cost curves c(t) and c*(t) in Figure 21.1. For
example, we might use an exponential growth of the form c(t) = f (1+d)t1, where d
is the degradation from one year to the next. The rational behind an exponential
growth is that the forces driving the track deterioration often is assumed pro-
portional to the deviation from an ideal track. A simple differential equation argu-
ment would then show an exponential growth.
524 J. Vatn
Table 21.5. Monetary values in Euros for each safety consequence class
Costs per passenger delayed 1 min = 0.4 Euros. A train with 250 passengers
then gives 100 Euros per minute delayed.
A life cycle cost (LCC) perspective will be taken with respect to calculating the
cost benefit ratio for the different projects. This includes a net present value analy-
sis, taking the following aspects into consideration:
Change in variable costs, c(t)
The effect of extending the lifelength
The project costs
N
LCCS = [c(t ) c *(t )](1 + r )
t =1
t
(21.6)
where r is the discount rate, and N is the calculation period. N is here the residual
lifelength (RLL) if nothing is done. This means that we compare the situation with
and without the project in the period from now till we have to do something in any
526 J. Vatn
case. Similarly we obtain the change in punctuality costs, LCCP and the change in
maintenance and operational costs, LCCM&O.
To calculate Equation 21.6 we may in some special situations find closed
formulas. For example, if c(t) is constant, i.e. c(t) = c, the formula for the sum of a
geometric series yields
N
1 (1 + r ) N
t =1
c (1 + r ) t
= c
r
(21.7)
Further, if c(t) the first year is c1 and c(t) increases by a factor (1+d) each year we
have
N 1+ d N
c1 (1 + d )t 1 (1 + r ) t = c1 1 1+ r (21.8)
t =1 r d
LCCRLT = RC(t ) (1 + r )
t{T }
t
RC *(t ) (1 + r )
t{T *}
t
(21.9)
The cost benefit ratio, or more precisely the benefit cost ratio is given by
1+ 0.07 5 1+ 0.03 5
LCCS = 110 000 1 1+ 0.04 55 000 1 1+ 0.04 300 000
0.040.07 0.040.03
528 J. Vatn
1+ 0.10 5 1+ 0.03 5
LCCP = 110,000 1 1+ 0.04 40,000 1 1+ 0.04 400,000
0.040.10 0.04 0.03
LCCS = 0.3
LCCP = 0.4
LCCM&O = 2.3
LCCRLT = 12.9
LCCI = 2.2
This yields a cost benefit ratio of C/B = 7.2, meaning that for each Euro put
into rail grinding, the payback is 7 Euros.
By calculating the cost benefit ratio for the various maintenance and renewal
projects, we get a sorted list of the most promising projects. In principle, we should
execute those projects having a cost benefit ratio, C/B, higher than one. If the
budget constraints imply that we can not execute all projects with C/B higher than
one, it would be necessary to have a thorough discussion related to the budget for
maintenance and renewal. Since most organizations suffer from the short term
costs cutting syndrome, it is a hard struggle to argue for spending more money now
in order to save money in a five to ten years perspective.
Even if we cannot do much about the budget situation, we may use the results
from the cost-benefit analysis to prioritize between the various projects.
21.5 Conclusions
The two case studies presented elaborate on some of the challenges in Norwegian
rail maintenance. Both the railway undertaking (NSB) and the infrastructure
manager (JBV) aim at implementing more proactive strategies for maintenance and
renewal based on more formal methods such as RCM and NPV/CBA. These
methods require reliability parameters of a much higher level of detail than the
current experience databases can offer today. Therefore both NSB and JBV have
started the process of restructuring databases, and emphasize the importance of
proper failure reporting. Due to the lack of experience data it has up to now been
necessary to utilize expert judgment to a great extent. It is further important to
emphasize that optimization models like the ones presented here should be con-
sidered as decision support, rather than decision rules. In order to improve on these
areas we believe that more systematic collection and analysis of reliability data is
an important factor, and here the rail industry may learn from the offshore industry
where joint data collection exercises have been run for 25 years (OREDA 2002).
Another challenge of such modelling is the lack of consistent degradation
models. For example, for the track there is a good qualitative understanding of
factors affecting degradation such as water in the track, contamination, geometry
failures, heavy axles, etc. However, the quantitative models for degradation taking
these factors into account are not very well developed. Research has paid much
attention to design problems to ensure long service life but it is difficult to use the
research results for maintenance and renewal considerations. More empirical re-
search on degradation mechanisms will also be important in the future.
530 J. Vatn
21.6 References
Bing AJ, Gross A, (1983) Development of Railroad Track Degradation Models.
Transportation Research Record 939, Transportation Research Board, National
Research Council, National Academy Press, Washington, D.C, USA.
Bogdaanski S, Olzak M, Stupnicki J. (1996). Numerical stress analysis of rail rolling contact
fatigue cracks. Wear 191:1424
Braghin F, Lewis R, Dwyer-Joyce RS, Bruni S, (2006) A mathematical model to predict
raiway wheel profile evolutio due to wear. Accepted for publication in Wear.
Budai G, Huisman D, Dekker R. (2005) Scheduling Preventive Railway Maintenance
Activities. Accepted for publication in Journal of the Operational Research Society.
Carreteroa J, Pereza JM, Garca-Carballeiraa F, Calderona A, Fernandeza J, Garcaa JD,
Lozano A, Cardonab L, Cotainac N, Prete P, (2003) Applying RCM in large scale
systems: a case study with railway networks. Reliability Engineering and System Safety
82:257273
Dekker R, Wildeman RE, Van der Duyn Schouten, FA, (1997). A Review of Multi-
Component Maintenance Models with Economic Dependence. Mathematical Methods
of Operations Research, 45:411435.
Ferreira L, Murray M, (1997) Modelling rail track deterioration and maintenance: current
practices and future needs. Transport Reviews, 17(3): 207221.
Garey MR, Johnson DS (1979). Computers and Intractability: a Guide to the Theory of NP-
Completeness. W.H. Freeman and Company: New York.
Grassie SL (2005). Rolling contact fatigue on the British railway system: treatment. Wear
258:13101318
Hecke A, (1998) Effects of future mixed traffic on track deterioration. Report TRITA-FKT
1998:30, Railway Technology, Department of Vehicle Engineering, Royal Institute of
Technology, Stockholm.
Kay AJ, (1998) Behaviour of Two Layer Railway Track Ballast under Cyclic and Monotonic
Loading. PhD Thesis, University of Shefield, UK.
Li D, Selig ET, (1995) Evaluation of railway sub grade problems. Transportation Research
Record. 1489:1725.
Meier-Hirmer1 C, Sourget F, Roussignol M, (2005). Optimising the strategy of track
maintenance. Advances in Safety and Reliability Koowrocki (ed.) Taylor & Francis
Group, London.
OREDA, (2002) Offshore Reliability Data, 4th ed. OREDA Participants. Available from
Det Norske Veritas, NO-1322 Hvik, Norway.
Pedregala DJ, Garcaa FP, Schmid F (2004) RCM2 predictive maintenance of railway
systems based on unobserved components models. Reliability Engineering and System
Safety 83:103110
Podofillini L, Zio E, Vatn J. Risk-informed optimization of railway tracks inspection and
maintenance procedures. Reliability Engineering and System Safety 91:2030, 2006
Reddy V, Chattopadhyay G, Larsson-Krik PO, Hargreaves DJ, (2006). Modelling and
analysis of rail maintenance cost . Accepted for publication in International Journal of
Production Economic.s
Salim W, (2004): Deformation and degradation aspects of ballast and constitutive modeling
under cyclic loading. PhD Thesis, university of Wollongong. Austrailia.
Sato Y, (1995) Japanese studies on deterioration of ballasted track. Vehicle System Dynamics,
24:197208.
Sriskandarajah C, Jardine, AKS, Chan, CK (1998). Maintennace scheduling of rolling stock
using a genetic algorithm. European J. Oper.Res., 35:115.
Telliskivi T, Olofsson U, (2004) Wheelrail wear simulation. Wear 257 11451153.
Maintenance in the Rail Industry 531
Vatn J, Podofillini, P, Zio E (2003). A risk based approach to determine type of ultrasonic
inspection and frequencies in railway applications. World Congress on Railway
Research. Edinburgh, Scotland 28 September 1 October 2003.
Veit P, Wogowitsch M, (2003) Track Maintenance based on life-cycle cost calculations. In
Innovations for a cost effective Railway Track.
www.promain.org/images/publications/Innovations-LCC.pdf
Welte T, Vatn J, Heggset J, (2006) Markov state model for optimization of maintenance and
renewal of hydro power components. 9th International Conference on Probabilistic
Methods Applied to Power Systems, KTH, Stockholm, 1115 June 2006.
Wildeman RE (1996). The art of grouping maintenance. PhD Thesis, Erasmus University
Rotterdam, Faculty of Economics.
Zakharov S, Komarovsky I, Zharov I (1998). Wheel flange/rail head wear simulation. Wear
215. 1824
Zarembski AM, Palese JW, (2003) Risk Based Ultrasonic Rail Test scheduling: Practical
Application in Europe and North America. 6th International Conference on Contact
Mechanics and Wear of Rail/Wheel Systems (CM2003) in Gothenburg, Sweden June
1013, 2003
Zhang YJ, Murray MH, Ferreira L, (1997). Railway track performance models: degradation
of track structures. Road and transport Research. 6(2):419
Zoeteman A, 2003. Life Cycle Management Plus. In Innovations for a cost effective Railway
Track. www.promain.org/images/publications/Innovations-LCC.pdf
22
22.1 Introduction
The engine is the heart of the ship; and the lubricant is the lifeblood of the engine.
Wear is one of the main causes that lead to engine failures. It is desirable to avoid
engine breakdowns for reasons of safety and economy. This has led to an increas-
ing interest in engine condition monitoring and performance modeling so as to
provide useful information for maintenance decision.
Generally, an engine goes through three phases (i) running-in phase with an
increasing wear rate, (ii) normal operational phase with a roughly constant wear
rate and, (iii) wear-out phase with a quickly increasing wear rate. The wear state
can be effectively monitored by a number of techniques. The most popular tech-
nique is lubrication oil testing and analysis. Other techniques such as vibration and
acoustical emission analyses also provide evidences of the wear state. A more
effective way may be an integrated use of various monitoring techniques. In this
chapter we confine our attention on oil analysis.
Oil analysis techniques fall into the following three types. The first is concen-
tration analysis of wear particles in lubricant. This can be conducted in the field or
the laboratory. The second is wear debris analysis. This deals with examination of
the shape, size, number, composition, and other characteristics of the wear particles
so as to identify the wear state. This is usually conducted in the laboratory. The
third is lubricant degradation analysis. This is used to analyze physical and chemi-
cal characteristics of lubricant and determine the state of lubricant. This can be
conducted in the field or the laboratory.
To avoid the use of expensive laboratory instrumentation for wear state identi-
fication, a usual practice is to build a quantitative relation (or discriminant model)
between the condition variables (e.g. concentrations of wear particles) and the wear
state using an observation sample obtained from both field and laboratory analysis.
Once such a relation is built and verified, only field analysis is needed in practical
applications. As a result, a key issue is to develop an effective and quantitative
condition monitoring model.
534 R. Jiang and X. Yan
In this chapter we present a case study, which deals with applying oil analysis
techniques to condition monitoring of marine diesel engines. We present a system-
atic approach to identify the important condition variables, construct a multivariate
control chart, build the quantitative relation between the condition variables and
the wear state, and establish the state discrimination criterion or critical value. The
proposed approach is formulated based on intuitive reasoning, optimization tech-
nique and real data.
The chapter is organized as follows. Section 22.2 presents a literature review on
condition-based maintenance (CBM) and its applications to diesel engines. Section
22.3 provides the background details and presents the monitoring and experimental
results. The results are analyzed and modeled in Section 22.4. Finally, we conclude
the chapter with a summary and discussion in Section 22.5.
Wear states of two 8NVD48A-2u marine main propulsion diesel engines were
experimentally investigated at the Reliability Institute of Wuhan University of
Technology, China. The overall objective of the program was to develop a CBM
technique to provide condition information for maintenance decision of the engines.
There are three kinds of condition variables to represent the wear condition of
the engines:
Wear particle concentrations
Lubricant quality parameters such as viscosity and contamination index
Operational parameters such as vibration level, shaft torque moment and
instantaneous rotation velocity
Condition Monitoring of Diesel Engines 541
The experiment was started after an overhaul of the engines, which is assumed to
restore the engine to good-as-new, and finished at the time instant of the next
overhaul. During the experiment the engines cummulatively ran for 4831 h, the
engine oil was periodically sampled, and a total of 110 oil samples were taken
from the 2 engines.
Various pieces of equipment such as direct reading ferrograph, rotary ferro-
graph, infrared spectrum analyzer, scanning electron microscope and electronic
digital analyzer, viscosity meter, and lubricant quality meter were used to analyze
the oil samples in order to classify the wear state.
In this study, the wear is divided into two states: normal (or State 0) and
abnormal (or State 1). The wear state can be determined by analyzing the size,
composition, and type of wear particles. Several different techniques were used to
identify the wear state in the laboratory. For more details about the state classifi-
cation based on wear particle morphology, see Roylance et al. (1994), Roylance
and Raadnui (1994) and Raadnui and Roylance (1995).
Most of the observations were under normal operational conditions. A trend
analysis of concentration vs. time was carried out. The main findings were as
follows:
1. There exists a close relation between oil degradation and abnormal wear. It
was observed that the concentration of wear particles increases as viscosity
decreases and the contaminate index increases.
2. There exist some differences among the outcomes provided by different
analysis techniques; and sometimes the outcomes are in disagreement.
j State Fe Cr Ni Mn Al Cu Pb Si
1 1 52.18 2.95 2.66 2.36 8.7 10.98 13.29 5.32
2 1 52.73 3.25 2.55 1.78 8.04 8.93 9.65 5.46
3 1 35.31 0.95 0.68 1.26 5.57 4.33 6.23 4.57
4 1 32.2 1.35 1.17 1.03 5.70 3.57 5.89 4.56
5 1 82.87 4.74 2.61 1.85 9.85 13.34 17.1 7.22
6 1 48.22 2.17 1.94 1.37 7.08 6.82 8.05 4.88
7 1 30.78 1.03 0.00 1.15 4.71 4.18 5.94 4.00
8 1 37.99 1.30 0.00 1.07 6.07 4.52 5.87 3.90
9 1 39.51 1.39 0.41 1.04 7.24 3.17 5.51 7.15
10 1 33.47 1.06 0.35 0.86 6.77 2.87 4.95 7.18
11 1 36.50 2.17 1.05 1.61 7.73 3.88 6.29 7.61
12 1 35.03 1.73 0.57 1.30 7.68 3.47 5.11 7.43
Mean 43.07 2.01 1.17 1.39 7.10 5.84 7.82 5.77
13 0 28.2 0.39 0.00 0.72 4.04 2.71 3.70 3.96
14 0 27.02 0.79 0.40 0.87 4.09 3.24 5.02 5.71
15 0 25.66 0.43 0.00 0.69 3.64 2.65 3.57 5.29
16 0 22.25 0.50 0.18 0.50 3.94 2.15 3.80 5.50
17 0 30.72 1.28 0.64 1.09 5.09 4.15 6.63 3.99
18 0 29.4 0.58 0.00 1.01 4.30 4.20 4.92 3.65
19 0 29.17 0.47 0.00 0.97 4.12 3.67 4.73 3.67
20 0 31.45 1.10 0.00 1.12 4.73 4.27 5.96 4.01
21 0 30.04 0.43 0.00 1.02 4.16 3.91 5.30 3.90
22 0 29.48 0.66 0.00 0.91 4.49 3.58 4.79 3.85
23 0 25.97 0.34 0.00 0.68 3.69 2.56 3.71 4.03
24 0 42.05 2.34 1.98 1.94 7.75 11.12 12.78 5.93
25 0 43.16 2.10 2.16 1.92 7.41 10.64 13.25 5.67
26 0 23.38 0.96 0.31 0.58 4.63 2.13 3.21 6.74
27 0 29.16 0.62 1.07 0.91 3.23 2.95 4.93 8.52
28 0 22.82 0.66 0.31 0.91 4.52 2.06 3.66 6.35
approach to model the data. Comparing it with the previous approaches, it appears
more straightforward and comprehensive.
t = r / 1 r2 . (22.1)
The critical value of t associated with the 95% level, one tail, and the degrees of
freedom 121 = 11 is 1.7959. This implies that the critical value of r is 0.8737.
Namely, the linear relation between two variables are significant if their correlation
coefficient is larger than 0.8737 in this application.
As can be seen from Table 22.3, there are eight correlation coefficients that are
larger than 0.8737. They are:
r(Cu, Pb) = 0.98, r(Fe, Cr) = 0.95, r(Fe, Pb) = 0.94, r(Fe, Cu) = 0.93,
r(Cr, Cu) = r(Cr, Pb) = 0.92, r(Cr, Al) = r(Cu, Ni) = 0.88. (22.2)
Equation 22.2 involve six elements: Fe, Cr, Ni, Al, Cu, and Pb. Among them, Ni
and Al appear only once and the corresponding correlation coefficients (= 0.88) are
very close to the citical value. Thus, we may classify the elements into three groups:
Strong correlation group: (Fe, Cr, Cu, Pb)
Weak correlation group: (Ni, Al)
Independent group: (Mn, Si)
544 R. Jiang and X. Yan
Cr Ni Mn Al Cu Pb Si
Fe 0.95 0.78 0.66 0.81 0.93 0.94 0.24
0.84 0.80 0.96 0.85 0.96 0.96 -0.04
Cr 0.87 0.78 0.88 0.92 0.92 0.32
0.88 0.89 0.96 0.91 0.93 0.23
Ni 0.84 0.75 0.88 0.84 0.11
0.83 0.82 0.85 0.89 0.50
Mn 0.73 0.84 0.81 0.11
0.90 0.97 0.97 0.05
Al 0.74 0.76 0.64
0.93 0.93 0.06
Cu 0.98 0.02
0.99 0.05
Pb 0.12
0.10
When two variables are strongly correlated and their means differ significantly,
then one can ignore the one with the smaller mean and simply use the one with the
larger mean. Using this reasoning, we may delete some of the elements in the
strong correlation group. Consider the first three correlation coefficients of Equa-
tion 22.2, which have larger r values. According to the first correlation coefficient
and the means given in Table 22.2, Cu may be deleted. Similarly, Cr and Pb may
be deleted based on the second and third correlation coefficients, respectively. As a
result, only five elements (Fe, Ni, Mn, Al and Si) are retained for further analysis.
A physical interpretation of the correlation in this case study is that the wear
debris may not be pure metal and can be from different parts. Its mathematical
interpretation is that an increase or decrease of the readings in one element implies
a possible increase or decrease [decrease or increase] of the readings in a positively
(negatively) correlated element. When the absolute value of readings is very small,
e.g. some of readings of Si, the correlation should be considered insignificant.
Each condition variable contributes partial information for identifying the state of
the monitored system. By quantitatively examining their contributions, we can
identify those varables which carry more state information. This study develops a
method to quantitatively evaluate contributions of the condition variables. It starts
with building the marginal distributions associated with the abnormal and normal
states for each condition variable.
certain distribution F1(i ) ( x) . The data associated with State 0 can be viewed as
right-censored. Namely, if the observed value associated with State 0 is xij+ , then
the corresponding value of x associated with State 1 meets the relation: x > xij+ . Its
likelihood function is given by 1 F1( i ) ( xij+ ) . The overall maximum likelihood
function is given by
12 28
L1(i ) = f1(i ) ( xij ) [1 F1(i ) ( xij+ )] . (22.3)
j =1 j =13
x
F ( x) = {1 exp[ exp( )]} exp(e / ), x 0 . (22.4)
1.5
0.5
0
-2 -1 -0.5 0 1 2 3 4 5
-1
y
-1.5 Mn
-2
Ni
-2.5
Si
-3
Al Fe
-3.5
x
Figure 22.1. WPP plots of data
546 R. Jiang and X. Yan
Their WPP plots are shown in Figure 22.2. Clearly, for each WPP plot of data in
Figure 22.1 one can find a shape that matches one of the WPP plots in Figure 22.2.
Thus, an appropriate model can be found from these three models for each variable.
Once the model type is determined, the maximum likelihood method can be
used to obtain the estimates of the model parameters. The estimated parameters,
( 1(i ) , 1(i ) ), are shown in Table 22.4.
In later analysis, we need to know the means and variances of the fitted
marginal distributions. For the truncated normal distribution, the mean and vari-
ance are given by
( / ) , V = 2 m( m ) ,
m = + (22.5)
1 ( / )
where and are the model parameters, and (.) and (.) are pdf and cdf of
the standard normal distribution, respectively. For the truncated Gumbel distribu-
tion, the mean and variance are given by
where
ln(s)e ds , I 2 = ln
2
I1 = s
( s )e s ds . (22.7)
/
e / e
Lognormal
T runcated
y
Gumbel
T runcated
normal
Figure 22.2. WPP plots of truncated normal, lognormal, and truncated Gumbel distributions
Condition Monitoring of Diesel Engines 547
i = 1, Fe i = 2, Ni i = 3, Mn i = 4, Al i = 5, Si
Truncated Truncated Truncated
Lognormal Gumbel normal normal Lognormal
(i )
0 3.3532 1.0286 0.0548 4.4977 1.5194
(i )
0 0.1243 0.6361 0.8948 1.1658 0.1880
(i )
m 0 28.8153 0.9530 0.7342 4.4980 4.6511
(i )
V 0 3.5965 0.8064 0.5494 1.1653 0.8820
(i )
1 3.7554 2.0401 1.5644 7.3574 1.8317
(i )
1 0.1897 0.7985 0.4440 1.3679 0.1964
(i )
m 1 43.5269 1.7723 1.5648 7.3574 6.3660
V1(i ) 8.3335 1.1040 0.4433 1.3679 1.2624
(i )
x c 34.3485 1.5703 1.0493 5.9022 5.3512
(i )
Err 1 0.0701 0.1171 0.2540 0.1142 0.2005
(i )
Err 2 0.1244 0.3797 0.1230 0.1437 0.2159
(i )
P( x c
) 0.0973 0.2484 0.1885 0.1289 0.2082
Rank 1 5 3 2 4
12 28
L(0i ) = F0(i ) ( xij ) f 0(i ) ( xij ) . (22.9)
j =1 j =13
A careful examination has been carried out to determine the model type of F0(i ) ( x) .
We found that that F0( i ) ( x) has the same model type as F1( i ) ( x) . The maximum
likelihood estimates of the model parameters, ( 0(i ) , 0( i ) ), are also shown in Table
22.4.
f 1 (x)
f 0 (x)
x xa xc
dP ( xc(i ) ) (i ) (i ) (i ) (i )
= 0 or f 0 ( xc ) = f1 ( xc ) . (22.13)
dxc( i )
Condition Monitoring of Diesel Engines 549
The specific values of the relevant parameters ( Err1(i ) , Err2(i ) , P( xc(i ) ), xc(i ) ) for
each element are shown in Table 22.4.
22.4.2.4 Discussion
P( xc(i ) ) is a measure of misjudgment probability. The smaller it is, the better is the
discrimination capability of variable i, namely, the variable contains more state
information. Using it as an importance criterion, we can rank the condition vari-
ables. The last row of Table 22.4 shows the rank number of each variable.
As can be seen from the table, Fe has the best discrimination capability. This is
consistent with the result of correlation analysis, which shows that it is highly
correlated with (Cr, Cu, Pb). Namely, the concentration of Fe comprehensively
reflects the concentrations of Cr, Cu, Pb and itself, and hence the reading of Fe
reflects the wear state to a great extent. The second most significant element is Al.
This also appears reasonable since debris of Al and Cr (the latter is reflected by Fe)
mainly comes from piston and piston rings, which are the main wear parts. Mn and
Si have almost the same discrimination capability. This appears reasonable due to
their independence. Finally, it is noted that Ni has the worst state discrimination
capability. This can be explained by the dispersion of its readings (see Table 22.2),
and the fact that the wear of the transmission gears may not be a major problem.
A multivariate control chart can intuitively display the results of condition moni-
toring and evolution trend. Therefore, it appears especially important to set an
alarm threshold and an abnormal threshold. Usually, the thresholds are optimized
in a CBM model. Here, our focus is on the construction of such a control chart, and
hence we only present a simple method to set the thresholds when the optimal
thresholds unavailable.
We define xc(i ) as the abnormal threshold, and define the alarm threshold as
below:
To achieve the second and third features, we use the following relation to
transform an observed concentration xi into a normalized concentration yi without
changing the relative magnitude of the original readings:
yi = ai + bi xi , bi > 0. (22.15)
Let
a
i
i =0 (22.18)
so as to decrease the influence of the constant term in Equation 22.15. This yields
xa( i ) xc( i )
= / . (22.19)
i xc( i ) xa( i ) i xc(i ) xa( i )
i = 1, Fe i = 2, Ni i = 3, Mn i = 4, Al i = 5, Si
(i )
x a 31.2898 0.4048 0.8342 5.1075 4.5206
ai 0.7925 0.7849 0.2215 0.1855 0.0284
bi 0.0522 0.1370 0.7419 0.2008 0.1922
combined scale is expected to have better failure (or abnormal state) prediction
capability than individual scales. Two typical models are the linear and multipli-
cative ones. Their parameters are determined by minimizing the sample coefficient
of variation (CV) of the composite scale. The minimum CV approach is hard to
apply in the presence of censored data. In this context, Jiang and Jardine (2006)
propose a simple method to estimate the model parameters in the presence of
censored data. The method transforms censored data into complete data by adding
a mean residual value to a censored datum for each scale. Such a new data set, thus
obtained, is called an equivalent complete data set and will be used for the para-
meter estimation using the minimal CV approach under the assumption that the
transformation does not significantly impact the composite scale model to be built.
They also conclude that a small value of CV is a necessary but insufficient con-
dition of a good prediction capability of failure for the composite scale model.
Therefore, they consider more than one alternative model, use the minimum CV
method to estimate the parameters of the alternative models, and determine the best
model based on the prediction capability of the models.
1.5 No. 12
Rescaled concentration
Abnormal
1
Alarm
0.5 No. 13
0
State Fe Al Si Mn Ni
Element
element
5 5
y = ci xi , c i = 1. (22.20)
i =1 i =1
If we want to exclude a certain variable, say xk, from the model, we just need to set
ck = 0.
According to the above assumptions, the composite scale, Y, is a normal ran-
dom variable. For State 1, the mean and variance of Y are given by
5 5
m1 = ci m1( i ) , V1 = ci2V1(i ) . (22.21)
i =1 i =1
5 5
m0 = ci m0(i ) , V0 = ci2V0(i ) . (22.22)
i =1 i =1
According to Equation 22.13, the critical value of the composite scale, yc, meets
the following relation:
y m0 y m1
( ) / V0 = ( ) / V1 . (22.23)
V0 V1
d 2 + 2( s 2 1) ln(s ) sd ,
yc = m1 + V1 (22.24)
s2 1
where
s = V1 / V0 , d = (m1 m0 ) / V0 . (22.25)
yc m0 y m1
P( yc ) = [1 ( ) + ( c )] / 2 . (22.26)
V0 V1
Condition Monitoring of Diesel Engines 553
Since m0, V0, m1, and V1 are functions of the decision variables {ci}, P(yc) is a
function of {ci}. As a result, {ci} can be optimally determined by directly mini-
mizing P(yc).
I = 1/P(yc). (22.27)
rI = 1 /[(n 1) P( yc )] . (22.28)
It comprehensively reflects the above two requirements. A large value for rI im-
plies a better model. We use this criterion to select the best model. The last column
of Table 22.6 shows the values of rI. As can be seen from the table, the best model
is the three-parameter model that includes the three important elements (Fe, Al, Si).
Also to be noted is that the second best model is the three-parameter model that
includes the elements (Fe, Al, Mn). Once more, it shows that Mn and Si have al-
most the same importance as indicated in the correlation analysis.
In the current case, we take = 1%. This yields y0.01 = 8.1536. The rescaled alarm
threshold for the composite scale equals 0.9156, which is not equal to the rescaled
alarm threshold (= ) for the elements; see Figure 22.4.
1. Some additional work is needed to validate the proposed model. This can be
done by examining the agreement between the model predition results and
the actual observations in the field.
2. The alarm threshold and oil sampling interval can be optimized so as to
obtain a balance between the acquired information and the effort involved.
3. To provide a more accurate assessment of engine condition, it appears
necessary to use multiple monitoring techniques. Thus, fusion of multi-
sensor data and aggregation of multi-state measures is an important topic
that needs further study.
4. An optimization maintenance decision model and computerized implemen-
tation software package needs to be developed to promote greater use of this
approach in industry.
22.6 Acknowledgement
The authors wish to thank Prof. D.N.P. Murthy for his constructive comments on
an earlier version of this chapter.
22.7 References
Anderson DN, Hubert CJ, Johnson JH, (1983) Advances in quantitative analytical
ferrography and the evaluation of a high gradient magnetic separator for the study of
diesel engine wear: Wear 90(2): 297333
Blischke WR, Murthy DNP, (2000) Reliability: modeling, prediction, and optimization.
John Wiley, New York
Douglas RM, Steel JA, Reuben RL, (2006) A study of the tribological behaviour of piston
ring/cylinder liner interaction in diesel engines using acoustic emission. Tribology
International 39(12): 16341642
Fisher RA, (1970) Statistical Methods for Research Workers. Oliver and Boyd, Edinburgh
Goode KB, Moore J, Roylance BJ, (2000) Plant machinery working life prediction method
utilizing reliability and condition-monitoring data. Proceedings of the Institution of
Mechanical Engineers Part E-Journal of Process Mechanical Engineering 214: 109122
Gorin N, Shay G, (1997) Diesel lubricant monitoring with new-concept shipboard test
equipment. TriboTest 3(4): 415430
Grimmelius HT, Meiler PP, Maas HLMM, Bonnier B, Grevink JS, van Kuilenburg RF,
(1999) Three state-of-the-art methods for condition monitoring. IEEE Transactions on
Industrial Electronics 46(2): 407416
Hargis SC, Taylor H, Gozzo JS, (1982) Condition monitoring of marine diesel engines
through ferrographic oil analysis. Wear 90(2): 225238
556 R. Jiang and X. Yan
Hofmann SL, (1987) Vibration analysis for preventive maintenance: a classical case history.
Marine Technology 24(4): 332339
Hojen-Sorensen PAdFR, de Freitas N, Fog T, (2000) On-line probabilistic classification
with particle filters. Neural Networks for Signal Processing X, 2000. Proceedings of the
2000 IEEE Signal Processing Society Workshop 1: 386395
Hountalasa DT, Kouremenosa AD, (1999) Development and application of a fully automatic
troubleshooting method for large marine diesel engines. Applied Thermal Engineering
19(3): 299324
Hubert CJ, Beck JW, Johnson JH, (1983) A model and the methodology for determining
wear particle generation rate and filter efficiency in a diesel engine using ferrography.
Wear 90(2): 335379
Jakopovic J, Bozicevic J, (1991) Approximate knowledge in LEXIT, an expert system for
assessing marine lubricant quality and diagnosing engine failures. Computers in Industry
17(1): 4347
Jardine AKS, Ralston P, Reid N, Stafford J, (1989) Proportional hazards analysis of diesel
engine failure data. Quality and Reliability Engineering International 5(3): 207216
Jardine AKS, Lin D, Banjevic D, (2006) A review on machinery diagnostics and prognostics
implementing condition-based maintenance. Mechanical Systems and Signal Processing
20(7): 14831510
Jiang R, Jardine AKS, (2006) Composite scale modeling in the presence of censored data.
Reliability Engineering and System Safety 91(7): 756764
Johnson JH, Hubert CJ, (1983) An overview of recent advances in quantitative ferrography
as applied to diesel engines. Wear 90(2): 199219
Liu Y, Liu Z, Xie Y, Yao Z, (2000) Research on an on-line wear condition monitoring
system for marine diesel engine. Tribology International 33(12): 829835
Logan KP, (2005) Operational Experience with Intelligent Software Agents for Shipboard
Diesel and Gas Turbine Engine Health Monitoring. 2005 IEEE Electric Ship
Technologies Symposium: 184194
Lu S, Lu H, Kolarik WJ, (2001) Multivariate performance reliability prediction in real-time.
Reliability Engineering and System Safety 72: 3945
Moubray J, (1997) Reliability-centred maintenance. Butterworth-Heinemann, Oxford.
Murthy DNP, Xie M, Jiang R, (2003) Weibull Models, Wiley.
Pontoppidan NH, Larsen J, (2003) Unsupervised condition change detection in large diesel
engines. 2003 IEEE XI11 Workshop On Neural Networks For Signal Processing: 565574
Priha I, (1991) FAKSan on-line expert system based on hyperobjects. Expert Systems
with Applications 3(2): 207217
Raadnui S, Roylance BJ, (1995) Classification of wear particle shape. Lubrication
Engineering 51(5): 432437
Roylance BJ, Albidewi IA, Laghari MS, Luxmoore AR, Deravi F, (1994) Computer-aided
vision engineering (CAVE): Quantification of wear particle morphology. Lubrication
Engineering 50(2): 111116
Roylance BJ, Raadnui S, (1994) Morphological attributes of wear particles their role in
identifying wear mechanisms. Wear 175(1-2): 115121
Saranga H, (2002) Relevant condition-parameter strategy for an effective condition-based
maintenance. Journal of Quality in Maintenance Engineering 8(1): 92105
Scherer M, Arndt M, Bertrand P, Jakoby B, (2004) Fluid condition monitoring sensors for
diesel engine control. Sensors, 2004. Proceedings of IEEE 1: 459462
Sharkey AJC (2001) Condition monitoring, diesel engines, and intelligent sensor processing.
Intelligent Sensor Processing, A DERA/IEE Workshop on: 1/1 1/6
Sun C, Pan X, Li X, (1996) The application of multisensor fusion technology in diesel
engine oil analysis. Signal Processing, 1996., 3rd International Conference on 2:1695
1698
Condition Monitoring of Diesel Engines 557
Tang T, Zhu Y, Li J, Chen B, Lin R, (1998) A fuzzy and neural network integrated
intelligence approach for fault diagnosing and monitoring. UKACC International Con-
ference on Control 2: 975980
Wang HF, Wang JP, (2000) Fault diagnosis theory: method and application based on
multisensor data fusion. Journal of Testing and Evaluation 28(6): 513518
Wu X, Chen J, Wang W, Zhou Y, (2001) Multi-index fusion-based fault diagnosis theories
and methods. Mechanical Systems and Signal Processing 15(5): 9951006
Zhang H, Li Z, Chen Z, (2003) Application of grey modeling method to fitting and
forecasting wear trend of marine diesel engines. Tribology International 36(10): 753756
Zhao C, Yan X, Zhao X, Xiao H, (2003) The prediction of wear model based on stepwise
pluralistic regression. In: Proceedings of International Conference on Intelligent Main-
tenance Systems (IMS), Xian, China: 6672
23
23.1 Introduction
To sustain a competitive edge in business, railway companies all over the world are
looking for ways and means to improve their maintenance performance. Bench-
marking is a very effective tool that can assist the management in their pursuits of
continuous improvement of their operations. The benefits are many, as bench-
marking helps developing realistic goals, strategic targets and facilitate the achieve-
ment of excellence in operation and maintenance (Almdal 1994).
In this chapter three different benchmarking studies are presented, these are: (1)
benchmarking of the maintenance process for cross-border operations, (2) study of
the effectiveness of outsourcing of maintenance process by different track regions in
Sweden, and (3) study of the level of transparency among the European railway ad-
ministrations. In these case studies the focus is on railway infrastructure excluding
the rolling stock. The outline of the chapter is as follows. An overview of Swedish
railway operation is presented in Section 23.2. The definition and methodology in
general is discussed in Section 23.3. The special demands for benchmarking of
maintenance is described in Section 23.4 and in Section 23.5, the special con-
siderations caused by the railway context is overviewed generally for the railways
and in more detailed from the Swedish context. The case studies are discussed in
Sectiosn 23.623.8. The discussions and conclusions are presented in Sections 23.9
and 23.10 respectively.
All the data pertinent to benchmarking of railway operation and maintenance are
retrieved, classified and analyzed in close cooperation with operation and main-
tenance personnel from both infrastructure owners and maintenance contractors.
The chapter discusses the pros and cons, the areas for improvement and the need for
the development of a framework and metrics for benchmarking. The focus of this
chapter is to visualize best practices in maintenance and also proposed means for
improvement in railway sector with special reference to railway infrastructure.
560 U. Espling and U. Kumar
Carrilion
Rail Inspectorate Inhouse Svensk Banproduktion
Banverket Contractor
Client
SJ
SJ AB
Rail Traffic
Administration Green Cargo AB
SJ Jenhusen AB
EuroMaint AB
EuroMaint AB
ASG AB TR TrainTech AB Interfleet
TrafficCare AB
Swebus Sweferry
Nordwaggon AB Unigrid AB
New MTAB
Traffic
operators TGOJ
Connex
23.2.1 Maintenance
Many of the European railways have followed a similar evolution. Although many
of the countries of Europe are now members of the European Union, questions are
being raised concerning the transparency of the state-controlled railway sector in
order to make comparisons possible and to find the best practices followed within
the railway business. The European railway sector has gradually started to use
benchmarking so that the different actors may be able to learn from each other.
Such organizational measures are useful to service users and provide a clear system
for translating feedback from the analysis into strategy for corrective actions.
Most of the literature points out the fact that successful benchmarking needs a
good plan specifying what to benchmark, whom to visit (to study the best practice),
when to visit, and what types of resources are required for analysis and implemen-
tation. Often simple studies are completed at little cost and generally have no
follow-up. Good benchmarking, on the other hand, is time- and resource-consum-
ing and has well-structured follow-up plans etc. The selection of the type and scope
of the benchmarking process should be made on the basis of the impact of the
outcome on the critical success factors for the process (Mishra et al. 1998).
A benchmarking exercise is of no value, if the findings are not implemented. In
fact, without implementation it would be a waste of resources. The benefits of
benchmarking do not occur until the findings from the benchmarking project are
realized, and therefore performance improvement through benchmarking needs to
be a continuous process.
23.3.2 Metrics
Equipment Perfomance
Operation
state measurement
Comparison with
Maintenance benchmarked
value
Wireman (2004) states that the maintenance management impact on the return
on fixed assets (ROFA) can be measured by two indicators, namely:
Maintenance cost as a percentage of the total process, production, or manu-
facturing cost
Maintenance cost per square foot maintained
core business under control, since planned work vs. unplanned work may have a
cost ratio as high as 1:5. Another rule of thumb concerns a high level of overtime,
which indicates reactive situations in the maintenance process. Since labour is a
large cost driver for maintenance, the amount of overtime can have a large impact
on maintenance costs. Another large cost driver is spare parts (Wireman 2004,
Hgerby 2002).
1. Two neighbouring local track areas sharing a line for railway traffic on
each side of the border. The aim was to compare the maintenance cost,
identify differences and find areas to improve.
2. Internal benchmarking for maintenance contracts in order to find the best
practice and to improve the maintenance contracts.
Benchmarking of the Maintenance Process at Banverket 567
The common denominator between these case studies is used for benchmarking
methodologies in order to find out if it is useful within the railway sector. The
differences between these case studies are the main objectives of the bench-
marking.
The metrics and data collected were the cost for the operation and maintenance and
outcome of performance losses. The data were collected for one calendar year from
the systems for accounting, planning system, failure reporting and inspection and
contained:
Budget vs. performed outcome for maintenance costs
Overhead costs for the local administrations
Maintenance planning
Failure statistics
The inspection remarks
568 U. Espling and U. Kumar
However, the following information and data relevant to the study could not be
collected:
Overhead cost for the contractor (not available due to the competition
between the different contractors)
Man hours (not available, not collected in the client system from the in-
voice)
Traffic volume
Asset age, which were approximately the same (not necessary to collect,
since the traffic mix and volume were the same)
Spare part costs (not available)
36.6.1.1 Normalisation
Since the organisation and accounting structure were almost the same, it was
assumed that the missing data could be disregarded. The amount of normalisation
was restricted to adjusting the currency.
The available data and information were then sorted as shown in Table 23.1. The
maintenance costs were grouped into the categories snow removal, corrective
maintenance and preventive maintenance; see Table 23.2.
Table 23.2. Difference in percentage in maintenance costs between Track Areas A and B
Maintenance activities Difference in percentage
from Track Area B
Snow removal + 10%
Corrective maintenance, including organisation + 32%
for preparedness (emergency service)
Preventive maintenance, including inspection 62%
The benchmarking result showed that the maintenance cost was approximately
the same as the total cost per track meter. One of the findings was that the amount
of corrective maintenance was very high in both track areas. A closer investigation
showed that Track Area A had a larger amount of corrective maintenance and
therefore less money for preventive maintenance.
Benchmarking of the Maintenance Process at Banverket 569
Furthermore the overhead cost and other external costs such as travel costs,
costs for consultancy etc. in Track Area A were much higher compared to Track
Area B. One of the explanations was the geographical isolation of Track Area A
from its own administration, resulting in higher traveling costs and the necessity of
buying consultancy for some services that Track Area B could obtain from its
nearby regional office. Another explanation was that Track Area A had to finance
all its buildings, the electrical power and the cost for the traffic control centre,
while this was taken care of by a separate organization for Track Area B.
It was also possible to find those areas of work that could be mutually co-
ordinated, for example snow removal. However, this was something that needed to
be negotiated and was therefore considered a political matter.
The implementation phase was the responsibility of the national railway ad-
ministrations. The results were mainly used as arguments clarifying why the costs
were so much higher for the railway line in Country A compared with those of
other national lines.
maintenance contract by learning from the experience and knowledge of other re-
gional track areas in this respect.
The benchmarking process followed the standard procedure recommended for
benchmarking as stated in an earlier section (Section 23.3). The study covered nine
local track areas named as Track Areas AI, and six of these were selected for the
study and follow-up of qualitative interviews (DI).
Before starting the collection of data and other relevant information, the existing
indicators and indices used by maintenance professionals available in the literature
and through professional bodies, for example the EFNMS indices (EFNMS 2006),
were examined for their suitability for the purpose of benchmarking maintenance
practices in different track regions at Banverket. Most of these metrics were not
found suitable for the purpose of this study and therefore actions were initiated to
establish indicators that would facilitate this benchmarking process. Furthermore,
information and data which were planned to be included in the study, namely
details of maintenance-related measures such as maintenance costs, maintenance
hours, material, maintenance vehicle costs, overhead costs etc., were missing or
only available in the aggregate form, due to the competitive situation.
As the deregulation of the railway transport system in Sweden has led to
competition among the traffic companies, it was not possible to get hold of traffic
data, i.e. how the track was used, because this information is being treated as a
business secret by the train operators.
Data from 2002 were collected from the systems for accounting, the failure
reports, the inspection remarks, and the asset information and from the train delay
reports. The following data were collected:
Asset data from BIS: total length of track, total length of operated track,
total amount of turnouts, total amount of operated turnouts, length of
electrification, number of protected level crossings. An attempt was also
made to define their standard by the assets age and what type of traffic
they had been exposed to this had to be skipped as it was not possible to
obtain complete data for all the assets and different track lines. The purpose
was to know the intensity of track utilization.
From the accounting system AGRESSO: snow removal and maintenance
costs for one year, defined per maintenance activity corresponding to the
maintenance contract (corrective, predetermined, condition-based etc.) and
cost per asset type (rail, sleeper, turnout etc.).
From BESSY (inspection remark system): the number of inspection re-
marks, classified as remarks requiring immediate attention or deployment
of corrective measures or remarks requiring attention or correction in the
near future (deferred inspections remarks).
From OFELIA (failure report system): failure reports (including asset type
and type of failure, time to fault localization and time to repair, symptoms
and causes, place, date and time). Time to establish on the fault place.
Benchmarking of the Maintenance Process at Banverket 571
The data collected from the accounting system needed normalisation in particular,
due to difficulties in separating normal track maintenance activities from track
renewal activities, as these two concepts were frequently being mixed in the
database. There were also some difficulties in using the prescribed terminology,
because of misunderstandings in the maintenance context which resulted in the
common structure for reporting cost back into the system not being used, and data
had to be sorted afterwards into the right boxes. Some track areas were using
maintenance definitions and concepts from other branches representing the build-
ing and construction industry. Some outliers were also eliminated from the data,
especially those representing some special or just-one-time investments made to
increase train punctuality or reduce winter problems.
Cost drivers leading to non-availability of infrastructure for train operation or
affecting safety were identified. The respective train delay hours were also re-
trieved. The cost drivers for the infrastructure were failure or defects in rail,
sleepers, rail joints, turnouts, level crossings, and catenaries (overhead wire). On
further investigation it was found that the cost related to sleepers could be classi-
fied as outliers, because a large amount of the sleepers replaced in the 1990s were
delivered with inbuilt defects. These sleepers are being dealt with in a replacement
phase within the framework of a large project.
In order to find the best internal practice within the organization, two para-
meters, the amount of corrective maintenance and the management indicator
return on fixed asset (ROFA), were used.
Track Areas AI are the nine track areas, DI are those selected by the infra-
structure manager for qualitative interviews and Track Areas AC are references.
The data pertaining to various costs, corrective maintenance, condition based
maintenance and failure and delay statistics from Track Areas AI for the year
2002 are given in Tables A.1A.7 of the Appendix to this chapter.
When using the parameter ROFA and the rule of thumb concerning the lowest
amount of corrective maintenance, Track Areas B, G, C and H were the best
performers (see Figure 23.3) and the ROFA measurement showed a tendency of
more money per track metre, less corrective maintenance; see Figure 23.4.
572 U. Espling and U. Kumar
100%
80%
60%
40%
20%
0%
A B C D E F G H I
Corrective maintenance Preventive maintenance
Figure 23.3. Share of corrective maintenance and preventive maintenance for the nine track
areas studied (Espling 2004)
10
8
Skr/sqr metre
0
A B C D E F G H I
Figure 23.4. Maintenance cost per square metre of track area (Espling 2004)
Another comparison was made concerning the maintenance cost per metre
within the framework of the maintenance contract for each track region under
study. Track Areas H, C and G showed the best practice followed; see Figure 23.5.
It was noted that the maintenance cost varies greatly per asset or per track metre
unit among the compared track areas due to the asset standard, type of wear,
climate and type of traffic.
To compare the performance, the amount of functional failures and train delay
hours were listed as failure or delay hours per metre or per cost driving asset; see
Figure 23.6. Even here the best performance was shown by Track Areas G and H.
Benchmarking of the Maintenance Process at Banverket 573
140
120
80
Saf ety Inspection
60 Repair Immidate Insp remarks
Failure repair
40
Snow removal
20
0
A B C D E F G H I
Track are a
25
20 Inpection remarks/km
hours/asset or
amount/asset
f ailur/crossing
15
f ailures/turnout
10
f ailures/km
5 h/catenary
h/turnout
0
A B C D E F G H I h/track km
Tr ack ar e as
All these results obtained from the comparison of different track regions, in
combination with the content of the maintenance contract defining work specifi-
cations within the maintenance contracts, were used for the gap analysis. The gap
analysis was conducted with the help of interviews with the track area managers
for Track Areas DI. The best practice criteria were identified with the help of
interviews and survey questionnaires. The best practices were:
Goal-oriented maintenance contracts combined with incentives
Scorecard perspectives, quality meetings and feedback facilitate manage-
ment by objectives
Frequent meetings where top managers from the local areas participate
Forms for cooperation and an open and clear dialogue, for example partner-
ing
Focus on increased preventive maintenance of assets with frequent func-
tional failures and a high maintenance cost will give results, e.g. turnouts
The use of Root Cause analysis
574 U. Espling and U. Kumar
The best practices identified from the benchmarking study were immediately
implemented in the new purchasing procedures and documents. These were used
for floating tenders and for new contracts by the infrastructure manager for the
local track area initiating this benchmark, and resulted in maintenance contracts at
a much lower price with better control of quality and performance. The bench-
marking study also identified the best practice for gaining control over backlogs by
using SMS and other internet-based tools. Besides these, the maintenance contract
was also provided with information about goals, objectives and expected incentives
related to the execution of the maintenance contracts.
23.8.1 Metrics
In this study, many official documents, such as annual reports and regulation letters
and documents, were studied in detail in order to gain insight into the types of
measures, key performance indicators and indices used by the railway administra-
tions investigated (hrn et al. 2005). The collected measures were then compared
with those recommended by EFNMS in order to see if these could be used in future
benchmarking exercises. Rather soon it was found that the EFNMS indices were
developed for factories and plants and were not suitable for studying or bench-
marking the performance of infrastructures, as they did not consider the type of
asset, the age of the asset, the asset condition or the practice of outsourcing main-
tenance work in an open market.
23.8.2 Normalisation
Since data were qualitative in nature, no normalisation was carried out for the pur-
pose of this study.
Benchmarking of the Maintenance Process at Banverket 575
The next step was to group the measurements according to the unit which they
measured; for example cost went into the economy group.
The parameters collected and reported by the infrastructure managers were then
classified into different categories of common denominators. These categories
comprised the following: strong denominators (Sods) collected by everyone, me-
dium denominators (Sims) collected by more than 50%, and weak denominators
(Sews) collected by less than 50%, and finally some indicators (I) also identified as
Sods presented as a percentage value; see Figure 23.7. The results show that eco-
nomic values, safety, and traffic are strong denominators, followed by quality,
assets, and labour. It is important to note that traffic is the total traffic volume on
a national level. These parameters could later on be used to develop new bench-
mark measures, e.g. maintenance costs per staff and amount of accidents per traffic
volume.
Today the comparable indicators are:
Corrective maintenance cost / total maintenance cost including renewal
Total maintenance cost / turnover
Maintenance and renewal costs / cost for asset replacement
Maintenance cost / track metre
When comparing the outcomes of the findings only highly aggregated measures
were used for the purpose of analysis, in terms of:
Economy
Punctuality
Safety
Number of staff employed
Track quality
Total traffic volume divided up into passenger and freight kilometres
They can be used as benchmarking measures, the lag indicators showing past
performance. This indicates that these areas of interest are important for every
studied railway administration. It is also important to note that the identified
measures can be defined as outcome measures from the railway maintenance pro-
cess. It has not been possible to find any measures reflecting the actual maintenance
performance. This can probably be explained by the fact that the maintenance ac-
tivities are carried out by either in-house or external maintenance contractors (hrn
et al. 2005).
Some of the maintenance performance indicators are used by various organi-
zations and provide railways with an opportunity to benchmark their operations
internationally to improve their performance. One of the findings in the studies is
that there are parameters missing regarding the traffic volume, infrastructure age,
and history of the performed maintenance.
576 U. Espling and U. Kumar
30
25
Others
20
Amount
I
15
SDm
10
SDw
5
0
y
t
ur
ia
y
ty
t
en
fic
li t
ry
se
om
er
fe
bo
ua
o
af
nm
As
at
ist
Sa
on
Tr
La
Q
M
H
ro
Ec
vi
t
se
En
As
23.9 Discussion
The reason why most plants do not enjoy best practices in maintenance is that they
do not picture how to structure a sustainable improvement process (Oliverson
2000). Benchmarking can then be a tool for waking up organisations and their
management in order to find improvement areas that create more value from the
business process. However, on the way there are many pitfalls to be aware of, such
as starting the process without knowing the starting point and the destination
(Oliverson 2000; Wireman 2004). Other pitfalls are:
Just doing quantitative benchmarking. Quantitative numbers just tell parts
of the story, and the difficulty is to start the sustainable improvement pro-
cess, by focusing on qualitative benchmarking (Oliverson 2000). If the
organisation does not have maturity or self-knowledge, it just glances at the
figures and continues to do as it always has done before.
Rejection of the results. Managers often overestimate their performance
and react with disbelief to feedback that tells them that their plants are
merely mediocre (Wiarda and Luria 1998).
Not being aware of the need for normalisation of data, including the prob-
lem of outliers or comparing apples with bananas.
Not finding the enablers (Wireman 2004).
Using benchmarking data as a performance goal.
Believing that it is as easy as just copying the best practice into ones own
organisation, rather than learning.
Unethical benchmarking.
The methodologies for performing benchmarking for plants are rather well
developed, but need to be adapted for infrastructure. Today it is difficult to
Benchmarking of the Maintenance Process at Banverket 577
23.10 Conclusion
Stating that the benchmarking of maintenance provides gains with relatively
little effort is a truth that needs some modification. First of all, the theory of
maintenance is a rather young science, which has resulted in a lack of common
nomenclature and understanding of maintenance through value. This is one of the
reasons why it is difficult to define what is included in maintenance and where to
put the boundaries for renewal. There can also be different structures in use to
describe what operation is and what maintenance is, and also for grouping main-
tenance into preventive and corrective maintenance. Outsourcing maintenance has
become popular in recent years, and this makes it difficult to obtain all the
necessary measurements, especially if the outsourcing is carried out in a per-
formance contract (lump sum, fixed price). The assets complexity and condition
are also difficult to compare and measure.
The multitude of entities involved in the railway systems after their restructur-
ing has made it considerably difficult to locate the organization responsible for the
problems encountered and to ascertain the course of action to be taken to rectify
them.
Benchmarking cannot be used if its results are not implemented. The benefits
from benchmarking do not occur until the findings from the benchmarking project
are implemented and systematically followed up and analyzed against the set
targets and goals.
The results from the three benchmarking studies presented show that bench-
marking is a powerful tool and its methodology can be used by other industries.
Since the focus of these case studies is the benchmarking process and not the con-
tinuous improvement process, it is important to point out the need for empowered
enablers, who will be responsible for identifying the problem, finding a solution to
the problem and implementing the solution and the continuous improvement
processes. The case studies also show that there is some more improvement to be
made in order to start the whole process of benchmarking including the implemen-
tation in an integrated manner.
578 U. Espling and U. Kumar
23.12 Acknowledgements
The authors are grateful to Banverket (the Swedish Rail Administration) for
sponsoring this research work and providing information and statistics through free
access to their database.
Benchmarking of the Maintenance Process at Banverket 579
Appendix
Table A.1. Failure and delay statistics from Track Areas A-I for the year 2003
Train Train Train delay Amount of Amount of Amount of Inspection
Track delay delay h/catenaries failures/ failures/ failure/ remarks/
area h/track km h/turnout km track km turnout crossing track km
Table A.2. Cost of various maintenance activities in thousands of SEK for each track area
for the year 2003
Track Snow removal Corrective Preventive Contract sum
area in thousands of SEK maintenance maintenance
A 15,325 24,189 14,130 53,644
B 16,801 17,792 12,941 47,534
C 12,908 28,728 10,863 52,553
D 22,085 46,772 20,537 89,394
E 18,074 44,168 21,532 83,774
F 8,250 39,181 15,991 63,442
G 4,336 22,050 26,388 52,774
H 3,041 22,854 19,131 45,026
I 4,976 46,414 31,803 83,193
Normalisation is necessary due to the investment of extra money just for one year
to enhance the preparedness to deal with failures causing train delays. The figures
in Table A.2 are the figures before normalisation
580 U. Espling and U. Kumar
Table A.3. Costs in thousands of SEK for corrective maintenance due to failure reports from
s for the year 2003
Track Maintenance Emergency Actual cost Fixed price Total cost SEK/ failure
area organisation organisation (lump sum) (t SEK)
(personnel,
machines,
spare parts)
A 2880a 7,989 10,869 5933
B 4416a 6,145 10,861 5273
C 3732a 4,128 7,860 4690
D 4701 11,448 16,150 5379
E 4776 16,078 20,854 5073
F 4884 14,095 18,897 5530
G 12,686 5838
H 3512a 7,785 11,444 6065
I 20,274 6304 28,246 145
a Extra preparedness 2003
Table A.4. Cost statistics for corrective maintenance triggered by the failure reporting
system ofelia (in thousands of SEK) after normalisation
Track Maintenance Emergency Actual cost Fixed price Total cost SEK/ failure
area organisation organisation (lump sum) (t SEK)
A 7,989 7,989 1832
B 2156 6,145 8,601 2060
C 1472 4,128 5,600 1676
D 4701 11,448 16,150 3002
E 4776 16,078 20,854 4111
F 4884 14,095 18,897 3417
G 12,686 2173
H 3512 7,785 11,367 1887
I 20,274 6 304 28,246 5490
Benchmarking of the Maintenance Process at Banverket 581
Table A.5. Reported corrective maintenance caused by inspection remarks classifying faults
as requiring immediate repair; also including activities such as inspection and condition-
based and predetermined maintenance that should have been booked under other codes in
the accounting system (before normalisation of the data)
Track Inspection Mixes of Inspection cost Operational Care of Condition- Total cost
area remarks inspection including actions electrical assets based
calling for remarks calling inspection due to pre- due to pre- main-
immediate for immediate remarks calling determined determined tenance
repair repair and for immediate maintenance maintenance
CBM Remarks repair
A 13,320 13,320
B 6,931 6,931
C 12,355 1485 7081 20,921
D 16,361 7614 3558 3091 30,638
E 10,864 1962 4732 1486 19,044
F 9,963 3194 4289 2756 168 20,383
G 9,346
H 11,107 303 11,410
I 18,169 18,168
Table A.6. Reported corrective maintenance caused by inspection remarks classifying faults
as requiring immediate repair; also including activities such as inspection and condition-
based and predetermined maintenance that should have been booked under other codes in
the accounting system (after normalisation)
Track Inspection remarks Inspection remarks Corrective New total cost
area calling for immediate calling for immediate maintenance booked
repair repairbooked under as inspection in the
inspection accounting system
A 13,320 13,320
B 6,931 6,931
C 12,355 995 13,350
D 16,361 1904 1506 19,771
E 10,864 491 1553 12,908
F 9,963 799 8 10,770
G 9,346
H 11,107 11,410
I 18,169 916 19,084
582 U. Espling and U. Kumar
Table A.7. Condition-based maintenance bought as extra orders in thousands of SEK, but
including the so-called special maintenance activity
Track area Original accounting sum Minus defective sleepers New Sum
A 32,319 32,319
B 43,831 43,831
C 44,139 44,139
D 6,607 6,607
E 81,720 60,913 20,807
F 53,797 27,972 25,825
G 50,753 50,753
H 45,198 7,680 37,518
I 63,426 12,722 51,004
23.13 References
Almdal, W. (1994), Continuous improvement with the use of benchmarking, CIM Bulletin,
Vol. 87 No.983, pp.2126
Burke, C.J. 2004. 10 steps to BestPractices Benchmarking.
http://www.qualitydigest.com/feb/bench.html
Campbell, J.D. (1995). Uptime: Strategies for Excellence in Maintenance Management,
Productivity Press, Portland, US
Dunn, S. (2003), Benchmarking as a Maintenance Performance Measurement and Improve-
ment Technique. Assetivity Pty Ltd,
http://www.plant-maintenance.com/maintenance_articles-Performance.shhtml
EFNMS (2006), http://www.efnms.org/efnms/publications/13defined101.doc
Espling, U. (2004), Benchmarking av Basentreprenad r 2002 fr drift och underhll,
Research Report, LTU 2004:16, (In Swedish).
Hgerby, M., Johansson, M. (2002). Maintenance performance assessment: strategies and
indicators. Master thesis, Linkping, Linkpings tekniska hgskola, LiTH IPE Ex arb
2002:635.
Kaplan, R.S. and Norton, D. P. (1992), The Balanced Scorecard: the measures that drive
performance, Harvard Business Review, JanFeb (1992), pp. 7179.
Larsson. L. (2002). Utvrdering av underhllspiloterna, delrapport 1. Banverket F02-
713/AL00. (In Swedish).
Liyanage, J.P. and Kumar, U. (2003). Towards a value-based view on operations and
maintenance performance management, Journal of Quality in Maintenance Engineering,
Vol. 9, pp. 333350.
Malano, H. (2000), Benchmarking irrigation and drainage performance: a case study in
Australia. Report on a Workshop 3 and 4 August 2000, FAO, Rome, Italy.
Mishra, C., Dutta Roy, A., Alexander, T.C. and Tyagi, R.P. (1998), Benchmarking of
maintenance practice for steel plants, Tata Search 1998, 167172.
Moulin, M. (2004), Eight essentials of performance measurements, International Journal of
Health Care Quality Assurance, Vol .17, Number 3. pp. 110112.
Oliverson, R.J. (2000), Benchmarking: a reliability driver, Hydrocarbon Processing, August
2000, pp. 7176.
Ramabadron, R., Dean Jr J.W. and Evans J.R. (1997), Benchmarking and project
management: a review and organisational model, Benchmarking for Quality
Management & Technology, Vol. 4, No. 1, pp. 437458.
Benchmarking of the Maintenance Process at Banverket 583
Stalder, O., Bente, H. and Lking, J. (2002), The Cost of Railway Infrastructure. ProM@ain
Progress in Maintenance and Management of Railway Infrastructure, 2, pp. 3237.
http://promain.server.de
Varcoe, B.J. (1996), Business-driven facilities benchmarking, Facilities, Vol. 14. Number
3/4, March /April, pp. 4248, MCB University Press.
Wiarda, E.A. and Luria, D.D. (1998), The Best-practice Company and Other Benchmarking
Myths
Wireman, T. (1998), Developing Performance Indicators in Maintenance. New York:
Industrial Press Inc.
Wireman, T. (2004), Benchmarking Best Practice in Maintenance Management. New York:
Industrial Press Inc.
Zairi, M. and Leonard, P. (1994). Practical Benchmarking: the Complete Guide. London:
Chapman and Hall.
Zoeteman, A. and Swier, J. (2005), Judging the merits of life cycle cost benchmarking, in
Proceedings International Heavy Haul Association Conference, Rio de Janeiro June,
hrn, T., Espling, U. and Kumar, U. (2005), Benchmarking of maintenance process: two
case studies from Banverket, Sweden, in Conference proceedings of the 8th Railway
Engineering Conference, London June 2930.
hrn, T. and Espling, U. (2003), Samordnet/Felles drift av Jrnvgen Kiruna Narvik
(confidential). Lule, Lule tekniska universitet (In Swedish).
24
Integrated e-Operationse-Maintenance:
Applications in North Sea Offshore Assets
Jayantha P. Liyanage
24.1 Introduction
There is a clear growth of interests today on the development and use of e-main-
tenance concepts for industrial facilities. This is particularly seen in the offshore oil
and gas (O&G) production environment in the North Sea in relation to a major re-
engineering process termed integrated operations (IO) that began in 20042005
as a new development scenario for the offshore industry (OLF 2003). Major
challenges to conventional operations and maintenance (O&M) practice have been
seen unavoidable under this new IO initiative. Subsequently, the industry began to
develop some serious interests on novel and smart solutions for O&M. The
developments began in 2005 seeking long-term changes to the conventional O&M
practice. The change process has been relatively slow during the 20052006
period, but seemingly has gathered gradual and steady pace by now. This is a
large-scale change, and hence the current plan is to realize fully functional e-opera-
tions e-maintenance status by the years 20122015 or so. Even though the integra-
ted e-operations and e-maintenance applications in the North Sea are still at their
inception, the learning process and the state of current knowledge can be very valu-
able for similar efforts in the development and implementation of novel solutions
in other industries and /or regions in the world.
Current developments in Norway exemplifies that the growth of smart use of
advanced information and communication technology (ICT) solutions is a principal
driving factor in the development and implementation of novel and smart solutions
to realize e-maintenance (Liyanage and Langeland 2007; Liyanage et al. 2006). In
principal it seeks to establish better offshore-onshore connectivity and interactivity
enhancing decisions and work processes. The emerging O&M practice will be
based on a smart blend of application technologies, novel managerial solutions,
new organizational forms, etc. to enable 24/7 online real-time operating modes.
The new set of O&M solutions for North Sea offshore assets are not simply about
the use of some form of core technologies for electronic data acquisition and so on,
but a large-scale re-engineering process dedicated to make a significant change to
586 J. Liyange
technology in this context. This section also underlines some of the important non-
technical issues that play pivotal roles in terms of being fully integrated and fail-
safe.
political change processes. The trends of deviations from conventional wisdom and
practices have become more and more clear, seeking to adapt creative, innovative,
and smart solutions to manage complex systems for commercial advantage (During
et al. 2004; Hosni and Khalil 2004; Russell and Taylor 2006). With the growth of
business uncertainties, the enterprise risk profile has become more complex
demanding more flexible, collaborative, and open strategies to support various
operational activities in industrial plants and facilities. The emerging commercial
environment by far has already indicated the greater reliance on new technological
and managerial solutions to manage important asset processes such as O&M
establishing a new landscape for commercial activities. This seems to be a generic
trend among almost all the commercial business sectors, but to varying degrees,
where the dependence on advanced technological solutions to manage complex
technical systems is rapidly growing. The resulting environment will obviously be
very dynamic enabling key stakeholders of complex technical systems to remain
intact within an extended live network (Wang et al. 2006).
The production, manufacturing, and process industries are directly seen im-
pacted by the new demands and the wave of subsequent changes. Technologically
complex and highrisk businesses in particular cannot afford to divert their manage-
ment strategies of complex assets away from the mainstream technologydriven
change. Today different industrial sectors are seen adapting various novel and
integrated solutions to manage their industrial assets and internal processes to
realize major commercial benefits. More often, rapid advancement in information
and communication technologies (ICT) has been very catalytic to the progress in
technology applications (e.g. diagnostic technologies) and data management
solutions particularly for complex systems, such as offshore oil and gas (O&G)
production platforms.
O&G activities on the NCS began in the early 1970s with the discovery of the
great Ekofisk asset. Ever since, NCS has been a major supplier of oil to the world
energy market. Today, after more than 30 years of continuous production, NCS has
stepped up to its peak level. Despite the fact that NCS foresees a gradual decline
after 2010 or so, the remaining potential is known to be substantial. But the future
is known to have a unique set of challenges with a major need to enhance the
recovery efficiency so that the commercial lives of major production assets can be
extended by another 4050 years. By 20032004, the forthcoming challenges to
O&G exploration and production activities in North Sea became very obvious. The
major part of the industry became relatively more inclined to resort to advanced
application technologies to address underlying commercial risks. At the same time
the industry has been undergoing some other challenges widely acknowledged as
serious impediments to future growth on NCS. For instance, the industry has been
experiencing some major setbacks in attracting talent, and in centralizing core
competencies. The problem has been further aggravated by the ageing workforce
with no suitable remedy to solve competency gaps. Industry restructuring has been
seen by the majority as a feasible solution to provide a tighter integration and
partnerships with the knowledge industry. Table 24.2 illustrates the complex set of
economical and technical drivers that challenged the conventional practices in the
North Sea O&G production environment.
Integrated e-Operationse-Maintenance: Applications in North Sea Offshore Assets 593
Table 24.2. Technical and economical factors that contributed to a step-change in North Sea
asset management practice also introducing changes to conventional O&M practices
The emerging technological and business Various other industrial circumstances have
environment have given its own solutions to also constantly been demanding some form
counter attack major problems. Such of a change to the conventional industry
solutions seem to be feasible through practice. This is primarily due to remaining
application technologies and ICT solutions, substantial un-tapped value potential, major
business-to-business collaboration forms, need for more open and flexible
closer inter-disciplinary integration to partnerships, emerging competency gaps,
jointly manage offshore activities, obsolete technologies and ageing
standardized platforms for dynamic data equipment, more complex and new kind of
and knowledge sharing challenges in production settings, etc.
Advanced Digital
technologies infrastructure
Figure 24.1. e-Approach to O&M in North Sea assets in principal relies on fourfold aspects
Sources of data
Asset Operator
Distributed control and Wireless network
Datahub IP-VPN /
ADSL
Offline-online technical Fiber-optic
based
data network Offshore O&M contractors
access
The figure highlights that the functional landscape for the establishment of e-
based O&M setting in North Sea is a relatively complex combination of various
technical as well as social elements. The synergy among at least three elements is
critical in the development of the necessary technical infrastructure, i.e.:
Advanced process and safety technologies implemented in equipment in
offshore assets that allows real-time data acquisition and transfer
Large scale ICT network with an appropriate bandwidth, that uses both
wireless, fiber-optic and web-enabled capabilities, to enable sharing of
acquired data and communication traffic on 24/7 basis
Well equipped onshore expert centers with built-in advanced data manage-
ment capabilities and collaborative technologies to process and interpret
data, and to stay connected with offshore assets as well as other partners to
interact online for enhancing decisions and activities
Such a large-scale technical setting can perhaps be considered as the heart of
e-operationse-maintenance activities, as it allows:
Integration of geographically dispersed knowledge centers creating a vir-
tual workplace
Establishment of 24/7 online net-based connectivity to provide easy and
fast access to remote experience and knowledge
Access to reliable IT network with a higher bandwidth and speed to acquire,
process, and to interpret volumes of real-time data
The largest implication of such a setting by far is the significant improvements
to decisionmaking and work processes. The connectivity and the interactivity
between offshore and onshore, as well as between different onshore-based com-
petence groups and knowledge centers, allows more effective decision loops and
more coordinated planning and execution of O&M activities (see Figure 24.3).
Smart combination of real-time data with multi-disciplinary expertise has major
Integrated e-Operationse-Maintenance: Applications in North Sea Offshore Assets 597
The targeted benefits of these developments within O&M, together with those
in other technical disciplines, are continuously expected in a 3040 year time span.
The key value creation elements identified includes, for example, methods and
techniques to reduce uncertainty in data interpretation, reduced cycle time on
decisions, better planning and work coordination procedures, and reduced offshore
operating costs through offshore-onshore work re-organization and prolonged
maintenance intervals. The overall commercial benefits expected include; approxi-
mately 10% increment in production, 3040 % reduction in operating costs, and
significant improvements in health and safety performance.
24.7 Key features of the e-Approach for O&M in North Sea Assets
As aforementioned, integrated e-operations e-maintenance is not just an effort to
introduce new technologies. It in fact represents a change in the use of technical
tools, advanced methods, and joint expertise to make O&M processes more effec-
tive and efficient. It introduces a novel scenario to manage the process stepping out
of the convention. However, the successful implementation and use of e-approach
dependent heavily on the synergy between remote diagnostic and prognostic tech-
nology, onshore expert centers directly connected to offshore collaborative rooms,
and net-based web-enabled ICT solutions (Figure 24.4).
598 J. Liyange
Offshore-Onshore expert
centers
Figure 24.4. The solid foundation to e-approach in O&M demands a synergy between three
main components that establish a complex and an interactive technical system
For a long time it had mostly been a challenge to make effective use of condition
monitoring on the Norwegian shelf (Ellingsen et al. 2006). There had been ad hoc
use of some diagnostic technologies such as vibration monitoring on heavy rotating
equipment, thermography on electrical equipment and oil analysis, but mainly on a
discontinuous need-by-need basis. In most cases use of diagnostic expertise had
been limited to on-the-site tapping and data acquisition after reporting a mal-
function or some abnormal technical indications. But today, many O&G producers
are keen on capitalizing on the inherent potential provided by the digital infrastruc-
ture on North Sea and advanced technologies. It implies that the use of condition
monitoring to support technical and safety integrity is strengthened in the inte-
grated environment since:
Data acquisition techniques have developed to an extent that the experts can
tap signals real-time at onshore support centers (OSC) on critical equipment
Online communication capability has allowed joint interpretation and trend
analysis, for instance coupling to asset operators OSC, and comparing
with set alarm levels
Expert centers have acquired technological capability so that they can
secure connections to several offshore assets in a way that those assets can
be served simultaneously if necessary
The use of advanced networking technologies is in fact a landmark of inte-
grated O&M solutions for North Sea assets, as opposed to offline technologies. It
has brought some unique capabilities to share the expertise. With the rapid use of
portable communication technology, offshore personnel can also communicate
effectively with OSCs allowing more sensible use of data acquisition technologies.
The current setting has given a new dimension to the diagnostic and prognostic
efforts for North Sea assets today.
The OSC in SKF-Norway is for instance a CBM expert center that has remote
diagnostic and prognostic capabilities and serves various operators in the Norwe-
gian and Danish O&G sectors. Over the past few years it has carried out online
remote vibration monitoring of critical machinery of offshore production platforms
Integrated e-Operationse-Maintenance: Applications in North Sea Offshore Assets 599
Onshore Support Centres (OSCs) can be considered as the active nodes of the in-
tegrated e-operations e-maintenance setting. Such OSCs are established in the
premises of both O&G producers and third parties. The functional characteristics
of OSCs can vary from one to another depending on the contractual roles and
specific assignments of external organizations. For instance, ConocoPhillips as the
operator of the Ekofisk asset has two such onshore centers. One of them is called
onshore operational center (OOC) and has built-in integrated solutions for O&M
planning, logistics, and other production and operation related activities (Figure
24.5).
600 J. Liyange
3D technologies &
Simulations laandscape
Realtime monitoring
landscape
Figure 24.5. Landscape of onshore support centers (OSCs) with built-in collaborative and
decision support technologies are the active nodes of the integrated e-operationse-mainten-
ance environment on NCS (courtesy: ConocoPhillips, Norway)
for expansions of substantial scale that can lead to a completely different techno-
logical setting and an operating mode by the year 2010 or so. The ongoing
developments at some stage would be coupled with other technologies, for instance
related to scenario simulations of technical faults and failures using 3D techno-
logies, intelligent watchdog agents for condition prognostics, virtual tools to train
O&M crews, etc.
Often, advanced ICT solutions are at the heart of principal commercial activities of
almost all industrial sectors today (Chang et al. 2004; van Oostendrep et al. 2005;
Mezgaar 2006). Current developments on the Norwegian shelf also resorted to
such solutions as the basis to induce the change. Current ICT solutions are a tech-
nical blend of more centralized LANs, primarily localized within organizational
boundaries, to large scale WAN solutions that open up transaction routes for com-
plex business-to-business (B2B) traffic. In fact, the specific need for such robust
integrated solutions for O&G industry in the North Sea have largely been growing
over the last 23 years, demanding more common platforms, for instance to manage
complex O&M and other plant data. The large scale ICT network established in
North Sea is called Secure Oil Informaton Link (SOIL).
SOIL was introduced to Norwegian E&P industry in 1998. It is a result of
growing demands for integrated data management and B2B communication
solutions. SOIL consists of a number of application services actively connecting
almost all the business sectors of the Norwegian O&G industry. This network
helps establishing the connectivity and interactivity between different parties, for
instance offshore O&M teams, operators onshore O&M support groups, third-
party CBM experts, logistic contractors, etc. through the use of fiber-optic cables
and wireless communications. Real-time equipment data can be acquired, jointly
analyzed and results can be exchanged online between these parties, enhancing the
ability for shared interpretation and decision-making. In this context, there are two
major functional features of SOIL (see also Figure 24.6):
The high reliable information and knowledge-sharing network to coordi-
nate and manage remotely O&M activities in North Sea offshore assets
regardless of the geographical location
Many-to-many simultaneous authorized connectivity breaking the conven-
tional one-to-one solution enhancing collaboration between experts, third
party services, asset operator, and offshore crew
The conventional one-to-one setting only enabled the connectivity between two
distinctive parties, for example between an inspection engineer of a contractor and
a maintenance planner of an asset owner. However, with the use of the web-
enabled networking solutions available today, a number of distinctive groups can
stay connected and interact simultaneously (i.e. many-to-many connectivity). This
capability has major effects on improvements to D2D and D2A processes of O&M
in terms of time, cost, and quality.
602 J. Liyange
Figure 24.6. SOILs application solutions provide many-to-many connectivity and inter-
activity on 24/7 online real-time bases to enhance D2D and D2A performance of O&M
From pure CBM perspective, there is a greater demand for the use of enabling
technologies as integral parts of robust CBM solutions. As the operating environ-
ment steps into a remote mode, where 24/7 access becomes a sensitive issue, the
experts need to ensure a tight technical coupling for instance between:
Signal-processing technology with a series of toolboxes for signal pro-
cessing and system performance evaluation to track the health of a system/
machine and provide diagnostic and prognostic information in order to
achieve the goal of near-zero-downtime performance
Application software solutions to interpret optimally monitored data signals
regarding the execution of a maintenance action and to estimate remaining
useful life (RUL)
The requirement on the Norwegian shelf today is a CBM technology that is not
limited to data acquisition but also has integrated advanced solutions with signal
processing and decision-making capabilities to make it more attractive and com-
mercially viable solution. In a series of more recent R&D efforts, the Center for
Intelligent Maintenance Systems (IMS) at University of Wisconsin-Milwaukee and
the CBM Lab at University of Toronto have developed such an integrated O&M
optimization platform to provide asset owners and operators with an advanced tool
for the signal processing and the maintenance decision-making (see Jardine et al.
1997; Banjevic et al. 2001). Figure 24.7 shows the multi-sensor performance assess-
ment framework of this technology.
This watchdog agent constitutes a toolbox with modules for signal processing,
feature extraction, degradation assessment and performance evaluation embedded
in a common software application. It includes signal processing and feature extrac-
tion tools built on Fourier analysis, time-frequency distribution, wavelet packet
analysis and ARMA time series models. The component of performance evaluation
uses such tools as fuzzy logic, match matrix, neural network and other advanced
algorithms. Functionally, the watchdog agent in principal is used for feature extrac-
tion from a series of signals under a given condition, and comparing those with a
template model built-up based on signals under a pre-identified normal condition.
The performance evaluation yields a confidence value (CV), which indicates the
health status of the system and is used as the basis for diagnostics and prognostics
under given circumstances. If the data can be directly associated with some failure
mode, then most recent performance signatures, obtained through the signal pro-
cessing and feature extraction modules, can also be matched against signatures
extracted from faulty behavior data for proper decisions.
604 J. Liyange
T im e- Logistic
Frequency T im e-frequency
R egression
An alysis / W avelet
m om ents and S tatistical
AR M A PCA pattern
m odeling recognition
W avelet
Fourier Frequency B ands Feature M ap
An alysis pattern
AR m odel roots
W avelet m atching
E xpert extracted
packet Analysis N eural
features
N etw ork
(intensity, peak-
pattern
to-peak value,
m atching
R M S ).
H idden
M arkov M odel
P article filter
Figure 24.7. The potential for further enhancement in the use of advance CBM technologies
such as Intelligent watchdog agents are very evident for North Sea assets (courtesy: CBM
Lab, University of Toronto, Canada)
realization, a major portion of the industry has begun to adapt along a more
cautious, synchronized, and an incremental development path. Initiatives by
authorities (e.g. NPD, PSA, etc.) and by socio-political sources (e.g. OLF) are criti-
cal to establish a more harmonized setting to ensure necessary levels of safety and
security. Even though a systematic strategy may prolong the integration plan, the
argument is that such a systematic move will have substantial long-term pay back
rather than a rapid solution that would eventually expose major stakeholders to
deal with unforeseen events requiring ad hoc solutions or quick fixes that would
be too costly to bear.
24.9 Conclusion
Commencing from 20032004, the Norwegian O&G industry has launched a
dedicated program to overcome obvious commercial risks on the NCS. This is
termed the third efficiency leap that has directly supported the implementation of
integrated e-operationse-maintenance solutions for offshore assets in North Sea.
This new practice greatly challenged the conventional practices of many
disciplines, particularly of O&M seeking a technological as well as a managerial
change. The new O&M practice pays major emphasis on the more active exploita-
tion of application technologies, new data and knowledge management techniques.
The change process has also begun to re-engineer the industry infrastructure to
actively integrate O&M expertise of O&G producers with that of the external
knowledge-based industry. The large-scale ICT network called Secure Oil
Information Link and onshore support centers mainly facilitate the rapid
development within O&M process. The new setting has already brought major
commercial benefits by streamlining D2D and D2A processes with substantial
improvements in work processes. However, some critical challenges still remains
to be addressed, and the socio-political organizations and authorities are keen on
ensuring fully functional and fail-safe operations. The demand and the interest to
complete the rest of the journey is through more cautious and systematic strategies
to sustain commercial benefits beyond the year 2050 without exposing the industry
to unwanted or hidden risks that would be too costly to bear.
24.10 References
Arnaiz, A., Arana, R., Maurtua, I., et al., (2005), Maintenance: future technologies,
Proceedings of the IMS (Intelligent Manufacturing System) International Forum IMS
Forum 2004 Como, Italy, May 1719, pp. 300307.
Bangemann, T., Rebeuf, X., Reboul, D., et al., (2006), PROTEUS-creating distributed
maintenance systems through an integration platform, Computers in Industry, 57(6),
pp. 539551.
Banjevic, D., Jardine, A.K.S., Makis, V. and Ennis, M., (2001), A control-limit policy and
software for condition-based maintenance optimization, INFOR, 39, pp. 3250.
Bonissone, G., (1995), Soft computing applications in equipment maintenance and service,
ISIE 95, Proceedings of the IEEE International Symposium, 2, pp. 1014.
Booher, HR. (ed.) (2003). Handbook of human systems integration, Wiley-Interscience.
Integrated e-Operationse-Maintenance: Applications in North Sea Offshore Assets 607
Chande, A., Tokekar, R., (1998), Expert-based maintenance: a study of its effectiveness,
IEEE Transactions on Reliability 47, pp. 5358.
Chang, Y.S., Makatsoris, H.C., Richards, H.D., (2004), Evolution of supply chain
management: symbiosis of adaptive value networks and ICT, Boston: Kluwer Academic
Publishers.
Djurdjanovic, D., Ni, J., Lee, J., (2002), Time-frequency based sensor fusion in the
assessment and monitoring of machine performance degradation, Proceedings of the
2002 ASME International Mechanical Engineering Congress and Exposition paper
number IMECE 2002-32032.
Djurdjanovic, D., Lee, J., Ni, J., (2003), Watchdog agent an infotronics-based prognostics
approach for product performance degradation assessment and prediction, special issue
on intelligent maintenance systems, Engineering Informatics Journal 17 (34), pp. 107
189.
During, W., Oakey, R., et al. (ed.) (2004). New technology-based firms in the new
millennium. Elsevier.
Ellingssen, H.P., Liyanage, J.P., Rus, R., (2006), Smart integrated operations and
maintenance solutions to manage offshore assets in North Sea, Proceedings of the 18th
EuroMaintenace, MM Support GmbH, pp, 319324.
Emmanouilidis, C., MacIntyre, J., Cox, C., (1998), An integrated, soft computing approach
for machine condition diagnosis, Proceedings of the Sixth European Congress on
Intelligent Techniques & Soft Computing (EUFIT98), vol. 2 Aachen, Germany, pp.
12211225.
Emmanouilidis, C., Jantunen E., MacIntyre, J., (2006), Flexible software for condition
monitoring, Computers in Industry, 57(6), pp, 516527.
Garca, M.C., Sanz-Bobi, M.A., (2002), Dynamic Scheduling of Industrial Maintenance
Using Genetic Algorithms, Proceedings of EuroMaintenance 2002, Helsinki, Finland.
Garcia, M.C., Sanz-Bobi, M.A., Pico, J., (2006), SIMAP: Intelligent systems for predictive
maintenance: Application to the health condition monitoring of a wind-turbine gearbox,
Computers in Industry, 7(6), pp, 552568.
Han, T., Yang, B.S., (2006), Development of an e-maintenance system integrating advanced
techniques, Computers in Industry, 57(6), pp, 569580.
Hansen, R., Hall, D., Kurtz, S., (1994), New approach to the challenge of machinery
prognostics, Proceedings of the International Gas Turbine and Aeroengine Congress and
Exposition American Society of Mechanical Engineers, pp. 18.
Health and Safety Executive (HSE). (1997). Human and organizational factors in offshore
safety. HSE, UK.
Hosni, Y.A., Khalil, T.M. (ed.) (2004). Management of technology. Elsevier.
Iung, B., (2003), From remote maintenance to MAS-based e-maintenance of an industrial
process, International Journal of Intelligent Manufacturing 14(1), pp. 5982.
Jardine, A.K.S., Banjevic, D., Makis, V., (1997), Optimal replacement policy and the
structure of software for condition-based maintenance, Journal of Quality in Maintenance
Engineering, 3, pp. 109119.
Jardine, A.K.S., Makis, V., Banjevic, D., et al., (1998), Decision optimization model for
condition-based maintenance, Journal of Quality in Maintenance Engineering 4 (2), pp.
115121
Jardine, A.K.S. Lin, D., Banjevic, D., (2006) A review on machinery diagnostics and
prognostics implementing condition based maintenance, Mech. Syst. Signal Process. 20
(7), pp. 14831510.
Jantunen, E. Jokinen, H. Milne, R., (1996), Flexible expert system for automated on-line
diagnostics of tool condition, Integrated Monitoring & Diagnostics & Failure Prevention,
Technology Showcase, 50th MFPT Mobile, Alabama.
608 J. Liyange
Khatib, A.R., Dong, Z., Qiu, B., et al., (2000), Thoughts on future Internet based power
system information network architecture, in: Proceedings of the 2000 Power Engineering
Society Summer Meeting, vol. 1, Seattle, USA.
Koc, M., Lee, J., (2001), A system framework for next-generation e-maintenance system,
Proceeding of Second International Symposium on Environmentally Conscious Design
and Inverse Manufacturing Tokyo, Japan.
Lee, J. (1996), Measurement of machine performance degradation using a neural network
model, Computers in Industry 30, pp. 193209.
Lee, J., (2004), Infotronics based intelligent maintenance system and its impacts to closed
loop product life cycle systems, Proceedings of the Proceedings of the IMS2004
International Conference on Intelligent Maintenance Systems Arles, France.
Liao, H.T., Lin, D.M. Qiu, H., et al., (2005), A predictive tool for remaining useful life
estimation of rotating machinery components, ASME International 20th Biennial
Conference on Mechanical Vibration and Noise Long Beach, CA.
Liyanage, J.P., (2003), Operations and maintenance performance in oil and gas production
assets: Theoretical architecture and capital value theory in perspective, PhD Thesis,
Norwegian University of Science and Technology (NTNU), Norway.
Liyanage, J.P., Herbert, M., Harestad, J., (2006), Smart integrated e-operations for high-risk
and technologically complex assets: Operational networks and collaborative partnerships
in the digital environment, Wang, Y.C., et al., (ed.), Supply chain management: Issues in
the new era of collaboration and competition, Idea Group, USA, pp. 387414.
Liyanage, J.P., Langeland, T., (2007), Smart assets through digital capabilities, Mehdi
Khosrow-Pour (ed.), Encyclopaedia of Information Science and Technology, Idea Group,
USA.
Liang, E., Rodriguez, R., Husseiny, A., (1988), Prognostics/diagnostics of mechanical
equipment by neural network, Neural Networks 1 (1), p. 33.
Marseguerra, M., Zio, E., Podofilini, L., (2002), Condition-based optimisation by means of
genetic algorithms and Monte Carlo simulation, Reliability Engineering and System
Safety 77, pp. 151166.
Mezgaar, I., (2006), Integration of ICT in smart organizations, Hershey, PA: Idea Group Pub.
Moore, W.J., Starr, A.G., (2006), An intelligent maintenance system for continuous cost-
based prioritization of maintenance activities, Computers in Industry, 57(6), pp. 595606.
OLF (Oljeindustriens landsforening / Norwegian Oil Industry Association), (2003). eDrift
for norsk sokkel: det tredje effektiviseringsspranget (eOperations in the Norwegian
continental shelf: The third efficiency leap), OLF (www.olf.no). (in Norwegian)
Palluat, N., Racoceanu, D., Zerhouni, N., (2006), A neuro-fuzzy monitoring system:
Application to flexible production systems, Computers in Industry, 57(6), pp. 528538.
Perow, C. (1999). Normal accidents: Living with high-risk technologies, Pinceton University
Press.
Roemer, M. Kacprzynski, G., Orsagh, R. (2001), Assessment of data and knowledge fusion
strategies for prognostics and health management, IEEE Aerospace Conference
Proceedings, vol. 6, pp. 6297962988
Russell, R.S., Taylor, B.W., (2006), Operations management: Quality and competitiveness
in a global environment, Hoboken, N.J.: Wiley
Sanz-Bobi, M.A., Toribio, M.A.D., (1999), Diagnosis of electrical motors using artificial
neural networks, IEEE International Symposium on Diagnostics for Electrical Machines,
Power Electronics and Drives (SDEMPED) Gijn, Spain, pp. 369374.
Sanz-Bobi, M.A., Palacios, R. Munoz, A., et al., (2002), ISPMAT: Intelligent System for
Predictive Maintenance Applied to Trains, Proceedings of EuroMaitenance 2002,
Helsinki, Finland.
Swanson, L., (2001), Linking maintenance strategies to performances, International Journal
of Production Economics 70, pp. 237244
Integrated e-Operationse-Maintenance: Applications in North Sea Offshore Assets 609
van Oostendrep, H., Breure, L., Dillon, A., (2005), Creation, use, and deployment of digital
information, Mahwah, N.J. : Lawrence Erlbaum Associates.
Wang, W., (2002), A stochastic control model for on line condition based maintenance
decision support, Proceedings of the Sixth World Multiconference on Systemics,
Cybernetics and Informatics, Part 6, vol. 6, pp. 370374
Wang, W.Y.C., Heng, M.S.H., Chau, P.Y.K., (2006), Supply chain management: Issues in
the new era of collaboration and competition, Idea Group Publishing.
Yager R., Zadeh, L., (1992), An Introduction to Fuzzy Logic Applications in Intelligent
Systems, Kluwer Academic Publishers.
Yang, B.S., Lim, D.S., Lee, C.M., (2000), Development of a case-based reasoning system
for abnormal vibration diagnosis of rotating machinery, Proceedings of the International
Symposium on Machine Condition Monitoring and Diagnosis Japan, pp. 4248.
Yen, G.G., (2003), Online multiple-model-based fault diagnosis and accomodation, IEEE
Transaction on Industrial Electronics 50 (2).
Yu, R., Iung B., Panetto, H., (2003), A mutli-agents based e-maintenance system with case-
based reasoning decision support, Engineering Applications of Artificial Intelligence 16,
pp. 321333.
25
25.1 Introduction
Despite the most refined maintenance strategies, equipment failures do occur. The
degree to which an industrial process or system is affected by these depends on the
severity of the faults/failures, the time required to identify the faults and the time
required to rectify the faults. Real-time fault detection and identification (FDI)
offers maintenance personnel the ability to minimise, and potentially eliminate one
or more of these factors, thereby facilitating greater equipment utilisation and in-
creased system availability.
This case study describes, in some detail, the application of data-driven fault
detection to an underground mining operation. However specific this application
may be, the concept can be employed on any system of machines, with or without
complex machine-machine or machine-environment interactions, or to individual
plant.
In addition to detailing the implementation of an FDI system in real-time, we
propose a semi-autonomous approach to dealing with inaccurate and incomplete
records of equipment malfunction. Since past equipment performance is often the
principal information source for maintenance planning and evaluation, it is of
utmost importance that this information be as accurate as possible. The method
described allows for varying levels of confidence in the record keeping.
Section 25.2 introduces the longwall mining system, the most common form of
mining coal underground at the present. The availability of longwall equipment
systems is low compared to surface systems of similar complexity. The present
approaches towards reducing equipment downtime in longwall mining are summa-
rised in Section 25.3. Common FDI approaches are summarised in Section 25.4.
Two data-driven techniques are used in this study, namely artificial neural net-
works and multi-variate statistics. The availability of quality training data is of
critical importance for either one. The issue is addressed in Sections 25.525.7.
Once the training data set is constructed, the application of the selected FDI tech-
612 D. Bongers and H. Gurgenci
niques is reasonably straightforward. The application and the results are summa-
rised in Section 25.8 with concluding remarks in Section 25.9.
Operating time
Longwall availability = 100%
Operating time + Maintenance delays
This KPI looks at the ability to have the machines operate for the time that they
are planned to operate. It is simply the percentage of the available planned time
that they do actually operate. The maintenance delays refer to scheduled and
breakdown maintenance. Some sites include only the breakdown maintenance in
this statistic, which leads to an inflated value of the equipment availability. Such
confusion in terms makes it difficult to benchmark practices between sites. Typical
values for this KPI average between 40% and 60%.
This KPI looks at the ability to sustain the operation of machines over periods
of time. It is a measure of how long, on average, before machines stop due to a
maintenance problem. Typical values average around 1 h.
This KPI looks at the ability to diagnose and remedy maintenance delays once
they have occurred. It is a measure of how long, on average, before machines that
have faulted are returned to operation. Typical values average around 20 min.
KPIs are typically reviewed on a weekly basis.
or stop the occurrence of the fault. Typically, machine design is the responsibility
of the manufacturer, and is a very time consuming and costly operation.
25.3.1.3 Redundancy
The concept of redundancy is widely used on a component level to make machines
more reliable. The principle is rather simple machine components such as relays,
hydraulic valves or electronic capacitors are unnecessarily duplicated in the design
in such a way that if one fails, another may take on its role. It may be possible in
some situations to extend this concept of redundancy to entire machinery, having
complete units on standby in case one should fail.
Although this method in no way improves the inherent reliability of the
individual plant, they are effectively made more available. The use of this sort of
redundancy is therefore suitable in those situations where the time to commission
replacement machinery is relatively small in comparison to the fault repair time (or
maintenance time).
Other factors that must be taken into account when considering the use of
redundant machinery are cost and storage. Many industries use machinery that is
far too expensive to purchase spares, or is too large to economically store.
25.3.3 Conclusions
All longwall mines employ preventative maintenance. It is one of the largest sub-
operations at any mine, and proves effective in that, when less attention is paid to
maintenance, more faults occur. The optimum level of planned maintenance is
difficult to determine because not all failures are age-related. In fact, many failures
follow an exponential distribution with a uniform failure rate that is not related to
age. Redesign and re-engineering of major offenders has been effective in many
instances and it is believed that more improvements can be realized through such
efforts. Redundancy in design has not been fully explored by longwall machine
designers mainly due to the extra cost and the bulk associated with the redundant
systems.
It is the authors opinion that the future of longwall mining should include
intelligent predictive systems that rely on the currently unused monitoring data.
The possibility of such a system relies on answering the question: does the
currently recorded condition monitoring data contain sufficient information regard-
ing imminent faults? If such information exists in the data, then information must
also exist that the fault has occurred, and specifically which fault occurred.
The outcome of the work described in this case study, the detection and
isolation of major longwall faults, should therefore be seen as a stepping stone
towards a predictive system for longwall faults/failures. A detection system would
also act as a diagnostic tool, as described above, itself contributing to the goal of
improved longwall availability.
Fault Detection and Identification for Longwall Machinery Using SCADA Data 619
Perhaps the most commonly applied FDI technique is the informal, qualitative
opinion of the expert. Analogous to the diagnostic method applied by a car
mechanic, operators (experts) use typical indicators such as heat, noise, vibration
or poor performance to ascertain the presence and nature of the fault. Typically,
faults detected using this rather subjective FDI technique must be confirmed by
further investigation.
The most rigorous of the FDI approaches, qualitative expert systems, are rule
based methods usually relying on a large number of if-then relationships. Expert
systems truly require an expert, as they rely heavily on knowledge of the influence
of all faults on system behaviour. This approach can provide excellent FDI;
however is not robust to variations in system parameters or the occurrence of un-
foreseen faults.
Model-based methods, as the name suggests, rely on a mathematical model of
the system of interest and/or a model of how system faults affect sensor measure-
ments; e.g. see Frank (1990). These techniques typically rely on analytical
redundancy. The principle behind analytical redundancy is simple: for a given
measured input, a mathematical model of the system may be used to generate
estimates of its output; the redundant measurements. Comparison of these and the
real output measurements allows inference to be made regarding the operating state
of the system. A commonly applied regime is that of the Kalman filter, an optimal
state estimator. The extended Kalman filter (EKF) is used when non-linearities are
dominant. In either case, the state representations can be chosen that are most
sensitive to fault induced behaviour. While originally developed for estimating
states in a control system, the Kalman filter has been applied in a wide range of
fields including control, communications, image processing, biomedical science,
meteorology, and geology. For more information on the Kalman filter and its appli-
cations, there are many excellent references available; e.g. Sorenson (1985); Gelb
(1974) and Grewal and Andrews (2001).
620 D. Bongers and H. Gurgenci
The inference to be drawn from the apparent difference between the model and
system outputs, referred to as the residual, often uses simple, statistical limits.
Assumptions enforced for model validity including the random distribution of
sensor noise allow chi-squared confidence limits, for example, to be determined for
each element of the residual vector. Expert knowledge is then employed to
establish which faults will be evident in each of these elements.
In contrast to model-based approaches where a priori knowledge of the system
is required, process history based or data-driven methods require only the avail-
ability of a large amount of historical process data. These techniques attempt to
capture the relationship between system measurements and system behaviour, with
the goal to detect and identify fault-affected behaviour from future measurements.
By definition, a data driven approach to fault detection and isolation is one in
which the decision criteria are based primarily or wholly on example data. Essen-
tially, a sufficiently large, example dataset representative of each fault of interest is
used to generate an algorithm which maps a single observation input to a single
fault classification output. As new or unseen observations of the systems are pre-
sented, they are subsequently classified (using these mappings), which allows both
the detection and isolation of faults.
Data driven methods are typically applied to systems for which the develop-
ment of accurate state-space or other dynamical equations is not possible or practi-
cal. Difficulty in the determination of accurate dynamical equations is common in
engineering problems for one or more of the following reasons:
Numerous journal and conference papers have been published describing the
application of data driven techniques to fault detection problems. Their popularity
is largely due to the fact that the established algorithms, namely principal compo-
nents analysis (PCA), partial least squares (PLS), linear discriminant analysis,
fuzzy logic discriminant analysis and neural networks, are simple and fast to apply
with little system knowledge. Venkatasubramanian et al. (2003) provide a compre-
Fault Detection and Identification for Longwall Machinery Using SCADA Data 621
hensive review of process history based methods applied to FDI, referencing over
140 such papers.
This section provides just a handful of brief descriptions of data driven FDI
applications, for the sole purpose of illustrating the methods by which the example
data classifications are typically determined.
McKay et al. (1996) described the use of an artificial neural network, or ANN
(see Section 25.4.4) to determine the acceptability of a polymer coating used to
coat copper wire. It was determined that the viscosity of the polymer as it exited
the extrusion process (during manufacture) was the most reliable indicator of
quality, short of destructive testing. A neural network was employed to estimate
this viscosity based on sensor measurements on the extrusion equipment and data
from an attached rheometer.
Network training data was developed over a period of time whereby laboratory
experiments were performed to accurately determine the viscosity of a number of
extruded polymer samples. This form of training data is manually generated, and
relies on a number of supervised sets of measurements.
Also described in McKay et al. (1996) is the use of a neural network integrated
as part of a model based predictive control scheme. In this case, a detailed model
of the process of mixing air and fuel in a combustion engine was developed, and
the model interrogated with a number of initial condition scenarios to generate a
predicted set of measurements. This set of conditions/artificial measurements
formed the training dataset for the neural network.
Chow (2000) describes the use of an ANN to detect and isolate simple faults in
a DC motor. In contrast to the two prior examples, the training process involved
expert diagnosis to classify faults/failures as they occurred. With each occurrence,
the network weights were updated. To expedite the process, faults were induced by
damaging components or changing the resistance of internal components.
The supervised approach to generating example data is typical of data-driven
FDI examples in the open literature. Such research focuses on new detection and
isolation regimes, and assumes that training data is both available and accurate.
All data-driven FDI systems need to be trained first on known data before they are
applied on unknown data. Availability of quality training or example data is an
essential requirement whether one used statistical FDI or artificial neural networks.
Example data are a sufficiently large dataset with the state of the system identified
for each observation. The identification process maps every observation to a
discrete state. Below is an augmented matrix, illustrating the form in which such a
training set with associated classifications, Y , would be assembled.
y11 y12 y1 p C1
y y 22 y2 p C 2
Y =
21
y n1 yn 2 y np C n
622 D. Bongers and H. Gurgenci
The last column in the above matrix includes the state descriptors assigned to
each observation vector (each row). Based on the assumption that the classifica-
tions accurately and discretely describe the state of the system, various algorithms
may be applied to generate rules (or equations) that map a single observation
vector input to a single classification output. Once generated from the training set,
these rules can be used to classify new observations of the system. As the state of
the system changes from normal operation to a state indicative of the presence of
a particular fault, this may be recognized as a fault being both detected and isolated
(identified).
Various data-driven techniques for FDI were discussed in the previous section.
The most common of these is multivariate statistical analysis (linear and non-
linear) and artificial neural networks. Both approaches have proven to be valuable
data-driven tools for the classification of multivariate observations.
The performance of an FDI system generated from example data is a function
of both the observability of each fault within the monitored variables and the
quality of the example data collected. Since these techniques are typically applied
where mathematical modeling is not feasible, a rigorous study of the observability
of each fault in observation space is not possible. The successful detection of faults
implies observability, but failure to detect certain faults does not imply non-ob-
servability. Observable faults will not be detected if the FDI function is not
sensitive to the specific changes exhibited by a fault, or if the training data set is
not of good quality.
It is paramount that one endeavours to apply a complete, unbiased and repre-
sentative training dataset in order to achieve a robust and accurate fault detection
and isolation system.
Inspired by the way the biological nervous system processes information, artificial
neural networks (ANNs) are a mathematical paradigm, composed of a large
number of interconnected elements operating in parallel. The function of the net-
work, influenced by a number of factors including its architecture, is however
largely determined by the connections between elements. Analogous to the ability
of the biological system to learn by example, particular functions can be developed
by adjusting the value of these connections, which are known as weights.
Essentially, neural networks are adjusted, or trained, so that a particular input
produces a specific target output. Based on a comparison of the output and the
target, network parameters are adjusted in an iterative process until the output
adequately matches the target. This process is known as supervised learning, which
typically involves a large number of input/target pairs.
During training, each output is set to be a binary indicator for each data
classification. Unlike linear discriminant FDI, however, the output of the network
using unseen data is not open to interpretation of the likelihood that the observation
belongs to a particular class.
Figure 25.2 shows the mathematical workings of the most basic neural network
element, often termed a neuron. Each element of the vector input x is multiplied by
a weight. These products are summed, together with the neuron bias b, to form the
Fault Detection and Identification for Longwall Machinery Using SCADA Data 623
net input, n. This net input is then applied to a transfer function to produce the
neuron output, z. The projection of the neuron element can viewed as a discrimi-
nant function g(x) given by
n
g (x) z = f xi wi + b
i =1
the assumption that the state of the longwall system can be classified into a finite
number of categories. The only record of the activity of the longwall is the main-
tenance log, which details all unscheduled downtime at the longwall face.
Table 25.1 is an excerpt from the maintenance log corresponding to the
condition monitoring data discussed earlier. It records the time that the delay began
and the duration of downtime experienced. The plant responsible is also recorded,
as well as a description of the delay cause.
Figure 25.4 illustrates the inaccuracy of the maintenance records. It shows
traces of motor currents and the shearer position, which are centred on a time
corresponding to a documented delay. In this case, the maintenance records show
that a delay began at observation 9059, and that the longwall was inactive for 50
observations (25 min).
Where possible, the challenges are approached in a generic manner. This will
illustrate the applicability of this research to a large number of engineering prob-
lems where system modeling is highly complex, and discrete states of the system
are not immediately apparent.
All faults considered lead to a complete longwall shutdown. That is, one or
more parameter (examples include gearbox temperatures, AFC chain tension and
earth leakage current) measures outside present safety limits, causing all major
longwall machinery to shutdown. As such, all longwall stoppages represent candi-
dates for each documented maintenance event. This section describes the process
by which the start time and duration of all longwall stoppages was determined, as
well as the selection criteria for candidates for each maintenance event of interest.
Fault Detection and Identification for Longwall Machinery Using SCADA Data 627
We consider now the selection of candidate stoppages for each maintenance event.
It is of course likely that the true event time lies in the vicinity of the documented
delay start time (DDST), and most certainly within the same 8-h working shift.
Although not shown in Table 25.1, the maintenance log contains a shift field,
which indicates day, afternoon or night shift. The shift schedule is known for the
mine from which the data was collected. Therefore, to establish a conservative
approach that will be adopted throughout this chapter, all longwall stoppages
within the same shift will be considered candidates for each fault occurrence of
interest.
25.6.1.1 Procedure
The process of determining candidate stoppages was automated using the
following procedure:
Step 1: Determine a list L of all observations for which the value of all face
equipment motor currents are 0.01.
Step 2: Determine the observation number for each observation in L.
628 D. Bongers and H. Gurgenci
25.6.1.2 Results
When this procedure was applied to data representing five months of longwall
operations, 2452 stoppages were determined. The average duration for each
stoppage is 69 observations or 34.5 min (the sampling rate is two observations per
min). On average, five candidates were selected for each maintenance event of
interest using the procedure described. As further testimony as to the inaccuracy of
the maintenance log, analysis showed that two particular shifts had fewer longwall
stoppages than the number of catastrophic maintenance events documented for
each shift.
To answer this, we look to the maintenance log. The only information available
is the difference between the delay start time and duration of each candidate and
those of the documented event. We define DST as the difference between the
delay start time of a candidate and the documented delay start time. DD is
similarly defined as the difference between the duration of each candidate stoppage
and that of the documented delay. Each candidate will have associated values of
DST and DD, and these will initially be used to determine which candidate
corresponds to the documented downtime.
The discriminating metric is simply a weighted sum of the available discrimina-
tory information, in this case DST and DD. Commonly referred to as a cost
function, it provides a crude way of determining which stoppage relates to the
documented maintenance event. The form of the cost function is
Cost = DST + DD
Table 25.2 shows the maintenance log from a single shift we observed. Table
25.3 is our record of the events as they occurred at the longwall face. Clearly, there
are discrepancies in both the DST and DD. Analysis of these errors shows the
average discrepancy to be 8 min and 31 min for DD and DST respectively.
In line with the previous arguments, and the limited comparative data, it is
decided that, on average, |DST| will be four times larger than |DD|. Therefore,
the cost function for initial candidate selection will be
Cost = DST + 4 DD
Similar results were seen for the majority of maintenance events, with the
exception of seven. Candidate selection was not possible for these because:
No candidates were projected within the 95% confidence interval as defined
by the T2-statistic
More than one candidate was projected within this confidence interval
Figure 25.6 shows the trace of the T2-statistic around the time of a longwall
stoppage identified as an example of a maingate drive cooling fault. The values on
the horizontal axis of this and subsequent figures have been shifted so that
observation zero represents the first measurement of longwall shutdown. There is a
clear transition from normal operation to shutdown indicated by the values of this
statistic starting to rise a number observations prior to shutdown.
The dashed lines represent the upper and lower confidence limit for data
representative of normal longwall operation. These were determined by conserva-
tively selecting data between a number of stoppages in production. The T2 values
for observations in the class normal will be likely to stay between these limits. It is
the violation of these limits that can be used to test if the system is behaving in a
abnormal manner.
This particular figure shows a distinct change from what is apparently normal
operation. The four observations prior to shutdown are clearly outside the 95%
confidence limit, which suggests that these represent operation with the fault
present.
634 D. Bongers and H. Gurgenci
Figure 25.7 shows the T2 values in the vicinity of an AFC maingate blockage
fault. Once again, significant abnormal activity is observed prior to shutdown.
output (i ) correct (i )
recall (i ) =
correct (i )
where output(i) refers to the set of all observations that the system classifies as that
of fault type i. The term correct(i) is the set of all observations in the input set that
are actually in fault class i. The recall is then the fraction of the correct classi-
fications of observation type i that the system correctly computes. It is of course
possible that correct(i) = 0 (when the system is presented with an input set for
Fault Detection and Identification for Longwall Machinery Using SCADA Data 639
output (i ) correct (i )
precision(i ) =
output (i )
Table 25.4 presents the overall FDI performance of the neural network. For all
faults, the values of precision and recall are higher than that for the linear
discriminant algorithm. All instances of faults were both detected and isolated,
again, occasionally a few observations after the FAULT PRESENT class of data
had begun.
The results presented in this section show the successful detection and isolation
of faults using both the linear discriminant algorithm and the two-layer neural
network. The improvements in FDI performance offered by the NN suggest that
there exists some non-linearity in the relationship between sensor measurements
and the determined classifications. This is typical of most mechanical systems,
largely due to the non-linear effect of damping.
640 D. Bongers and H. Gurgenci
25.10 References
Bongers, D., (2004) Development of a Classification System for Fault Detection in Longwall
Systems, PhD Thesis, The University of Queensland
Chow, M.Y., (2000) Guest Editorial: Special Section on Motor Fault Detection and
Diagnosis. IEEE Transactions on Industrial Electronics, 47(5):982983
Frank, P.M., (1990) Fault diagnosis in dynamic systems using analytical and knowledge-
based redundancy a survey and some new results, Automatica, 26(3): 459474
Gelb, A., (1974) Applied Optimal Estimation, MIT Press, Cambridge, Massachusetts.
Fault Detection and Identification for Longwall Machinery Using SCADA Data 641
Grewal, M.S., Andrews, A.P., (2001) Kalman Filtering: Theory and practice using MATLAB,
John Wiley and Sons, New York
Hotelling, H., (1931) The generalization of Student's ratio. Annals of Mathematical Statistics,
2:360378
McKay, B., Lennox, B., Willis, M., Barton, G., Montague, G., (1996) Extruder Modelling:
A Comparison of two Paradigms. UKACC International Conference on Control'96, 2:
734739, Exeter, UK. Conference publication No. 427
Reid, A. (2007) Longwall Shearer Cutting Force Estimation, PhD Thesis, The University of
Queensland
Sorenson, H.W., (1985) Kalman Filtering: Theory and Application, IEEE Press, New York
Todeschini, R., (1990) Weighted k-nearest neighbor method for the calculation of missing
values, Chemometrics and Intelligent Laboratory Systems, 9:201205
Venkatasubramanian, V., Rengaswamy R, Yin K, Kavuri S, (2003) Review of Process Fault
Diagnosis Parts I, II, III. Computers and Chem Eng, 27(3): 293346
Willsky, A.S., (1976) A survey of design methods for failure detection in dynamic systems,
Automatica, 12:601611
Contributor Biographies
Chapter 1
Chapter 2
Chapter 3
Jay Lee is Ohio Eminent Scholar and L.W. Scott Alter Chair Professor in
Advanced Manufacturing at the University of Cincinnati and is founding director
of National Science Foundation (NSF) Industry/University Cooperative Research
Centre (I/UCRC) on Intelligent Maintenance Systems. His current research focuses
on autonomic computing and smart prognostics technologies for predictive main-
tenance and self-maintenance systems, as well and closed-loop product life cycle
service model studies. He has authored/co-authored over 100 technical publi-
cations, edited 2 books, contributed numerous book chapters, 3 U.S. patents and 2
trademarks. He received his B.S. degree from Taiwan, a M.S. in Mechanical
Engineering from the Univversity of Wisconsin-Madison, a M.S. in Industrial
Management from the State University of New York at Stony Brook, and D.Sc. in
Mechanical Engineering from the George Washington University. He is a Fellow
of ASME and SME.
Haixia Wang is a postdoctoral researcher in the NSF Industry/University Co-
operative Research Centre (I/UCRC) on Intelligent Maintenance Systems (IMS)
Center headquartered at the University of Cincinnati. Her current research interest
focuses on data streamlining for machinery prognostics and health management,
manufacturing process performance and quality improvement, and design for pro-
duct reliability and serviceability. Haixia Wang received her B.S. degree in
Mechanical Engineering from Shandong University at China, a Ph.D. in Mechani-
cal Engineering from Southeast University at China, a M.S. and a Ph.D. in Indus-
trial and Systems Engineering from the University of Wisconsin-Madison.
Contributor Biographies 645
Chapter 4
Chapter 5
Wenbin Wang is Chair of Operational Research at the Centre for OR and Applied
Statistics, Salford Business School, University of Salford, UK. Prof. Wang
received his B.Sc. (Harbin, China) in Mechanical Engineering in 1981, M.Sc.
(Xian, China) in Operations Management in 1984 and Ph.D. in OR and Applied
Statistics from Salford University (UK) in 1992. He has over 20 years experience
in OR modelling in general and maintenance and reliability modelling in particular.
He received 3 EPSRC projects in the past and has authored and co-authored over
80 research papers. Professor Wang is a fellow of Royal Statistics Society, Opera-
tional Research Society, Institute of Mathematical Applications, and a charted
mathematician. He is also a member of the International Foundation for Research
in Maintenance. Professor Wang holds a guest professorship at Harbin Institute of
Technology, China.
Chapter 6
David Percy gained a B.Sc. degree with first class honours in mathematics from
Loughborough University in 1985 and a Ph.D. degree in statistics from Liverpool
University in 1990. He is a reader in mathematics at the University of Salford and
his research into Bayesian inference, stochastic processes and multivariate analysis
has produced 40 refereed publications and many conference presentations. He is
actively involved in collaborative research for industrial applications, particularly
concerning maintenance scheduling problems for complex systems. Dave is a
chartered scientist, chartered mathematician and member of the governing Council
for the Institute of Mathematics and its Applications.
646 Contributor Biographies
Chapter 7
Chapter 8
Chapter 9
Chapter 10
Chapter 11
Chapter 12
Chapter 13
Chapter 14
Chapter 15
Chapter 16
Chapter 17
Chapter 18
Chapter 19
Chapter 20
Chapter 21
Chapter 22
Chapter 23
Chapter 24
Chapter 25
Daniel Bongers received his B.E. (1999) and Ph.D. (2004) from the University of
Queensland, Australia. He is currently a research fellow for the Australian Co-
operative Research Centre Mining, and is responsible for managing two late-stage
technology development projects. His current research interests include physiologi-
cal signal processing, fault detection and isolation, physiological fatigue detection
and signal measurement.
Hal Gurgenci received his B.Sc. (1976) and M.Sc. (1979) from the Middle East
Technical University, Turkey, and Ph.D. (1982) from the University of Miami. He
is currently a professor with the School of Engineering, The University of Queens-
land in Brisbane. Previously, he was a Vice President of the Australian Coopera-
tive Research Centre on Mining responsible for research and education activities of
the Centre. He was the principal investigator of several large projects in mining
equipment design, automation, reliability and maintenance. His current research
interests include energy generation and conservation.
Index
A Applications 538
CMMS 43, 417
ABC classification 484
Composite scale 542
Accelerated
Condition monitoring techniques 112
Degradation testing 157
Contract 402
Failure time testing 156
Cost benefit ratio 525, 526, 529
Life testing plans 156
Cost benefit analysis 509, 521, 529
Adverse selection 388
Costs
Agency Theory 387
Down time 324
Issues 388
Punctuality 526
Aging parameter 517
Safety 525
AHP 427
Criticality index 93
Artificial intelligence 209
Asset 4
D
B Data
Acquisition 535
Bayesian
Fusion 537
Approach 135
Processing 536
Decision Theory 146
Decision
Inference 136
Charts 35
Benchmarking
Model 116
Methodology 562
Support 42
Need 563
Delay time
Overview 561
Bayesian approach 362
Modelling 345
C Objective data method 364
Candidate group 515 Subjective estimation 359
Case Demand
Based reasoning 209, 212 Distribution 487
Studies 69, 124, 150, 445 Estimators 500
CBA See cost benefit analysis Mean 489
CBM 52, 54 Variance 492
654 Index
Dependence G
Economic 265
Game
Stochastic 266
Nash 385
Structural 266
Stackelberg 385
Diagnostics 536
Genetic algorithm 212
Module 66
Government 400
Technologies 598
Diesel Engine 538
Discount rate 521, 525, 528 H
Distributions HAZOP 441
Event time 627 HIMOS 217
Posterior 139 HSE 471
Predictive 142
Prior 139 I
DMG 422
Dynamic grouping 512, 514, 516, 519, Industry
527 Nuclear 473
Oil and gas 474
E Process and utility 475
Railway 475, 565
Economy of scale 511 Information fusion 128
Economy of scope 511 Infrastructure 376
Effective failure rate 512 Inspections
E-maintenance 586 Imperfect 349
EMQ 333 Perfect 348
E-operations 586 Intensity function
Equipment leasing 397 General proportional 193
ERP 418 Reduction 405
Extrusion press 366 Interval optimization 105
Inventory decision 482
F
Failure K
Information 94 Knowledge based systems 212
Interaction 275 KPI 461
Interaction Type I 276
Interaction Type II 278 L
Fault detection 611
FMECA 90, 517 Laplace trend test 197
Forecasting Lease
Non-parametric 493 Definition 397
Parametric 487 Finance 398
FTA 442 New equipment 408
Functional Operating 397
Block diagrams 85 Sale and leaseback 399
Failure analysis 84 Used equipment 409
Failures 85 Lessee 402
Fuzzy logic 212, 428 Lessor 401
Life cycle cost 34, 509, 510, 525
Calculations 525
Index 655
M Total productive 37
Usage based 30
Maintainability 8
Metrics 461, 570
Maintenance
Misjudgment 549
Actions 27
Model
Actions Selection 94
Age based 306
Benchmarking 563
Basic risk 446
Concepts 32
Capital replacement 303
Concepts customized 40
Competing risk 245
Condition based 49, 424
Cumulative usage based 310
Context 22
Dynamic programming 306
Contract 569
Economic life 290
Corrective 27, 379
Finite horizon 294
Design-out 30
Intensity reduction 191
Failure-based 30
Linear regression 160
Framework 4
Markov 252
Grouping activities 511
Non-homogeneous Poisson process
Intelligent Systems 56
187
Intervals 97
Period based 308
Longwall 613
Proportional hazards 190
Management 9
Proportional intensities 192
Management 22
Renewal process 187
Manager 41
Repair alert 248
Maturity levels 45
Risk influence 448
Measurement and control 225
Selection 197, 553
Offshore asset 589
State discriminant 533
Opportunity based 519
Statistic based nonparametric 160
Opportunity-based 30
Statistic based parametric 159
Optimization 509, 511
Two-cycle 291
Outsourcing 24
Virtual age 190
Outsourcing advantages 375
Monitoring
Outsourcing disadvantages 375
Off-line 615
Passive 29
Oil based 114
Performance 6
On-line 616
Performance Measurement 459
Vibration based 113
Policies 30
Moral hazard 388
Predictive 29
MPM system 469
Preventive 28, 199, 511, 79, 379,
MTTF 517, 518
510513, 519
Multivariate control chart 549
Preventive comparison analysis 97
Preventive optimal schedule 170,
173
N
Proactive 29, 53 Net present value 521, 525, 526, 528
Reactive 51 Neural network 212, 622
Reliability centered 37 NPV See Net present value
Scheduling 199, 271
Self 53 O
Service contract 6
Technologies 50 Oil degradation 541
Time based 30 Opportunity maintenance 325
656 Index
RCM
T
Analysis process 79
Data collection 80, 99 Technologies
Implementation 80, 99 Diagnostic 598
Regulator 400 Prognostic 598
Reliability Training set 637
Inherent 6 Trend renewal process 237
Measures 254 Heterogeneous 242
Theory 8
Renewal Process U
Alternating 189
Trend 235 Unplanned costs 512
Repair
Maximal 187 V
Minimal 187 Variable costs 525
Replacement model Virtual age 406
Age based 306
Cumulative-usage based 310
Index 657
W Servicing 384
Wear particles 541
Warranty
Weibull 517
Extended 377