7 Failure Prevention and Recovery

Nigel Slack, Stuart Chambers & Robert Johnston, 2004 Operations Management, 4E: Chapter 19
19.1
Failure Prevention and Recovery
Chapter coverage:
System failure
Failure detection and analysis
Improving process reliability
Recovery
19.2
Failure
There is always a chance that things might go wrong we
must accept this NOT ignore this.
Critical failure:
Lost of customer
High downtime
High repair cost
Injury or lost of lives (company reputation)
Non - critical failure lesser effect
Organizations must discriminate and give priority to
critical failure why things fail & how to measure the
impact of failure
19.3
All failure can be traced back to some kind of human
failure.
A machine failure might have been cause by
someones poor design or maintenance.
Delivery failure might have been someones error in
managing the supply schedule.
Failures are rarely a random chance.
It can be controlled to a certain extent
Can learn from failure and change accordingly
Opportunity to examine and plan for elimination
Failure as an Opportunity
19.4
System Failure
Why things fail:
1) Failure resulting from within the operation:
Design failure
Facilities failure
People failure
2) Failure resulting from material or information input
Supplier failure
3) Failure resulting from customer actions
Customer failure
19.5
Why Things Fail
Design failure:
Operations may look fine on paper but cannot cope with
real circumstances.
Type 1: Characteristic of demand was overlooked or
miscalculated.
Bearing factory designed to produce 100 bearings
per day but customers demand 125 bearings per
day.
Type 2: The circumstances under which the operation
has to work are not as expected.
A factory building designed to house stationary
machinery fails when it was used to store a
vibrating machine.
19.6
Why Things Fail
Facilities failure:
All facilities (machines, equipment, buildings, fittings)
are liable to breakdown.
Type 1: Partial breakdown
Worn out carpet in a hotel
Machine can only half its normal rate
Type 2: Complete breakdown
Sudden stop of operation
It is the effect of the breakdown that is important some
breakdowns could paralyse the whole operation.
Some failures have a cumulative significant impact.
19.7
Why Things Fail
People failure:
Type 1: Errors are mistakes in judgement
A managers decision to continue running the plant
with a partially failed heat exchanger resulted in a
more expensive complete breakdown.
Type 2: Violation are acts which are contrary to
defined operating procedures
A machine operator failure to lubricate the
bearings of the motor resulted in the bearings
overheating and failing
19.8
Why Things Fail
Supplier failure:
A supplier failed to
Deliver.
Deliver on time.
Deliver quality goods and services
can lead to failure within an operation.
Customer failure:
Customer failure can result when customers misuse
products and services
Example: Someone loading a 14kg washing
machine with 18kg of cloths will cause the machine
to fail.
19.9
There are three main ways of measuring failure:
Failure rates how often a failure occurs
Reliability the chances of failure occurring
Availability the amount of available useful
operating time
Measuring Failure
19.10
Failure rate (FR):

Example: If an engine fails 4 times after operating for
300 hours, it has a failure rate of 0.013 (0.13%).
Example: If out of 250 products tested for operability 5
failed, the failure rate is 0.02 (0.2%)
tested products of number total
failures of number
FR
time operating
failures of number
FR
Measuring Failure
19.11
Failure over time the bath-tub curve
At different stages during the life of anything, the
probability of it failing will be different.
Most physical entity failure pattern will follow the
bath-tub curve.
Measuring Failure
19.12
The bath tub curve comprises three stages:
The infant-mortality stage where early failures
occur caused by defective parts or improper use.
The normal life stage when the failure rate is low
and reasonably constant and caused by normal
random factors.
The wear-out stage when the failure rate increases
as the part approaches the end of its working life and
failure is caused by the ageing and deterioration of
parts
19.13
Bath-Tub Curve
Time

F
a
i
l
u
r
e

r
a
t
e

Infant-
mortality
stage
Normal-life
stage
Wear-out
stage
X y
19.14
Reliability
Measures the probability of a system, product or service
to perform as expected over time.
Values between 0 and 1 (0 to 100% reliability)
Used to relate parts of the system to the system.
If components in a system are all interdependent, a
failure in any individual component will cause the
whole system to fail.
Hence, reliability of the whole system, R
s
,
R
s
= R
1
R
2
R
3
R
n

Where: R1 = reliability of component 1
R
2
= reliability of component 2
R
3
= reliability of component 3
Etc
19.15
Worked Example
An automated pizza-making machine in a food manufacturers factory
has five major components, with individual reliabilities (the probability
of the component not failing) as follows:
Dough mixer Reliability = 0.95
Dough roller and cutter Reliability = 0.99
Tomato paste applicator Reliability = 0.97
Cheese applicator Reliability = 0.90
Oven Reliability = 0.98

If one of these parts of the production system fails, the whole system
will stop working. Thus the reliability of the whole system is:

Rs = 0.95 0.99 0.97 0.90 0.98
= 0.805
19.16
Worked Example
Notes:
The reliability of the whole system is 0.8 even though the
reliability of the individual components was higher.
If the system had more components, its reliability would be
lower.
E.g. for a system with 10 components having reliability of
0.99 each, the reliability of the system is 0.9 BUT if the
system has 50 components having reliability of 0.99 each,
the reliability of the system reduces to 0.8.

Reliability chart given on page 687 of recommended text.
19.17
Availability
Availability is the degree to which the operation is
ready to work.
An operation is not available if it has either failed or is
being repaired following a failure.

f ailures of number
hours operating
MTBF
repair to time mean MTTR
f ailures between time mean MTBF
Where
MTTR MTBF
MTBF
A ty Availabili
19.18
The three tasks of failure prevention and recovery
Failure detection and
analysis
Finding out what is
going wrong and why
Improving system
reliability
Stopping things going
wrong
Recovery
Coping when things do
go wrong
19.19
Mechanisms to detect failure:
1. In process checks
2. Machine diagnostic check
3. Point-of-departure interviews
4. Phone surveys
5. Focus groups
6. Complaint cards of feedback sheets
7. Questionnaires
19.20
1. In process checks employees check that the process
is acceptable during the process.
Example: Is everything alright with your meal,
madam?
2. Machine diagnostic check a machine is put through
a prescribed sequence of activities to expose any
failures or potential failures.
Example: A heat exchanger tested for leaks, cracks and
wear
19.21
3. Point-of-departure interviews at the end of a
service, staff may check that the service has been
satisfactory.
4. Focus group groups of customers are brought
together to some aspects of a product or service.
5. Phone survey, Complaint cards & Questionnaires
these can be used to ask for opinions about products or
services.
19.22
Failure analysis:
1. Accident investigation
Trained staff analyse the cause of the accident.
Make recommendations to minimize or eradicate of
the failure happening again.
Specialized investigation technique suited to the type
of accident
2. Product liability
Ensures all products are traceable.
Traced back to the process, the components from
which they were produced and the supplier who
supplied them.
Goods can be recalled if necessary.
19.23
3. Complaint analysis
Complaints and compliments are recorded and taken
seriously.
Cheap and easily available source of information
about errors.
Involves tracking number of complaints over time.
4. Critical incident analysis
Requires customers to identify the elements of
products or services they found either satisfying or
not satisfying.
Especially used in service operations.
19.24
4. Failure mode and effect analysis (FMEA)
Used to identify failure before they happen so
proactive measures can be taken.
For each possible cause of failure the following type
questions are asked:
What is the likelihood a failure will occur?
What would the consequence of the failure be?
How likely is such a failure to be detected
before it affects the customer?
Risk priority number (RPN) calculated based on
these questions.
Corrective action taken based on RPN.
19.25
6. Fault-tree analysis
This is a logical procedure that starts with a failure
or potential failure and works backwards to identify
all the possible causes and therefore the origins of
that failure.
Made up of branches connected by AND nodes and
OR nodes.
Branches below AND node all need to occur for the
event above the node to occur.
Only one of the branches below an OR node needs to
occur for the event above the node to occur
19.26
Fault-tree analysis for below-temperature
food being served to customers
Food served to
customer is below
temperature
Cold plate
used
Plate taken
too early
from warmer
Plate warmer
malfunction
Oven
malfunction
Timing error
by chef
Ingredients
not
defrosted
Plate
is cold
Food
is cold
Key
AND node
OR node
19.27
To be continued
19.28
Improving Process Reliability
After the cause and effect of a failure is known, the next
course of action is to try to prevent the failures from
taking place. This can be done in a number of ways
Designing out fail points in the process
Building redundancy into the process
Fail-safeing some of the activities in the process
Maintenance of the physical facilities in the process
19.29
Designing out fail points
Identifying and then controlling process, product and
service characteristics to try to prevent failures.
Use of process maps to detect potential fail points in
operations.
Redundancy
Building up redundancy to an operation means having
back-up systems in case of failure.
Increases the reliability of a component
Expensive solution
Used for breakdowns with critical impact.
19.30
Fail-safeing
Called poka-yoke in Japan.
Based on the principle that human mistakes are to some
extent inevitable.
The objective is to prevent them from becoming a
defect.
Poka-yokes are simple (preferably inexpensive) devices
of systems which are incorporated into a process to
prevent inadvertent operator mistakes resulting in a
defect.
19.31
Maintenance
Maintenance is the method used by organizations to
avoid failure by taking care of their physical activities
Important to organizations whose physical activities
play a central role in creating their goods and service.
Benefits of maintenance:
Enhanced safety
Increased reliability
Higher quality
Lower operating costs
Longer life span
Higher end value
19.32
Benefits of Maintenance
Enhanced safety: Well maintained facilities are less
likely to behave in an unpredictable or non-standard
way, or fail outright, all of which would pose a hazard to
staff.
Increased reliability This leads to less time lost while
facilities are repaired, less disruption to the normal
activities of the operation , and less variation in output
rates.
Higher quality Badly maintained equipment is more
likely to perform below standard and cause quality
errors.
19.33
Benefits of Maintenance
Lower operating costs Many pieces of process
technology run more efficiently when regularly
serviced.
Longer life span Regular care prolong the effective
life of facilities by reducing the problems in operation
whose cumulative effect causes deterioration.
Higher end value Well maintained facilities are
generally easier to dispose of into the second-hand
market.
19.34
Approaches to maintenance
1. Run to breakdown (RTB)
Allowing the facilities to continue operating until
they fail.
Maintenance work is performed after failure has
taken place.
The effect of the failure is not catastrophic or
frequent e.g. does not paralyze the whole
operation.
Regular checks are sufficient.
19.35
2. Preventive maintenance (PM)
Attempts to eliminate or reduce the chances of
failure by servicing the facilities at pre-planned
intervals.
Used when the consequence of failure is
considerably more serious.
Can be used to detect impending failures.
Remedial actions can be planned for, thus
improving overall efficiency.
The useful life of certain components can be
increase beyond their recommended life span.
19.36
3. Conditioned-based maintenance (CBM)
Attempts to perform maintenance only when the
facilities require it.
May involve continuously monitoring parameters
(vibrations, temperature, displacement) of the
facility.
The results of the monitored parameter is used to
decide whether to stop the facility to conduct
maintenance.
19.37
4. Mixed maintenance strategies
Most operations adopt a mixture of these
approaches because different elements of their
facilities have different characteristics.
Use ???
Use ???
Use ???
19.38
5. Run to breakdown versus preventive maintenance
The more frequent preventive maintenance is
carried out, the lesser chance it has of breaking
down.
The cost of preventive maintenance is often high.

Infrequent preventive maintenance will cost less
but will result in higher chances of breaking
down.
The cost of an unplanned breakdown is often
high.
19.39
Cost of Preventive Maintenance
C
o
s
t
s

o
f

P
M

Amount of preventive maintenance
19.40
Cost of Breakdown
C
o
s
t
s

o
f

b
r
e
a
k
d
o
w
n

19.41
Maintenance cost model 1: One model of the costs associated
with preventive maintenance shows an optimum level of
maintenance effort.
C
o
s
t
s

Total cost
Cost of providing
preventive
maintenance
Optimum level of
preventive
maintenance
Cost of
breakdowns
19.42
Maintenance cost model 2: an optimum level of maintenance
effort.
C
o
s
t
s

Actual cost of providing
preventive maintenance
Model 1 cost of providing
preventive maintenance
19.43
effort.
C
o
s
t
s

Actual cost of
breakdowns
Model 1 cost of breakdowns
19.44
effort.
C
o
s
t
s

Total cost
Cost of breakdowns
Cost of providing preventive
maintenance
19.45
Notes:
In actuality the cost of PM does not increase as steeply as
indicated in Model 1.
Model 1 assumes that all maintenance jobs must be
carried out by a specialist maintenance team but Model
2 recognizes that operators themselves can carry out
simple, in process maintenance. Etc
The cost of breakdown could be higher than indicated in
Model 1.
A breakdown may cost more than the cost of repair
and the cost of the stoppage itself a stoppage can
take away the stability in the operation.
19.46
Run To Breakdown or Preventive
Maintenance?

Based on the arguments above, the
shift is more towards the use of
Preventive Maintenance.
19.47
6. Failure distributions
The shape of the failure probability distribution of a
facility can determine if it benefits from preventive
maintenance.
Machine A
Machine B
P
r
o
b
a
b
i
l
i
t
y

o
f

f
a
i
l
u
r
e

Time
x y
19.48
Notes:
Machine A
The probability that it will break down before time x is
relatively low.
It has high probability of breaking down between
times x and y.
If preventive maintenance was carried out just before
point x, the chances of breakdown can be reduced.
19.49
Notes:
Machine B
It has a relatively high probability of breaking down at
any time.
Its failure probability increases gradually as it passes
through time x.
Carrying out preventive maintenance at point x or any
other cannot dramatically reduce the probability of
failure.
19.50
Total Productive Maintenance (TPM) Approach
Total productive maintenance (TPM) is defined as:
the productive maintenance carried out by all
employees through small group activities
Where productive maintenance is:
maintenance management which recognizes
the importance of reliability, maintenance and
economic efficiency in plant design
19.51
The five goals of TPM:
1. Improve equipment effectiveness:
Examine how the facilities contribute to the
effectiveness of the operation by examining all the
losses which occur.
2. Achieve autonomous maintenance:
Allow people who operate the equipment to take
responsibility for some maintenance task.
Maintenance staff to take responsibility for the
improvement of maintenance performance.
19.52
There are three levels at which maintenance staff
can take responsibility for process reliability:
Repair level staff carry out instructions but do not
predict the future, they simply react to problems.
Prevention level staff can predict the future by
foreseeing problems, and take corrective action.
Improvement level staff can predict the future by
foreseeing problems, they not only take corrective
action but also propose improvements to prevent
recurrence.
19.53
Example:
Suppose the screws on a machine become loose. Each week
it jams up and is passed to maintenance to be fixed.
A repair level maintenance engineer will simply
repair it and hand it back to production.
A prevention level maintenance engineer will spot
the weekly pattern to the problem and tighten the
screws in advance of their loosening.
An improvement-level maintenance engineer will
recognize that there is a design problem and modify
the machine so that the problem cannot recur.
19.54
The five goals of TPM (cont):
3. Plan maintenance:
To have a fully worked out approach to all
maintenance activities. Includes
the level of preventive maintenance required
for each piece of equipment.
the standard for condition-based maintenance
the respective responsibilities of operating staff
and maintenance staff. See Slide 19.55
4. Train all staff in relevant maintenance skills:
TPM emphasises on appropriate and continuous
training to ensure staff have the skills to carry out
their roles.
19.55
The roles and responsibilities of operating staff and
maintenance staff in TPM
Maintenance staff Operating staff
Roles To develop:
Preventive actions
Breakdown services
To take on:
Ownership of
facilities
Care of facilities
Responsibilities Train operators
Device maintenance
practice
Problem-solving
Assess operating
practice
Correct operation
Routine preventive
maintenance
Routine condition-
based maintenance
Problem detection
19.56
The five goals of TPM (cont):
5. Achieve early equipment management:
This goal is directed at avoiding maintenance
altogether by maintenance prevention (MP).
MP involves considering root causes of failure and
maintainability of equipment during the design
stage, manufacture, installation and its
commissioning.
19.57
Reliability Centred Maintenance (RCM) Approach
1. TPM tends to recommend preventive maintenance even
when it is not appropriate.
2. Uses the pattern of failure for each type of failure mode
to dictate the approach of maintenance.
3. The approach of RCM is sometimes summarized as If
we cannot stop it from happening, we had better stop it
from mattering efforts need to be directed at reducing
the impact of the failure.
19.58
Example:
Take the process illustrated in Slide 19.59. This is a simple
shredding process which prepares the vegetables prior to
freezing. The most significant part of the process which
requires the most maintenance attention is the cutter sub-
assembly. However, there are several modes of failure.
1) They require changing because they have worn out
through usage
2) They have been damaged by small stones entering the
process
3) They have shaken loose because they were not fitter
correctly.
19.59
One part in one process can have several
different failure modes, each of which
requires a different approach
F
a
i
l
u
r
e
s

F
a
i
l
u
r
e
s

F
a
i
l
u
r
e
s

Time
Time
Time
Cutter shake loose
failure pattern
Cutter damage
failure pattern
Cutter wear out
failure pattern
Solution
Preventive maintenance
before end of useful life
Solution
Preventive damage, fix
stone screen
Solution
Ensure correct fitting
through training
Cutters
Shredding
process
19.60

7 Failure Prevention and Recovery

Uploaded by

Document Information

Original Description:

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

7 Failure Prevention and Recovery

Uploaded by

Copyright:

Available Formats

Nigel Slack, Stuart Chambers & Robert Johnston, 2004 Operations Management, 4E: Chapter 19

You might also like