You are on page 1of 6

Use of Semi Hidden Markov Models in the Prognostics of Shaft Failure

Eric Bechhoefer1, Andreas Bernhard2, David He3, Pat Banerjee3


1
Goodrich Fuels and Utility Systems
Vergennes, VT 05491 Eric.Bechhoefer@Goodrich.com
2
Sikorsky Aircraft Corporation
Stratford CT 06615
3
Department of Mechanical & Industrial Engineering
The University of Illinois at Chicago
Chicago, IL, USA, 60607

Abstract 1 Introduction
Vibration based mechanical diagnostics, such as used in The promise of lower operating cost of HUMS has not been
health and usage monitoring systems (HUMS), can fully realized. While there is a direct benefit to aircraft
successfully identify anomalous components. For shaft, safety and maintenance (due to improved usage
where damage can be directly measured as a exceedance in measurements, diagnostics, Rotor Track and Balance, etc.),
shaft order 1, 2 or 3 (a vibration vector 1, 2 or 3 times the the goal of on condition maintenance has not been
RPM of the shaft), it is simply a matter of taking the accomplished. On condition maintenance would require
spectrum of the vibration data, and noting if the magnitude maintenance credit, as covered under [1]. This will be
of the shaft order frequency exceeds some threshold. difficult to achieve in the near term, in part due to:
1. Limited direct evidence (service history of HUMS
In some cases, the component manufacture has published equipped aircraft, seeded fault testing),
limits. In other case, statistics from a population of 2. Verification/Validation of COTS Ground Station, and,
components are calculated, and a threshold is set such that 3. HUMS Program Items (parts life tracking)
the probability of the component being normal, when HUMS are expensive both monetarily, and from a systems
exceeding the threshold, is small. This event or fault perspective due to weight. On condition maintenance would
detection is diagnostic. To increase the utility of the HUMS be give an economic incentive to install HUMS. Yet, as
system, the ability to predict the useful life remaining (ULR) noted, the certification process for this will not be achieved
on the component would greatly improve the operational soon. Alternatively, maintenance cost reduction could be
readiness, reduce the logistic foot print and consequentially, realized by reducing the logistic foot print if reliable
reduce maintenance cost. component prognostics were available.

The current diagnostics capability gives some marginal The ability to reliably predict component failure 50 to 100
prognostic capacity (e.g. if the components health is trending hours in the future would reduce operating cost. For
up, one would suggest that maintenance will be require example, operators would order long lead time parts prior to
“soon”). Further benefit would be gained if a reliable failure. This increases operational readiness as aircraft
estimate of the component’s ULR was similar to the logistic would not be down waiting for a part. Alternatively,
supply line time scale. That is, 50 to 100 hours of advance operators could change the way aircraft are allocated for
notice vs. 2 to 5 hours of notice currently available. missions, ensuring that the “best” aircraft would be used for
the most critical missions. Finally, reliable prognostics will
be importance for certification for credit.
The paper investigates the use of Hidden Semi-Markov
Model (HSMM) to predict the ULR on a component. The
In this paper, we investigate a method of equipment
intent is to investigate modeling techniques which will push
prognostics based on the Markov process..
the ULR prediction, reliably, to100 hours. Real world data
from 30 utility helicopters is used to train and validate the
model. The Markov Models
In Markov models, we simplify a system into a stochastic
process with a finite number of state values. This set of
1
Presented at the American Helicopter Society 62th Annual possible values of the process will be the set of non-negative
Forum, Phoenix, AZ, May 9-11, 2006. Copyright @ 2006 by integers {0, 1, 2 …}. Without loss of generality, these state
the American Helicopter Society International, Inc. All values can be mapped to various component conditions:
rights reserved.
new, good, worn, warning, alarm. If Xn = i, then the process
is in state i at time n. That is:
( )
P H t +1 | H 1t , HI1t = P(H t +1 | H t ) (4)
This suggests that the predicted state at Ht+1 and observed
{ }
P X n+1 = j | X n = i, X n!1 = in!1 ,K X 0 = i0 = Pij (1) data HIt are completely conditioned by the state variable Ht.
for all states i0,i1, …,in-1,i, j, and all n > 0. This process is Equations 3 and 4 assume independence so that the joint
know as a Markov chain [see 2, 3]. Equation 1 states that distribution of the hidden (HI) and observed (H) variables
the conditional distribution of an future state Xn+1 given the can be simplified to (see 5):
T "1 T
past states X0, X1, … Xn-1 and the present state Xn, is only
dependent on the present state. The probability Pij represents
( )
P HI1t , H 1t = P(H 1 )!t =1 P(H t +1 | H t )!t =1 P(HI t | H t )
the probability that the process will, when in state i, next (5)
make a transition into state j (figure 1). This joint probability function can then be specified by the
initial state probabilities P(H1), the transition probabilities
Since the probability are nonnegative, and since the process P(Ht|Ht-1), and the emission probabilities P(HIt|Ht). The
must make a transition into some state (even if it’s the same strategy for finding these values will be discussed latter.
state) we ensure that:
" The Hidden Semi-Markov Model Process
! j =0
Pi , j = 1, i = 0,1,... (2)
Consider a drive train component which is initially new. As
it wears, the component can be characterized as processing
P1,1 through k distinct states: new, good, worn, warning, alarm,
P2,2 prior to failure. The time spent in each state is a random
New: 1 P1,2 variable with some mean and variance. The probability of
P3,3 transitioning into a state is Pij. This process is appropriately
Good: 2 P2,3 modeled by the HSMM. The HSMM is similar to the
HMM, expect that each state can emit a sequence of
Worn: 3 observations. Note that if the amount of time that the
P5,5 process spends in state before making a transition is identical
P3,4
P4,4 to 1, then the HSMM is just a HMM process.
Alarm: 5 Warning: 4 The HSMM has four basic problems that must be solved to
P4,5 be of use to a research:
1. Give a sequence of observations HI, compute
Figure 1. Notional Markov State Example efficiently the emission probabilities P(HIt|Ht)
2. Given a sequence of observations HI, compute the
The Hidden Markov Models transition probabilities P(Ht|Ht-1),
In many cases, the number of states needed to represent the 3. Given a sequence of observations HI, compute the
process is large. The number of required parameters for initial state probabilities P(H1), and finally
representing the transition probability is of order nk+1. This 4. Given a sequence of observation data HI, and the
necessarily restricts the number of states to some smaller transition probabilities P(Ht|Ht-1), calculate the
value of k. Unfortunately, in many cases, the observed duration distribution for a given state.
sequential data of interest does not satisfy the Markov Dynamic programming techniques have been developed to
assumption for k states. To address this model shortcoming, efficiently solve these problems. A detailed description is
we can hypothesis that at time t, past data in the sequence given by [4] and [5].
can be summarized concisely by a state variable [see 3, 4, 5].
In essence, the Hidden Markov Models (HMM) does not The Observation Space: HI Definition
assume that the observed data sequence has a Markov The observation space is some measured artifact of the
property. However, another, unobserved (hidden) but process under investigation. In this example we are
related variable, the state variable, is assumed to exist and to interested in the health of a shaft. As noted in HUMS,
have the Markov property. typical shaft indicators are shaft order 1, 2 and 3. It can be
shown that the probability distribution functions for shaft
For example, we do not observer or measure the true state of order magnitudes are a Rayleigh distribution [see 6] if the
a component (e.g. Health), but a Health Indicator (HI) with shaft is healthy (e.g. no mass imbalance, or higher order
some associated probability function. The relation between eccentricity due to the shaft being bent, or a loose/crack
the observed sequence HIt1 = {HI1…HIt} and the hidden coupling). We wish to map the measured shaft orders into
state sequence Ht1 is given by the following conditional one value that measures the health of the component. There
independence assumptions: are many such ways to do this. For example, if the operator
( )
P HI t | H 1t , HI1t = P(HI t | H t ) (3) has knowledge as to shaft order values representing damage,
a multiple hypothesis technique could be employed [see 6].
and
Generally, due to the limited service history of HUMS
equipped aircraft, a more conservative approach is to testing 2.869, and Ω = 36. The threshold at which the null
when the component is not normal. In this case, we can hypothesis is rejected is then the inverse cumulative
design a formal hypothesis test that the component is no distribution function for the Nakagami with the given
longer normal; we reject the null hypothesis that the parameters. For this example, the PFA was set at 10-3: the
component is normal and say is damaged. The HI algorithm critical value is 11.736. This means that after performing
is then based on a n dimensional Rayleigh hypothesis test. the operations in equation 6, if the value is greater than
11.736, we reject the null hypothesis and say the HI is in
The N-Dimensional Rayleigh Hypothesis Test alarm. Since our convention is to map this between zero and
one, with a HI of .9 to represent alarm, we now have our
A convention is desired that maps the measured shaft orders
health algorithms.
into a range of zero to 1, and rejects the null hypothesis (e.g.
say the component is bad) when the HI is greater than 0.0. ([
T !1
HI = M " M ]0.5
)* .911.736 (8)
One can then scale the mapping in such a way that for some
nominal component, there is a constant probability of For a more details on the function of distributions for health
rejecting the null hypothesis. One says that the probability algorithms, see [9].
of false alarm, the PFA, is the probability of saying the
component is bad when in fact it is good. This allows one to An Example Using Generator Shafts
define a n-dimensional Rayleigh hypothesis test. Formally, The HSMM method for diagnostics and prognostics is no
on tests the hypothesis that: based on a physical model of the shaft and coupling.
Because of this, coefficient generation of the emission
Ho: HI < .9 (e.g. the component in not in alarm) vs. probabilities P(HIt|Ht), the transition probabilities P(Ht|Ht-1)
H1: HI >= .9. (e.g. the component is in alarm) the initial state probabilities P(H1) require real world data.
Goodrich’s IMD-HUMS has been installed on 30+ UH-60
In this convention, alarm implies that the component in need aircraft. Currently over 140,000 data acquisitions have been
of repair, and that the probability of a nominal shaft gathered in a period of a year and a half. IMD-HUMS
exceeding a HI is some small probability (the probability of monitors twenty five shafts, in addition to gears, and bearing
false alarm (PFA)), say 10-3. components. IMD-HUMS has identified a number of shafts
have been in the process of degradation, which in general is
In our mapping, we wish to normalize the measured a rare event. As a component type, the generator shafts have
magnitude shaft orders by the standard deviation of the shaft provide the richest set of training data as it has the highest
order. On such mapping is: wear rate.
[
HI = M T " !1 M ]0.5
(6)
The generator shaft is geared through the transmission
This mapping has some convenient properties. The
accessory module. The shaft has a phenolic coupler that is a
distribution of the transformed shaft orders now has a
sacrificial connection between the generator shaft and the
variance of 1. The standard deviation of the underlying accessory gearbox. When the shaft adapter fails, the generator
Rayliegh is then: σ = sqrt(1/(2-pi/2)) = 1.5264. See [6] for a shaft no longer transmits torque to the generator resulting in
more complete derivation. a power failure. The crew is then required to land at the
next opportunity causing a mission abort. The generator
As a rule, any operations performed on the distribution shaft is a good example to study as there is only one failure
results in a function of that distribution, and a probability mode, that being the spline coupling. Shaft such as the
distribution function (PDF) can be found for it. This is input drive shaft, could have a number of failure modes, such
important in that critical values can then be calculated. The as:
distribution of this HI function can be found using the 1. Out of balance due to water, corrosion,
method of moments [7]. When this is done for equation 6, 2. Bent Shaft
(where the distribution being operated on is a Rayleigh), one 3. Broken Flange/Damage Coupling
finds that this function has the same moment as the Prognostics on these shafts would require first determining
Nakagami distribution [8] and therefore has a Nakagami the failure mode, the using the appropriate HSMM.
PDF. The Nakagami PDF is defined as: Additionally, the load on a generator shaft is fairly constant.
On a shaft such the input power shaft, we would need to
m
(
f (x ) = 2 #(m )(m ! ) x 2 m"1 exp " mx 2 ! ) (7)
address the effect of torque on the transition probabilities
where m is the ratio of moments (sometimes called the P(Ht|Ht-1).
fading figure) and Ω = E[x2]. If m = 1, the Nakagami reverts
to Rayleigh. The Nakagami shape parameters can be The initially training set was the time series HI’s for 30
derived explicitly from the Rayleigh when the Rayleigh has aircraft, for both the right and left generator shaft. Of the
been normalized by the standard deviation. For 3 Rayleigh initial set, data from aircraft that had repairs done where not
distributions, the mean value of the Nakagami is the used. We wanted to ensure that the transition probability
expected value of the Rayleigh (which can be explicitly was zero when move to a lower health state. Additionally,
calculated as 1.5264*sqrt(π/2) * 3 = 5.73, and the variance there where 2 examples where the HI was 1. These aircraft
(being normalized to 1) is just 3. Using these values, m is have been identified as requiring maintenance as it was
feared this data could skew the training set. After this component. For the HSMM, this is the problem of sequence
initially set reduction, 21 out of 43 sets where chose classification. Here one wished to classify the measured HI
randomly for training. into one of the k health classes. This is done by calculating
the conditional probability of the health given the observed
The HSMM was designed to use 10 states. After training, health and the prior probabilities of the health state:
the hidden health and mean time to transition were (Table (
Bi ,t = P HI t | H t = i ) (9)
1):
Then the Viterbi algorithm [4,5] is used to find the
STATE HEALTH TIME (MIN) s
maximum likely state path over time. Figure 2 is an
New 1 0.0376 21.34 '
example of the left generator shaft that progresses form
Good 2 0.1322 24.00 x
Good
e (Health state 2) to Warning (Health State 9).
3 0.2374 19.41 d 0 10 20 30 40 50
4 0.3628 19.81 n 0.9
I Health Index 9e
5 0.4629 21.57 0.8 t
Worn 6 0.5477 18.47 h 8a
7 0.5886 21.97 t 0.7 t
l
8 0.6886 15.85 a 0.6 7S
Warning 9 0.7832 23.11 e h
H 0.5 6t
Alarm 10 0.8301 23.50
l
Table 1. Calculated Mean State Values and Emission Times t 5a
f 0.4 e
a 4H
The transition probabilities P(Ht|Ht-1) where: h 0.3
P1,1 0.99848, P1,2 0.0015246 S
0.2 3
P2,2 0.99952, P2,3 0.00047508
P3,3 0.99884, P3,4 0.001158 0.1 2
P4,4 0.99759, P4,5 0.0024131 0 10 20 30 40 50
P5,5 0.99809, P5,6 0.0019135 Time (hrs)
P6,6 0.99328, P6,7 0.0067243 Figure 2 AC545 Left Generator Shaft Health Index and
P7,7 0.99249, P7,8 0.007511 Health vs. Time
P8,8 0.88741, P8,9 0.11259 s
Figure 3 is an example from aircraft 516, which shows the
P9,9 0.98392, P9,10 0.016082 '
progression
x of the Health state from 5 (marginally warn) to
10e (alarm and in need of replacement).
The initial state probabilities P(H1) where calculated to be d
(table 2): 0 20 40 60 80 100 120
n
I Health Index
STATE HEALTH P(H1) 0.9 10 e
t
New 1 0.0376 0.4543 h a
Good 2 0.1322 0.3076 t 9 t
l 0.8 S
3 0.2374 0.1429 a
4 0.3628 0.0238 e 8 h
5 0.4629 0.0476 H 0.7 t
l
Worn 6 0.5477 0.0238 t 7 a
7 0.5886 0 f 0.6 e
a H
8 0.6886 0 h 6
Warning 9 0.7832 0 S
0.5
Alarm 10 0.8301 0
5
Table 2: Calculated P(H1) for a Given State
0.4
0 20 40 60 80 100 120
The P(Hi) states that 90% of the generator shaft began Time (hrs)
monitoring with a good health, only 2 percent of the Figure 3 AC516 Left Generator Shaft Health Index and
generator shaft where in a worn state. With these values, Health vs. Time
diagnostics and prognostics can be performed. The HSMM does an exceptionally good job and identifying
the appropriate health state in a noise environment. We will
HSMM for Machinery Diagnostics now investigate the HSMM as a prognostic tool.
Now that the four basic probabilities required to characterize
the HSMM have been estimated, a number of analysis can HSMM for Machinery Prognostics
be done to address diagnostics and prognostics. For For prognostics, one wished to find URL until the
diagnostics, one would want to state the health of the component requires maintenance. Depending on the risk
associated with component failure, rules can be devised to Chapman-Kolmogorov Equations
quantify the required confidence that the part remains
The one-step transition probability is defined in equation 1.
serviceable. As an example, take the generator shaft for
Now let one define the n-step transition probability Pnij to be
AC516 (Figure 3). The component was initially in state 5
the probability that a process in state i will be in state j after
(Slightly Worn). The duration of state 5 was 14.9 hours
n additional transitions:
(D(H5) = 14.9, but it must be noted that this was an
incomplete sample – there is no knowledge of when the P{X n+ m = j | X n = i}= Pijn , n ! 0, (14)
shaft transitioned into state 5). Once in state 6, the The Chapman-Kolmogorov equations provide a method of
component duration was 35.2 hours (D(H6) = 35.2). The computing these n step transition probabilities. These
component transitioned through the intervening states until equations are see [2]:
state 9, where maintenance would be required, after 101.46 "
hours of operation. At the instant when the component Pijn+ m = !i =0 Pijn Pijm (15)
transitioned to state 6, the URL was 86.56 hours. The total Pnij Pmij is the probability that if starting in state i the process
life of the part is then: will go into state j in n+m transitions though a path which
N
T = !i =1 D(H i ) (10) takes it into state k at the nth transition. Summing over all
intermediate states k yields the probability that the process
will be in state j after n+m transitions. If we let P(n) denote
State Duration Model Based Prognostics the n-step transition probability Pnij, then equation 15 asserts
In the previous work of [5], a state duration model was that P(n+m) = P(n) P(m). By induction, P(n) = P(n-1+1) = P(n-1) P,
proposed. The prognostics based on this model had which is simply Pn. Thus, the n-step transition probability
excellent results. The state duration model based the matrix is obtained by multiplying the matrix P by itself n
duration density P(di|Hi) on the Gaussian distribution where: times.
D(H i ) = µ (H i )+ !" 2 (H i ) As an example, in 500 transitions (P500 which is 178 hours),
(11)
( N
$ = T " !i =1 µ (H i ) )! N
i =1
# 2 (H i )
(12)
a component starting in state 1 will have the following
probability distribution:
The calculated mean duration in health state i and standard P1= 0.4663, P2 = 0.4680,
deviation is given in table Table 3. P3=0.0538, P4=0.0088,
P5= 0.0025, P6=0.0003.
STATE µ(Hi) - hours σ(Hi)
The cumulative probability of the component requiring
New 1 223.2 126.0 maintenance (e.g. in state 9 or 10) is zero. Alternatively, a
Good 2 135.8 161.6 component beginning in state 5 (worn) would have the
3 42.1 46.3 following probability distribution:
4 47.9 40.1 P5= 0.384, P6 = 0.139,
5 28.8 17.3 P7=0.134, P8=0.009,
Worn 6 83.2 80.8 P9= 0.063, P10=0.271.
7 34.4 43.5 The cumulative probability of requiring maintenance on this
8 23.0 25 component is 0.33, or a third.
Warning 9 72.1
Table 3. Mean and Standard Deviation of State Duration. From the operator’s perspective, this is a valuable tool for
scheduling maintenance. A maintenance practice could be
Note that state 1 is under estimated because the test did not established that orders parts when there was a 50%
commence with the installation of the component. There probability of requiring maintenance in 100 hours. In Figure
was only one sample for state 9 due to the limited size of the 4, the probability of requiring maintenance (e.g. the sum of
training data, which explains the lack of the standard probabilities of ending in state 9 or 10) is plotted vs. Time
deviation. Table 3 allows one to derive the expected time (e.g. Flight hours) when starting in state 3, 4, 5 or 6. When
remaining after a transition into a new state. For example, starting in state 1 or 2, the probabilities of requiring
the mean time to Warning (state 9) from Good (state 2) is maintenance over this time scale is nearly 0. If starting in
394 hours. One could then put bounds on risk by assigning a state 7 or above, the probabilities quickly converge towards
probabilistic confidence. The operator may wish to be 90% 1. The mean time (50% probability) of requiring
confident that no repair would be required after T flight maintenance is:
hours if the component was in state s and required Starting State Probability of Maintenance
maintenance in state k. Then T could be calculated as: H3 668.8 hours
k H4 387.7 hours
T = #i = s D(H i ) " 1.28! (H i ) H5 241.9 hours
(13)
where 1.28 is inverse Gaussian cumulative distribution H6 88.9 hours
function for .1. Alternatively, we could use a Markov Note that any maintenance policy could be enacted: a more
property to generate probabilistic bound on the URL. conservative approach would decrease the probability of not
t
n
i
a
M

g
requiring
n maintenance (or increase the probability of Airworthiness Approval of Rotorcraft Health Usage
requiring
i maintenance). For example if one chose 90%, Monitoring Systems (HUMS) pp 986 – 1002
r
starting from state 5, T for H5 = would drop to 96 hours.
i
u [2]Ross, S., Introduction to Probability Models Academic
1 Press, Boston 1989, pp 135-189
q
e
R
0.8 [3]Rabiner, L., “A Tutorial on Hidden Markov Models and
f Selected Applications in Speech Recognition” Proceeding of
o the IEEE, Vol 77, No 2 February 1999.
0.6
y
t [4]Murphy, K., Hidden Semi-Markov Models,
i http://www.cs.ubc.ca/~murphyk/Software/index.html,
l 0.4
i November 2002.
b State 3
a [5]Dong, M., He, D., “Hidden Semi-Markov Models for
b 0.2 State 4
o State 5 Machinery Health Diagnosis and Prognosis” Transactions of
r State 6 NAMRI/SME, Vol 32, 2004
P 0
100 200 300 400 500 600 700
[6]Bechhoefer, E., Bernhard, A., “Use of Non-Gaussian
Time (Hrs)
Figure 4 Probability of Requiring Maintenance vs. Time Distribution for Analysis of Shaft Components”, IEEE
for State 3, 4, 5 and 6 Aerospace, Big Sky, March 3006.

Conclusions and Summary [7] Wackerly, D., Mendenhall, W., Scheaffer, R.,
Mathematical Statistics with Applications, Duxbury press,
HSMM is a powerful tool for diagnostics and prognostics.
Belmont, 1996.
As a generalized solution strategy, HSMM provides a wealth
of information on the behavior of the underlying damage
[8]Proakis, John, G., Digital Communications, McGraw-
process in mechanical components. In particular, the
Hill, Boston MA, 1995, page 45-46
diagnostics/classification seems well suited in mapping
health indicators into a discrete health state. Additionally,
[9]Bechhoefer, E., Mayhew, E., “Mechanical Diagnostics
the prognostics methodology presented herein can give
System Engineering in IMD-HUMS”, ”, IEEE Aerospace,
useful guidance to an operator on:
Big Sky, March 3006.
• When Parts should be ordered,
• Aircraft allocation,
• Scheduling maintenance.
Given that, there are a number of shortcomings to the
technique identified during the course of the study. First, it
is data requirement is intensive. This can be problematic for
helicopter transmissions, which are for the most part, very
reliable. It is unlikely that there will be training data for
every component in the drive train.

Further, it is well know that component damage is a function


of torque, or throughput power. Ideally, the HSMM
probability transition matrix would be adjusted by torque
input. While the current example was well suited for
HSMM, a component such as an input drive shaft that is
under varying torque conditions would have greater
variation in its prognostics capability (Although, the
diagnostics would be as valid).

While HSMM do have some sort coming, when training data


is available, they prove to be a good solution strategy for
diagnostics and prognostics.

References
[1] U.S. Department of Transportation, Federal Avation
Administration, Advisory Circular AC-29 MG 15:

You might also like