You are on page 1of 8

ISA TRANSACTIONS"

ELSEVIER ISA Transactions 34 (1995) 311-318

The "primary integrity parameters" - Design parameters for safety systems


S t e v e n E . S m i t h a, P a u l G r u h n b,.
a 1CS Triplex, Inc. Torrance, CA, USA b Industrial Control Services, Inc., 10595 Westoffice Drive, Houston. TX 77042-5389, USA

Abstract

This paper discusses the Primary Integrity Parameters (PIPs) - design attributes that, as a group, determine the level of safety integrity achieved by a Programmable Electronic System (PES). These parameters include redundancy level, failure rates and modes, diagnostic coverage factor, common cause failure rates, on-line manual test interval/duration, maintainability and security. The paper demonstrates that the level of safety provided by a PES is not simply a factor of any one of these attributes, but is determined by the total combination of the PIPs. Examples are given to show the dependency of the system safety on each of the parameters.
Keywords: Safety integrity; Operational integrity; Availability; Redundancy; Failure rates; Diagnostics; Coverage;

Test interval; Common-cause failures; Maintainability; Security

1. Introduction

As redundant and fault tolerant systems have gained growing acceptance for safety-related roles in process control, the industry has attempted to discover the underlying principles that contribute to the overall dependability of these systems. In the case of emergency shutdown and safety interlock systems, the industry has come to recognize that the two most significant reliability attributes of a safety-related system are: 1. its "Safety Integrity", or probability of providing the required function on demand (variously expressed by p a r a m e t e r s such as Safety Availability, Safety Integrity, Risk (or Hazard) * Corresponding author. Tel.: (713) 268-1334; fax: (713) 266-3072.

Reduction Factor, Probability of Performance on D e m a n d - or conversely, Probability of Failure on Demand), and 2. its "Operational Integrity" (or " u p - t i m e " ) that is, its ability to keep on operating and not be subject to a "nuisance trip" that would interfere with the overall process production (even though it is safe). This attribute is variously characterized by parameters such as Availability, O p e r a t i o n a l Availability and MTBF. Today many safety system specifications address both of these attributes; for example requiring that a safety-related system achieve a certain Safety Availability while maintaining some minim u m m e a n time between nuisance trips. These parameters may be specified quantitatively or qualitatively. In the case of a qualitative specifica-

0019-0578/95/$09.50 1995 Elsevier Science B.V. All rights reserved SSDI 001 9-0578(95)00025-9

312

S.E. Smith, P. Gruhn / I S A Transactions 34 (1995) 311-318

tion, the level of performance is specified in terms of a relative scale (e.g., 1 to 4) without actually specifying quantitative numbers for the target safety parameters. In either the quantitative or the qualitative case, the principles remain the same. That is, the objective of the safety system specification is to specify a system that will provide the desired level of protection without interfering with normal production. In this paper, the terms that are used are the "Hazard Reduction Factor" (HRF) to represent the level of Safety Integrity and the " M e a n Time Between Nuisance Trips" (MTBNT) to represent the Operational Availability of the system. The H R F is defined as HRF = 1/PFD, where PFD = probability of failure on demand = 1 - Safety Availability [1,2]. The H R F can be viewed as the amount " m o r e safe" that the process becomes as a result of inclusion of the safety system [3]. For representing the Operational Integrity of the safety system, this paper uses the term " M e a n Time Between Nuisance Trips" which is the mean time between overt failures of the safety system. Since this number directly expresses how often, on the average, nuisance trips will occur, it is quite meaningful and intuitive. In recent years much attention has been paid to discovering the basic system design attributes that contribute to a safety-related system achieving a given level of Safety Integrity and up-time [2,4,5]. Some of the parameters that contribute to achieving an overall level of reliability are redundancy level, failure rates of components, diagnostic coverage and test interval. The papers referenced above each explore the sensitivity of the overall Safety Integrity to one or more of these parameters. One of the challenges of the ISA SP84 committee (which is developing a standard for "Application of Safety Instrumented Systems for the Process Industries") is to understand these underlying parameters that contribute to safety. Only by identifying and characterizing the basic design parameters that contribute to system safety, can we know that we achieve the levels of

safety we desire. With the objective of identifying the pertinent safety system design attributes that contribute to the overall Safety Integrity and system up-time, the ISA SP84.03 subcommittee, in conjunction with the SP84.02 subcommittee agreed on seven "Primary Integrity Parameters" (PIPs) to characterize the design of safety systems. These parameters are: - system redundancy configuration (e.g. triple, dual), failure rates and failure modes of the elements that make up the system, coverage factor of automatic diagnostics, - common cause failure rates or probabilities, test interval and duration for on-line manual testing, maintainability, - security. By providing proper attention to these parameters in the safety system design process, one gains assurance that the resulting system will achieve the target Safety Integrity and up-time. Furthermore, reliability analyses based on these parameters can analytically predict the overall safety performance of a system - in terms of the Safety Integrity and the Operational Availability.

2.

Development

of

system

model

A "generic" set of Markov models were developed that account for all of the Primary Integrity Parameters. These models are represented by Fig. 1. Here, various levels of redundancy are added by inserting intermediate states between the fully operational state (State 0) and the failure states (States 2 and 4). Depending on the levels of redundancy, as many intermediate "failed-operational" states can be added (e.g. States 1 and 3). The generic framework shown can be used to model duplex and triplex systems. By eliminating states 1 and 3, simplex systems are modeled. The integrity parameters are represented in this model by - N is the redundancy level, Ls is the safe (overt) failure rate, - Ld is the dangerous (covert) failure rate, B is the common-cause rate (% of failures that are common-cause),

S.E. Smith, P. Gruhn /lSA Transactions 34 (1995) 311-318

313

BLs (N 1)cLs ~',, 2

Safe

NcLs Operational

N{1-c)Ld

&

3
BLd (N 1)(1 c)Ld 4, ' 4 Unsafe Jk

Fig. 1. General form of Markov model for analyzing PIPs.

- c is the coverage factor, -k is the recovery rate from manual testing (function of Test Interval; see equation below), - u is the repair rate (function of maintainability). This Model is simplified in some areas that do not significantly affect the results of the analyses. Some of the transitions are worth noting: specifically, the common-cause failures (that transition at rate BLs and BLd) lead directly to failure states, because they affect all redundant units. When the system is failed covertly (State 4), the failure will not be immediately revealed. In fact, unless some periodic proof test is done on the system to assure its readiness, it may remain in the dangerous state indefinitely. If, however, the system is proof-tested occasionally, the covert failure will be discovered and the system can be restored. This test-and-restore process is modeled by the transition from State 4 to State 0 and occurs at rate k, defined as k = the restoration rate from a covert failure mode = 1/((TI/2) + MTI'R), where TI is the mean test interval for proof testing. Note that the denominator in k has two elements: the mean time to discover the failure ( T I / 2 ) and the MTTR. The term ( T I / 2 ) is used since, on the average, faults may be assumed to occur half way between proof tests or 1 / 2 of the Test Interval, TI [6]. Regular proof testing and

Fig. 2. Effect of periodic proof testing.

repair has the effect of restoring the system to its original state. Fig. 2 shows a graphical example of the effect of proof testing on the reliability of a component [7]. In Fig. 2(a), the component is tested at system commissioning time, but not tested thereafter. If the component is not tested during the life of the installation, the probability that it is still operating decreases with time according to the equation R = e cat After sufficient time, there is very low probability that the component will be working. If, however, the system is tested and restored on a regular basis, as shown in Fig. 2(b), the average reliability will be very high, since the At term in the above equation remains small.

3. Analysis of the effects of the primary integrity parameters

Using the general model configuration developed above and represented in Fig. 1, the effects of each one of the integrity parameters is studied by holding the other parameters constant and

314 Table 1 Baseline system configuration

S.E. Smith, P. Gruhn / ISA Transactions 34 (1995) 311-318

System redundancy configuration Failure rates and failure modes of the elements that make up the system Coverage factor of diagnostics Common cause failure rates or probabilities Test interval/duration for manual tests Maintainability Security

Triple CPUs = 1 failure per 10 years I / O modules = 1 failure per 100 years a 99% 0% (no percentage of the failures are "common-cause") 12 months interval; 0 hours test duration 4 hour mean time to repair any failed module in system 100%

For the purposes of these analyses, two I / 0 modules are considered as the number of modules required to initiate a shutdown or to inhibit a safe shutdown.

varying the parameter to be studied. To facilitate this, a "baseline" PES is assumed with the parameters given in Table 1. The baseline system is analyzed using the Markov model in Fig. 1, producing baseline values for the covert and overt failure rates and probabilities. These attributes are then converted into the parameters H R F and M T B N T to express the Safety Integrity and Operational Integrity of the system. After this, each integrity parameter is varied to show its effect on H R F and MTBNT.

handling mechanisms - tolerate single failures and keep on operating safely. This is the major advantage and is reflected in the comparison of the H R F and M T B N T between the baseline triplex system and a simplex system with similar PIPs (except for redundancy level). This comparison is shown graphically in Fig. 3. (Note: Dual loo2 stands for 1 out of 2, where only one channel must function in order to perform a shutdown. 2o02 stands for 2 out of 2, where both channels must function in order to perform a shutdown.)

3.1. Redundancy 3.2. Failure rates and modes


To consider the effects of redundancy, various configurations may be considered, including simplex, duplex and triplex. Duplex and triplex systems have the obvious advantage that they can if designed with the proper diagnostics and fault
Trip Rate le+6 le+5 le+4 ~' le+3 le+2 le+l le+O Simplex
Dual Dual

Having a triplex or duplex system, however, does not guarantee adequate performance. The failure rates and failure modes of the basic hardware and software elements have a dramatic elHazard Reduction Factor le+8 le+7 le+6 le+5 le+4 le+3 le+2 le+1

Triple

Simplex

(loo2)

(2002)

Dual (loo2)

Dual (2o02)

Triple

Fig. 3. Impact of redundancy - comparison of simplex, duplex and triplex systems.

S.E. Smith, P. Gruhn l I S A Transactions 34 (1995) 311-318


Trip Rate 10e6 1Oe5 10e4 10e3 10e2 t0el 10e7 10e6 10e5 10e4 10e3 10e2 Hazard Reduction Factor

315

3.4. Common-cause failures


A common-cause failure is a failure mechanism that can result in multiple redundant elements of the system failing. The failures in the redundant units may be simultaneous or not. Generally, common cause failures are a concern only if the redundant units fail simultaneously or near-simultaneously - within less that the test interval at which they are tested. Examples of common cause failures are (1) a sensitivity of the system to electromagnetic interference that could cause two or more redundant components to fail in the same way at the same time, or (2) a manufacturing defect that could cause sensors to drift out of calibration. This second example may not cause a problem, unless the drift rate were faster than the test and calibration interval. The problem with common-cause failures can be observed by considering the Markov Model in Fig. 1. Any common cause failure can cause the system to fail in a covert or overt failure mode immediately - circumventing all of the redundancy and diagnostics within the system. Clearly the presence of common-cause failure sources must be eliminated to the maximum extent practical in redundant systems. The single most identifiable potential source of common-cause failures in most redundant systems is the software system. Usually, the software in the redundant computing elements is of a single design, so if one of the redundant systems
Tri0 Rate 10e6 10e5 1Oe4 10e3 10e2 10el lOe? 10e6 10e5 10e4 10e3 ~ 1Oe2 ] Hazard Reduction Factor

Low MTBF High MTBF

Low MTBF High MTBF

Fig. 4. Effect of failure rates on a triplicated system.

fect on the system reliability parameters. For example, by increasing the CPU and I / O failure rates to 1 failure per year and one failure per 10 years, respectively (one order of magnitude increase), the H R F and the M T B N T for a triplicated system are reduced by two orders of magnitude. This relationship is shown graphically in Fig. 4. This sensitivity drives home the necessity of having quality hardware and software, despite the redundancy level.

3.3. Coverage factor


The coverage factor is defined as the ratio of failures that are automatically detected by internal PES diagnostics to the total n u m b e r of all possible failures. Coverage factor has been demonstrated to have a significant effect on the safety integrity of PESs [8]. This effect is further demonstrated by the comparison of two systems the baseline with 99% coverage and another identical system with 90% coverage. This difference in coverage accounts for a difference in excess of 1 order of magnitude in the H R F of the systems, as shown in Fig. 5. In order for systems to be safe, redundancy and high reliability components are not sufficient built-in diagnostics must assure that a high percentage of all possible faults will be detected and handled by the system. Any failure that goes undetected can contribute to a covert failure mode and defeat the safe operation of the system.
-

[
LOW High Diagnostics Diagnostics Low High Diagnostics Diagnostics

Fig. 5. Effects o f diagnostic coverage factor on a triplicated

system.

316

S.E. Smith, P. Gruhn / ISA Transactions 34 (1995) 311-318


T r i p Rate Hazard R e d u c t i o n F a c t o r 10e7 10e6
i

Hazard 10e7 10e6

Reduction

Factor

10e6 10e5

I
10e4 10e3 10e4 10e2 10e3 10el 10e2 1% I:?~ I~ 10% 10e 5

1
i

Fig. 6. Effects of common-cause failures on a triplicated system (as a percentage of total failures).

' ,~

mrF. ,r,

'tt ! ~', r~lr:

'3(

U ';

'i.

~rl [;r,,~

~ .~:t , ' ",r

Fig. 7. Effects of manual proof testing. has a bug, they all have the same bug. For this reason, extensive methods are employed in the specification, generation and validation of safety-related software. Vendor-supplied and application-specific software for safety systems requires special design and maintenance techniques. The safety-system user must be aware of these considerations and impose appropriate standards and controls on the purchase and development of such software [9]. The effects of common-cause failures are shown graphically in Fig. 6. If merely 1% of the failures have a common-cause, the H R F of a triplicated system is reduced by an order of magnitude (as compared to the baseline in Fig. 3). Control of common-cause failures is the key to achieving safety. 3.5. Test intercal and duration As shown earlier, periodic manual proof testing (and repair) can produce an improvement in system reliability. This is not only true with simplex systems, but with redundant systems as well. In effect, manual proof testing should catch any covert failures that are not caught due to imperfect automatic diagnostics. In systems with high diagnostic coverage, proof testing will not produce as effective returns, as the diagnostics already cover most faults. Proof testing will have dramatic benefits in systems with lower levels of automatic diagnostics. For example, to demon-

strate the effects of proof testing, the baseline system is not used directly - but has been modified to only have 90% fault coverage. Fig. 7 shows the effects of manual proof testing on this system. Note that the more frequent the testing, the higher the HRF. These results are consistent with the principle demonstrated in Fig. 2 - a smaller At will produce a more reliable system. In considering proof testing, one should take into account whether the manual proof test requires taking the system off-line. If so, then the proof test duration is critical, since the system is unavailable during the off-line tests. This directly impacts the Safety Availability - and thus the HRF. Ideally, proof testing should be done online, or with minimal interruption to the safety system availability. During proof testing, other means, such as manual monitoring and shutdown should be employed to respond to safety demands. When possible, proof tests should be scheduled during normal system turn-arounds, when it is safe to take the safety system off-line. 3.6. Maintainability Maintainability is a key attribute to all safety systems, whether redundant with on-line hot repair or simplex, requiring off-line repair. The ability to restore the system from a failure or partial failure state correctly and quickly directly

S.E. Smith, P. Gruhn / ISA Transactions 34 (1995) 311-318


Trip Rate Hazard Reduction Factor

317

10e6 10e5 10e4 10e3 10e2 10el

10e7 10e6 10e 5 10e4


I

Table 3 Safety performance of comparison systems System configuration Mean time to nuisance trip 1,400 yr 13 yr 14 yr 1.4 yr Hazard reduction factor 1,500,000 110 3,300 5

10e3~
10e2 1

Good triplex Poor triplex Good simplex Poor simplex

3.8. Summary comparison of four systems


Fig. 8. Impact of maintainability on the baseline system.

impacts safety - although, in fault-tolerant systems, this effect is not as dramatic as some of the other integrity parameters. Fig. 8 shows the effects of 8-hour and 1-hour repair times on the baseline system. Since the triplex system continues to operate during the repair process, there is little effect on overall H R F . However, extremely long repair times will have an effect, as the longer a redundant system goes unrepaired, the higher the probability of a second fault occurring that could result in a system-level failure.

3. Z Security
Security is considered a Primary Integrity Parameter, even though its effect is basically the same as a common-cause failure. It is separated out as a distinct parameter, as it is a fundamental field of study and encompasses many attributes of the system - including software and hardware configuration control, operator access and limitations, maintenance access and operations.
Table 2 Integrity factors for comparison systems Integrity parameter CPU fail rate I / O module fail rate CPU coverage I / O coverage Common cause Manual test interval MTTR Good triplex 1/ 10 yr 1/100 yr 99% 99% 1% 3 mo 1 hr

As a final concept review of the effects of the Primary Integrity Parameters, four systems are compared. These are a "good" triplex system, a " p o o r " triplex system, a "good" simplex system and a " p o o r " simplex system. Their PIP attributes are summarized in Table 2. The results of analyzing these systems are given in Table 3. The main conclusion to be drawn is: the PIPs are all important and must be given equal consideration. One cannot say summarily that a triplex system is better than a simplex system - because a well designed simplex system can provide better safety performance than a poorly designed triplex system. The Primary Integrity Parameters are in-effect a "safety chain" - strong only as its weakest link. That is to say that redundancy alone will not provide high integrity without low-failure-rate components; redundancy and low failure-rates will not provide integrity without high fault coverage, etc. Each p a r a m e t e r must be adequately addressed in a safety system design in order to assure that overall integrity is achieved.

Poor triplex 1/yr 1/10 yr 95~ 95~ 10~ 36 mo 8 hr

Good simplex 1/ 10 yr 1/100 yr 95% 95% 1% 3 mo 1 hr

Poor simplex 1/yr 1/10 yr 80c/c 50% 10% 36 mo 8 hr

318

S.E. Smith, P. Gruhn / ISA Transactions 34 (1995) 311-318 [6] B.W. Balls, A.B. Rentcome and J.A. Wilkenson, "Specification and design of safety systems for the process industries", 8th International System Safety Conference (New Orleans, 1987). [7] K.L. Wade, "Programmable controller reliability improvement method for batch control operations", Engineering Society of Detroit, Programmable Controller Conference (Detroit, April 1986). [8] S.E. Smith, "System-level reliability analysis for applying fault tolerant controls", ISA CHEMPID Symposium (Edmonton, April 1991). [9] N.G. Levison, "Software safety: What, why and how", Association for Computing Machinery - Computing Surueys (June 1986).

References
[1] P. Gruhn, "Safety system performance terms; Clearing up the confusion", Hydrocarbon Processing (February 1993). [2] B.W. Balls, "Determination of specified availability for a process plant safety protection system", Control Expo (Chicago, 1989). [3] Instrument Society of America, SP.84 Draft Standard for "Application of Safety Instrumented Systems for the Process Industries", Draft 17, September 1995. [4] S.E. Smith, "Fault coverage in plant protection systems", ISA CHEMPID Conference (St. Louis, May 1990). [5] A.A. Frederickson, "Fault tolerant control systems for use in safety applications", ISA / 8 8 International Conference (Houston).

You might also like