You are on page 1of 47

Reliability

What is reliability
Reliability is the ability of a component or system to perform its required functions, at the specified levels of performance, without failure, in the specified environment and operational loads, for the time required. Reliable products will have less faults within its specified lifetime and thus reduced operational costs. A reliable product will also have improved safety if severe consequences occur upon failures. It is important when discussing reliability to understand the difference between component reliability, system reliability and availability. This page aims to give a practical introduction to the basics of reliability theory.

Component reliability
Component reliability is the reliability of a single component. A component could range from a small part to larger assemblies. The reliability is always related to the component functional outputs. Component reliability can only be improved by changed design, material selection, surface treatments and operational conditions. Mathematical models for component reliability will not directly improve the reliability of the component itself, but can be used for aid in the design optimisation to predict the reliability of existing components. For example, what is the probability that a component will survive 10 years in operation. Such quantitative reliability estimates require access to realistic component experience data. The data should include a large number of units operated in a relevant environment and for a long period of time (if long time estimates are required). Most reliability estimation techniques also need a realistic mathematical model that describes the experience data. Typical expressions are:

Probability density function f(t) Probability function F(t) Survivor function R(t) Failure rate z(t) Mean time to failure E(T)

The lifetime exposure of a component is normally given in years or hours, but could also be on/off cycles (switch), km (car), rotations (shaft) and load cycles (bolt).

System reliability
System reliability is the reliability of a group of several components. System reliability dependents on component reliability and how components are configured as a logical structure. The basic system reliability structures are:

Series structure (NooN) Parallel structure (1ooN) K out of N structure (KooN)

Note that KooN refers to a system that is functioning as long as K out of a total of N component parts of a system are functioning. A system is often configured as a combination of the above basic structures. System reliability can be optimised by calculations.

System reliability calculations

Availability
Availability is the probability that the component or system will be in an operational state at a given point of time. Availability is thus a function of reliability, maintainability and maintenance support. Availability is an indicator for component or system dependability and is often used as a measure for performance with regards to production (production availability) and safety (safety availability).

Reliability of Series Systems of "n" Identical and Independent Components


A series system is a configuration such that, if any one of the system components fails, the entire system fails. Conceptually, a series system is one that is as weak as its weakest link. A graphical description of a series system is shown in Figure 1.

Figure 1. Representation of a Series System of "n" Components (Click to Zoom)

Engineers are trained to work with system reliability [RS] concepts using "blocks" for each system element, each block having its own reliability for a given mission time T:

RS = R1 R2 ... Rn (if the component reliabilities differ, or) RS = [Ri ]n (if all i = 1, ... , n components are identical) However, behind the reliability block symbols lies a whole body of statistical knowledge. For, in a series system of "n" components, the following are two equivalent "events":

"System Success" = "Success of every individual component" Therefore, the probability of the two equivalent events, that define total system reliability for mission time T (denoted R(T)), must be the same:

The preceding assertion holds because Ri(T), the probability of any component succeeding in mission time T, is its reliability. All system components are assumed identical with the same FR "" and independent. Hence, the product of all component reliabilities Ri(T) yields the entire system reliability R(T). This allows us to calculate R(T) using system FR (s = n), or the "nT" power of unit time component reliability [Ri (1)]nT, or the "nth" power of component reliability [Ri(T)]n, for any mission time T. We will discuss, later in this START sheet, the case where different components have different reliabilities or FR. From all of the preceding considerations, we can summarize the following results when all elements, which are identical, of a system are connected in series:

1. The reliability of the entire system can be obtained in one of two ways: o R(T) = [Ri(T)]n; i.e., the reliability (T) of any component "i" to the power "n" o R(T) = [Ri(1)]nT; unit reliability of any component "i" to the power "nT" 2. System reliability can also be obtained by using system FR s: R(T) = exp{-T}: o Since s = + + + ... + = n (all component FR are identical) o System FR s is then, the sum ("n" times) of all component failure rates (): R(T) = Exp{-( + + + ... + ) T} = Exp{-n T}) = Exp{sT} 3. Component FR () can be obtained from system reliability R(T): o = [- ln (R(T))] / n T (inverting the reliability results given in 1) o Component FR can also be obtained from component reliability R i(T) : = - ln [Ri(T)]n / n T = - ln [Ri(T)] /T o Previous expression is used for allocating system FR s, among the system components 4. Total system FR s can also be obtained from 3: o s = [- ln (R(T))] / T = - ln [Ri(T)]n / T o s = n remains time-independent in series configuration 5. Allocation of component reliability Ri(T) from systems requirements is obtained by solving for Ri(T) in the previous R(T) equations. 6. System "unreliability" = U(T) = 1 - R(T) = 1 - reliability. One can calculate the various reliability and FR values for the special case of unit mission time (T = 1) by letting "T" vanish from all the formulas (e.g., substituting T by 1). One can obtain reliability R(T) for any mission time T, from R(1), reliability for unit mission time:

Numerical Examples
The concepts discussed are best explained and understood by working out simple numerical examples. Let a computer system be composed of five identical terminals in series. Let the required system reliability, for unit mission time (T = 1) be R(1) = 0.999. We will now calculate each component's reliability, unreliability, and failure rate values. From the data and formulas just given, each terminal reliability Ri(T) can be obtained by inverting the system reliability R(T) equation for unit mission time (T = 1):

Component unreliability is: Ui(1) = 1 - Ri(1) = 1 - 0.9998 = 0.0002. Component FR is obtained by solving for in the equation for component reliability:

Now, assume, that component reliability for mission time T = 1 is given: Ri(1) = 0.999. Now, we are asked to obtain total system reliability, unreliability, and FR, for the (computer) system and mission time T = 10 hours. First, for unit time:

Hence, system FR is:

If we require system reliability for mission time T = 10 hours, R(10), and the unit time reliability is R(1) = 0.995, we can use either the 10th power or the FR s:

If mission time T is arbitrary, then R(T) is called "Survival Function" (of T). R(T) can then be used to find mission time "T" that accomplishes a pre-specified reliability. Assume that R(T) = 0.98 is required and we need to find out maximum time T:

Hence, a Mission Time of T = 4.03 hours (or less) meets the requirement of reliability 0.98 (or more). Let's now assume that a new system, a ship, will be propelled by five identical

engines. The system must meet a reliability requirement R(T) = 0.9048 for a mission time T = 10. We need to allocate reliability by engine (component reliability), for the required mission time T. We invert the formula for system reliability R(10), expressed as a function of component reliability. Then, we solve for component reliability Ri(10):

We now calculate system FR (s) and MTTF () for the fiveengine system. These are obtained for mission time T = 10 hours and required system reliability R(10) = 0.9048:

FR and MTTF values, equivalently, can be obtained using FR per component, yielding the same results:

Finally, assume that the required ship FR s = 5 = 0.010005 is given. We now need component reliability, Unreliability and FR, by unit mission time (T = 1):

R(1) = Exp{-s} = Exp {-0.010005} = 0.99 = Exp{-5 } = [Exp(-)]5 = [Ri(1)]5

Component reliability: Ri (1) = [R(1)]1/5 = [0.99]0.2 = 0.998M

Component unreliability: Ui (1) = 1 - Ri (1) = 1 - 0.998 = 0.002

Component FR: = [- ln (R(1))]/n 1 = [-ln(0.99)]/5 = 0.002

The Case of Different Component Reliabilities


Now, assume that different system components have different reliabilities and FR. Then:

Then system Mean Time To Failure, MTTF, = = 1/s = 1/ i For example, assume that the five engines (components), in the above system (ship) have different reliabilities (maybe they come from different manufacturers, or exhibit different ages). Let their reliabilities, for mission time (T = 10) be 0.99, 0.97, 0.95, 0.93, and 0.9, respectively. Then, total system reliability R(T) for T = 10 and FR are:

Since the system FR is s = 0.02697, then the system MTTF is = 1/ = 1/ i = 1/0.02697 = 37.077.

Reliability of Parallel Systems


A parallel system is a configuration such that, as long as not all of the system components fail, the entire system works. Conceptually, in a parallel configuration the total system reliability is higher than the reliability of any single system component. A graphical description of a parallel system of "n" components is shown in Figure 2.

Figure 2. Representation of a Parallel System of "n" Components (Click to Zoom)

Reliability engineers are trained to work with parallel systems using block concepts:

RS = 1 - (1 - Ri) = 1-(1 - R1) (1 - R2) ... (1 - Rn); if the component reliabilities differ, or RS = 1 - (1 - Ri) = 1-[1 - R]n; if all "n" components are identical: [Ri = R; i = 1, ..., n] However, behind the reliability block symbols lies a whole body of statistical knowledge. To illustrate, we analyze a simple parallel system composed of n = 2 identical components. The system can survive mission time T only if the first component, or the second component, or both components, survive mission time T (Figure 3). In the language of statistical "events":

Figure 3. Venn Diagram Representing the "Event" of Either Device or Both Surviving Mission Time (Click to Zoom)

This approach easily can be extended to an arbitrary number of "n" parallel components, identical or different. By expanding the formula RS = 1 -(1 - R1)(1 R2)...(1 - Rn) into products, the well-known reliability block formulas are obtained. For example, for n = 3 blocks, when only one is needed:

RS = 1 -(1 - R1)(1 - R2)(1 - R3) = R1 + R2 + R3 - R1R2 - R1R3 - R2R3 + R1R2R3 or RS = 1 -(1 - R)(1 - R)(1 - R) = 3R - 3R2 + R3 (if all components are identical: Ri= R; i = 1, ..., n Using instead, the statistical formulation of the Survival Function R(T), we can obtain system MTTF () for an arbitrary mission time T. For, say n = 2 arbitrary components:

Finally, one can calculate system FR s from the theoretical definition of FR. For n =

2:

Notice from this derivation that, even when every component FR() is constant, the resulting parallel system Hazard Rate s(T) is time-dependent. This result is very important!

Numerical Examples
Let a parallel system be composed of n = 2 identical components, each with FR = 0.01 and mission time T = 10 hours, only one of which is needed for system success. Then, total system reliability, by both calculations, is:

Mean Time to Failure (in hours):

The failure (hazard) rate for the two-component parallel system is now a function of T:

This system hazard rate s(T) can be calculated as a function of any mission time T, as shown in Figure 4.

Figure 4. Plot of the Hazard s(T) as a Function of Mission Time T. Hazard Rate s(T) increases as time T increases. This plot can be used to find the s(T) required to meet a Mission Time of T. Say T = 10, then s(T) about 0.0018 (Click to Zoom)

Reliability of "K out of N" Redundant Systems with "n" Identical Components
A "k" out of "n" redundant system is a parallel configuration where "k" of the system components, as a minimum, are required to be fully operational at the completion time T of the mission, for the system to "succeed" (for k = 1 it reduces to a parallel system; for k = n, to a series one). We illustrate this using the example of a system operation depicted in Figure 5. The Probability "p" for any system unit or component "i", 1 i n, to survive mission time T is:

Figure 5. Units Either Fail/Survive Mission Time (Click to Zoom)

All units are identical and "k" or more units, out of the "n" total, are required to be operational at mission time T, for the entire system to fulfill the mission. Therefore, the Probability of Mission Success (i.e., system reliability) is equivalent to the probability of obtaining "k" or more successes out of the possible "n" trials, with success probability p. This probability is described by the Binomial (n, p) Distribution. In our case, the probability of success "p" is just the reliability Ri(T) of any independent unit or component "i", for the required mission time "T". Therefore, total system reliability R(T), for an arbitrary mission time T, is calculated by:

Sometimes the formula: used instead. This holds true because:

is

The "summation" values are obtained using the Binomial Distribution tables or the corresponding Excel algorithm (formula).

Following the same approach of the series system case, we obtain the MTTF ().

We can obtain all parameters for an arbitrary T, by recalculating probability p = eT of a component surviving this new mission time "T". In the special case of mission time T = 1, the "T" vanishes from all these formulas (e.g., substituted T by 1). Applying the immediately preceding assumptions and formulas, we obtain the following results:

The reliability R(T) of the entire system, for specified T, is obtained by: o Providing the total number of system components (n) and required ones (k) o Providing the reliability (for mission time T) of one component: R i(T) = p o Alternatively, providing the Failure Rate (FR) of one unit or component System MTTF can be obtained from R(T) using the preceding inputs and:

The "Unreliability" = U(T) = 1 - Reliability = 1 - R(T)

Numerical Example
Let there be n = 5 identical components (computers) in a system (shuttle). Define system "success" if k = 2 or more components (computers) are running during reentry. Let every component (computer) have a reliability Ri(1) = 0.9. Let mission "re-entry" time be T = 1. If each component has a reliability Ri(T) = p = 0.9, then total system (shuttle) reliability R(T), the component FR () and the MTTF () are obtained as:

Now, assume that a less expensive design is being considered, consisting of n = 8 identical components in parallel. The new design requires that at least k = 5 units are working for a successful completion of the mission. Assume that mission time is T = 1 and the new component FR = 0.223144. Compare the two system reliabilities and MTTFs. First, we need to obtain the new component reliability Ri (T) = p for T = 1:

Proceeding as before, we obtain the new total system reliability for unit mission time:

The cheaper (second) design is, therefore, less reliable (and has a lower MTTF) than the first design.

Combinations of Configurations
Some systems are made up of combinations of several series and parallel configurations. The way to obtain system reliability in such cases is to break the total system configuration down into homogeneous subsystems. Then, consider each of these subsystems separately as a unit, and calculate their reliabilities. Finally, put these simple units back (via series or parallel recombination) into a single system and obtain its reliability. For example, assume that we have a system composed of the combination, in series, of the examples developed in the previous two sections. The first subsystem, therefore, consists of two identical components in parallel. The second subsystem consists of a "2 out of 5" (parallel) redundant configuration, composed of also five identical components (Figure 6). Assume also that Mission Time is T = 10 hours.

Figure 6. A Combined Configuration of Two Parallel Subsystems in Series (Click to Zoom)

Using the same values as before, for subsystem, A (two identical components in parallel, with FR = 0.01 and mission time T = 10 hours), we can calculate reliability as:

Similarly, subsystem B ("2 out of 5" redundant) has five identical components, of which at least two are required for the subsystem mission success. R3(1) = R4 (1) = R5 (1) = R6 (1) = R7(1) = 0.9, for T = 1. We first recalculate the component reliability for the new mission time T = 10 and then calculate subsystem B reliability as follows:

Recombining both subsystems, we get a series system, consisting of subsystems A and B. Therefore, the combined system reliability, for mission time T = 10, is:

This result immediately shows which subsystem is driving down the total system reliability and sheds light about possible measures that can be taken to correct this situation.

Summary
The reliability analysis for the case of non-repairable systems, for configurations in series, in parallel, "k out of n" redundant and their combinations, has been reviewed for the case of exponentially-distributed lives. When component lives follow other distributions, we substitute the density function in the corresponding reliability formulas R(T) and redevelop the algebra. Of particular interest is the case when component lives have an underlying Weibull distribution:

Here, we substitute these values into equations 1 through 5 of the first section and 1 through 6 of the second section and redevelop the algebra. Due to its complexity, this case will be the topic of a separate START sheet. Finally, for those readers interested in pursuing these studies at a more advanced level, we provide a useful bibliography For Further Study.

For Further Study

1. Kececioglu, D., Reliability and Life Testing Handbook, Prentice Hall, 1993.

2. Hoyland, A. and M. Rausand, System Reliability Theory: Models and Statistical Methods, Wiley, NY, 1994.

3. Nelson, W., Applied Life Data Analysis, Wiley, NY, 1982.

4. Mann, N., R. Schafer, and N. Singpurwalla, Methods for Statistical Analysis of Reliability and Life Data, John Wiley, NY, 1974.

5. O'Connor, P., Practical Reliability Engineering, Wiley, NY, 2003.

6. Romeu, J.L. Reliability Estimations for Exponential Life, RIAC START, Volume 10, Number, http://theriac.org/DeskReference/viewDocument.php?id=214&Scop e=reg

About the Author


* Note: The following information about the author(s) is same as what was on the original document and may not be correct anymore. Dr. Jorge Luis Romeu has over thirty years of statistical and operations research experience in consulting, research, and teaching. He was a consultant for the petrochemical, construction, and agricultural industries. Dr. Romeu has also worked in statistical and simulation modeling and in data analysis of software and hardware reliability, software engineering, and ecological problems. Dr. Romeu has taught undergraduate and graduate statistics, operations research, and computer science in several American and foreign universities. He teaches short, intensive professional training courses. He is currently an Adjunct Professor of Statistics and Operations Research for Syracuse University and a Practicing Faculty of that school's Institute for Manufacturing Enterprises. For his work in education and research and for his publications and presentations, Dr. Romeu has been elected Chartered Statistician Fellow of the Royal Statistical Society, Full Member of the Operations Research Society of America, and Fellow of the Institute of Statisticians. Romeu has received several international grants and awards, including a Fulbright Senior Lectureship and a Speaker Specialist Grant from the Department of State, in Mexico. He has extensive experience in international assignments in Spain and Latin America and is fluent in Spanish, English, and French. Romeu is a senior technical advisor for reliability and advanced information technology research with Alion Science and Technology previously IIT Research Institute (IITRI). Since rejoining Alion in 1998, Romeu has provided consulting for several statistical and operations research projects. He has written a State of the Art Report on Statistical Analysis of Materials Data, designed and taught a three-day intensive statistics course for practicing engineers, and written a series of articles on statistics and data analysis for the AMPTIAC Newsletter and RIAC Journal.

System Reliability and Availability


We have already discussed reliability and availability basics in a previous article. This article will focus on techniques for calculating system availability from the availability information for its components. The following topics are discussed in detail: System Availability o Availability in Series o Availability in Parallel o Partial Operation Availability Availability Computation Example o Understanding the System o Reliability Modeling of the System o Calculating Availability of Individual Components o Calculating System Availability

System Availability
System Availability is calculated by modeling the system as an interconnection of parts in series and parallel. The following rules are used to decide if components should be placed in series or parallel: If failure of a part leads to the combination becoming inoperable, the two parts are considered to be operating in series
If failure of a part leads to the other part taking over the operations of the failed part, the

two parts are considered to be operating in parallel.

Availability in Series
As stated above, two parts X and Y are considered to be operating in series if failure of either of the parts results in failure of the combination. The combined system is operational only if both Part X and Part Y are available. From this it follows that the combined availability is a product of the availability of the two parts. The combined availability is shown by the equation below:

A = A x Ay
The implications of the above equation are that the combined availability of two components in series is always lower than the availability of its individual components. Consider the system in the figure above. Part X and Y are connected in series. The table below shows the availability and downtime for individual components and the series combination. Component Availability

Dow

99% (2-nines)

3.65

99.99% (4-nines)

52 m

X and Y Combined

98.99%

3.69

From the above table it is clear that even though a very high availability Part Y was used, the overall availability of the system was pulled down by the low availability of Part X. This just proves the saying that a chain is as strong as the weakest link. More specifically, a chain is weaker than the weakest link.

Availability in Parallel

As stated above, two parts are considered to be operating in parallel if the combination is considered failed when both parts fail. The combined system is operational if either is available. From this it follows that the combined availability is 1 - (both parts are unavailable). The combined availability is shown by the equation below:

A = 1-(1-Ax )

The implications of the above equation are that the combined availability of two components in parallel is always much higher than the availability of its individual components. Consider the system in the figure above. Two instances of Part X are connected in parallel. The table below shows the availability and downtime for individual components and the parallel combination. Component Availability

99% (2-nines)

Two X components operating in parallel

99.99% (4-nines)

Three X components operating in parallel

99.9999% (6-nines)

From the above table it is clear that even though a very low availability Part X was used, the overall availability of the system is much higher. Thus parallel operation provides a very powerful mechanism for making a highly reliable system from low reliability. For this reason, all mission critical systems are designed with redundant components. (Different redundancy techniques are discussed in the Hardware Fault Tolerance article)

Partial Operation Availability


Consider a system like the Xenon switching system. In Xenon, XEN cards handle the call processing for digital trunks connected to the XEN cards. The system has been designed to incrementally add XEN cards to handle subscriber load. Now consider the case of a Xenon

switch configured with 10 XEN cards. Should we consider the system to be unavailable when one XEN card fails? This doesn't seem right, as 90% of subscribers are still being served. In such systems where failure of a component leads to some users losing service, system availability has to be defined by considering the percentage of users affected by the failure. For example, in Xenon the system might be considered unavailable if 30% of the subscribers are affected. This translates to 3 XEN cards out of 10 failing. The availability for this system can be computed by calculating A(p,q) as specified below:

A(p,q) = C(q,p) A (1-A)


q-p

Here p is the number of failed units and q is the total number of units.

Availability Computation Example


In this section we will compute the availability of a simple signal processing system.

Understanding the System


As a first step, we prepare a detailed block diagram of the system. This system consists of an input transducer which receives the signal and converts it to a data stream suitable for the signal processor. This output is fed to a redundant pair of signal processors. The active signal processor acts on the input, while the standby signal processor ignores the data from the input transducer. Standby just monitors the sanity of the active signal processor. The output from the two signal processor boards is combined and fed into the output transducer. Again, the active signal processor drives the data lines. The standby keeps the data lines tristated. The output transducer outputs the signal to the external world. Input and output transducer are passive devices with no microprocessor control. The Signal processor cards run a real-time operating system and signal processing applications. Also note that the system stays completely operational as long as at least one signal processor is in operation. Failure of an input or output transducer leads to complete system failure.

Reliability Modeling of the System


The second step is to prepare a reliability model of the system. At this stage we decide the parallel and serial connectivity of the system. The complete reliability model of our example system is shown below:

A few important points to note here are: The signal processor hardware and software have been modeled as two distinct entities. The software and the hardware are operating in series as the signal processor cannot function if the hardware or the software is not operational. The two signal processors (software + hardware) combine together to form the signal processing complex. Within the signal processing complex, the two signal processing complexes are placed in parallel as the system can function when one of the signal processors fails. The input transducer, the signal processing complex and the output transducer have been placed in series as failure of any of the three parts will lead to complete failure of the system.

Calculating Availability of Individual Components


Third step involves computing the availability of individual components. MTBF (Mean time between failure) and MTTR (Mean time to repair) values are estimated for each component (See Reliability and Availability basics article for details). For hardware components, MTBF information can be obtained from hardware manufactures data sheets. If the hardware has been developed in house, the hardware group would provide MTBF information for the board. MTTR estimates for hardware are based on the degree to which the system will be monitored by operators. Here we estimate the hardware MTTR to be around 2 hours. Once MTBF and MTTR are known, the availability of the component can be calculated using the following formula:

Estimating software MTBF is a tricky task. Software MTBF is really the time between subsequent reboots of the software. This interval may be estimated from the defect rate of the system. The estimate can also be based on previous experience with similar systems. Here we estimate the MTBF to be around 4000 hours. The MTTR is the time taken to reboot the failed processor. Our processor supports automatic reboot, so we estimate the software MTTR to be around 5 minute. Note that 5 minutes might seem to be on the higher side. But MTTR should include the following: Time wasted in activities aborted due to signal processor software crash Time taken to detect signal processor failure Time taken by the failed processor to reboot and come back in service Component MTBF MTTR Availability

Input Transducer

100,000 hours

2 hours

99.998%

Signal Processor Hardware

10,000 hours

2 hours

99.98%

Signal Processor Software

2190 hours

5 minute

99.9962%

Output Transducer

100,000 hours

2 hours

99.998%

Things to note from the above table are: Availability of software is higher, even though hardware MTBF is higher. The main reason is that software has a much lower MTTR. In other words, the software does fail often but it recovers quickly, thereby having less impact on system availability. The input and output transducers have fairly high availability, thus fairly high availability can be achieved even without redundant components.

Calculating System Availability


The last step involves computing the availability of the entire system. These calculations have been based on serial and parallel availability calculation formulas. Ava ilabi lity Downti me

Component

Signal Processing Complex (software + hardware)

99.9 762 %

2.08 hours/y ear

Combined availability of Signal Processing Complex 0 and 1 operating in parallel

99.9 999 9%

3.15 second s/year

Complete System

99.9 960 %

21.08 minute s/year

Understanding Series and Parallel Systems Reliability


Introduction
Reliability engineers often need to work with systems having elements connected in parallel and series, and to calculate their reliability. To this end, when a system consists of a combination of series and parallel segments, engineers often apply very convoluted block reliability formulas and use software calculation packages. As the underlying statistical theory behind the formulas is not always well understood, errors or misapplications may occur. The objective of this START sheet is to help the reader better understand the statistical reasoning behind reliability block formulas for series and parallel systems and provide examples of the practical ways of using them. This knowledge will allow

engineers to more correctly use the software packages and interpret the results. We start this START sheet by providing some notation and definitions that we will use in discussing non-repairable systems integrated by series or parallel configurations:

1. All the "n" system component lives (X) are Exponentially distributed:

2. Therefore, every ith component Failure Rate (FR) is constant (i(t) = i).

3. All "n" system components hence, FR are equal (i = ; 1 i n).

are

identical;

4. All "n" components (and their failure times) are statistically independent:

5. Denote system mission time "T". Hence, any ith component (1 < i < n) reliability "Ri(T)":

Summarizing, in this START sheet we consider the case where life is exponentially distributed (i.e., component FR is time independent). First, examples will be given using identical components, and then examples will be considered using components with different FR. Independent components are those whose failure does not affect the performance of any other system component. Reliability is the probability of a component (or system) of surviving its mission time "T." This allows us to obtain both, component and system FR, from their reliability specification. We will first discuss series systems, then parallel and redundant systems, and finally

a combination of all these configurations, for non-repairable systems and the case of exponentially distributed lives. Examples of analyses and uses of reliability, FR, and survival functions, to illustrate the theory, are provided.

Reliability of Series Systems of "n" Identical and Independent Components


A series system is a configuration such that, if any one of the system components fails, the entire system fails. Conceptually, a series system is one that is as weak as its weakest link. A graphical description of a series system is shown in Figure 1.

Figure 1. Representation of a Series System of "n" Components (Click to Zoom)

Engineers are trained to work with system reliability [RS] concepts using "blocks" for each system element, each block having its own reliability for a given mission time T:

RS = R1 R2 ... Rn (if the component reliabilities differ, or) RS = [Ri ]n (if all i = 1, ... , n components are identical) However, behind the reliability block symbols lies a whole body of statistical knowledge. For, in a series system of "n" components, the following are two equivalent "events":

"System Success" = "Success of every individual component" Therefore, the probability of the two equivalent events, that define total system reliability for mission time T (denoted R(T)), must be the same:

The preceding assertion holds because Ri(T), the probability of any component succeeding in mission time T, is its reliability. All system components are assumed identical with the same FR "" and independent. Hence, the product of all component reliabilities Ri(T) yields the entire system reliability R(T). This allows us to calculate R(T) using system FR (s = n), or the "nT" power of unit time component reliability [Ri (1)]nT, or the "nth" power of component reliability [Ri(T)]n, for any mission time T. We will discuss, later in this START sheet, the case where

different components have different reliabilities or FR. From all of the preceding considerations, we can summarize the following results when all elements, which are identical, of a system are connected in series:

1. The reliability of the entire system can be obtained in one of two ways: o R(T) = [Ri(T)]n; i.e., the reliability (T) of any component "i" to the power "n" o R(T) = [Ri(1)]nT; unit reliability of any component "i" to the power "nT" 2. System reliability can also be obtained by using system FR s: R(T) = exp{-T}: o Since s = + + + ... + = n (all component FR are identical) o System FR s is then, the sum ("n" times) of all component failure rates (): R(T) = Exp{-( + + + ... + ) T} = Exp{-n T}) = Exp{sT} 3. Component FR () can be obtained from system reliability R(T): o = [- ln (R(T))] / n T (inverting the reliability results given in 1) o Component FR can also be obtained from component reliability R i(T) : = - ln [Ri(T)]n / n T = - ln [Ri(T)] /T o Previous expression is used for allocating system FR s, among the system components 4. Total system FR s can also be obtained from 3: o s = [- ln (R(T))] / T = - ln [Ri(T)]n / T o s = n remains time-independent in series configuration 5. Allocation of component reliability Ri(T) from systems requirements is obtained by solving for Ri(T) in the previous R(T) equations. 6. System "unreliability" = U(T) = 1 - R(T) = 1 - reliability. One can calculate the various reliability and FR values for the special case of unit mission time (T = 1) by letting "T" vanish from all the formulas (e.g., substituting T by 1). One can obtain reliability R(T) for any mission time T, from R(1), reliability for unit mission time:

Numerical Examples
The concepts discussed are best explained and understood by working out simple numerical examples. Let a computer system be composed of five identical terminals in series. Let the required system reliability, for unit mission time (T = 1) be R(1) = 0.999. We will now calculate each component's reliability, unreliability, and failure rate values.

From the data and formulas just given, each terminal reliability Ri(T) can be obtained by inverting the system reliability R(T) equation for unit mission time (T = 1):

Component unreliability is: Ui(1) = 1 - Ri(1) = 1 - 0.9998 = 0.0002. Component FR is obtained by solving for in the equation for component reliability:

Now, assume, that component reliability for mission time T = 1 is given: Ri(1) = 0.999. Now, we are asked to obtain total system reliability, unreliability, and FR, for the (computer) system and mission time T = 10 hours. First, for unit time:

Hence, system FR is:

If we require system reliability for mission time T = 10 hours, R(10), and the unit time reliability is R(1) = 0.995, we can use either the 10th power or the FR s:

If mission time T is arbitrary, then R(T) is called "Survival Function" (of T). R(T) can then be used to find mission time "T" that accomplishes a pre-specified reliability. Assume that R(T) = 0.98 is required and we need to find out maximum time T:

Hence, a Mission Time of T = 4.03 hours (or less) meets the requirement of reliability 0.98 (or more). Let's now assume that a new system, a ship, will be propelled by five identical engines. The system must meet a reliability requirement R(T) = 0.9048 for a mission time T = 10. We need to allocate reliability by engine (component reliability), for the required mission time T. We invert the formula for system reliability R(10), expressed as a function of component reliability. Then, we solve for component reliability Ri(10):

We now calculate system FR (s) and MTTF () for the fiveengine system. These are obtained for mission time T = 10 hours and required system reliability R(10) = 0.9048:

FR and MTTF values, equivalently, can be obtained using FR per component, yielding the same results:

Finally, assume that the required ship FR s = 5 = 0.010005 is given. We now need component reliability, Unreliability and FR, by unit mission time (T = 1):

R(1) = Exp{-s} = Exp {-0.010005} = 0.99 = Exp{-5 } = [Exp(-)]5 = [Ri(1)]5

Component reliability: Ri (1) = [R(1)]1/5 = [0.99]0.2 = 0.998M

Component unreliability: Ui (1) = 1 - Ri (1) = 1 - 0.998 = 0.002

Component FR: = [- ln (R(1))]/n 1 = [-ln(0.99)]/5 = 0.002

The Case of Different Component Reliabilities


Now, assume that different system components have different reliabilities and FR. Then:

Then system Mean Time To Failure, MTTF, = = 1/s = 1/ i For example, assume that the five engines (components), in the above system (ship) have different reliabilities (maybe they come from different manufacturers, or exhibit different ages). Let their reliabilities, for mission time (T = 10) be 0.99, 0.97, 0.95, 0.93, and 0.9, respectively. Then, total system reliability R(T) for T = 10 and FR are:

Since the system FR is s = 0.02697, then the system MTTF is = 1/ = 1/ i = 1/0.02697 = 37.077.

Reliability of Parallel Systems


A parallel system is a configuration such that, as long as not all of the system components fail, the entire system works. Conceptually, in a parallel configuration the total system reliability is higher than the reliability of any single system component. A graphical description of a parallel system of "n" components is shown

in Figure 2.

Figure 2. Representation of a Parallel System of "n" Components (Click to Zoom)

Reliability engineers are trained to work with parallel systems using block concepts:

RS = 1 - (1 - Ri) = 1-(1 - R1) (1 - R2) ... (1 - Rn); if the component reliabilities differ, or RS = 1 - (1 - Ri) = 1-[1 - R]n; if all "n" components are identical: [Ri = R; i = 1, ..., n] However, behind the reliability block symbols lies a whole body of statistical knowledge. To illustrate, we analyze a simple parallel system composed of n = 2 identical components. The system can survive mission time T only if the first component, or the second component, or both components, survive mission time T (Figure 3). In the language of statistical "events":

Figure 3. Venn Diagram Representing the "Event" of Either Device or Both Surviving Mission Time (Click to Zoom)

This approach easily can be extended to an arbitrary number of "n" parallel components, identical or different. By expanding the formula RS = 1 -(1 - R1)(1 R2)...(1 - Rn) into products, the well-known reliability block formulas are obtained. For example, for n = 3 blocks, when only one is needed:

RS = 1 -(1 - R1)(1 - R2)(1 - R3) = R1 + R2 + R3 - R1R2 - R1R3 - R2R3 + R1R2R3 or RS = 1 -(1 - R)(1 - R)(1 - R) = 3R - 3R2 + R3 (if all components are identical: Ri= R; i = 1, ..., n Using instead, the statistical formulation of the Survival Function R(T), we can obtain system MTTF () for an arbitrary mission time T. For, say n = 2 arbitrary components:

Finally, one can calculate system FR s from the theoretical definition of FR. For n =

2:

Notice from this derivation that, even when every component FR() is constant, the resulting parallel system Hazard Rate s(T) is time-dependent. This result is very important!

Numerical Examples
Let a parallel system be composed of n = 2 identical components, each with FR = 0.01 and mission time T = 10 hours, only one of which is needed for system success. Then, total system reliability, by both calculations, is:

Mean Time to Failure (in hours):

The failure (hazard) rate for the two-component parallel system is now a function of T:

This system hazard rate s(T) can be calculated as a function of any mission time T, as shown in Figure 4.

Figure 4. Plot of the Hazard s(T) as a Function of Mission Time T. Hazard Rate s(T) increases as time T increases. This plot can be used to find the s(T) required to meet a Mission Time of T. Say T = 10, then s(T) about 0.0018 (Click to Zoom)

Reliability of "K out of N" Redundant Systems with "n" Identical Components
A "k" out of "n" redundant system is a parallel configuration where "k" of the system components, as a minimum, are required to be fully operational at the completion time T of the mission, for the system to "succeed" (for k = 1 it reduces to a parallel system; for k = n, to a series one). We illustrate this using the example of a system operation depicted in Figure 5. The Probability "p" for any system unit or component "i", 1 i n, to survive mission time T is:

Figure 5. Units Either Fail/Survive Mission Time (Click to Zoom)

All units are identical and "k" or more units, out of the "n" total, are required to be operational at mission time T, for the entire system to fulfill the mission. Therefore, the Probability of Mission Success (i.e., system reliability) is equivalent to the probability of obtaining "k" or more successes out of the possible "n" trials, with success probability p. This probability is described by the Binomial (n, p) Distribution. In our case, the probability of success "p" is just the reliability Ri(T) of any independent unit or component "i", for the required mission time "T". Therefore, total system reliability R(T), for an arbitrary mission time T, is calculated by:

Sometimes the formula: used instead. This holds true because:

is

The "summation" values are obtained using the Binomial Distribution tables or the corresponding Excel algorithm (formula).

Following the same approach of the series system case, we obtain the MTTF ().

We can obtain all parameters for an arbitrary T, by recalculating probability p = eT of a component surviving this new mission time "T". In the special case of mission time T = 1, the "T" vanishes from all these formulas (e.g., substituted T by 1). Applying the immediately preceding assumptions and formulas, we obtain the following results:

The reliability R(T) of the entire system, for specified T, is obtained by: o Providing the total number of system components (n) and required ones (k) o Providing the reliability (for mission time T) of one component: R i(T) = p o Alternatively, providing the Failure Rate (FR) of one unit or component System MTTF can be obtained from R(T) using the preceding inputs and:

The "Unreliability" = U(T) = 1 - Reliability = 1 - R(T)

Numerical Example
Let there be n = 5 identical components (computers) in a system (shuttle). Define system "success" if k = 2 or more components (computers) are running during reentry. Let every component (computer) have a reliability Ri(1) = 0.9. Let mission "re-entry" time be T = 1. If each component has a reliability Ri(T) = p = 0.9, then total system (shuttle) reliability R(T), the component FR () and the MTTF () are obtained as:

Now, assume that a less expensive design is being considered, consisting of n = 8 identical components in parallel. The new design requires that at least k = 5 units are working for a successful completion of the mission. Assume that mission time is T = 1 and the new component FR = 0.223144. Compare the two system reliabilities and MTTFs. First, we need to obtain the new component reliability Ri (T) = p for T = 1:

Proceeding as before, we obtain the new total system reliability for unit mission time:

The cheaper (second) design is, therefore, less reliable (and has a lower MTTF) than the first design.

Combinations of Configurations
Some systems are made up of combinations of several series and parallel configurations. The way to obtain system reliability in such cases is to break the total system configuration down into homogeneous subsystems. Then, consider each of these subsystems separately as a unit, and calculate their reliabilities. Finally, put these simple units back (via series or parallel recombination) into a single system and obtain its reliability. For example, assume that we have a system composed of the combination, in series, of the examples developed in the previous two sections. The first subsystem, therefore, consists of two identical components in parallel. The second subsystem consists of a "2 out of 5" (parallel) redundant configuration, composed of also five identical components (Figure 6). Assume also that Mission Time is T = 10 hours.

Figure 6. A Combined Configuration of Two Parallel Subsystems in Series (Click to Zoom)

Using the same values as before, for subsystem, A (two identical components in parallel, with FR = 0.01 and mission time T = 10 hours), we can calculate reliability as:

Similarly, subsystem B ("2 out of 5" redundant) has five identical components, of which at least two are required for the subsystem mission success. R3(1) = R4 (1) = R5 (1) = R6 (1) = R7(1) = 0.9, for T = 1. We first recalculate the component reliability for the new mission time T = 10 and then calculate subsystem B reliability as follows:

Recombining both subsystems, we get a series system, consisting of subsystems A and B. Therefore, the combined system reliability, for mission time T = 10, is:

This result immediately shows which subsystem is driving down the total system reliability and sheds light about possible measures that can be taken to correct this situation.

Summary
The reliability analysis for the case of non-repairable systems, for configurations in series, in parallel, "k out of n" redundant and their combinations, has been reviewed for the case of exponentially-distributed lives. When component lives follow other distributions, we substitute the density function in the corresponding reliability formulas R(T) and redevelop the algebra. Of particular interest is the case when component lives have an underlying Weibull distribution:

Here, we substitute these values into equations 1 through 5 of the first section and 1 through 6 of the second section and redevelop the algebra. Due to its complexity, this case will be the topic of a separate START sheet. Finally, for those readers interested in pursuing these studies at a more advanced level, we provide a useful bibliography For Further Study.

For Further Study

1. Kececioglu, D., Reliability and Life Testing Handbook, Prentice Hall, 1993.

2. Hoyland, A. and M. Rausand, System Reliability Theory: Models and Statistical Methods, Wiley, NY, 1994.

3. Nelson, W., Applied Life Data Analysis, Wiley, NY, 1982.

4. Mann, N., R. Schafer, and N. Singpurwalla, Methods for Statistical Analysis of Reliability and Life Data, John Wiley, NY, 1974.

5. O'Connor, P., Practical Reliability Engineering, Wiley, NY, 2003.

6. Romeu, J.L. Reliability Estimations for Exponential Life, RIAC START, Volume 10, Number, http://theriac.org/DeskReference/viewDocument.php?id=214&Scop e=reg

About the Author


* Note: The following information about the author(s) is same as what was on the original document and may not be correct anymore. Dr. Jorge Luis Romeu has over thirty years of statistical and operations research experience in consulting, research, and teaching. He was a consultant for the petrochemical, construction, and agricultural industries. Dr. Romeu has also worked in statistical and simulation modeling and in data analysis of software and hardware reliability, software engineering, and ecological problems. Dr. Romeu has taught undergraduate and graduate statistics, operations research, and computer science in several American and foreign universities. He teaches short, intensive professional training courses. He is currently an Adjunct Professor of Statistics and Operations Research for Syracuse University and a Practicing Faculty of that school's Institute for Manufacturing Enterprises. For his work in education and research and for his publications and presentations, Dr. Romeu has been elected Chartered Statistician Fellow of the Royal Statistical Society, Full Member of the Operations Research Society of America, and Fellow of the Institute of Statisticians. Romeu has received several international grants and awards, including a Fulbright Senior Lectureship and a Speaker Specialist Grant from the Department of State, in Mexico. He has extensive experience in international assignments in Spain and Latin America and is fluent in Spanish, English, and French. Romeu is a senior technical advisor for reliability and advanced information technology research with Alion Science and Technology previously IIT Research Institute (IITRI). Since rejoining Alion in 1998, Romeu has provided consulting for several statistical and operations research projects. He has written a State of the Art Report on Statistical Analysis of Materials Data, designed and taught a three-day intensive statistics course for practicing engineers, and written a series of articles on statistics and data analysis for the AMPTIAC Newsletter and RIAC Journal.

Achieving High Reliability


By: Larry H. Crow, Ph.D.

This article discusses issues related to the concepts presented by the author in his paper, On the Initial System Reliability, Proceedings 1986 Annual Reliability and Maintainability Symposium, pp. 115-119, Las Vegas, NE.

Introduction
In todays environment of reduced development budgets, faster times to market, reduced test time, and the wide use of non-developmental items, attaining high reliability for complex systems is very difficult but critical. For reliability affects not only system performance but also operating and support costs. Achieving high reliability is receiving increased interest and was addressed at the June 9-10, 2000 Committee on National Statistics Workshop on Reliability issues for DoD Systems, held at the National Academy of Sciences, Washington, DC. In the authors almost 30 years in the reliability field, he observed why high reliability requirements are not met. He then identified eight principles that consistently yield very high reliability systems. He found that applying these principles did not increase the overall costs of the reliability program but, as implementation was refined and better understood, actually decreased them by more than one third. This methodology simply integrates sound reliability and parts management strategies in early design. This article provides a discussion of the issues being addressed and an overview of the eight principles.

Discussion
Data presented at the June workshop showed that many of todays new DoD systems fall short of their operational reliability requirements based on the results of Operational Testing (OT). Typically OT occurs after Development Testing (DT) and the OT reliability estimate is often the total test time divided by the total number of observed failures. This estimate is of an MTBF, which is often the reliability parameter of choice but may not be the most meaningful reliability parameter (particularly for systems consisting of both a repairable and nonrepairable segment). Generally, DT objectives include evaluating performance and reliability parameters, identifying problems, and making management and engineering decisions on the incorporation of corrective actions. The measure of reliability during DT is a function of several factors including the total amount of test time and the value of reliability at the beginning of this testing. Everything being equal, the less test time available, the higher the initial reliability must be to reach the reliability goal at the end of DT. Another potentially significant factor is delaying corrective actions until late in

testing, say just prior to OT testing. Assessing the impact of these delayed fixes on the total system reliability is generally not straightforward and requires the use of a proper projection methodology. A commonly used method overestimates the system reliability after delayed fixes and may indicate that the reliability meets requirements, when in fact it does not. These are just some of the reasons that the OT reliability may be lower than desired. The author recognized that reduced development budgets and schedules make a corresponding reduction in DT testing inevitable. Consequently, the initial reliability going into testing must be higher than it has been in the past. Initial reliability is the result of the early, basic engineering design effort for reliability and is the input into DT testing. Initial reliability is a key metric and a measure of how effective the basic reliability tasks, such as requirements analysis, trade studies, modeling, allocation, prediction, failure modes and effects analysis (FMEA), and parts and vendor selection, have been. What has the initial reliability been in the past? For an answer, we look at studies conducted by the US Army Materiel Systems Analysis Activity (AMSAA). In 1984 and 1990, AMSAA conducted two studies of Army systems. Both studies showed that the ratio of the initial MTBF to the final mature system MTBF was about 1:4 to 1:3. If the final mature reliability was 1000 hours MTBF, for example, then the initial reliability coming out of early design and entering DT was an average of 250 to 300 hours MTBF. These studies also showed that the average amount a failure modes rate of occurrence was reduced because of corrective actions - the Effectiveness Factor (EF) was about 70%. That is, corrective actions increase the failure mode MTBF by an average of about 3.3 (conversely, a problem failure modes rate is reduced, on average, by 70%). If we couple this fact with the concept that valid reliability prediction estimates the inherent, mature failure rate of a failure mode, then we have a basis for a reliability growth (RG) metric in design. The 1:4 to 1:3 ratio may have been acceptable several years ago when more DT test time was available. Today, with much less test time, such a ratio will not allow the potential reliability to be reached. A logical solution to the problem of low OT results is to increase the initial reliability in early design. This can be accomplished by performing the same reliability tasks noted earlier, but somewhat differently and applying a metric that estimates the initial reliability during design. If the initial reliability is actually improving in design, as it should be, then reliability growth in design is occurring. With a higher initial reliability, the RG program in DT has a better chance of success. This integrated RG is the framework for the reliability management principles presented later. The framework is based on systemically managing failure mode identification, classification, analysis, and mitigation. In this paper, a failure mode is a problem and a cause. A given problem can result from multiple causes and corrective action takes place on a problem and cause basis. The 70% EF noted earlier applies to corrective action on a problem and cause, and relates to this definition of failure mode.

At the end of this discussion are listed the eight principles or features of a reliability program that the author has applied that consistently yields high reliability, state of the art systems. Many others have successfully applied these basic principles, and examples were given at the June 2000 Workshop. In the authors applications, the programs had a preliminary design phase (PDP) and a final design phase (FDP). The PDP included requirements analysis, trade studies, preliminary modeling, allocation, redundancy analyses, preliminary prediction, preliminary FMEA, and preliminary parts and vendor selection. In the final design phase, more complete reliability tasks were conducted. Also, during this phase, potential problem failure modes (RG in design), were systematically identified and mitigated, with metrics to track progress using the FMEA. In the FMEA, failure modes are classified as either a potential A mode or a potential B mode. A failure mode is a B mode until it meets the criteria for an A mode, which are: 1. There is a numerical calculation of the failure rate 2. This numerical calculation is substantiated by at least one of the following: analysis, analogy, or test. 3. The failure rate is acceptable given the system reliability requirement or goal. In an ideal situation, if all failure modes are classified as A modes, then the overall system failure rate should equal or be close to the reliability prediction. On the other hand, an investigation may prove that a failure mode classified as a potential B mode does not need any improvement; that is, it satisfies the Amode criteria. However, corrective action (i.e., reselection of a part or vendor, added redundancy, mitigation of environmental stress, better materials, wider design tolerances, or manufacturing changes) may be needed. The amount of actual improvement will depend on the EF. However, if an average EF is applied, such as 0.7, then an assigned failure rate to the B mode before investigation is 3.3 times the predicted failure rate. Of course, any assigned EF or B mode failure rate deemed appropriate can be applied. This approach would estimate the initial MTBF to be somewhere between 30% to 100% of the predicted, depending on the percent of A modes assigned to the system in the FMEA. As potential B modes are mitigated, this estimate would increase. This is the reliability growth metric discussed under Principle 5. See Figure 1.

Figure 1. Metric is estimate of MTBF when design improvement is stopped (Click to Zoom)

Recommended Principles for a Successful Reliability Program


1. Requirements and Failure Definition Analysis. Requirements must be fully understood and determined to be attainable using current technology. Also, failure should not be confused with performance. The most meaningful reliability metric may not be MTBF, particularly when the system consists of both a repairable and non-repairable segment. A useful metric in this case is Probability of Mission Success, which considers mission length, total calendar time for the mission, reliability, repair time, and total spares allocation. 2. Integrated Reliability Growth Testing (IRGT). In many cases reliability problems are surfaced early in engineering tests. The focus of these tests is typically on performance and not reliability. Therefore, if the problem is not brought to the attention of reliability it may not be corrected early, when it is the most cost effective and impacts schedules least. IRGT simply piggybacks reliability failure reporting, in an informal fashion, on all engineering tests. When a potential reliability failure is observed, notify reliability engineering. 3. Closed Loop Failure Mode Mitigation Process. Usually, patent or potential reliability problems can be mitigated by the reliability engineer and product design team. Sometimes, however, a potential problem needs special management attention due to high risks, costs, criticality, additional screening or testing, or schedule impact. Without a focused approach, resolution can be time consuming and expensive. For these critical problems, a reliability mitigation process at the system engineer and program manager level can greatly decrease the time and cost of a solution. In this process, the concern is documented and assigned to the appropriate person for resolution, in much the way as a failure is reported. But in this case, the failure has not yet occurred. The process is most effective when managed by the program manager, system engineer, and the reliability manager. 4. The Parts and Vendor Selection Process Addresses Reliability. Parts and vendor selection must be conducted in early design since most of the parts used in early design are used in the final design. Immediately after the design engineer has determined a part can perform the desired function, it should undergo a parts and

vendor selection process for reliability assessment and approval. That is, the part must be shown to provide the function and be reliable before being approved for use. Because vendor quality for that part affects the parts reliability, this process should evaluate the part and vendor combination, not just the part. Depending on the information obtained, this assessment will lead to a reliability estimate based on data or a prediction using, for example, the RACs PRISM model. Only if this estimate is consistent with the allocation or expectations, will the part be formally approved. Some mitigation options are to consider other parts or vendors, subject the part to additional screening, incorporate redundancy, or accept more risk. When the mitigation options require additional resources or potential redesign, increase cost or schedule, or are high risk, then others (e.g., program manager, systems engineer, product team leader, design engineer, reliability engineer) may need to get involved. To do this efficiently, the closed loop failure mode mitigation process is used. This process focuses on a solution and risk management in a documented and effective manner. 5. Manage the Failure Mitigation Process with the FMEA and Calculate Metric. The FMEA should be used to identify the systems failure modes and also to identify potential problem areas affecting reliability and safety. This purpose can easily be met by adding a column to a standard FMEA sheet and classifying each failure mode according to its A-B mode status. In the preliminary design phase the assigned reliability value for each failure mode would typically be the allocation or prediction. In the FDP, the A modes are given their calculated value and the potential B modes failure rates are increased using the EF approach or some other method. These estimates are put into the system reliability model to generate an estimate of the initial reliability metric. As more B modes are mitigated, the metric will increase. 6. Formal Reviews for Reliability. A formal review for reliability should be held at least once in both the PDP and FDP. These reviews give the latest reliability status of the system and baselines the reliability model to the current design. This assures that the reliability model and engineering design agree and that earlier proposed design changes (e.g., redundancy) are reflected in the current design. In the PDP the allocated and early predictions are presented; in the FDP, the initial reliability metric is presented. 7. Link Design and Reliability Testing. For many complex systems, the initial reliability at the end of the FDP may still fall short of the requirement. This possibility should be planned for and a target minimum value of the initial reliability established. This value should be linked to the available amount of follow-on reliability DT. If it is not and the initial reliability is too low or the allocated test time is too short, then the requirement will probably not be met. 8. Apply Valid Methodology for Assessing the Reliability in Testing. The caution here is estimating the impact of a group of delayed corrective actions on the reliability of the system. A common approach in practice significantly overestimates the actual reliability. If this approach is applied, then the reliability may appear much higher than it actually is, and contribute to lower than expected operational reliability. A

valid methodology for estimating the reliability improvement due to delayed corrective actions exists and is recommended.

General
An item or system is specified, procured, and designed to a functional requirement and it is important that it satisfies this requirement. However it is also desirable that the the item or system should be predictably available and this depends upon the its reliability and availability. For some disposable products in our modern society the availability requirement may be acceptably low. For a large range of consumer products the availability, based on high reliability, is an important selling point. For items and systems used in critical areas including military equipment, process plant , and the nuclear industry, the availability, reliability and maintainability considerations are vital. The economic justification for a project is generally based on the lifetime cost of the project. A major contribution to this cost involves an evaluation of the availability reliability and maintainability of the equipment..

Availability
The ability of an item to be in a state to perform a required function under given conditions at a given instant of time or during a given time interval, assuming that the required external resources are provided. At its simplest level.. Availability = Uptime / (Downtime + Uptime) The time units are generally hours and the time base is 1 year . There are 8760 hours in one year. From the design area of concern this equation translates to .. Availability(Intrinsic) A i = MTBF / (MTBF + MTTR) MTBF = Mean time between failures.. MTTR = Mean time to repair / Mean time to replace. Operational availability is defined differently Availability (Operational) A o = MTBM/(MTBM+MDT). MTBM = Mean time between maintenance.. MDT = Mean Down Time

Reliability
The ability of an item to perform a required function under given conditions for a given

time interval. The reliability is expressed as a probability (0-1 or 0 to 100%). Thus the reliability of a component may be expressed as 99% that it will work successfully for one year. The reliability is essentially an indication of probability that a the item will not fail in the given time period. A very generalised curve for the failure rates of components over time is the bathtub curve. This shows that in the early period a number of failures result from manufacturing, assembly, commissioning, setting to work problems. When all of the teething problems have been eliminated the remaining population has a useful life over which the items fail at a relatively low rate. After a long operating time interval the items will fail at an increasing rate due to wear and other time related functions. This curve applies mostly to electronic components which is why electronic products are operated continuously for set times (burn-in) prior to delivery to users..

The bathtub curve for mass produced mechanical items is controlled to minimise the initial early failure period by use of quality control to ensure uniformity of production of high reliability items. Before items are introduced onto the market they are rigorously tested to identify and correct design and manufacturing problems. A prime target of design, manufacturing and operation is to ensure that the useful life is extended by attention to the following factors.

Strength/ Life safety factors Tribology considerations (Prevention of wear and lubrication ) Corrosion prevention Protection against environment effects (temperature /humidity) Fatigue Vibration Regular servicing (or elimination) of short life components (filters /brakes pads etc)

For systems with items in series the overall reliability is the product of the reliabilities of the individual components.. For systems with active items in parallel the resulting reliability is improved. For example if there are two items in parallel A (Reliability Ra) and B (Reliability Rb). The overall reliability is = 1-(1-Ra)*(1-Rb)

Maintainability
The ability of an item under given conditions of use, to be retained in, or restored to, a state in which it can perform a required function, when maintenance is performed under given conditions and using stated procedures and resources. When a piece of equipment has failed it is important to get it back into an operating condition as soon as possible, this is known as maintainability. To calculate the maintainability or Mean Time To Repair (MTTR) of an item, the time required to perform each anticipated repair task is multiplied by the relative frequency with which that task is performed(e.g. no. of times per year). MTTR data supplied by manufacturers will be

purely repair time which will assume the fault is correctly identified and the required spares and personnel are available. The MTTR to the user will include the logistic delay as shown below. The MTTR should also include factors such as the skill of the maintenance engineers MTTR User factors...

Detection of fault Start Up mainenance team Diagnose fault Obtain Spare parts Repair (MTTR-Manufacturers information) Test and accept repair Start up equipment

You might also like