Nasa DFR

T,.
NASA Technical Memorandum 106313
Design for Reliability:
NASA Reliability Preferred Practices

for Design and Test
Vincent R. Lalli
Lewis Research Center
Cleveland, Ohio
Prepared for the

Reliability and Maintainability Symposium
cosponsored by ASQC, IIE, IEEE, SOLE, IES, AIAA, SSS, and SRE
Anaheim, California, January 24-27, 1994
- (NASA-TM-I06313) DESIGN FOR N95-I3728

RELIABILITY: NASA RELIABILITY
National Aeronamics and __ PREFERRED PRACTICES FOR DESIGN AND
Space Administration TEST (NASA. Lewis Research Center) Unclas
27 p
G3/18 0028221
_ _=_;_ = ir _
DESIGN FOR RELIABILITY
PART I--NASA RELIABILITY PREFERRED PRACTICES FOR DESIGN AND TEST
Vincent R. Lalli
National Aeronautics and Space Administration
Lewis Research Center
Cleveland, Ohio 44135
SUMMARY AND PURPOSE • Design weaknesses evident by test or analysis are

identified and tracked.
This tutorial summarizes reliability experience from
both NASA and industry and reflects engineering practic- Vincent R. Lalli has been at NASA Lewis Research
es that support current and future civil space programs. Center since 1963 when he was hired as an aerospace
These practices were collected from various NASA field technologist. Presently, as an adjunct to his work for the
centers and were reviewed by a committee of senior tech- Office of Mission Safety and Assurance in design, anal-
nical representatives from the participating centers (mem- ysis, and failure metrics, he is responsible for product
bers are listed at the end). The material for this tutorial assurancemanagement and alsoteaches coursesto assist
was taken from the publication issued by the NASA Reli- with NASA's trainingneeds. Mr. Lalligraduated from
ability and Maintainability Steering Committee (NASA Case Western Universitywith a B.S. and an M.S. in elec-
Reliability Preferred Practices for Design and Test. trical engineering.In 1959 as a researchassistantat Case,
NASA TM-4322, 1991). and laterat PicatinnyArsenal,he helped to develop elec-
tronicfusesand specialdevices.From 1956 to 1963, he
Reliability must be an integral part of the systems worked at TRW as a design,lead, and group engineer•
engineering process. Although both disciplines must be Mr. Lalliis a registeredengineerin Ohio and a member
weighted equally with other technical and programmatic of Eta Kappa Nu, IEEE, IPC, ANSI, and ASME.
demands, the application of sound reliability principles
will be the key to the effectiveness and affordability of
America's space program. Our space programs have 1.0 OVERVIEW
shown that reliability efforts must focus on the design
characteristics that affect the frequency of failure. Herein, 1.1 Applicability
we emphasize that these identified design characteristics
must be controlled by applying conservative engineering The designpracticesthat have contributedto NASA
principles. mission successrepresentthe "best technicaladvice" on
reliability
design and testpractices.These practicesare
This tutorialshould be used to assessyour current not requirementsbut rather proven technicalapproaches
reliability
techniques,thus promoting an activetechnical that can enhance system reliability.
interchange between reliabilityand design engineering
that focuseson the design margins and theirpotential This tutorial is divided into two technical sections.
impact on maintenance and logistics requirements.By Section II contains reliability practices, including design
applying these practices and guidelines,reliability criteria, test procedures, and analytical techniques, that
organizations throughout NASA and the aerospace have been successfully applied in previous spaceflight
community will continue to contribute to a systems programs. Section III contains reliability guidelines,
development processwhich assuresthat including techniques currently applied to spaceflight
projects, where insufficient information exists to certify
• Operating environments are well defined and that the technique will contribute to mission success.
independently verified.
• Design criteria drive a conservative design approach.

1.2 Discussion 2.3 Document Referencing
Experiencefrom NASA's successful extended-duration The followingexample of the document numbering

space missions shows that four elements contribute to system applicableto thepracticesand guidelinesisaPart
high reliability:
(I) understanding stressfactorsimposed Junction Temperature,_ practicenumber PD-ED-1204:
on flight hardware by the operating environment;
(2) controllingthe stressfactorsthrough the selection of P D ED 12 04
conservativedesign criteria;(3) conducting an appropri-
ate analysisto identifyand track high stresspointsin the 1. Practice
design {priorto qualificationtestingor flightuse); and
2. Design Factors
(4) selecting redundancy alternativesto provide the
necessary function(s)should failureoccur. 3. Engineering Design
4. Series 12
2.0 RELIABILITY PRACTICES 5. Practice04
2.1 Introduction Key to nomenclature.--The followingisan explana-

tionof the numbering system:
The reliability design practices presented herein con-
tributed tothe success of previous spaceflight programs. Position Code
The information is for use throughout NASA and the
aerospace community to assist in the design and develop- 1. G- Guideline
ment of highly reliable equipment and assemblies. The P - Practice
practices include recommended analysis procedures,
redundancy considerations, parts selection, environmental 2. D - Design factors
requirements considerations, and test requirements and T - Test elements
procedures.
3° EC - Environmental considerations
ED - Engineering design
2.2 Format AP - Analytical procedures
TE- Test considerations and procedures
The following format is used for reliability practices:
4. x Seriesnumber
PRACTICE FORMAT DEFINITIONS

5. xx Practicenumber within series
Prau:t|co: A brief statement of the practice
BeneJlt: A concise statement of the technical improvement realised
from implementing the practice 2.4 Practices as of January 1993

Progr4_me _lsat Certified Ue_: Identifiable programs or projects
that have applied the prectlce PD-EC-II01 Environmental Factors

Center to Courier for More Information: Source of additional PD-EC-II02 ** Meteoroids/Space Debris
informat|on_ usually • sponsoring NASA Center (see prigs 6)
implementltiou Method: A brief technical discussion, not intended PD-ED-1201 EEE Parts Derating
to give the full detkile of the process but to provide & design engineer PD-ED-1202 High-Voltage Power Supply Design
with adequate information to understand how the pr•ctice should be
ussd and Manufacturing Practices
PD-ED-1203 Class-S Parts in High-Reliability
Toclz_ical Igotimaalo: A brief technical juetification for use of the
practice Applications
PD-ED-1204 Part Junction Temperature
Impact ofNonpr_tleo: A brief statement of what con be expected if
the practice is avoided

PD-ED-1205 Welding Practicesfor 2219 Alumi-
num and Inconel718
Related Prnctices: Identification of other topic areas in
the manual that contain related inform••ion SPONSOR PD-ED-1206 Power Line Filters
OF PD-ED-1207 Magnetic Design Control for Sci-
Referqmceo: Publications that contain addltlon•l infer- PRACTICE
matlon about the practice ence Instruments

PD-ED-1208 * Static Cryogenic Seals for Launch PT-TE-1410 * Selectionof Spacecraft Materials
Vehicle Applications and Supporting Vacuum Outgass-
PD-ED-1209 ** Ammonia-Charged AhminumHeat ing Data
Pipes with Extruded Wicks PT-TE-1411 ** Heat Sinks for Parts Operated in
PD-ED-1210 * Assessment and Control of Vacuum
Electrical
Charges PT-TF_--1412 ** Environmental Test Sequencing
PD-ED-1211 * Combination Methods forDeriving PT-TE-1413 ** Random Vibration Testing
StructuralDesign Loads Consider- PT-TE-1414 ** Electrostatic
Discharge (ESD) Test
ingVibro-Acoustic,etc.,Responses Practices
PD-ED-1212 * Design and Analysis of Electronic
Circuitsfor Worst-Case Environ- *New practicesfor January1992.
ments and Part Variations **New practicesfor January1993.
PD-ED-1213 ** Electrical Shielding of Power,
Signal,and Control Cables
PD-ED-1214 ** ElectricalGrounding Practicesfor 2.5 Typical Reliability Practice
Aerospace Hardware
PD-ED-1215.1 ** Preliminary Design Review A typical reliabilitypractice is illustratedin this
PD-ED-1216 ** Active Redundancy section.Environmental factorsarevery important in the
PD-ED-1217 ** StructuralLaminate Composites for system design so equipment operatingconditionsmust be
Space Applications identified.Systems designed to have adequate environ-
PD-ED-1218 ** Applicationof AblativeComposites mental strengthperform wellin the fieldand satisfyour
to Nozzles for Reusable Solid customers. Failure to perform a detailedllfe-cycleenvi-
Rocket Motors ronment profilecan lead to overlooking environmental
PD-ED-1219 ** Vehicle Integration/Tolerance factorswhose effectiscriticalto equipment reliability.
Buildup Practices Not includingthese factorsin the environmental design
criteriaand test program can lead to environment-
PD-ED-1221 ** Battery Selection Practice for
induced failuresduring spaceflightoperations.
Aerospace Power Systems
PD-ED-1222 ** Magnetic Field Restraints for
SpacecraftSystems and Subsystems Environmental Factors
PD-AP-1301 Surface Charging and Electrostatic
Discharge Analysis • Practice {PD-EC-1101):Identify equipment operat-
PD-AP-13O2 * Independent Review of Reliability ing conditions.
Analyses
PD-AP-1303 * Part ElectricalStressAnalysis • Benefit:Adequate environmental strengthisincor-
PD-AP-1304 * Problem/Failure Report Indepen- porated into design.
dent Review/Approval
PD-AP-1305 * Risk Rating of Problem/Failure • Programs That Certified Usage: SERT I and II,
Reports CTS, ACTS, space experiments, launch vehicles, space
PD-AP-1306 * Thermal Analysis of Electronic power systems, and Space Station Freedom
Assemblies to the Piece Part Level
PD-AP-1307 ** Failure Modes, Effects, and • Center to Contact for More Information: NASA
Criticality Analysis {FMECA) Lewis Research Center
PT-TE-1401 EEE Parts Screening

• Implement ationMethod: Develop life-cycle
environ-
PT-TE-1402 Thermal Cycling
ment profile.
PT-TE-1403 Thermographic Mapping of PC
Boards
Describe anticipatedevents from finalfactory
PT-TE-1404 Thermal Test Levels
acceptance through removal from inventory.
PT-TE--1405 Powered-On Vibration
PT-TE-1406 SinusoidalVibration Identifysignificant
natural and induced envi-
PT-TE-1407 * Assembly Acoustic Tests ronments foreach event.
PT-TE--1408 * Pyrotechnic Shock Describe environmental and stressconditions:
PT-TE-1409 * Thermal Vacuum Versus Thermal
Narrative
Atmospheric Test of Electronic Statistical
Assemblies
• Technical Rationale • Technical Rationale (continued)
Environment Principal effects Typical failures

_nvJronment Principal effects Typicalfailurel
induced induced
Thermal aging: Insulation failure; Wind Force ippllciilon Structural collapiel

High
Oxidation alteration of elec- interference with
temperature
Structural change trlcal properties function; lose of
Chemlcal reaction Structural failure mechanical strength
Lose of lubrication Depoi|t|on of materials Mechanical Inferrer-
propertlel once and clogging;

abrasion accelerated
Softening I meliing t and Structural failure;
sublimation incraued mechanlcal Heat Isis (low velocity) Acceleration of low-
Viscosity reduction/ stress; Increased temperature effects
Heat gain (high Acceleration of blab-

ovaporitlon wear on moving parts
velocity) temperature effect e
Physical expansion
Lois of lubrication Rain Physical stress Structural collapse

Low Increased viscosity and
solidification properties Water absorption and Increase In weigher
temperature
Ice formation Alteration of electrical immersion electrical failure;
properties structural weakening
]_mbrlttlement Lose of mechanical ]_roslon Removal of protective
strength; cracking; coatings; structural
fracture weakenlng i surface

deterioration
Physical contraction Structural failure;
increased wear on Corrosion Enhancement of
chemical reactions
moving parts
Moisture absorption Swellings rupture of Temperature Mechenlcal stress Structural collapse or
container; physical shock weakening; seal
breakdown; lois of damage
electrical strength
Chemical reaction Lose of mechanical High-speed Heating Thermal aging i
Corrosion particles oxidation

strength; interference
with function; Ion of (nuclear Tran|mutition and Alteration of
Electrolysis
electrical properties; irradiation) ionisation chemical, physical,
increased conductiv- and electrical
ity of insulators properties; produc-

tion of gases and
Low relatlve Desiccation Loss of mechanical secondary particles
humidity Embrlttlemcnt strength; structural

Granulation collapse; alteration Zero gravity Mechanical stress Interruption of
of electrical proper- gravity-dependent

functions
ties; aduitlngS
Absence Of convection Aggravation of hlgh-
High Compression Structural collapeoi cooling temperature effects
pressure penetration of sial-

log I Interference with Oione Chemical reactions:
function Crasing, cracking Rapid oxidation;

alteration of elec-
Low pressure Expansion Fracture of container; trical properties
explosive expansion Emhrittlcment Lois of mechanical
Alteration of electrical strength

Outgueing
properties; Ion of Granulation Interference with
function
mechanical strength
Reduced dlelcctrlcal Insulation breakdown Reduced dlelectrical Insulation b reakdown
strength of atr and arc-over

strength of air and arcoover i corolla
and osone formation
Severe mechtnlcal Rupture and cracking
Exploilve de-
Solar Actinic and phyiJo- Surface deterioration; comprolllOn stroll structural collapse
radiation chemical reactions: alteration of electri-
embrlttlement Chemical reactions:

cal properties; die-
coloration of mitertlll Conttmlnition Alteration of physical
olonl formation and electrical
properties
Abrasion Increased wear Reduced dielectric Insulation breakdown

Sand and
dust Interference wlih func- strength and arc-over
Clogging
tion; alteration of
Acceleration Mechanical stress Structural collapse

electrical properties
Salt spray Chemical reactions: Increases weir
Corrosion Lois of mechanical
strength; atteritlon
of electrical proper-
ties; Interference
with function
Electrolysis Surface deterioration;
structural weakening;
increased
conductivity
• Technical Rationale (concluded) and projects.
Unlike a reliability
design practice,a guide-
linelacksspecific
operationalexperienceor data to vali-
date its contribution to mission success.However, a
Environment Principal effect, Typical failures
induced guidelinedoescontaininformationthatrepresentscurrent
Vibration Mechanical stress Lose of mechanical
%eat thinking"on a particularsubject.
strength; interference
with function;
increased wear
_tt|i_i Structural coIllpH 3.2 Format
Magnetic Induced magnetization Interference with func-
fields tion; alteration of The following format is used for reliability guidelines:
electrical properties;
induced heating
GUIDELINE FORMAT DEFINITIONS
Prwctice: A brief statement of the guideline

Impact of Nonpractice:
Basalt: A concise statement of the technical improvement re.iiued
from Implementing the guideline

Failure to perform a detailedlife-cycle
environ-
ment profilecan lead to overlooking environ- Center tn Contact for More Information: Source of additional •
information, usually the sponsoring NASA Center (see p&se 6)

mental factors whose effect is criticalto
equipment reliability.If these factors are not Implementation Method: A brief technical discussion, not intended
to give the full details of the process but to provide a design engineer
includedin the environmental design criteria
and with adequate inform.ainu to understand how the guideline should
testprogram, environment-induced failuresmay be used.
occur during spaceflightoperations. Technical Ratlvnale: A brief technical justlfication for use of the
guideline
References: Impa_'q of Iq[onpractlce: A brief statement of what can be expected if
the guideline is avoided
Government Related Guidelines: Identlficatlon of other topic areas
in the manual that contain related information _IPONSOR

OP
I. ReliabilityPrediction of Electronic Equip- References: Publications that contain additional GUIDELINE

information about the guideline
ment. MIL-HDBK-217E Notice I, January
1990.
2. Reliability/Design Thermal Applications.
MIL-HDBK-251, January 1978. 3.3 Guidelines as of January 1993
3. Electronic Reliability Design Handbook.
MIL-HDBK-338-1A, October 1088. GD-ED-2201 ** Fastener Standardization and Selec-
4. Environmental Test Methods and Engineering tion Considerations
Guidelines.MIL-STD-810E, July 1989. GD-ED-2202 ** Design Considerations for Selec-
tion of Thick-Film Microelectronic
Industry Circuits
GD-ED-2203 ** Design Checklists for Microcircuits
5. Space StationFreedom ElectricPower System
Reliability and Maintainability Guidelines GD-AP-2301 Earth Orbit Environmental Heating
Document. EID-00866, Rocketdyne Division,
Rockwell International,1990. GT-TE-2401 ** EMC Guideline for Payloads, Sub-
6. Societyof Automotive Engineers,Reliability, systems, and Components
Maintainability,and SupportabilityGuide-
book, SAE G-11, 1990. **New Guidelines as of January 1993.
3.0 RELIABILITY DESIGN GUIDELINES 3.4 Typical Reliability

Guideline
3.1 Introduction A typical reliability guideline is illustrated in this

section. Environmental heating for Earth orbiting systems
The reliability design guidelines for consideration by is an important design consideration. Designers should
the aerospace community are presented herein. These use currently accepted values for the solar constant,
guidelines contain information that represents a techni- albedo factor, and Earth radiation when calculating the
cally credible process applied to ongoing NASA programs heat balance of Earth orbiters. These calculations can
accuratelypredict the thermal environment of orbiting • References:
devices.Failure to use these constantscan resultin an
incomplete thermal analysisand grosslyunderestimated I. Leffler,J.M.: Spacecraft External Heating
temperature variationsof the components. Variations in Orbit. AIAA paper 87-1596,
June 1987.
2. Reliability/Design, Thermal Applications.
Analysis of Earth Orbit Environmental Heating MIL-lIDBK-251, 1978.
3. Incropera, F.P.; and DeWitt, D.P.: Funda-
• Guideline (GD-AP-2$01): Use currentlyaccepted mentals ofHeat and Mass Transfer.Second ed.
valuesforsolarconstant,albedo factor,and Earth radia- John Wiley & Sons, 1985.
tionwhen calculatingheatbalanceof Earth orbiters.
This
practiceprovidesheatingrateforblackbody casewithout
consideringspectraleffectsor collimation. 4.0 NASA RELIABILITY AND MAINTAINABILITY
STEERING COMMITTEE
• Benefit:Thermal environment of orbitingdevicesis
accuratelypredicted. The followingmembers of the NASA Reliabilityand
MaintainabilitySteeringCommittee may be contactedfor
• Center to Contact forMore Information: Goddard more information about the practicesand guidelines:
Implementation Method Dan Lee

Ames ResearchCenter
MS 218-7 DQR
Solar constant,W/m 2
MoffettField,California
94035
Nominal, 1367.5
Winter, 1422.0 JackRemez
Summer, 1318.0 Goddard Space FlightCenter
Bldg.6 Rm $233 Code 302
Greenbelt,
Maryland 20771
Albedo factor
Nominal, 0.30 Thomas Gindoff
JetPropulsionLaboratory
Hot, 0.35
California
Instituteof Technology
Cold, 0.25 MS 301-456 SEC 521
4800 Oak Grove Drive
- Earth-emitted energy (nominal, 255 K; Pasadena,California91109
producing 241 W/m 2)
Nancy Steisslinger
Lyndon B. Johnson Space Center
Bldg.45 Run 613 Code NB23
Solar Albedo Earth-emitted Equivslent Houston,Texas 77058
constant, factor energy, earth
W/m _ W,/m _ temperature, Leon M]gdalski

K
John F. Kennedy Space Center
Nomlntl, 1367.5 0.25 286 _5e RT-ENG-2 KSC HQS 3548
.30 239 254 Kennedy Space Center,Florida32899
.$5 222 280
SalvatoreBavuso
Winter solstice, O.g8 257 263
LangleyResearchCenter
1422 .30 249 2];8
.SS 231 283

MS 478
5 Freeman Road
Summer so]stics, 0.28 24T 256 Hampton, Virginia 23665-5225
131S .50 231 251
.S& 214 246 VincentLalli

Lewis ResearchCenter
MS 501-4 Code 0152
• Technical Rationale: Modification of energy inci- 21000 BrookparkRoad
Cleveland,Ohio 44135
dent on a spacecraftdue to Earth-Sun distancevariation
and accuracy ofsolarconstantareof sufficientmagnitude Donald Bush
to be important parameters in performing a thermal George C. MarshallSpace FlightCenter
CT11 Bldg.4103
analysis.
MarshallSpace FlightCenter,Alabama 35812
• Impact of Nonpractice: Failure to use constants Ronald Lisk

NASA HeadquartersCode QS
resultsin an incomplete thermal analysisand grossly
Washington,DC 20546
underestimated temperature variationsof components.
PART H--RELIABILITY TRAINING
1.0 INTRODUCTION TO RELIABILITY ure.Although such design controlsare important, most

equipment failuresin the fieldbear no relation to the
=Reliability
_ appliesto systems consistingof people, resultsof reasonablestressanalysesduring design.These
machines, and writteninformation.A system isreliable failuresare type II (i.e.,
those caused by built-inflaws}.
ifthose who need itcan depend on itover a reasonable
period oftime and ifitsatisfies
theirneeds.Of the people
involved in a system, some relyon it,some keep it reli- 1.2 New Direction
able,and some do both. Severalmachines comprise a sys-
tem: mechanical, electrical,and electronic.
The written The new direction in reliability engineering will be
information defines peoples'roles in the system: sales toward a more realistic recognition of the causes and
literature;
system specifications;
detailedmanufacturing effects of failures. The new boundaries proposed for relia-
drawings; software,programs, and procedures;operating bility engineering are to exclude management, applied
and repairinstructions;and inventory control. mathematics, and double checking. These functions are
important and may still be performed by reliability engi-
Reliabilityengineeringis the discipline that defines neers. However, reliability engineering is to be a synthe-
specifictasks done while a system is being planned, sizing function devoted to flaw control. The functions
designed,manufactured, used, and improved. Outside of presented in figure 2 relate to the following tasks:
the usual engineeringand management tasks,these tasks
ensure that the people in the system attend to allthose (1) Identifyflaws and stressesand rank them for
detailsthat keep itoperatingreliably. priorityactions.
Reliability
engineeringisnecessarybecause as usersof (2) Engage thematerialtechnologists
to determine the
rapidly changing technology and as members of large flaw failuremechanisms.
complex systems,we cannot ensure that essential
details
affectingreliability
are not overlooked. (3)Develop flaw controltechniquesand send informa-
tionback to the engineersresponsiblefordesign,manu-
facture,and support planning.
I.I Period of Awakening: FailureAnalysis
The theme ofthistutorialisfailurephysics:the study

of how products, hardware, software, and systems fail
and what can be done about it.Training in reliability
must begin with a review of mathematics and a descrip-
Electromigration Cathode Bearing
tion of the elements that contributeto product failures. depletfon wear
Consider the followingexample of a failureanalysis.A (a) Type I failures (a designmarginproblem on stress/strength,
semiconductor diode developed a short.Analysisshowed fatigue, andwear).
that a surgevoltagewas occurringoccasionally,exceeding
the breakdown voltage of the diode and burning itup.
The problem: stressexceeding strength,a type I failure.
A transistorsuddenly stopped functioning. Analysis Oxide
showed that aluminum metallizationopened at an oxide
step on the chip, the opening acceleratedby the neck-
down of the metallizationat the step.In classical
termi-
Electromigration Misaligned Oxidepinhole
nology,thisfailure,caused by a manufacturing flaw,isa aroundflaw gear wear breakdown
random failure(type If).These two failuretypes are
(b)Type II failures(a flawproblem).
shown in figure1.Formerly, most of the design control
Figure1.--Two typesof failure.
effortsshown in the figurewere aimed at the type I fail-
analysis
Environmental
function __
function Environmental
I Design I=
stress
Design information Information on
_ package operational
,r
Flaw control conditions
it information liiiiiiii!i i i System/
I
equipment
Support li iliiiiiiiiii
l- Flaw (failure)
information
iser
planning
Manufacturing _--k_nufacturing
Flaw (failure)
flaw information mechanisms
I tecMaterial
hnology
Completed systems/equlpm_ent
Maintenance plan and test equipment
Figure 2.--Role of reliability engineering for the 1990's.
The types of output expected from reliability engi- Era of Semiconductors

Period of Awakening
neering are different from those provided by traditional
New Direction
engineering: stress-screening regimens; failure characteris-
Concluding Remarks
tics of parts and systems; effects of environmental stresses Reliability Training
on flaws and failures; relationship of failure mechanisms
to flaw failures; relationship of manufacturing yield to Relisbmty Mathematics and Failure Physics
Mathematics Review
product reliability; flaw detection methods such as auto-
Notation
mated IC chip inspection and vibration signature
Manipulation of Exponential Functions
monitoring. Rounding Data
Integration Formulas
Differential Formulas
Because flaws in an item depend on the design, manu-
Partial Derivatives
facturing processes, quality control, parts, and materials,
Expansion of (a + b) n
the distribution of flaws does not stay constant. Relia-
Failure Physics
bility engineering must act in a timely manner to provide Probability Theory
flaw control information to the proper functions for ac- Fundamentals
tion. It is important that customers recognize this fact Probability Theorems
Concept of Reliability
and allow proper controls to be tailored to the needs of
Reliability as Probability of Success
the time instead of demanding a one-time negotiation of Reliability as Absence of Failure
what should be done for the total contract period. Product Application
K-Factors
Concluding l_marks
Reliability Training
1.3 Training as of June 1992
Exponential Dlstribut|on and Rellabmty Models

Although this tutorial considers only specific areas to Exponential Distribution
exemplify the contents of a reliability training program, Failure Rate Definition
Failure Rate Dimensions
the following provides a complete list from the NASA
_Bathtub" Curve
Reference Publication 1253, =Reliability Training," avail-
Mean Time Between Failures
able upon request from the National Technical Informa- Calculations of Pc for Single Devices
tion Service, Springfield, Virginia; (703) 487-4650. A Reliability Models
course evaluation form is included in the appendix. Calculation of Reliability for Serles-Connected Devices
Calculation of Reliability for Devices Connected in Parallel
Introduction to Reliability (Redundancy)
Era of Mechanical Designs Calculation of Reliability for Complete System
Era of Electron Tubes
8
Concluding Remarks Other Models
Reliability Training Trends and Conclusions
Software
Uetag Failure Rate Data
Categories of Software
Variables Affecting Failure Rates
Processing Environments
Operating Life Test Severity of Software Defects
Storage Test Software Bugs Compared With Software Defects
Summary of Variables Affecting Failure Rates Hardware and Software Failures
Part Failure Rate Data
Manifestations of Software Bugs
Improving System Reliability Through Part Derating Reliability Training
Predicting Reliability by Rapid Techniques
Use of Failure Rates in Tradeoffs Software quality A_uraace
Nonoperating Failures Concept of Quality
Applications of Reliability Predictions to Control of Software Quality
Equipment Reliability Software Quality Characteristics
Standardisation as a Means of Reducing Failure Rates Software Quality Metrics
Allocation of Failure Rates and Reliability Overall Software Quality Metrics
Importance of Learning From Each Failure Software Quality Standards
Failure Reporting, Analysis, Corrective Action, and Concluding Remarks
Concurrence Reliability Training
Case Study--Achieving Launch Vehicle Reliability
Reliability Management
Design Challenge
Roots of Reliability Management
Subsystem Description
Planning a Reliability Management Organisation
Approach to Achieving Reliability Goals
General Management Considerations
Launch and Flight Reliability
Program Establishment
Field Failure Problem
Goals and Objectives
Mechanical Tests
Symbolic Representation
Runup and Rundown Tests
Logistics Support and Repair Philosophy
Summary of Case Study
Reliability Management Activities
Concluding Remarks
Performance Requirements
Specification Targets
Applying Probabillty Density Functions Field Studies
Probability Density Functions Human Reliability
Application of Density Functions Analysis Methods
Cumulative Probability Distribution Human Errors
Normal Distribution Example
Normal Density Function Presentation of Reliability
Properties of Normal Distribution Engineering and Manufacturing
Symmetrical Two-Limit Problems User or Customer
One-Limit Problems Reliability Training
Nonsymmetrical Two-Limit Problems
Appendixes
Application of Normal Distribution to Test Analyses and
A--Reliability Information
Reliability Predictions
B--Project Manager's Guide on Product Assurance
Effects of Tolerance on a Product
C--Reliabillty Testing Examples
Notes on Tolerance Accumulation: A How-To-Do-It Guide
Estimating Effects of Tolerance Bibliography
Concluding Remarks
Reliability Training Answer_
Testing for Reliability

Demonstrating Reliability
2.0 RELIABILITY MATHEMATICS AND FAIL-
Pc Illustrated
Ps Illustrated URE PHYSICS
Pw Illustrated
K-Factors lllustrated
2.1 FailurePhysics
Test Objectives and Methods
Test Objectives
Attribute Test Methods When most engineersthink of reliability, they think
Test-to-Failure Methods of parts sinceparts are the buildingblocks of products.
Life Test Methods Allagree that a reliable product must have reliableparts.
Concluding Remarks But what makes a part reliable? When asked, nearly all
engineers would say a reliablepaxt is one purchased
Software Reliabfllty according to a certain source control document and
Models bought from an approved vendor. Unfortunately, these
Time Domain Models two qualifications are not always guarantees of relial)ility.
Data Domain Models
The following case illustrates this problem.
Axiomatic Models
A clock purchased according to PD 4600008 was 2.2 Reliabilityas Absence of Failure
procured from an approved vendor foruse in the ground
support equipment of a missilesystem and was subjected Although the classicaldefinitionof reliabilityis
to qualification testsas part of the reliability
program. adequate for most purposes,we are going to modify it
These testsconsistedof high- and low-temperature,me- somewhat and examine reliability from a slightlydiffer-
chanicalshock,temperature shock,vibration, and humid- ent viewpoint. Consider thisdefinition: Reliabilityis the
ity.The clocksfrom the then sole-sourcevendor failed probability that the critical [ailuremodes of a device wi]l
two of the tests:low-temperatureand humidity. A failure not occur during a specified period of time and under
analysisrevealedthatlubricantsin theclock'smechanism specified conditions when used in the manner and for the
froze and that the sealswere not adequate to protectthe purpose _tended. Essentially, thismodificationreplaces
mechanism from humidity. A second approved vendor the words "a device willoperate successfullffwith the
was selected. His clocksfailedthe hlgh-temperaturetest. words "critical failuremodes . . . willnot occur._ This
In the processthe dialhands and numerals turned black, means that ifallthe possiblefailuremodes of a device
making readingsimpossiblefrom a distanceof 2 feet.A (ways the device can fail)and their probabilitiesof
third approved vendor's clocks passed all of the tests occurrence are known, the probabilityof success (orthe
except mechanical shock,which cracked two of the cases. reliability
of a device)can be stated.It can be stated in
Ironically, the fourth approved vendor's clocks,though terms of the probabilitythat those failuremodes critical
lessexpensive,passed allthe tests. to the performance of the device willnot occur. Just as
we needed a clear definitionof success when using the
The point of thisillustration
isthat fourclocks,each classicaldefinition,
we must alsohave a clear definition
designed to the same specificationand procured from a of failurewhen using the modified definition.
qualifiedvendor, allperformed differentlyin the same
environments. Why did thishappen? The specification For example, assume that a resistorhas only two
did not includethe gear lubricantor the type of coating failure modes: it can open or it can short. If the
on the hands and numerals or the type of casematerial. probabilitythat the resistor
willnot short is0.99 and the
probabilitythat itwillnot open is 0.9,the reliability of
Many similarexamples could be cited,ranging from the resistor(or the probabilitythat the resistorwill not
requirements for glue and paint to complete assemblies short or open) is given by
and systems, and the key to answering these problems
can best be stated as follows:To kuow how reliablea
Rreslstor = Probability of no opens
product is or how to design a reliable product, you must
know how many ways its parts can fail and the types and
× Probabilityof no shorts
magnitude of stresses that cause such failures. Think
about this: if you knew every conceivable way a missile
= 0.9 × 0.99 = 0.89
could fail and if you knew the type and level of stress
required to produce each failure, you could build a missile
that would never fail because you could eliminate Note that we have multipliedthe probabilities. Proba-
bilitytheorem 2 thereforerequiresthat the open-failure-
(I) As many ways of failureas possible mode probabilityand the short-failure-mode probability
(2) As many stressesas possible be independent of each other.This condition issatisfied
(3)The remaining potentialfailuresby controlling because an open-failuremode cannot occur simultane-
the levelof the remaining stresses ously with a short mode.
Sound simple? Well, itwould be except that despitethe

thousands of failuresobserved in industry each day, we 2.3 Product Application
still
know very littleabout why thingsfailand even less
about how to control these failures. However, through (or the probabilityof
This section relatesreliability
systematic data accumulation and study, we learnmore success)to product failures.
each day.
2.3.1 Product failuremodes.--In general, critical
As stated at the outset,thistutorialintroducessome equipment failuresmay be classifiedas catastrophicpart
basic concepts of failurephysics:failuremodes (how failures,tolerance failures,
and wearout failures.The
failuresarerevealed);failuremechanisms (what produces expressionforreliabilitythen becomes
the failuremode); and failurestresses(what activatesthe
failuremechanisms). The theory and the practicaltools R = P££w
availablefor controllingfailuresare presentedalso.
I0
where added failures, the observed failures will be greater than
the inherent failures of the design.
Pc probability that catastrophic part failures will not
occur
2.4 K-Factors
Pt probability that tolerance failures will not occur
The other contributorsto product failurejust men-
Pw probability that wearout failures will not occur tioned are calledK-factors;they have a value between 0
and 1 and modify the inherentreliability:
As in the resistor example, these probabilities are multi-
plied together because they are considered to be indepen-
Rproduc t = Ri(KqKmKrK_Ku)
dent of each other. However, this may not always be true
because an out-of-tolerance failure, for example, may
evolve into or result from a catastrophic part failure. • K-factors denote probabilities that inherent reliability
Nevertheless, in this tutorial they are considered inde- will not be degraded by
pendent and exceptions are pointed out as required.
- K m manufacturing and fabricationand assembly
2.3.2 Inherent product reliability.--Consider the in- techniques
herent reliability R i of a product. Think of the expression - quality test methods and acceptance criteria
R i = PcPtP w as representing the potential reliability of Krqreliability engineering activities
a product as described by its documentation, or let it - K_ logistics activities
represent the reliability inherent in the design drawings K u the user or customer
instead of the reliability of the manufactured hardware.
This inherent reliability is predicated on the decisions and • Any K-factor can cause reliability
to go to zero.
actions of many people. If they change, the inherent relia-
bility could change. • Ifeach K-factor equals 1 (thegoal),Rproduct = R i.
Why do we consider inherent reliability? Because the

facts of failure are these: When a design comes off the 2.5 Variables AffectingFailureRates
drawing board, the parts and materials have been se-
lected; the tolerance, error, stress, and other performance Part failure rates are affected by (1) acceptance cri-
analyses have been performed; the type of packaging is teria, (2) all environments, (3) application, and (4) stor-
firm; the manufacturing processes and fabrication tech- age. To reduce the occurrence of part failures, we observe
niques have been decided; and usually the test methods failure modes, learn what caused the failure (the failure
and the quality acceptance criteria have been selected. At stress), determine why it failed (the failure mechanism),
this point the design documentation represents some po- and then take action to eliminate the failure. For exam-
tential reliability that can never be increased except by a ple, one of the failure modes observed during a storage
design change or good maintenance. However, the possi- test was an _open" in a wet tantalum capacitor. The fail-
bility exists that the actual reliability observed when the ure mechanism was end seal deterioration, allowing the
documentation is transformed into hardware will be much electrolyte to leak. One obvious way to avoid this failure
less than the potential reliability of the design. To under- mode in a system that must be stored for long periods
stand why this is true, consider the hardware to be a without maintenance is not to use wet tantalum capaci-
black box with a hole in both the top and bottom. Inside tors. If this is impossible, the best solution would be to
are potential failures that limit the inherent reliability of redesign the end seals. Further testing would be required
the design. When the hardware is operated, these poten- to isolate the exact failure stress that produces the failure
tial failures fall out the bottom (i.e., operating failures mechanism. Once isolated, the failure mechanism can
are observed). The rate at which the failures fall out often be eliminated through redesign or additional process
depends on how the box or hardware is operated. Unfor- controls.
tunately, we never have just the inherent failures to
worry about because other types of failures are being
added to the box through the hole in the top. These other 2.6 Use of Failure Rates in Tradeoffs
failures are generated by the manufacturing, quality, mad
logistics functions, by the user or customer, and even by Failure rate tables and derating curves are useful to a
the reliability organization itself. We discuss these added designer because they enable him to make reliability
failures and their contributors in the following paragraphs tradeoffs and provide a more practical method of estab-
but it is important to understand that, because of the lishing derating requirements. For example, suppose we
11
have two design conceptsfor performing some function. isthe requiredfailurerate.Ifblowersareused forcooling,
If the failurerate of concept A is 10 times higher than the equipment must operate at temperatures as high as
that of concept ]3,one can expect concept B to failone- 75 °C; ifair-conditioning isused, the temperature need
tenth as often as concept A. If it is desirableto use not exceed 50 °C. Therefore,air-conditioningmust be
concept A forother reasons,such as cost,size,perform- used ifwe are to meet the reliability
requirement.
ance, or weight, the derating failurerate curves can be
used to improve concept A's failurerate (e.g.,select Other factorsmust be examined before we make a
components with a lower failurerate,derate the compo- final decision.Whatever type of cooling equipment is
nents more, or both). An even betterapproach isto find selected,the totalsystem reliabilitynow becomes
ways to reduce the complexity and thus the failurerate
of concept A. Figure 3 illustrates
the use of failurerate R T = ReR c
data in tradeoffs.This figuregives a failure-rate-versus-
temperature curve for the electronicsof a complex (over Therefore,the effecton the system of the cooling equip
35 000 parts) piece of ground support equipment. The ment's reliability
must be calculated.A more important
curve was developed as follows: considerationisthe effecton system reliability
should the
cooling equipment fail.Because temperature control
(1) A failureratepredictionwas performed by using appears to be critical,lossof itmay have serioussystem
component failureratesand theirapplicationfactorsK A consequences.Therefore,itistoo soon to ruleout blowers
foran operatingtemperature of25 oC. The resultingfail- entirely.A failuremode, effects, and criticality
analysis
ure rate was chosen as a referencepoint. (FMECA) must be made on both cooling methods to
examine allpossiblefailuremodes and theireffectson the
(2) Predictionswere then made by using the same system. Only then willwe have sufficient information to
method for temperatures of 50, 75, and 100 °C. The make a sound decision.
ratios of these predictionsto the referencepoint were
plotted versus component operating temperature, with
the resultingcurve for the equipment. This curve was 2.7 Importance of Learning From Each Failure
then used to provide tradeoff criteriafor using air-
conditioningversus blowers to cool the equipment. To When a product fails, a valuable piece of information
illustrate,
suppose the maximum operatingtemperatures about it has been generated because we have the
expected are 50 °C with air-conditioning
and 75 °C with opportunity to learn how to improve the product if we
blowers.Suppose furtherthatthe requiredfailureratefor take the right actions.
the equipment, ifthe equipment isto meet itsreliability
goal,isone failureper 50 hr.A failureratepredictionat as:
Failurescan be classified
25 °C might indicatea failurerateof I per 100 hr.From
the figure,note that the maximum allowableoperating (1) Catastrophic (a shorted transistoror an open
temperature is therefore60 °C, since the maximum wire-wound resistor)
allowablefailurerateratioisA = 2; that is,at 60 °C the (2)Degradation (change in transistorgain or the re-
equipment failureratewillbe (1/100) × 2 = 1/50,which sistorvalue)
(3)Wearout (brush wear in an electricmotor)
40 - i _Maximum /e These threefailurecategories

can be subclassified
further:
- I Ioperating /
I ltemperature/
(1) Independent (a shorted capacitor in a radiofre-
2O - I Iwilhblowers/'
IMaxJmum I / quency amplifier being unrelated to a low-emission
lO _ Ioperating l / cathode in a picture tube)
8 - Itemperaturewith I (2) Cascade (the shorted capacitor in the radiofre-
- lair-conditioning I /_ quency amplifier causing excessive current to flow in
4 its transistor and burning the collector beam lead
_+ open)
(3) Common mode (uncured resin being present in
2 i point co
motors)
Much can be learnedfrom each failureby using these

20 30 40 50 60 70 80 90 100
Componenttemperature,+C categories,
good failurereporting,analysis,and a concur-
rence system and by taking correctiveaction. Failure
Figure3.--Predicted failurerateratiosversustemperaturefor
groundsupportequipment(electronics). analysisdetermines what caused the part to fail.Correc-
12
tive action ensures that the cause is dealt with. Concur- provide some measure of reliabilitybut littleinformation
rence informs management of actions being taken to about the population failure mechanisms of like devices.
avoid another failure.These data enable all personnel to (The exceptions to this are not dealt with at this time.)
compare the part ratings with the use stresses and verify
that the part isbeing used with a known margin. In subsequent sections, we discuss confidence levels,
attribute test, test-to-failure, and lifetest methods,
explain how wel these methods meet the two test objec-
2.8 Effects of Tolerance on a Product tives, show how the test results can be statistically
analyzed, and introduce the subject and use of confidence
Because tolerances must be expected in all manufac- limits.
turing processes, some important questions to ask about

the affectsof tolerance on a product are
3.2 Confidence Levels
(1) How is the reliability affected?

Mr. Igor Bazovsky, in his book, Reliability Theory
(2) How can tolerances he analyzed and what methods and Practice (ref. 4), defines the term aconfidence_ in
are available? testing:
(3) What will affect the term Pt in the product We know that statistical
estimatesare more likelyto be close
reliabilitymodel? to the truevalue as the sample sizeincreases.
Thus, thereisa close
correlationbetween the accuracy ofan estimateand the sizeofthe
mample from which itwas obtained.Only an infinitely largesample
Electrical circuits are often affected by part tolerances size could give us a 100 percent confidence or certainty that a
(circuit gains can shift up or down, and transfer function measured statistical parameter coincides with the true value. In
poles or zeros can shift into the righthand s-plane, this context, confidence is a mathematical probability relating the
mutual positions of the true value of a parameter and its estimate.
causing oscillations).Mechanical components may not fit
together or may be so loose that excessive vibration When the estimate of a parameter is obtained from a reason-
causes trouble (refs.1 to 3). ably sized sample, we may logically assume that the true value of
that parameter will be somewhere in the neighborhood of the
estimate, to the right or to the left, Therefore, it would be more
3.0 TESTING FOR RELIABILITY meaningful to express statistical estimates in terms of a range or
interval with an associated probability or confidence that the true
value lies within such interval than to express them as point
3.1 Test Objectives estimates. This is exactly what we are doing when we assign
confidence limits to point estimates obtained from statistical
measurements.
It can be inferred that i000 test samples are required
to demonstrate a reliabilityrequirement of 0.999. Because
In other words, rather than express statisticalestimates
of cost and time, this approach is impractical. Further-
more, the total production of a product often may not as point estimates, it would be more meaningful to
even approach 1000 items. Because we usually cannot test express them as a range (or interval), with an associated
the total production of a product (calledproduct popula- probability (or confidence) that the true value lleswithin
such an interval. Confidence is a statisticalterm that
tion), we must demonstrate reliabilityon a few samples.
Thus, the main objective of a reliabilitytest is to test an depends on supporting data and reflects the amount of
available device so that the data will allow a statistical risk to be taken when stating the reliability.
conclusion to be reached about the reliabilityof similar

devices that will not or cannot be tested. That is, the
3.3 Attribute Test Methods
main objective of a reliabilitytest isnot only to evaluate
the specificitems tested but also to provide a sound basis
for predicting the reliabilityof similar items that will not Qualification, preflight certification,and design verifi-
be tested and that often have not yet been manufactured. cation tests are categorized as attribute tests (refs.5
and 6). They are usually go/no-go and demonstrate that
To know how reliable a product is one must know a device is good or bad without showing how good or
how many ways itcan failand the types and magnitudes how bad. In a typical test, two samples are subjected to
of the stresses that produce such failures.This premise a selected level of environmental stress,usually tilemaxi-
leads to a secondary objective of a reliability test: to mum anticipated operational limit. Ifboth samples pass,
produce failures in the product so that the types and the device is considered qualified, preflight certified,or
magnitudes of the stresses causing such failures can be verified for use in the particular environment involved
identified. Reliability tests that result in no failures (refs.7 and 8). Occasionally, such tests are called tests to
13
success because the true objective is to have the device 3.5 Life Test Methods
pass the test.
Lifetestsaxe conducted to illustrate
how the failure
In summary, an attribute test is not a satisfactory rateof a typicalsystem or complex subsystem variesdur-
method of testing for reliability because it can only iden- ing itsoperatinglife.Such data provide valuable guide-
tify gross design and manufacturing problems; it is an lines for controllingproduct reliability.They help to
adequate method of testing for reliability only when suffi- establishburn-in requirements, to predict spare part
cient samples are tested to establish an acceptable level requirements,and to understand the need for or lack of
of statistical confidence. need for a system overhaul program. Such data are ob-
tained through laboratory lifetestsor from the normal
operation of a fieldedsystem.
3.4 Test-To-Failure Methods
2o
The purpose of the test-to-failure method is to develop

a failure distribution for a product under one or more
18
types of stress. The results are used to calculate the
demonstrated reliability of the device for each stress. In
this case the demonstrated population reliability will usu- 16
ally be the Pt or Pw product reliability term.
In this discussion of test-to-failure methods, the term 14

=safety factor s SF is included because it is often confused
with safety margin SM. Safety factor is widely used in
12
industry to describe the assurance against failure that is
built into structural products. Of the many definitions of
safety factor the most commonly used is the ratio of
10
mean strength to reliability boundary:
8 _ 9.68 Percentdefective
p(x)
When we deal with materials with clearly defined, (a) StructureA.

repeatable, and =tight _ strength distributions, such as D
18
sheet and structural steel or aluminum, using S F presents
little risk. However, when we deal with plastics, fiber-
glass, and other metal substitutes or processes with wide = 10 "
16
variations in strength or repeatability, using S M provides ___SF 13"13
a clearer picture of what is happening (fig. 4). In most
cases, we must know the safety margin to understand
how accurate
In summary,
the safety factor may be.
test-to-failure methods can be used to

14
-D
SM = 4.0
develop a strength distribution that provides a good esti-
mate of the Pt and Pw product reliability terms without lO Rb
the need for the large samples required for attribute tests;
the results of a test-to-failure exposure of a device can be
used to predict the reliability of similar devices that can-
8 _--'_'- 0.003 Peroentdefective
not or will not be tested; testing to failure provides a
means of evaluating the failure modes and mechanisms
devices so that improvements can be made; confidence
of
6
/
p(x)
levels can be applied to the safety margins and to the
resulting population reliability estimates; the accuracy of (b) StructureB.
a safety factor can be known only if the associated safety Figure4.--Two structureswith identicalsafetyfactors
margin is known. (SF= 13/10 = 1.3) butdifferentsafetymargins.
14
In summary, Iliatests are performed to evaluate Of course, in the case of missile flights or other events
product failurerate characteristics;
iffailuresincludeall that produce go/no-go results, an attribute analysis is the
causes of system failure,
the failurerate of the system is only way to determine product reliability.
the only true factoravailableforevaluating the system's
performance; lifetests at the part level require large
sample sizesifrealisticfailurerate characteristicsare to 4.0 SOFTWARE RELIABILITY
be identified;laboratory lifetests must simulate the
major factors that influencefailurerates in a device Software reliability management is highly dependent
during fieldoperations;the use ofrunning averagesin the on how the relationship between quality and reliability is
analysisof lifedata will identifyburn-in and wearout perceived. For the purposes of this tutorial, quality is
regions ifsuch exist;and failurerates are statisticsand closely related to the process, and reliability is closely
thereforeare subject to confidence levelswhen used in related to the product. Thus, both span the life cycle.
making predictions.
Before we can stratify software reliability, the progress
Figure 5 illustrates what might be called a failure of hardware reliability will be reviewed. Over the past
surfacefor a typicalproduct. Itshows system failurerate 25 years, the industry observed (1) the initial assignment
versus operating time and environmental stress,three of "wizard status _ to hardware reliability for theory,
parameters that describea surface such that, given an modeling, and analysis, (2) the growth of the field, and
environmental stressand an operating time, the failure (3) the final establishment of hardware reliability as a
rate is a point on the surface. science. One of the major problems was aligning
reliability predictions and field performance. Once that
Test-to-failuremethods generate lineson the surface was accomplished, the wizard status was removed from
parallelto the stressaxis;lifetestsgeneratelineson the hardware reliability. The emphasis in hardware reliabil-
surfaceparallelto the time axis. Therefore,these tests ity from now to the year 2000 will be on system failure
provide a good descriptionof the failuresurface and, modes and effects.
consequently,the reliabilityof a product.
Software reliability became classified as a science for
Attribute testsresultonly in a point on the surfaceif many reasons. The difficulty in assessing software
failuresoccur and a point somewhere within the volume reliability is analogous to the problem of assessing the
iffailuresdo not occur.For thisreason,attributetesting reliability of a new hardware device with unknown
isthe least desirablemethod forascertainingreliability. reliability characteristics. The existence of 30 to 50
different software reliability models indicates the
organization in this area. Hardware reliability began at
a few companies and later became the focus of the
Advisory Group on Reliability of Electronic Equipment.
The field then logically progressed through different
models in sequence over the years. Similarly, numerous
people and companies simultaneously entered the soft-
ware reliability field in their major areas: cost, complex-
ity, and reliability. The difference is that at least 100
times as many people are now studying software reliabil-
ity as those who initially studied hardware reliability.
The existence of so many models and their purports tends
to mask the fact that several of these models showed
excellent correlations between software performance pre-
dictions and actual software field performance: the Musa
model as applied to communications systems and the
Xerox model as applied to office copiers. There are also
reasons for not accepting software reliability as a science,
and they are discussed next.
One impediment to the establishment of software

reliability as a science is the tendency toward program-
/0 0 ming development philosophies such as (1) "do it right
the in'st time _ (a reliability model is not needed} or
Figure 5.--Product failure surface. (2) "quality is a programmer's development toolj _ or
15
(3) _quality is the same as reliability and is measured by "It is contrary to the definition of reliability to apply
the number of defects in a program and not by its reliability analysis to a system that never really works.
reliability. _ All of these philosophies tend to eliminate This means that the software which still has bugs in it
probabilistic measures because the managers consider a really has never worked in the true sense of reliability in
programmer to be a software factory whose quMity the hardware sense. _ Large complex software programs
output is controllable, adjustable, or both. In actuality, used in the communications industry are usually operat-
hardware design can be controlled for reliability ing with some software bugs. Thus, a reliability analysis
characteristics better than software design can. Design of such software is different from a reliability analysis of
philosophy experiments that failed to enhance hardware established hardware. Software reliability is not alone in
reliability axe again being formulated for software design the need for establishing qualitative and quantitative
(ref. 9). Quality and reliability are not the same. Quality models.
is characteristic and reliability is probabilistic. Our
approach draws the line between quality and reliability In the early 1980's,work was done on a combined
because quality is concerned with the development hardware/software reliability
model. A theory for com-
process and reliability is concerned with the operating bining well-known hardware and software models in a
product. Many models have been developed and a num- Markov processwas developed. A considerationwas the
ber of the measurement models show great promise. Pre- topic of software bugs and errorsbased on experiencein
dictive models have been far less successful partly because the telecommunicationsfield.To synthesizethe manifes-
a data base {such as MIL-HDBK-217E, ref. 10} is not tationsof softwarebugs, some of the followinghardware
yet available for software. Software reliability often has trends for these systems should be noted: (1)hardware
to use other methods; it must be concerned with the proc- transientfailuresincreaseas integratedcircuitsbecome
ess of software product development. denser; (2} hardware transientfailurestend to remain
constant or increaseslightlywith time afterthe burn-in;
and (3} hardware (integrated circuit) catastrophic failures
4.1 Hardware and Software Failures decrease with time after the burn-in phase. These trends
affect the operational software of communications sys-
Microprocessor-based products have more refined defi- tems. If the transient failures increase, the error analysis
nitions. Four types of failure may be considered (1} hard- and system security software are called into action more
ware catastrophic, (2)hardware transient, (3)software often. This increases the risk of misprocessing a given
catastrophic, and (4) software transient. In general, the transaction in the communications system. A decrease in
catastrophic failures require a physical or remote hard- the catastrophic failure rate of integrated circuits can be
ware replacement, a manual or remote unit restart, or a significant {ref. 12}. An order-of-magnitude decrease in
software program patch. The transient failure categories the failure rate of 4K memory devices between the first
can result in either restarts or reloads for the year and the twentieth year is predicted. We also tend to
microprocessor-based systems, subsystems, or individual oversimplify the actual situations. Even with five vendors
units and may or may not require further correction. A of these 4K devices, the manufacturing quality control
recent reliability analysis of such a system assigned ratios person may have to set up different screens to eliminate
for these categories. Hardware transient faults were the defective devices from different vendors. Thus, the
assumed to occur at 10 times the hardware catastrophic system software will see many different transient memory
rate, and software transient faults were assumed to occur problems and combinations of them in operation.
at 100 to 500 times the software catastrophic rate.
Centralcontroltechnology has prevailedin communi-
The time of day is of great concern in reliability cationssystems for25 years.The industryhas used many
modeling and analysis.Although hardware catastrophic of itsold modeling tools and applied them directlyto
failuresoccur at any time of theday, they oftenmanifest distributedcontrol structures.Most modeling research
themselves during busiersystem processingtimes.On the was performed on largeduplex processors.With an evolu-
other hand, hardware and softwaretransientfailures gen- tion through forms of multiple duplex processors and
erallyoccur during the busy hours.When a system'spre- load-sharingprocessorsand on to the present forms of
dicted reliabilityis close to the specifiedreliability,
a distributedprocessingarchitectures, the modeling tools
sensitivityanalysismust be performed. need to be verified.With fullydistributedcontrol sys-
tems, thesoftwarereliability
model must be conceptually
matched to the softwaredesign in order to achievevalid
4.2 Manifestations of Software Bugs predictionsof reliability.
Many theories, models, and methods are available for The followingtrends can be formulated for software
quantifying software reliability. Nathan (ref. 11) stated, transientfailures:
(I}softwaretransientfailuresdecrease
16
as the system architecture approaches a fully distributed modeling tool could additionally combine the effects of
control structure, and (2) software transient failures hardware, software, and operator faults, it would be a
increase as the processing window decreases (i.e., less time powerful tool for making design tradeoff decisions.
allowed per function, fast timing mode entry, removal of Table I, an example of the missing link, presents a five-
error checking, removal of system ready checks). level criticalityindex for defects.These examples indicate
the flexibility of such an approach to criticality
A fully distributed control structure can be configured classification.
to operate as its own error filter.In a hierarchy of proc-

essing levels, each level acts as a barrier to the level We can choose a decreasing, constant, or increasing
below and prevents errors or transient faults from propa- software bug removal rate for systems software. Although
gating through the system. Central control structures each has its app]icatlon to special situations and systems,
cannot usually prevent this type of error propagation. a decreasing software bug removal rate will generally be
encountered. Systems software also has advantages in
If the interleaving of transaction processes in a soft- that certain software defects can be temporarily patched
ware program is reduced, such as with a fully distributed and the permanent patch postponed to a more appropri-
control architecture, the transaction processes are less ate date. Thus, this type of defect manifestation is
likely to fail. This is especially true with nonconslstent treated in genera] as one that does not affect service, but
user interaction as experienced in communications sys- it should be included in the overall software quality
tems. Another opinion on software transient failures is assessment. The missing link concerns software bug mani-
that the faster a software program runs, the more likely festations. Until the traditional separation of hardware
it is to cause errors (such as encountered in central and software systems is overcome in the design of large
control architectures}. systems, it will be impossible to achieve a satisfactory
performance benchmark. This indicates that software per-
A "missing link" needs further discussion. Several formance modeling has not yet focused on the specific
methods can be used to quantify the occurrence of soft- causes of software unreliability.
ware bugs. However, manifestations in the system's oper-
ations are detrimental to the reliability analysis because
each manifestation could cause a failure event. The key 4.3 Concept of Quality
is to categorize levels of criticality for bug manifestations
and estimate their probability of occurrence and their re- Consider the concept of quality before We go on to
spective distributions. The importance of this increases software quality. The need for quality is universal. The
with the distribution of the hardware and software. Soft- concepts of _zero defects" and "doing it right the first
ware reliability is often controlled by establishing a soft- time" have changed our perspective on quality manage-
ware reliability design process. The final measure is the ment. We changed from measuring defects per unit and
system test, which includes the evaluation of priority acceptable quality levels to monitoring the design and
problems and the performance of the system while under cost reduction processes. The present concepts indicate
stress as defined by audits, interrupts, re-lnitializatlon, that quality is not free. One viewpoint is that a major
and other measurable parameters. The missing link in improvement in quality can be achieved by perfecting the
quantifying software bug manifestations needs to be process of developing a product. Thus, we would charac-
found before we can obtain an accurate software reliabil- terize the process, implement factors to achieve customer
ity model for measuring tradeoffs in the design process on satisfaction, correct defects as soon as possible, and then
a predicted performance basis. If a software reliability strive for total quality management. The key to achieving
TABLE I.--CRITICALITY INDEX
Bug Defect Level of Failure type Failure

manifestation removal criticality characteristic
rate rate
4 per day 1 per month Transient Errors come and go

3 per day 1 per week Transient Errors are repeated
2 per week 1 per month Transient or Service is affected
catastrophic
1 per month 2 per year Transient or System is partially
catastrophic down
1 per two 1 per year Catastrophic System stops
years
17
quality appears to have a thirdmajor factorin addition metrics, and (4) standards. Areas {1} and (2) are applica-
to product and process:the environment. People are im- ble during both the design and development phase and
portant.They make the processor theproduct successful. the operation and maintenance phase. In general, area (2)
is used during the design and development phase before
The next step isto discusswhat the processof achiev- the acceptance phase for a given software product. The
ing qualityin softwareconsistsof and how qualityman- following discussion will concern area (2).
agement isinvolved.The purpose ofqualitymanagement
for programming products isto ensure that a preselected
softwarequalitylevelhas been achieved on scheduleand
4.5 Software Quality Metrics
in a cost-effective
manner. In developing a qualityman-
agement system, the programming product'scritical life-
The entirearea of softwaremeasurements and metrics
cycle-phasereviewsprovidethe referencebase fortracking
has been widely discussedand the subjectof many publi-
the achievement of qualityobjectives.The International
cations. Notable is the guide for software reliability
ElectrotechnicalCommission (IEC) system life-cycle
measurement developedby theInstitute forElectricaland
phases presented in theirguidelinesfor reliability
and
ElectronicsEngineers (IEEE} Computer Society's working
maintainabilitymanagement are (I)concept and defini-
group on metrics.A basisforsoftwarequalitystandardi-
tion, (2)design and development, (3) manufacturing,
zationwas alsoissuedby the IEEE. Software metricscan-
installation,
and acceptance, (4)operation and main-
not be developed before the cause and effectof a defect
tenance,and (5)disposal.
have been establishedfora givenproduct with relationto
itsproduct lifecycle.A typicalcause-and-effectchart for
In general,a phase-coststudy shows the increasing
a softwareproduct includesthe processindicator.At the
cost of correctingprogramming defectsin laterphases of
testingstage of product development, the evolution of
a programming product'slife. Also, the higher the level
softwarequalitylevelscan be assessedby characteristics
of softwarequality,the more life-cycle
costsare reduced.
such as freedom from error,successfultestcase comple-
tion,and estimate of the software bugs remaining. For
example, these processindicatorscan be used to predict
4.4 Software Quality
slippageof the product deliverydate and the inabilityto
meet originaldesign goals.
The next step is to look at specific software quality
items. Software quality is defined as %he achievement of
When the programming product entersthe qualifica-
a preselected software quality level within the costs,
tion, installation,and acceptance phase and continues
schedule, and productivity boundaries established by
into the maintenance and enhancements phase, the con-
management _ (ref. 10}. However, agreement on such a
cept of performance is important in the quality charac-
definition is often difficult to achieve because metrics
teristicactivity.This concept isshown in table IIwhere
vary more than those for hardware, software reliability
the 5 IEC system life-cyclephaseshave been expanded to
management has focused on the product, and software
10 softwarelife-cycle phases.
quality management has focused on the process. In prac-
tice, the quality emphasis can change with respect to the
specific product application environment. Different per-
4.6 Concluding Remarks
spectives of software product quality have been presented
over the years. However, in todays' literature there is
This sectionpresented a snapshot of software quality
general agreement that the proper quality level for a par-
assurancetoday. Continuing researchisconcerned with
ticular software product should be determined in the con-
theuse ofoverallsoftwarequalitymetricsand bettersoft-
cept and definition phase and that quality managers
ware predictiontoolsfordetermining the defectpopula-
should monitor the project during the remaining life-cycle
tion. In addition,simulators and code generators are
phases to ensure the proper quality level.
being furtherdeveloped so thathigh-qualitysoftwarecan
be produced.
The developer of a methodology for assessing the qual-
ity of a software product must respond to the specific
Process indicators are closely related to software
characteristics of the product. There can be no single
quality and some include them as a stage in software
quality metric. The process of assessing the quality of a
development. In general,such measures as (1) testcases
software product begins with the selection of specific
completed versus testcases planned and (2) the number
characteristics, quality metrics, and performance criteria.
of linesof code developed versus the number expected
give an indicationof the overallcompany or corporate
With respect to software quality, several areas of
progress toward a qualitysoftware product. Too often,
interest are (1) characteristics, (2) metrics, (3) overall
18
TABLE II.--MEASUREMENTS AND PROGRAMMING PRODUCT LIFE CYCLE
[The 5 International Electrotechnical Commission (IEC) life-cycle phases have been expanded to I0
software phases.]
System life- Software Order of precedence

cycle phase life-cycle phase
Primary Secondary
Concept and Conceptual planning (1) .........................................................

definition Requirements definition (2) .........................................................
Product definition (3) Quality metrics" ............................
Design and Top-level design (4) Quality metrics Process indicatorJ

development Detailed design (5) Quality metrics Process indicators
Implementation (6) Process Indicators b Quality metrics
w
Manufacturing Testing and integration (7) Process indicators Performance measures

and installation Qualification, installation, Performance measures c Quality metrics
and acceptance (8)
Operation and Maintenance and Performance measures ............................

maintenance enhancements (9)
Disposal Disposal (10) .........................................................
LMetrlcs, qualitative assessment, quantitative prediction, or both.

blndicators, month-by-month tracking of key proiect parameters.
¢Measures, quantitative performance assessment.
personnel are moved from one project to another and (4) Selectthe reliability
analysisprocedure.
thus the lagging projects improve but the leading projects
decline in their process indicators. The llfe cycle for (5)Selectthe data sourcesforfailurerates and repair
programming products should not be disrupted. rates.
Performance measures,includingsuch criteriaas the (6) Determine the failureratesand the repairrates.
percentage of proper transactions,the number of system
restarts, the number of system reloads, and the (7) Perform the necessarycalculations.
percentage of uptime, should reflect
the user'sviewpoint.
(8) Validate and verify the reliability.
In general,the determination of applicable quality
measures for a given software product development is (9) Measure reliability until customer shipment.
viewed as a specific
taskof thesoftwarequalityassurance
function.The determinationof the processindicatorsand
performance measures is a task of the software quality 5.1 Goals and Objectives
standards function.
Goals must be placed into the proper perspective.
Because they are often examined by using models that
5.0 RELIABILITY MANAGEMENT the producer develops,one of the weakest links in the
reliability
processisthe modeling. Dr. John D. Spragins,
To design for successfulreliability
and continue to an editorfor the IEEE Transaction on Computers, cor-
provide customers with a reliable
product, the following roboratesthisfactwith the followingstatement (ref.13):
steps are necessary:
Some standard definitions of reliability or availability, such as
those based on the probability that all components of a system are
(1)Determine the reliability
goals to be met.
operational at a given time, can be dismissed as irrelevant when
studying large telecommunication networks, Many telecommunica-
(2) Construct a symbolic representation. tion networks are so large that the probability they are operational
according to this criterion may be very nearly zero; at least one
(3) Determine the logisticssupport and repair item of equipment may be down essentially all of the time. The
philosophy. typical user, however, does not see this unless he or she happens to
19
be the unlucky person whose equipment fails;the system may still 5.3 Human Reliability
operate perfectly from this user's point of view. A more meaningful
criterion is one based on the reliabilityseen by typical system
The major objectivesofreliability
management areto
users. The reliability apparent to system operators is another valid,
ensure that a selectedreliability
levelfor a product can
but distinct, criterion. (Since system operators commonly consider
systems down only after failures have been reported to them, and be achieved on schedulein a cost-effectivemanner and
may not hear of short self-clearing outages, their estimates of that the customer perceivesthe selectedreliabilitylevel.
reliability are often higher than the values seen by users.) The current emphasis in reliability management is on
meeting or exceedingcustomer expectations.We can view
Reliability objectives can be defined differently for thisas a challenge,but itshould be viewed as the bridge
various systems. An example from the telecommunica- between the user and the producer or provider. This
tions industry (ref. 14) is presented in table III. bridge is actually =human reliability.
_ In the past, the
producerwas concerned with the processand the product
and found reliabilitymeasurements that addressed both.
5.2 Specification Targets Often there was no correlationbetween fielddata, the
customer's perception of reliability,
and the producer's
A system can have a detailed performance or relia- reliability
metrics.Surveys then began to indicatethat
bility specification that is based on customer require- the customer distinguishedbetween reliabilityperform-
ments. The survivability of a telecommunications ance, response to order placement, technicalsupport,
network is defined as the ability of the network to servicequality,etc.
perform under stress caused by cable cuts or sudden and
lengthy traffic overloads and after failures including Human reliabilityis defined (ref. 17) as %..the
equipment breakdowns. Thus, performance and availabil- probabilityof accomplishinga job or task successfully
by
ity have been combined into a unified metric. One area humans at any requiredstagein system operationswithin
of telecommunications where these principles have been a specifiedminimum time limit (ifthe time requirement
applied is the design and implementation of fiber-based isspecified)."Although customers generallyaxe not yet
networks. Roohy-Laleh et al. (ref. 15) state a...the statis- requiringhuman reliability models in addition to the
tical observation that on the average 56 percent of the requestedhardware and software reliability models, the
pairs in a copper cable are cut when the cable is dug up, scienceof human reliability iswellestablished.
makes the copper network 'structurally survivable. On TM
the other hand, a fiber network can be assumed to be an

all or nothing situation with 100 percent of the circuits 5.4 Customer
being affected by a cable cut, failure, etc. In this case
study (ref. 15), =...cross connects and allocatable capacity Reliability growth has been studied, modeled, and
axe utilized by the intelligent network operation system analyzed--usually from the design and development
to dynamically reconfigure the network in the case of fail- viewpoint. Seldom is the process or product studied from
uresY Figure 6 (from ref. 16) presents a concept for speci- the customer's perspective. Furthermore, the reliability
fication targets. that the first customer observes with the r-st shipment
TABLE III.--RELIABILITY OB3ECTIVES FOR

TELECOMMUNICATIONS INDUSTRY lOO
Module or system Objective Fullyoperational

P2
Telephone instrument Mean time between failures

Electronic key system Complete
Major loss
loss of service
of service
Minor loss of service
i Subliminal
availability
major
Subliminal
availability
minor
Degraded
operation
PABX Complete loss of service E Pl

Major loss of service Subliminalperformance,
Minor loss of service I11 75 percent at toad
Mishandled calls factor B
Unusable
Subliminal performance,
Traffic service Mishandled calls 65 percent at load
position system (TSPS) System outage factor B
a1 a2 100
Class 5 office System outage
Class 4 office Loss of service Availability, percent
Class 3 office Service degradation Figure 6.--Specification targets (ref. 16).
2O
can be quitedifferentfrom thereliabilitythata customer Reliabilitygrowth can be specifiedfrom "day 1_ in
willobservewith a unit or system produced 5 years later, product development and can be measured or controlled
or with the lastshipment. Because the customer'sexperi- with a 10-yearlifeuntil _day 5000._ We can apply the
ence can vary with the maturity of a system, reliability philosophy of reliability
knowledge generationprinciples,
growth isan important concept to customers and should which isto generatereliabilityknowledge at the earliest
be consideredin theirpurchasing decisions. possibletime in the planning processand to add to this
base for the duration of the product's useful life.To
One key to reliability growth is the ability to def'me accurately measure and control reliabilitygrowth, we
the goals for the product or service from the customer's must examine the entiremanufacturing lifecycle.One
perspective while reflecting the actual situation in which method is the constructionof a production life-cycle
the customer obtains the product or service. For large reliability
growth chart.
telecommunications switching systems, the rule of thumb
for determining reliability growth has been that often sys- In certainlargetelecommunications systems,the long
tems have been allowed to operate at a lower availability installationtime allows the electronicpart reliability
to
than the specifiedavailability goalforthe first 6 months grow so that the customer observesboth the design and
to 1 year of operation (ref.18).In addition,component the production growth. Large complex systems oftenoffer
part replacement rates have often been allowed to be an environment unique to each product installation,
50 percenthigher than specified forthe first6 months of which dictatesthat a significantreliability growth will
operation.These allowancesaccommodated craftspersons occur. Yet, with the differencethat sizeand complexity
learningpatterns,software patches,design errors,etc. impose on resultantproduct reliability growth, corpora-
tionswith largeproduct linesshould not present overall
Another key to reliabilitygrowth is to have its reliabilitygrowth curves on a corporatebasis but must
measurement encompass the entirelifecycleof the pro- presentindividualproduct-linereliabilitygrowth pictures
duct. The concept is not new; only here the emphasis is to achieve totalcustomer satisfaction.
placed on the customer's perspective.
21
APPENDIX--COURSE EVALUATION
22
NASA SAFETY TRAINING CENTER (NSTC) COURSE EVALUATION
Name: Course Title:
Sponsor: Grade: (academic course only)

I Date:
I.What were the strengthsof thiscourse/workshop?
2. What were the weaknesses of thiscourse/workshop?
3. How will the skills/knowledgeyou gained in thiscourse/workshop help you to perform betterin your job?
4. Please give the course/workshop an overallrating.
5 4 3 2 I
Excellent Fair Poor
5. Please give the instructoran overallrating.
5 4 3 2 1
Excellent Fair Poor
6. Please rate the applicability of this course to your work.
5 4 3 2 1
Excellent Fair Poor
(OVER)
23
7.Asa customer of the NASA Safety Training Center (NSTC), how would you rateour services?
5 4 3 2 I
Excellent Fair Poor
Comments:
8. Please rate the followingitems:
Excellent Fair Poor
I. Overall course content ............................ 5 4 3 2 1

2. Achievement of course objectives ..................... 5 4 3 2 1
2 1
3. Instructor's knowledge of subject ..................... 5 4 3
4. Instructor's presentationmethods .................... 5 4 3 2 1
5. Instructor's abilityto addressquestions ................ 5 4 $ 2 1
6. Quality of textbook/workbook (if applicable} ............ 5 4 3 2 1
7. Training facilities ............................... 5 4 3 2 1
8. Time allottedforthe course ........................ 5 4 3 2 1
Comments:
9. Training expense other than tuition(ifapplicable}:
Travel {includingplane fare,taxi,car rentaland tolls}
Per Diem
Total
10. Please send this evaluationto:
NASA Safety Training Center

Webb, Murray & Associates,Inc.
1730 NASA Road One, Suite 102
Houston, Texas 77058
THANK YOU_
24
REFERENCES ment. Workshop on Quantitative Software Models,
IEEE, New York, 1979.
1. ReliabilityPrediction of Electronic Equipment.
MIL-HDBK-217E, Jan. 1990. 12. Schick,G.J.;and Wolverton, R.W.: An Analysisof
Computing Software ReliabilityModels. IEEE
*
Electronic Reliability Design Handbook. Trans. Software Eng., vol.SE-4, no. 2, Mar. 1978_
MIL-HDBK-338, vols.1 and 2, Oct. 1988. pp. 104-120.
3. Reliability Modeling and Prediction. 13. Spragins,J.D.,et al.:Current Telecommunication

MIL-STD-756B, Aug. 1982.
Network ReliabilityModels: A CriticalAssessment,
IEEE J. Sel.Topics Commun., vol.SAC-4, no. 7,
o
Bazovsky, I.: Reliability Theory and Practice.
Oct. 1986, pp. 1168-1173.
Prentice-Hall, 1963.
14. Malec, H.A.: Reliability Optimization in Telephone

5.
Reliability
Test Methods, Plans,and Environments
for Engineering Development, Qualification,and Switching Systems Design. IEEE Trans. Rel.,
Production. MIL-HDBK-781, July 1987. vol. R-26, no. 3, Aug. 1977, pp. 203-208.
.
Laubach, C.H.: Environmental Acceptance Test- 15. Roohy-Laleh, E.,et al.:A Procedure forDesigning
ing. NASA SP-T-0023, 1975. a Low Connected Survivable Fiber Network. IEEE
J. Sel.Topics Commun., vol.SAC-4, no. 7, Oct.
7.
Laube, R.B.: Methods to Assessthe Successof Test 1986, pp. 1112-1117.
Programs. J. Environ. Sci.,vol. 26, no. 2, Mar.-
Apr. 1983, pp. 54-58. 16. Jones, D.R.; and Malec, H.A.: Communications
Systems Performability: New Horizons. 1989 IEEE
8.
Test Requirements for Space Vehicles. International Conference on Communications,
MIL-STD-1540B, Oct. 1982. vol. 1, IEEE, 1989, pp. 1.4.1-1.4.9.
17. Dhillon, B.S.: Human Reliability: With Human

.
Siewiorek, D.P.; and Swarz, R.S.: The Theory and
Practice of Reliable System Design. Digital Press, Factors. Pergamon Press, 1986.
Bedford, MA, 1982, pp. 206-211.
18. Conroy, R.A.; Malec, H.A.; and Van Goethem, J.:
10. ReliabilityPrediction of Electronic Equipment, The Design, Applications, and Performance of the
MIL-HDBK-217E, Jan. 1990. System-12 Distributed Computer Architecture.
First International Conference on Computers and
11. Nathan, I.: A Deterministic Model To Predict Applications, E.A. Parrish and S. Jiang, eds.,
_Error-Free' Status of Complex Software Develop- IEEE, 1984, pp. 186-195.
25
Form Approved
REPORT DOCUMENTATION PAGE OMB No. 0704-0188
Public tepontng burden for this collection of information is estimated to .average 1.,hour pej' response, including the time lot revl.ewing In,stru.ctlons, searching existing data sources,
gathering and maintaining the data. needed, and completing ano reviewing the conec.on ov information. :_eno comments regarolng this ouroen estimate or any other aspect of this
co k_tion o! informatlon,/ncludlng suggestions for reducing this burden, to Washington Headquarters Services. Deectorate for Information Operstlons and Reports, 1215 Jefferson
Davis Highway, Suite 1204, Arlington. VA 22202-4302, and to the Office of Management and Budget. Paperwork Reduct on Project (0704-0188), Wash ngton. DC 20503.
1. AGENCY USE ONLY (Leave b/anlO 2. REPORT DATE 3. REPORT TYPE AND DATES COVERED
October 1994 Technical Memorandum
4. TITLE AND SUBTITLE 5. FUNDING NUMBERS
Design for Reliability:

NASA Reliability Prc,fen'cd Practices for Design and Test
WU-323-44-19
O. AUTHOR(S)
Vincent R. Lalli
7. PERFORMING ORGANIZATION NAME(S) AND ADDRESS(ES) 8. PERFORMING ORGANIZATION

REPORT NUMBER

Lewis Research Center E-8053
Cleveland, Ohio 44135-3191
9. SPONSORING/MONITORING AGENCY NAME(S) AND ADDRESS(ES) 10. SPONSORING/MONITORING

AGENCY REPORTNUMBER

Washington, D.C. 20546-0001 NASA TM-106313
11. SUPPLEMENTARY NOTES

Prepared for the Reliability and Maintainability Symposium cosponsored by ASQC, IIE, IEEE, SOLE, IES, AIAA, SSS,
and SRE, Anaheim, California, January 24-27, 1994. Responsible person, Vincent R. Lalli, organization code 0152,
(216) 433-2354.
12a. DISTRIBUTION'AVAILABILITY STATEMENT 12b. DISTRIBUTION CODE
Unclassified - Unlimited
Subject Category 18
13. ABSTRACT (Maximum 200 words)
This tutorial summarizes reliability experience from both NASA and industry and reflects engineering practices that
support current and future civil space programs. These practices were collected from various NASA field centers and
were reviewed by a committee of senior technical representatives from the participating centers (members are listed at
the end). The material for this tutorial was taken from the publication issued by the NASA Reliability and Maintainabil-
ity Steering Committee (NASA Reliability Preferred Practices for Design and TesL NASA TM-4322, 1991). Reliability
must be an integral part of the systems engineering process. Although both disciplines must be weighted equally with
other technical and programmatic demands, the application of sound reliability principles will be the key to the effective-
ness and affordability of America's space program. Our space programs have shown that reliability efforts must focus on
the design characteristics that affect the frequency of failure. Herein, we emphasize that these identified design character-
istics must be controlled by applying conservative engineering principles.
14. SUBJECT TERMS 15. NUMBER OF PAGES
27
Design; Test; Practices; Reliability; Training; Flight proven 16. PRICE CODE
A03
17. SECURITY CLASSIFICATION 18. SECURITY CLASSIFICATION 19. SECURITY CLASSIFICATION 20. LIMITATION OF ABSTRACT
OF REPORT OF THIS PAGE OF ABSTRACT
Unclassified Unclassified Unclassified
NSN 7540-01-280-5500 Standard Form 298 (Rev. 2-89)

Prescribed by AN_31 _td 739-18
_.9G-1C2

Nasa DFR

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Nasa DFR

Uploaded by

Copyright:

Available Formats

T,.

NASA Technical Memorandum 106313

Design for Reliability:

NASA Reliability Preferred Practices

Prepared for the

- (NASA-TM-I06313) DESIGN FOR N95-I3728

PART I--NASA RELIABILITY PREFERRED PRACTICES FOR DESIGN AND TEST

SUMMARY AND PURPOSE • Design weaknesses evident by test or analysis are

• Design criteria drive a conservative design approach.

Experiencefrom NASA's successful extended-duration The followingexample of the document numbering

2.0 RELIABILITY PRACTICES 5. Practice04

2.1 Introduction Key to nomenclature.--The followingisan explana-

PRACTICE FORMAT DEFINITIONS

BeneJlt: A concise statement of the technical improvement realised

from implementing the practice 2.4 Practices as of January 1993

that have applied the prectlce PD-EC-II01 Environmental Factors

the practice is avoided

matlon about the practice ence Instruments

PT-TE-1401 EEE Parts Screening

Environment Principal effects Typical failures

Thermal aging: Insulation failure; Wind Force ippllciilon Structural collapiel

Chemlcal reaction Structural failure mechanical strength

Lose of lubrication Depoi|t|on of materials Mechanical Inferrer-

propertlel once and clogging;

Viscosity reduction/ stress; Increased temperature effects

Heat gain (high Acceleration of blab-

Lois of lubrication Rain Physical stress Structural collapse

properties structural weakening

]_mbrlttlement Lose of mechanical ]_roslon Removal of protective

strength; cracking; coatings; structural

fracture weakenlng i surface

Moisture absorption Swellings rupture of Temperature Mechenlcal stress Structural collapse or

container; physical shock weakening; seal

breakdown; lois of damage

Corrosion particles oxidation

increased conductiv- and electrical

ity of insulators properties; produc-

Low relatlve Desiccation Loss of mechanical secondary particles

humidity Embrlttlemcnt strength; structural

of electrical proper- gravity-dependent

High Compression Structural collapeoi cooling temperature effects

pressure penetration of sial-

function Crasing, cracking Rapid oxidation;

Low pressure Expansion Fracture of container; trical properties

explosive expansion Emhrittlcment Lois of mechanical

Alteration of electrical strength

strength of atr and arc-over

radiation chemical reactions: alteration of electri-

embrlttlement Chemical reactions:

olonl formation and electrical

Abrasion Increased wear Reduced dielectric Insulation breakdown

Acceleration Mechanical stress Structural collapse

Salt spray Chemical reactions: Increases weir

Corrosion Lois of mechanical

Electrolysis Surface deterioration;

_tt|i_i Structural coIllpH 3.2 Format

Magnetic Induced magnetization Interference with func-

GUIDELINE FORMAT DEFINITIONS

Prwctice: A brief statement of the guideline

from Implementing the guideline

information, usually the sponsoring NASA Center (see p&se 6)

testprogram, environment-induced failuresmay be used.