You are on page 1of 22

Estimating Software Reliability with

Hypothesis Testing
Denise M. Woit

October 22, 1993


CRL Report No. 263
Telecommunications Research Institute of Ontario (TRIO)
McMaster University, Hamilton, Ontario L8S 4K1
Copyright
c 1993, D.M. Woit
Abstract
We present a method for estimating software reliability, based on statistical
hypothesis testing. Our estimations are based on the current version of the
software only, and our method is applicable when the current version has not
yet failed any random tests. Our method is particularly useful when one desires
reliability estimates that are practical to calculate, and that are more accurate
than estimations obtainable with current models.

ii
Acknowledgements
I would like to thank my supervisor, David Parnas of McMaster University,
for his advice support, and guidance. Thanks also goes to Roman Viveros-
Aguilera of McMaster University for his helpful discussions regarding this
work.

iii
Contents
1 Introduction 1
1.1 Applications of our Method : : : : : : : : : : : : : : : : : : : 1
1.1.1 Module Terminology : : : : : : : : : : : : : : : : : : : 1
1.1.2 Memoryless Program Terminology : : : : : : : : : : : : 2
1.1.3 Modes of Use : : : : : : : : : : : : : : : : : : : : : : : 3
2 Relevant Research 3
2.1 Reliability Growth Models : : : : : : : : : : : : : : : : : : : : 3
2.2 Reliability Models : : : : : : : : : : : : : : : : : : : : : : : : : 5
2.2.1 Inappropriateness of Most Methods : : : : : : : : : : : 5
2.2.2 Inappropriateness of Miller's Method : : : : : : : : : : 5
3 Reliability Estimation Method 7
3.1 Application of Hypothesis Testing to Software : : : : : : : : : 7
3.1.1 Reliability Calculations: : : : : : : : : : : : : : : : : : 7
3.1.2 More Exact Calculations : : : : : : : : : : : : : : : : : 9
4 E ectiveness of the Method 11
4.1 Comparison Criterion : : : : : : : : : : : : : : : : : : : : : : : 12
4.2 Comparison Experiments : : : : : : : : : : : : : : : : : : : : : 13
4.3 Method Preference : : : : : : : : : : : : : : : : : : : : : : : : 14
5 Conclusions 14
A Theory of Hypothesis Testing 17

iv
1 Introduction
In this paper, we present a method for estimating software reliability, based
on statistical hypothesis testing. Our estimations are based on the current
version of the software only, and our method is applicable when the current
version has not yet failed any random tests. Our method is particularly useful
for estimating reliability of the nal version of the software, as it does not fail
any test (assuming failure precipitates software revision.)
Other reliability estimation methods have been described in the literature;
however, as detailed in Section 2, these models may be impractical or may
produce imprecise estimations. Our method overcomes the limitations of these
other methods by our lack of unrealistic assumptions. Thus, our method is
especially useful when one desires reliability estimates that are practical to
calculate, and that are more precise than estimations obtainable with current
models.
In Section 1.1, potential applications of our method are described. In
Section 2, we outline reliability estimation methods of the current literature
and detail the problems and limitations associated with them. In Section 3
we outline the theory of classical statistical hypothesis testing and describe
how it can be used in reliability estimation of software. We then describe our
reliability estimation method. Section 4 explains why we believe our method
can be e ective. Conclusions are presented in Section 5.

1.1 Applications of our Method


Our method was derived speci cally for estimating operational reliability of
modules [Woi93a]. However, the method can also apply to memoryless pro-
grams and to non-operational reliability. In the sequel of this section, we will
rst de ne necessary terminology; we will then describe how our method can
be used to estimate operational reliability of both modules and memoryless
programs; and we will also describe how non-operational reliability estimations
can be obtained.
1.1.1 Module Terminology
A module is an information hiding package of programs that implements ob-
jects of a particular type. The object communicates with the outside world
only by input variables which the object observes, and by access programs
1
which provide information to and/or receive information from the object by
means of their arguments.
An object's state can only be changed by an event, which may be the
external invocation of an access program, or a change in the value of an input
variable.
A module execution to refers to the sequence of events issued to a module,
beginning with module initialization (or re-initialization) and ending with the
event immediately prior to the next module re-initialization. A module exe-
cution is described by the sequence of event descriptions, Init:E1:E2: : : : Et ,
where Init refers to module initialization or re-initialization, Ej is the j th event
i

issued after this module initialization (or re-initialization), and Et is the last
event issued before the next module re-initialization (1  j  ti.)
i

A test case is a module execution; thus, it may be denoted Init:E1:E2: : : : Et . i

An operational pro le is a description of the distribution of input values


that is expected to occur in actual module operation. Operational pro les for
modules are assumed to be conditional probability distributions, as outlined
in [Woi93b].
A test case (module execution), E0E1 : : : En , is said to fail if at least
one of its constituent events, Ek , fails, i.e., the (module-state, input) pair
(E0E1 : : : Ek 1 ; Ek ) does not produce the correct output or the correct module
state according to the module speci cations (for some k = 0 : : : n).
The failure rate of a module is the probability that a module execution,
selected at random according to a given operational pro le, will fail.
We consider the operational reliability (hereafter simply referred to as re-
liability) of a software module to be the probability that a module execution,
selected at random according to a given operational pro le, will not fail. Thus,
reliability = 1 - (failure rate).
1.1.2 Memoryless Program Terminology
When our reliability estimation method is to be applied to modules we must
assume the de nitions above. However, when we wish to use our method
to estimate reliability of memoryless programs, the de nitions of operational
pro le, test case, failure, etc. become simpler:
An operational pro le is an unconditional probability distribution on the
inputs of the program.
A test case consists of one program input, which we consider to be a vector
of program argument values (if the program has multiple arguments), or a
2
single argument value (if the program has only one argument.)
A test case (input) is said to fail if the program does not produce correct
output according to the program speci cations.
The failure rate is the probability that an input, selected at random ac-
cording to a given operational pro le, will fail.
Operational reliability of a memoryless program is the probability that an
input, selected at random according to a given operational pro le will not fail.
1.1.3 Modes of Use
In this paper, we use the terms de ned above in the description of our relia-
bility estimation method. The method can be used to estimate reliability of a
module or of a memoryless program, depending on which set of de nitions is
assumed. To estimate non-operational reliability, one simply selects test cases
according to a distribution other than the operational pro le; for instance, to
estimate average-case reliability, test cases are selected according to a uniform
distribution.

2 Relevant Research
In the current literature, two general methods of estimating software reliability
are described: reliability growth models and reliability models. Judging from
this literature, the former method is the most commonly used.

2.1 Reliability Growth Models


In reliability growth models, the reliability of one version of a program is
predicted from the reliabilities of previous versions. A survey of some popular
models is available in [MIO90]. We do not wish to use reliability growth models
for estimating reliability because we believe their estimations are too imprecise
and because we wish to estimate reliability based upon only the performance
of the nal version of the software. In the following list, we present a more
detailed description of why we consider reliability growth models inappropriate
in our situation.
1. In growth models the software is considered initially faulty, and relia-
bility poor; an increase in reliability is calculated with each successive

3
modi cation of the software. For some software, the time necessary for
it to achieve the desired level of reliability can be impractically long (or
in nite) [KM91, Lit90].
Littlewood [Lit90] presents some real-life data (inter-fail times) depicting
this problem. The data shows that reliability grows quickly at rst, then
progressively more slowly. This is because the failures with the high-
est failure rates tend to be discovered and removed rst; the remaining
failures tend to have lower failure rates and therefore take longer to nd.
2. Good statistical practice requires that the reliability estimate of version
i be based only on trials involving version i. Growth models assume
that the reliability of version i can be obtained by using data from the
previous versions, 1; 2; : : : i 1, i.e., that trials from previous versions
are relevant to the reliability of version i. Growth models also tend to
assume that the reliability of a slightly changed program is only slightly
di erent from the unchanged program. With software, there is no reason
to believe that these assumptions hold, because of the intricate manner
in which modi cations a ect the set of failure-causing input.
Reliability growth estimations are basically a curve- tting and extrapo-
lation problem:

R4    Ri
!

  R3
R5

 R2

 R1
It is argued that with such models, only a rough estimate of Ri is possi-
ble. Some believe that these models are best used as a managerial tool
to predict schedules, etc., but that they should not be used if a fairly
accurate estimate of reliability is required [Lit90].
3. Growth models make predictions based upon the number of failures ob-
served during testing. If no failures are observed, the predictions are
not very meaningful because they are based on values of the model pa-
rameters, which must be guessed. (Once failure data is available, the
4
parameters are estimated using techniques such as Maximum Likelihood
Estimates or Least Squares.)
4. Di erent reliability growth models can produce vastly di erent results
for the same data. In fact, a recent development in the area of growth
models is the use of super-models to help users select which reliability
growth model is most suitable for their particular application [KM89,
FS88, Lyu90, LBL92]. We believe this is evidence that reliability growth
model estimates are too imprecise in general, for if they were all very
precise, they would all produce similar estimations. We are unable to
determine if one model is \correct" and the others are not, because
advocates of each method can produce examples for which their model
is more accurate.

2.2 Reliability Models


Reliability models estimate software reliability based upon trials of one version
of the software only [TLN78, WW88, BL75, MMN+92, Whi92].
2.2.1 Inappropriateness of Most Methods
Most of these models [TLN78, WW88, BL75, Whi92] estimate that reliability
is 1 (perfect) when the testing reveals no failures. These models do not incor-
porate information about the number of tests successfully executed (i.e., they
do not di erentiate between having executed 5 out of 5 tests successfully, or
5,000 out of 5,000.) Such models are not appropriate for our problem because
(1) we assume testing of the nal version reveals no failures, and (2) we believe
that the number of successfully executed test cases performed should a ect our
estimate of reliability.
2.2.2 Inappropriateness of Miller's Method
Miller et al. [MMN+92] present a reliability estimation method that takes
successful test results into account and produces a non-perfect estimate of
reliability when testing reveals no failures.
In Miller's method, the actual reliability is considered to be 1 , where
 is the actual failure rate.  is estimated as the Bayesian point estimate:

5
^ = a=(N + a + b), where N is the number of tests, a and b are parameters em-
bodying prior assumptions about the possible values of , and the distribution
of  is considered to be Beta(a; b).1 Thus, reliability is estimated as 1 ^.
This model is not directly applicable in our situation because it is limited
to memoryless programs and because it assumes that the operational pro le is
an unconditional probability distribution. It is possible to modify the model to
apply to modules and to take into account conditional probabilities. However,
even with such modi cations, we are not convinced that the model produces
meaningful estimates because it uses a controversial technique of Bayesian
estimation, as explained in the following paragraph.
Problems Inherent to Bayesian Estimation: In Miller's method, one
guesses at a distribution of  initially (by choosing values for a and b) and then
alters the guess by taking into account some data (the number of successful
test cases, N .) The parameters a and b a ect the degree to which the initial
guess is weighed relative to the data. Thus, the ^, and therefore, the estimate
of reliability, will always depend upon our initial guess, and will depend upon
the assumption that the  has a Beta(a; b) distribution.
Another diculty with this type of method is its lack of objectivity. For
instance, if one tester guesses (a,b) = (5, 7.3) as the prior values for a given
piece of software, while another selects (a,b) = (3, 6.1), the resulting estimates,
^, will be quite di erent.
The use of an initial guess is very common in estimation methods. The
guess is required at the onset of the estimation process, when no data is yet
available. When data becomes available, the reliability estimations are no
longer calculated as a function of the guess (although the time it takes for the
process to converge to an answer may still depend on the guess.) In Miller's
method, the reliability estimate remains a function of the initial guess, even
when data becomes available. Since the reliability estimate is a function of a
guess, and is based upon an assumption for which no supporting evidence is
presented, we are not convinced that it will tend to be precise.
1  is considered a random variable in the following sense: \the characteristics of
the program and the process used to develop the program determine an ensemble
of programs that could have those characteristics and development history; each of
these possible programs has an associated . The single program that we are testing
has a xed , but the value of  is unknown to us" [MMN+ 92].

6
3 Reliability Estimation Method
We expect that a reliability estimate based upon hypothesis testing would
be more appropriate than the methods in the current literature because it
produces a non-perfect estimate of reliability when testing reveals no failures,
because it considers the number of successful test cases, and because it is a
well-accepted statistical technique. A discussion of the potential of hypothesis
testing for estimating module reliability is presented in [PvSK90].
In Appendix A, we informally illustrate certain aspects of the classical
theory of statistical hypothesis testing in order to familiarize the reader with
the basics of the method, if necessary. In the following sections, we describe
the application of hypothesis testing to estimating software reliability.

3.1 Application of Hypothesis Testing to Software


Let p be the failure rate of the software, i.e., p = PUi=1 !(Ii)P (Ii), where Ii is
the ith input or module execution, P (Ii) is the probability Ii is issued according
to the operational pro le, and !(Ii) = 0 if Ii succeeds and 1 if Ii fails. Suppose
we choose a particular value, , for p, and we wish to test the hypothesis that
the failure rate is at most ; thus, H0 : p   and H1 : p > .
Suppose we perform N tests, randomly selected (with replacement) in ac-
cordance with an input distribution describing operational usage, and T of the
tests fail. Since we do not wish to tolerate any failures, we reject H0 if T  1.
Then = P [T  1 j p  ] = 1 P [T = 0 j p  ] = 1 (1 p)N ; p  :
Thus,
 1 (1 )N :
is P [T = 0 j p > ] = (1 p)N ; p > . Thus,
< (1 )N : (1)
3.1.1 Reliability Calculations:
To test the hypothesis that the operational reliability of some module or mem-
oryless program, P, is at least r, we test H0 : p  , where p is P 's failure
rate, and  = 1 r (reliability is 1 failure rate.) We perform N random
tests, selected, with replacement, in accordance with a distribution describing

7
operational usage; if no failures occur, we accept the hypothesis that opera-
tional reliability is at least r. The probability that we have made an error
(that operational reliability is not at least r) is less than (1 )N by (1).
Consider the equation
(1 )N = M; (2)
where N and  are as above. >From (1) and (2) we know < M . Thus, given
a batch of N successful operational tests on P , the probability that we will
erroneously accept that reliability is at least r is less than M . Or equivalently,
the probability that P has reliability less than r and still passes a batch of N
random tests is less than M .
The Type II error statistic, , is a valuable measure of the con dence we
have in our reliability estimation. Con dence is also measured by another
statistic, the signi cance level, SL, which gives an indication of the agreement
between test data and the operational pro le. Automatic calculation of sig-
ni cance levels is described in [Woi93a]. SL is a percentage; the higher the
percentage, the more agreement between the test data and the operational pro-
le. SL allows us to incorporate information about the form of the operational
pro le into our reliability estimations.
Our reliability estimation model contains four variables, ; N; M; and SL.
 depends on N and M ; N depends on ; M; and SL; M depends on  and N ;
and SL depends on N . Test personnel might set the values of some variables
and calculate the values of others, depending on how they plan to use the
estimation method. Some di erent ways of using the method and setting
variables are outlined in the following examples:
Example: It might be decided that reliability must be at least a certain
value, with a certain probability of error ( and M are set). Given this infor-
mation, the number of tests necessary (N ) and the signi cance level (SL) can
be calculated.
Example: It might be the case that N random tests were executed without
failure. >From this information, only the signi cance level (SL) can be cal-
culated. One of M or  may be set to calculate the other; or the estimation
method might be used to obtain a number of (; M ) pairs, for the given N .
Example: We can estimate software reliability as at least 1 , given (A)
that the software has passed a batch of N tests, randomly selected according
8
to an operational pro le, and (B) that we require a statistical guarantee that
the probability of error is less than some given M . Given N , we can also
calculate the signi cance level, SL.
Example: Perhaps the testers will continue to test the software until the two
con dence measures are considered to be \satisfactory" by another source. In
this case, the number of test cases (N ) is increased until both and SL reach
the desired values. It will tend to be the case that an increase in N will both
decrease the error, , and increase the signi cance, SL, of our estimation.
3.1.2 More Exact Calculations
A more accurate reliability estimation can be obtained if we incorporate in-
formation about the percentage of tests executed relative to the total number
of possible tests. However, this increase in accuracy is sometimes obtained at
the cost of extensive calculation, as outlined below.
Let U be the total number of possible inputs (if the software is a memoryless
program), or the total number of unique module executions according to the
operational pro le speci cation, (if the software is a module). Let n be the
total number of unique tests in the test set (if any tests occur more than once
in the test set, then
P n will be less than the total number of tests in the test
U
set.) Let a = U i=1 !(Ii)P (Ii), where Ii is the ith input or module execution,
P (Ii ) is the probability Ii is issued according to the operational pro le, and
! (Ii ) = 0 if Ii succeeds and 1 if Ii fails.
Our hypothesis testing model can now be modi ed to use the hypergeo-
metric distribution rather than the binomial distribution, since we can assume
testing without replacement. We are still interested in estimating the probabil-
ity of failure, p, which is equal to a=U . Our null hypothesis is still H0 : p  ,
which is equivalent to HP 0 : a  U. Similarly, the alternate hypothesis is
H1 : a > U. Let T = ni=1 Ti , where Ti is 1 if test i fails and 0 if it suc-
ceeds. We accept H0 if T = 0. The probability that we have made an error
in accepting H0 is = P [T = 0 j a > U]. According to the hypergeometric
distribution, ! !
= U a U
= n ; a > U
n

9
Thus, ! !
U U U
 n = n
We will obtain more precise estimations using the hypergeometric distri-
bution than we would using the binomial distribution in the case where U and
n are known, since the binomial distribution only approximates the hyperge-
ometric; i.e., we would obtain a smaller for the same .
For example, suppose U = 10 and n = 7 and assume that all 7 tests
are unique. Assume we let  = :1. Then using hypothesis testing with the
binomial distribution gives  :47, while using hypothesis testing with the
hypergeometric distribution gives  :3. Even if two of the tests were not
unique (i.e., n = 7 for the binomial and n = 6 for the hypergeometric),
hypothesis testing with the hypergeometric distribution would give  :4.
Thus, we can usually decrease the error at the cost of calculating U = the
total number of possible test cases, and n = the total number of unique test
cases executed.
Calculation of U and n: When the software is a memoryless program, U
can be easily calculated from the operational pro le, as the number of unique
inputs. However, when the software is a module, the number of calculations
involved in calculating U can be impractically large. U can be calculated from
the operational pro le speci cation as the total number of unique possible
module executions derivable. There are many ways to accomplish this; one is
to build a tree-representation of the operational pro le speci cation, such that
U is the number of leaves in the tree, as described in [Woi93a].
n is calculated as the total number of unique tests in the test set. When
the software is a memoryless program, n can be easily calculated by simple
textual comparison of the test cases. When the software is a module, n may be
calculated by textual comparison, or by making use of the tree representation
mentioned above, as outlined in [Woi93a].
Accuracy of Approximations: As stated above, the binomial distribution
may be used to approximate the hypergeometric distribution; thus, the cal-
culations of Section 3.1.1 may be used to approximate those of Section 3.1.2.
How accurate are such approximations? In general, the binomial distribution
becomes a more accurate approximation as the sample space approaches in-
nity [Ric88]. In other words, the larger U is, the more accurate the binomial
10
approximation is. In the example above, U was small (10); the approximation
was therefore not very good, giving an inaccuracy of :17. When U = 100, the
inaccuracy decreases to :01; when U = 1000, the inaccuracy decreases to :001.
The degree of inaccuracy depends to a certain extent on the values of  and n
as well. For instance, when  = :01, U = 1; 000, and n = 100, the inaccuracy
is :02.
Usefulness: The above calculations for U can require much computation
in the case of modules, because of the nature of the conditional probability
distributions in the operational pro les [Woi93b]. We expect that only in the
cases of simple module operational pro le speci cations will one be able to
take advantage of being able to increase estimation accuracy by incorporating
information about the number of tests executed relative to the total number of
tests possible. However, in the case of memoryless programs, we expect that
it will be feasible to use the more precise calculations, since these operational
pro les are unconditional probability distributions.
The increase in accuracy that can be obtained by using the more precise
calculations will depend largely on the size of U , and to a lesser extent on
the values of  and n. When U is very large, we expect that the increase in
accuracy will usually be insigni cant compared to the cost of calculating U
and n. However, when U is small, or when accuracy is of extreme importance,
testers may wish to use the more exact calculations outlined in Section 3.1.2.

4 E ectiveness of the Method


We would expect to be able to demonstrate the e ectiveness of our reliability
estimation technique by experimentally showing that our method is superior
to similar reliability estimation methods in the current literature. However,
this is dicult because the most similar technique in the current literature,
Miller's method (MM), is still very di erent from our method. It is extremely
dicult to determine which one is \better" in any absolute or mathematical
sense, because the methods provide such disparate gures. In Section 4.1, we
outline the di erences in the measurements produced by these two methods.
We then determine a criterion by which the two methods may be compared.
In Section 4.2, we show that any experiments that would use the criterion
to determine which method is \better" are largely meaningless because of the
nature of the gures produced by the methods. Although we cannot determine
11
mathematically or experimentally which of the methods is \better," we can
reason that one might prefer to employ our method over that of MM; such
reasoning is outlined in Section 4.3.

4.1 Comparison Criterion


Assume we plan to base our estimations on N successful test cases. Given N ,
MM will produce EM , its estimation of reliability, and , a measure or our
con dence in EM ( is the variance about the estimated failure rate, 1 EM .)
Given N , our method will produce a series of 3-tuples, (EH ; M; SL), where EH
is a guess at an upper bound for reliability, M is a measure of our con dence
in EH , and SL is a measure of the accuracy of our test data.
Although EM and EH are both reliabilities (both 2 [0; 1]), it is dicult
to compare the results of our method with those of MM because our method
produces a range of possible values for reliability, from 0 to EH , while MM
produces a single point in [0; 1]. To compare our method and MM, we must
determine a comparison criterion to compare a range to a point; however, the
determination of such a criterion is not obvious, as outlined in the following.
(We will divide our discussion of the comparison criterion into four cases: (1)
EM and EH are both  the true reliability, R; (2) EM and EH are both  R;
(3) EH < R and EM > R; (4) EH > R and EM < R.)
(1) When EM and EH are both less than or equal to R, we can decide that
our method is superior if EM < EH . However, when EH < EM , it is not clear
that we should say that MM is superior to our method, because our method
has indeed produced a correct answer, i.e., R 2 [EH ; 1].
(2) When EH and EM are both greater than R, we can decide that MM is
superior to our method if EM < EH . However, when EH < EM , it is not clear
that we should say our method is superior to MM because it is not true that
R 2 [EH ; 1].
(3) Similarly, when EH < R and EM > R, we can say that our method is
superior if j R EH j<j R EM j, and inferior if j R EH j>j R EM j.
(4) However, when EH > R and EM < R, it is not clear that we should
base our comparison simply on absolute di erences because it is not true that
R 2 [EH ; 1].
>From the above, we conclude that it is dicult to determine a good com-
parison criterion, because of the diculties of comparing a point to a range.
We could instead consider only the two points EH and EM , i.e., our method is

12
superior if j R EH j<j R EM j and MM is superior if j R EH j>j R EM j.
However, in light of the above discussion, we must keep in mind that this is a
crude comparison.

4.2 Comparison Experiments


Using the comparison criterion outlined in the previous paragraph, we can
design experiments to determine if our method is superior to MM. Assume
we execute N successful tests on some software, P. Suppose we know the true
reliability, R, of P (R must be calculated from real-life usage of P.) We must
determine if EH or EM is closer to R. It is clear that given any EM , we can
always calculate a 3-tuple (EH ; M; SL), such that j R EH jj R EM j,
i.e., we can always show our method to be superior to (or the same as) MM.
Conversely, given any EM , we can always show that MM is superior to (or
the same as) our method, by selecting the appropriate 3-tuple. Thus, we
consider it to be largely meaningless to perform such experiments comparing
EH and EM , since the desired results can always be obtained by numerical
manipulations.
For a particular value of M , we can determine if MM or our method is
superior, but again, such experiments are largely meaningless because they
are not a good representation of real-life (in real-life, M will be determined
by test personnel, and we cannot predict what it will tend to be; also, test
personnel might not assume a xed M .)
Another possible type of experiment is to determine, for some sample soft-
ware, which values of M will cause our method to be superior. However, the
meaningfulness of such experiments is questionable also, because the superior-
ity/inferiority of our method will depend on the true reliability of the software,
which, in real-life, will be unknown at the time of deciding whether or not to
use our method or MM.
Thus, we do not believe it is possible to experimentally demonstrate that
our reliability estimation method is superior (or inferior) to similar methods
in the current literature. Because an evaluation of our method's e ectiveness
cannot be shown experimentally by comparing it to similar methods, we will
instead reason that testers might prefer to use our method.

13
4.3 Method Preference
When module testing reveals no failures, using our method (or MM) is prefer-
able to the methods outlined in [TLN78, WW88, BL75, Whi92] because the
latter will all produce a reliability estimate of 1 (perfect) and will give no indi-
cation of the potential error of this estimation. When a non-perfect reliability
estimate and a potential error are desired, our method is preferable to MM
because ours involves well-accepted, non-controversial statistical methods such
as hypothesis testing and goodness of t tests2, while MM involves controver-
sial Bayesian techniques. Our method's estimations are based on observed
results, while MM's are based on observed results and an initial guess, which
will always a ect the nal estimations. Further, if the initial guess is too far
o , an infeasibly large number of test cases could be required to achieve a
reliability estimate close to the true reliability, as outlined in [Woi93a]. For
the reasons above, we believe that our method may be preferable to similar
methods in the current literature.

5 Conclusions
In this paper, we have described a method for estimating reliability (op-
erational or non-operational) of software modules or memoryless programs
that, in their current versions, have not yet failed any tests. Our method is
based on statistical hypothesis testing, and thus, produces a series of 3-tuples,
(R; C1; C2), where R is a reliability estimate and C1 and C2 are measures of
our con dence in R. We described a general estimation method, which we ex-
pect will be useful for most software; we also described a more precise method,
which may be useful in cases of simple operational pro les, when more accurate
estimations are required. We showed that our estimation technique overcomes
the problems and limitations of similar techniques, and explained why we be-
lieve that estimations obtained with our estimation method will tend to be
more precise than estimations obtained with similar methods. Because of our
method's lack of unrealistic assumptions and its ease of use, we believe that
testers might prefer to use our method of reliability estimation.

2 Goodness of t tests are used in the calculation of signi cance levels, as discussed
in [Woi93b].
14
References
[BL75] J.R. Brown and M. Lipow. Testing for software reliability. In Proc.
Intl. Conf. Reliable Software (Los Angeles, CA), pages 518{527,
April 1975.
[FS88] W. Farr and O.J. Smith. A tool for statistical modeling and es-
timation of reliability functions for software: Smerfs. Journal of
Systems and Software, 8(1):47{55, Jan, 1988.
[KM89] P.A. Keiller and D.R. Miller. On the use and the performance
of software reliability growth models. Reliability Engineering and
System Safety Special Issue, pages 1{21, 1989.
[KM91] P.A. Keiller and D.R. Miller. On the use and the performance
of software reliability growth models. Reliability Engineering and
System Safety Special Issue, pages 95{117, 1991.
[LBL92] M. Lu, S. Brocklehurst, and B. Littlewood. Combination of pre-
dictions obtained from di erent software reliability growth models.
In Proceedings 10th Software Reliability Symposium, June 25-26,
1992.
[Lit90] B. Littlewood. Limits to evaluation of software dependability.
Draft report-PDCS No. D8 Project 3092-PDCS, City University,
Northampton Square, London, July, 1990.
[Lyu90] M.R. Lyu. Software reliability engineering and measurement at the
jet propulsion laboratory. workshop on software reliability. Car-
leton University, Ottawa Ontario Canada, May 24-26 1990.
[MIO90] J.D. Musa, A. Iannino, and K. Okumoto. Software reliability:
measurement, prediction, application. McGraw-Hill, New York,
1990.
[MMN+92] K.W. Miller, L.J. Morell, L.E. Noonan, S.K. Park, D.M. Nicol,
B.W. Murrill, and J.M. Voag. Estimating the probability of failure
when testing reveals no failures. IEEE Trans. Software Engineer-
ing, 18(1):33{42, January 1992.

15
[PvSK90] D.L. Parnas, J. van Schouwen, and S.P. Kwan. Evaluation stan-
dards for safety critical software. Communications of the ACM,
33(6):836{48, June, 1990.
Previous versions:
Technical Report 88-220, Dept. of Comp. & Info. Sci., Queen's
University, May 1988.
Proc. Intl. Working Group on Nuclear Power Plant Control and In-
strumentation, IAEA NPPCS Specialists' Meeting, International
Atomic Energy Agency, London, United Kingdom, 10-12 May
1988.
Software Development: Tips and Techniques, U.S. Professional
Development Institute, Silver Spring, Maryland, 1989, pp. 311-
350.
Proc. Seventh Intl. Conference on Testing Computer Software, San
Francisco, June 18-21, 1990, pp. 89-117.
[Ric88] John A. Rice. Mathematical Statistics and Data Analysis.
Wadsworth and Brooks/Cole, New York, 1988.
[TLN78] T.A. Thayer, M. Lipow, and E.C. Nelson. Software Reliability.
North-Holland, New York, 1978.
[Whi92] James Whittaker. Markov chain techniques for software testing
and reliability analysis. PhD dissertation, University of Tennessee,
May, 1992.
[Woi93a] D.M. Woit. Operational pro le speci cation, atuomatic test case
generation and reliability estimation for modules. PhD disserta-
tion, Queen's University, Kingston, Ontario, 1993.
[Woi93b] D.M. Woit. Specifying operational pro les for modules. In Proceed-
ings ISSTA (International Symposium on Software Testing and
Analysis). ACM, June 28-30, 1993.
[WW88] S.N. Weiss and E.J. Weyuker. An extended domain-based
model of software reliability. IEEE Trans. Software Engineering,
14(10):1512{1524, October 1988.

16
A Theory of Hypothesis Testing
A hypothesis, H , is a statement about the probability distribution of a random
variable. Hypothesis testing is a process by which we may accept or reject H
based upon a sampling of the random variable referred to in H . Generally
speaking, the more consistency between our hypothesis and our sample results,
the more likely we are to accept H ; the less consistency, the more likely we
are to reject H . However, we could be wrong: we might reject H when it is
true in reality, or might accept H when it is false in reality. Such errors are
known as Type I and Type II errors respectively; the probability of a Type I
error is denoted , and the probability of a Type II error is denoted .
We refer to H (the hypothesis we wish to test) as the null hypothesis,
and denote it H0. In rejecting H0 , we are implicitly accepting some alternate
hypothesis. To be precise, we should also identify the alternate hypothesis,
denoted H1 (necessary for calculation of .) A simple hypothesis is one in
which an exact value of the unknown parameter of the assumed probability
distribution is given; in a composite hypothesis, ranges of values are speci ed
(for example,  = :01 is a simple hypothesis, and   0:1 is a composite
hypothesis.)
Example: Suppose for a particular coin we wish 1to test the hypothesis that
the probability of tossing \heads", p, is at least 2 , versus the alternate hy-
pothesis that p is less than 12 (i.e., H0 : p  21 and H1 : p < 21 .)
Suppose we base our test upon 10 tosses of the coin, and suppose we decide
to reject H0 if we do not get at least 2 heads in the 10 tosses. More formally,
let F be the number of heads in 10 tosses. Then F = P10 i=1 Fi, where
(
Fi = 1 if toss i results in heads
0 if toss i results in tails
for i = 1; 2; : : : ; 10; thus, F1; F2; : : : ; F10 is a random sampling of 10 obser-
vations of a Bernoulli random variable, F , with parameter p (i.e., F has the
binomial distribution (10; p), p  :5, according to H0.) We will reject H0 if
F  1.
We test H0 by obtaining values for the Fi (tossing the coin ourselves, or
using previously obtained data) and computing F . We reject H0 if F  1, and
accept it otherwise. Next, we estimate the potential error in our decision.

17
The probability of a Type I error, , is
!
P 10
= P [F  1 j p  :5] = j =0 j (p)j (1 p)10 j ; p  :5
1

Because H0 is composite, we cannot calculate a speci c . However, we can


calculate for speci c values of p; i.e., for p = :5,
!
P 10
= j =0 j (:5)j (:5)10 j
1

= (:5)10 + 10(:5)(:5)9
' :01
Note that
P [F  1 j p = :5]  : (3)
The probability of a Type II error, , is
= P [2  F  10 j p < :5]
Because H1 is composite, we cannot calculate a speci c . However, we can
calculate for di erent values of p; i.e., with p = :25,
10 10 !
X
= j 10 j ' :76:
j (:25) (:75)
j =2
Another way of calculating is to use the result P [N  N j p] = 1 for any
binomial distribution (N; p), which gives: = 1 P [F  1 j p < :5]. This is
known as the operating characteristic (OC). It describes how the probability
of the Type II error varies with p. Notice that P [F  1 j p = :5] < P [F  1 j
p < :5]. Thus,
< 1 P [F  1 j p = :5]: (4)
By (1) and (2), the probability of a Type I error in the hypothesis test of
H0 is no larger than P [F  1 j p = :5], and the probability of a Type II error
is less than 1 P [F  1 j p = :5].

18

You might also like