Estimating Software Reliability With Hypothesis Testing

Estimating Software Reliability with
Hypothesis Testing
Denise M. Woit
October 22, 1993

CRL Report No. 263
Telecommunications Research Institute of Ontario (TRIO)
McMaster University, Hamilton, Ontario L8S 4K1
Copyright
c 1993, D.M. Woit
Abstract
We present a method for estimating software reliability, based on statistical
hypothesis testing. Our estimations are based on the current version of the
software only, and our method is applicable when the current version has not
yet failed any random tests. Our method is particularly useful when one desires
reliability estimates that are practical to calculate, and that are more accurate
than estimations obtainable with current models.
ii
Acknowledgements
I would like to thank my supervisor, David Parnas of McMaster University,
for his advice support, and guidance. Thanks also goes to Roman Viveros-
Aguilera of McMaster University for his helpful discussions regarding this
work.
iii
Contents
1 Introduction 1
1.1 Applications of our Method : : : : : : : : : : : : : : : : : : : 1
1.1.1 Module Terminology : : : : : : : : : : : : : : : : : : : 1
1.1.2 Memoryless Program Terminology : : : : : : : : : : : : 2
1.1.3 Modes of Use : : : : : : : : : : : : : : : : : : : : : : : 3
2 Relevant Research 3
2.1 Reliability Growth Models : : : : : : : : : : : : : : : : : : : : 3
2.2 Reliability Models : : : : : : : : : : : : : : : : : : : : : : : : : 5
2.2.1 Inappropriateness of Most Methods : : : : : : : : : : : 5
2.2.2 Inappropriateness of Miller's Method : : : : : : : : : : 5
3 Reliability Estimation Method 7
3.1 Application of Hypothesis Testing to Software : : : : : : : : : 7
3.1.1 Reliability Calculations: : : : : : : : : : : : : : : : : : 7
3.1.2 More Exact Calculations : : : : : : : : : : : : : : : : : 9
4 Eectiveness of the Method 11
4.1 Comparison Criterion : : : : : : : : : : : : : : : : : : : : : : : 12
4.2 Comparison Experiments : : : : : : : : : : : : : : : : : : : : : 13
4.3 Method Preference : : : : : : : : : : : : : : : : : : : : : : : : 14
5 Conclusions 14
A Theory of Hypothesis Testing 17
iv
1 Introduction
In this paper, we present a method for estimating software reliability, based
on statistical hypothesis testing. Our estimations are based on the current
version of the software only, and our method is applicable when the current
version has not yet failed any random tests. Our method is particularly useful
for estimating reliability of the nal version of the software, as it does not fail
any test (assuming failure precipitates software revision.)
Other reliability estimation methods have been described in the literature;
however, as detailed in Section 2, these models may be impractical or may
produce imprecise estimations. Our method overcomes the limitations of these
other methods by our lack of unrealistic assumptions. Thus, our method is
especially useful when one desires reliability estimates that are practical to
calculate, and that are more precise than estimations obtainable with current
models.
In Section 1.1, potential applications of our method are described. In
Section 2, we outline reliability estimation methods of the current literature
and detail the problems and limitations associated with them. In Section 3
we outline the theory of classical statistical hypothesis testing and describe
how it can be used in reliability estimation of software. We then describe our
reliability estimation method. Section 4 explains why we believe our method
can be eective. Conclusions are presented in Section 5.
1.1 Applications of our Method

Our method was derived specically for estimating operational reliability of
modules [Woi93a]. However, the method can also apply to memoryless pro-
grams and to non-operational reliability. In the sequel of this section, we will
rst dene necessary terminology; we will then describe how our method can
be used to estimate operational reliability of both modules and memoryless
programs; and we will also describe how non-operational reliability estimations
can be obtained.
1.1.1 Module Terminology
A module is an information hiding package of programs that implements ob-
jects of a particular type. The object communicates with the outside world
only by input variables which the object observes, and by access programs
1
which provide information to and/or receive information from the object by
means of their arguments.
An object's state can only be changed by an event, which may be the
external invocation of an access program, or a change in the value of an input
variable.
A module execution to refers to the sequence of events issued to a module,
beginning with module initialization (or re-initialization) and ending with the
event immediately prior to the next module re-initialization. A module exe-
cution is described by the sequence of event descriptions, Init:E1:E2: : : : Et ,
where Init refers to module initialization or re-initialization, Ej is the j th event
i
issued after this module initialization (or re-initialization), and Et is the last
event issued before the next module re-initialization (1 j ti.)
i
A test case is a module execution; thus, it may be denoted Init:E1:E2: : : : Et . i
An operational prole is a description of the distribution of input values

that is expected to occur in actual module operation. Operational proles for
modules are assumed to be conditional probability distributions, as outlined
in [Woi93b].
A test case (module execution), E0E1 : : : En , is said to fail if at least
one of its constituent events, Ek , fails, i.e., the (module-state, input) pair
(E0E1 : : : Ek 1 ; Ek ) does not produce the correct output or the correct module
state according to the module specications (for some k = 0 : : : n).
The failure rate of a module is the probability that a module execution,
selected at random according to a given operational prole, will fail.
We consider the operational reliability (hereafter simply referred to as re-
liability) of a software module to be the probability that a module execution,
selected at random according to a given operational prole, will not fail. Thus,
reliability = 1 - (failure rate).
1.1.2 Memoryless Program Terminology
When our reliability estimation method is to be applied to modules we must
assume the denitions above. However, when we wish to use our method
to estimate reliability of memoryless programs, the denitions of operational
prole, test case, failure, etc. become simpler:
An operational prole is an unconditional probability distribution on the
inputs of the program.
A test case consists of one program input, which we consider to be a vector
of program argument values (if the program has multiple arguments), or a
2
single argument value (if the program has only one argument.)
A test case (input) is said to fail if the program does not produce correct
output according to the program specications.
The failure rate is the probability that an input, selected at random ac-
cording to a given operational prole, will fail.
Operational reliability of a memoryless program is the probability that an
input, selected at random according to a given operational prole will not fail.
1.1.3 Modes of Use
In this paper, we use the terms dened above in the description of our relia-
bility estimation method. The method can be used to estimate reliability of a
module or of a memoryless program, depending on which set of denitions is
assumed. To estimate non-operational reliability, one simply selects test cases
according to a distribution other than the operational prole; for instance, to
estimate average-case reliability, test cases are selected according to a uniform
distribution.
2 Relevant Research
In the current literature, two general methods of estimating software reliability
are described: reliability growth models and reliability models. Judging from
this literature, the former method is the most commonly used.
2.1 Reliability Growth Models

In reliability growth models, the reliability of one version of a program is
predicted from the reliabilities of previous versions. A survey of some popular
models is available in [MIO90]. We do not wish to use reliability growth models
for estimating reliability because we believe their estimations are too imprecise
and because we wish to estimate reliability based upon only the performance
of the nal version of the software. In the following list, we present a more
detailed description of why we consider reliability growth models inappropriate
in our situation.
1. In growth models the software is considered initially faulty, and relia-
bility poor; an increase in reliability is calculated with each successive
3
modication of the software. For some software, the time necessary for
it to achieve the desired level of reliability can be impractically long (or
innite) [KM91, Lit90].
Littlewood [Lit90] presents some real-life data (inter-fail times) depicting
this problem. The data shows that reliability grows quickly at rst, then
progressively more slowly. This is because the failures with the high-
est failure rates tend to be discovered and removed rst; the remaining
failures tend to have lower failure rates and therefore take longer to nd.
2. Good statistical practice requires that the reliability estimate of version
i be based only on trials involving version i. Growth models assume
that the reliability of version i can be obtained by using data from the
previous versions, 1; 2; : : : i 1, i.e., that trials from previous versions
are relevant to the reliability of version i. Growth models also tend to
assume that the reliability of a slightly changed program is only slightly
dierent from the unchanged program. With software, there is no reason
to believe that these assumptions hold, because of the intricate manner
in which modications aect the set of failure-causing input.
Reliability growth estimations are basically a curve-tting and extrapo-
lation problem:

R4 Ri
!

R3
R5
R2

R1
It is argued that with such models, only a rough estimate of Ri is possi-
ble. Some believe that these models are best used as a managerial tool
to predict schedules, etc., but that they should not be used if a fairly
accurate estimate of reliability is required [Lit90].
3. Growth models make predictions based upon the number of failures ob-
served during testing. If no failures are observed, the predictions are
not very meaningful because they are based on values of the model pa-
rameters, which must be guessed. (Once failure data is available, the
4
parameters are estimated using techniques such as Maximum Likelihood
Estimates or Least Squares.)
4. Dierent reliability growth models can produce vastly dierent results
for the same data. In fact, a recent development in the area of growth
models is the use of super-models to help users select which reliability
growth model is most suitable for their particular application [KM89,
FS88, Lyu90, LBL92]. We believe this is evidence that reliability growth
model estimates are too imprecise in general, for if they were all very
precise, they would all produce similar estimations. We are unable to
determine if one model is \correct" and the others are not, because
advocates of each method can produce examples for which their model
is more accurate.
2.2 Reliability Models

Reliability models estimate software reliability based upon trials of one version
of the software only [TLN78, WW88, BL75, MMN+92, Whi92].
2.2.1 Inappropriateness of Most Methods
Most of these models [TLN78, WW88, BL75, Whi92] estimate that reliability
is 1 (perfect) when the testing reveals no failures. These models do not incor-
porate information about the number of tests successfully executed (i.e., they
do not dierentiate between having executed 5 out of 5 tests successfully, or
5,000 out of 5,000.) Such models are not appropriate for our problem because
(1) we assume testing of the nal version reveals no failures, and (2) we believe
that the number of successfully executed test cases performed should aect our
estimate of reliability.
2.2.2 Inappropriateness of Miller's Method
Miller et al. [MMN+92] present a reliability estimation method that takes
successful test results into account and produces a non-perfect estimate of
reliability when testing reveals no failures.
In Miller's method, the actual reliability is considered to be 1 , where
is the actual failure rate. is estimated as the Bayesian point estimate:
5
^ = a=(N + a + b), where N is the number of tests, a and b are parameters em-
bodying prior assumptions about the possible values of , and the distribution
of is considered to be Beta(a; b).1 Thus, reliability is estimated as 1 ^.
This model is not directly applicable in our situation because it is limited
to memoryless programs and because it assumes that the operational prole is
an unconditional probability distribution. It is possible to modify the model to
apply to modules and to take into account conditional probabilities. However,
even with such modications, we are not convinced that the model produces
meaningful estimates because it uses a controversial technique of Bayesian
estimation, as explained in the following paragraph.
Problems Inherent to Bayesian Estimation: In Miller's method, one
guesses at a distribution of initially (by choosing values for a and b) and then
alters the guess by taking into account some data (the number of successful
test cases, N .) The parameters a and b aect the degree to which the initial
guess is weighed relative to the data. Thus, the ^, and therefore, the estimate
of reliability, will always depend upon our initial guess, and will depend upon
the assumption that the has a Beta(a; b) distribution.
Another diculty with this type of method is its lack of objectivity. For
instance, if one tester guesses (a,b) = (5, 7.3) as the prior values for a given
piece of software, while another selects (a,b) = (3, 6.1), the resulting estimates,
^, will be quite dierent.
The use of an initial guess is very common in estimation methods. The
guess is required at the onset of the estimation process, when no data is yet
available. When data becomes available, the reliability estimations are no
longer calculated as a function of the guess (although the time it takes for the
process to converge to an answer may still depend on the guess.) In Miller's
method, the reliability estimate remains a function of the initial guess, even
when data becomes available. Since the reliability estimate is a function of a
guess, and is based upon an assumption for which no supporting evidence is
presented, we are not convinced that it will tend to be precise.
1 is considered a random variable in the following sense: \the characteristics of
the program and the process used to develop the program determine an ensemble
of programs that could have those characteristics and development history; each of
these possible programs has an associated . The single program that we are testing
has a xed , but the value of is unknown to us" [MMN+ 92].
6
3 Reliability Estimation Method
We expect that a reliability estimate based upon hypothesis testing would
be more appropriate than the methods in the current literature because it
produces a non-perfect estimate of reliability when testing reveals no failures,
because it considers the number of successful test cases, and because it is a
well-accepted statistical technique. A discussion of the potential of hypothesis
testing for estimating module reliability is presented in [PvSK90].
In Appendix A, we informally illustrate certain aspects of the classical
theory of statistical hypothesis testing in order to familiarize the reader with
the basics of the method, if necessary. In the following sections, we describe
the application of hypothesis testing to estimating software reliability.
3.1 Application of Hypothesis Testing to Software

Let p be the failure rate of the software, i.e., p = PUi=1 !(Ii)P (Ii), where Ii is
the ith input or module execution, P (Ii) is the probability Ii is issued according
to the operational prole, and !(Ii) = 0 if Ii succeeds and 1 if Ii fails. Suppose
we choose a particular value, , for p, and we wish to test the hypothesis that
the failure rate is at most ; thus, H0 : p and H1 : p > .
Suppose we perform N tests, randomly selected (with replacement) in ac-
cordance with an input distribution describing operational usage, and T of the
tests fail. Since we do not wish to tolerate any failures, we reject H0 if T 1.
Then = P [T 1 j p ] = 1 P [T = 0 j p ] = 1 (1 p)N ; p :
Thus,
1 (1 )N :
is P [T = 0 j p > ] = (1 p)N ; p > . Thus,
< (1 )N : (1)
3.1.1 Reliability Calculations:
To test the hypothesis that the operational reliability of some module or mem-
oryless program, P, is at least r, we test H0 : p , where p is P 's failure
rate, and = 1 r (reliability is 1 failure rate.) We perform N random
tests, selected, with replacement, in accordance with a distribution describing
7
operational usage; if no failures occur, we accept the hypothesis that opera-
tional reliability is at least r. The probability that we have made an error
(that operational reliability is not at least r) is less than (1 )N by (1).
Consider the equation
(1 )N = M; (2)
where N and are as above. >From (1) and (2) we know < M . Thus, given
a batch of N successful operational tests on P , the probability that we will
erroneously accept that reliability is at least r is less than M . Or equivalently,
the probability that P has reliability less than r and still passes a batch of N
random tests is less than M .
The Type II error statistic, , is a valuable measure of the condence we
have in our reliability estimation. Condence is also measured by another
statistic, the signicance level, SL, which gives an indication of the agreement
between test data and the operational prole. Automatic calculation of sig-
nicance levels is described in [Woi93a]. SL is a percentage; the higher the
percentage, the more agreement between the test data and the operational pro-
le. SL allows us to incorporate information about the form of the operational
prole into our reliability estimations.
Our reliability estimation model contains four variables, ; N; M; and SL.
depends on N and M ; N depends on ; M; and SL; M depends on and N ;
and SL depends on N . Test personnel might set the values of some variables
and calculate the values of others, depending on how they plan to use the
estimation method. Some dierent ways of using the method and setting
variables are outlined in the following examples:
Example: It might be decided that reliability must be at least a certain
value, with a certain probability of error ( and M are set). Given this infor-
mation, the number of tests necessary (N ) and the signicance level (SL) can
be calculated.
Example: It might be the case that N random tests were executed without
failure. >From this information, only the signicance level (SL) can be cal-
culated. One of M or may be set to calculate the other; or the estimation
method might be used to obtain a number of (; M ) pairs, for the given N .
Example: We can estimate software reliability as at least 1 , given (A)
that the software has passed a batch of N tests, randomly selected according
8
to an operational prole, and (B) that we require a statistical guarantee that
the probability of error is less than some given M . Given N , we can also
calculate the signicance level, SL.
Example: Perhaps the testers will continue to test the software until the two
condence measures are considered to be \satisfactory" by another source. In
this case, the number of test cases (N ) is increased until both and SL reach
the desired values. It will tend to be the case that an increase in N will both
decrease the error, , and increase the signicance, SL, of our estimation.
3.1.2 More Exact Calculations
A more accurate reliability estimation can be obtained if we incorporate in-
formation about the percentage of tests executed relative to the total number
of possible tests. However, this increase in accuracy is sometimes obtained at
the cost of extensive calculation, as outlined below.
Let U be the total number of possible inputs (if the software is a memoryless
program), or the total number of unique module executions according to the
operational prole specication, (if the software is a module). Let n be the
total number of unique tests in the test set (if any tests occur more than once
in the test set, then
P n will be less than the total number of tests in the test
U
set.) Let a = U i=1 !(Ii)P (Ii), where Ii is the ith input or module execution,
P (Ii ) is the probability Ii is issued according to the operational prole, and
! (Ii ) = 0 if Ii succeeds and 1 if Ii fails.
Our hypothesis testing model can now be modied to use the hypergeo-
metric distribution rather than the binomial distribution, since we can assume
testing without replacement. We are still interested in estimating the probabil-
ity of failure, p, which is equal to a=U . Our null hypothesis is still H0 : p ,
which is equivalent to HP 0 : a U. Similarly, the alternate hypothesis is
H1 : a > U. Let T = ni=1 Ti , where Ti is 1 if test i fails and 0 if it suc-
ceeds. We accept H0 if T = 0. The probability that we have made an error
in accepting H0 is = P [T = 0 j a > U]. According to the hypergeometric
distribution, ! !
= U a U
= n ; a > U
n
9
Thus, ! !
U U U
n = n
We will obtain more precise estimations using the hypergeometric distri-
bution than we would using the binomial distribution in the case where U and
n are known, since the binomial distribution only approximates the hyperge-
ometric; i.e., we would obtain a smaller for the same .
For example, suppose U = 10 and n = 7 and assume that all 7 tests
are unique. Assume we let = :1. Then using hypothesis testing with the
binomial distribution gives :47, while using hypothesis testing with the
hypergeometric distribution gives :3. Even if two of the tests were not
unique (i.e., n = 7 for the binomial and n = 6 for the hypergeometric),
hypothesis testing with the hypergeometric distribution would give :4.
Thus, we can usually decrease the error at the cost of calculating U = the
total number of possible test cases, and n = the total number of unique test
cases executed.
Calculation of U and n: When the software is a memoryless program, U
can be easily calculated from the operational prole, as the number of unique
inputs. However, when the software is a module, the number of calculations
involved in calculating U can be impractically large. U can be calculated from
the operational prole specication as the total number of unique possible
module executions derivable. There are many ways to accomplish this; one is
to build a tree-representation of the operational prole specication, such that
U is the number of leaves in the tree, as described in [Woi93a].
n is calculated as the total number of unique tests in the test set. When
the software is a memoryless program, n can be easily calculated by simple
textual comparison of the test cases. When the software is a module, n may be
calculated by textual comparison, or by making use of the tree representation
mentioned above, as outlined in [Woi93a].
Accuracy of Approximations: As stated above, the binomial distribution
may be used to approximate the hypergeometric distribution; thus, the cal-
culations of Section 3.1.1 may be used to approximate those of Section 3.1.2.
How accurate are such approximations? In general, the binomial distribution
becomes a more accurate approximation as the sample space approaches in-
nity [Ric88]. In other words, the larger U is, the more accurate the binomial
10
approximation is. In the example above, U was small (10); the approximation
was therefore not very good, giving an inaccuracy of :17. When U = 100, the
inaccuracy decreases to :01; when U = 1000, the inaccuracy decreases to :001.
The degree of inaccuracy depends to a certain extent on the values of and n
as well. For instance, when = :01, U = 1; 000, and n = 100, the inaccuracy
is :02.
Usefulness: The above calculations for U can require much computation
in the case of modules, because of the nature of the conditional probability
distributions in the operational proles [Woi93b]. We expect that only in the
cases of simple module operational prole specications will one be able to
take advantage of being able to increase estimation accuracy by incorporating
information about the number of tests executed relative to the total number of
tests possible. However, in the case of memoryless programs, we expect that
it will be feasible to use the more precise calculations, since these operational
proles are unconditional probability distributions.
The increase in accuracy that can be obtained by using the more precise
calculations will depend largely on the size of U , and to a lesser extent on
the values of and n. When U is very large, we expect that the increase in
accuracy will usually be insignicant compared to the cost of calculating U
and n. However, when U is small, or when accuracy is of extreme importance,
testers may wish to use the more exact calculations outlined in Section 3.1.2.
4 Eectiveness of the Method

We would expect to be able to demonstrate the eectiveness of our reliability
estimation technique by experimentally showing that our method is superior
to similar reliability estimation methods in the current literature. However,
this is dicult because the most similar technique in the current literature,
Miller's method (MM), is still very dierent from our method. It is extremely
dicult to determine which one is \better" in any absolute or mathematical
sense, because the methods provide such disparate gures. In Section 4.1, we
outline the dierences in the measurements produced by these two methods.
We then determine a criterion by which the two methods may be compared.
In Section 4.2, we show that any experiments that would use the criterion
to determine which method is \better" are largely meaningless because of the
nature of the gures produced by the methods. Although we cannot determine
11
mathematically or experimentally which of the methods is \better," we can
reason that one might prefer to employ our method over that of MM; such
reasoning is outlined in Section 4.3.
4.1 Comparison Criterion

Assume we plan to base our estimations on N successful test cases. Given N ,
MM will produce EM , its estimation of reliability, and , a measure or our
condence in EM ( is the variance about the estimated failure rate, 1 EM .)
Given N , our method will produce a series of 3-tuples, (EH ; M; SL), where EH
is a guess at an upper bound for reliability, M is a measure of our condence
in EH , and SL is a measure of the accuracy of our test data.
Although EM and EH are both reliabilities (both 2 [0; 1]), it is dicult
to compare the results of our method with those of MM because our method
produces a range of possible values for reliability, from 0 to EH , while MM
produces a single point in [0; 1]. To compare our method and MM, we must
determine a comparison criterion to compare a range to a point; however, the
determination of such a criterion is not obvious, as outlined in the following.
(We will divide our discussion of the comparison criterion into four cases: (1)
EM and EH are both the true reliability, R; (2) EM and EH are both R;
(3) EH < R and EM > R; (4) EH > R and EM < R.)
(1) When EM and EH are both less than or equal to R, we can decide that
our method is superior if EM < EH . However, when EH < EM , it is not clear
that we should say that MM is superior to our method, because our method
has indeed produced a correct answer, i.e., R 2 [EH ; 1].
(2) When EH and EM are both greater than R, we can decide that MM is
superior to our method if EM < EH . However, when EH < EM , it is not clear
that we should say our method is superior to MM because it is not true that
R 2 [EH ; 1].
(3) Similarly, when EH < R and EM > R, we can say that our method is
superior if j R EH j<j R EM j, and inferior if j R EH j>j R EM j.
(4) However, when EH > R and EM < R, it is not clear that we should
base our comparison simply on absolute dierences because it is not true that
R 2 [EH ; 1].
>From the above, we conclude that it is dicult to determine a good com-
parison criterion, because of the diculties of comparing a point to a range.
We could instead consider only the two points EH and EM , i.e., our method is
12
superior if j R EH j<j R EM j and MM is superior if j R EH j>j R EM j.
However, in light of the above discussion, we must keep in mind that this is a
crude comparison.
4.2 Comparison Experiments

Using the comparison criterion outlined in the previous paragraph, we can
design experiments to determine if our method is superior to MM. Assume
we execute N successful tests on some software, P. Suppose we know the true
reliability, R, of P (R must be calculated from real-life usage of P.) We must
determine if EH or EM is closer to R. It is clear that given any EM , we can
always calculate a 3-tuple (EH ; M; SL), such that j R EH jj R EM j,
i.e., we can always show our method to be superior to (or the same as) MM.
Conversely, given any EM , we can always show that MM is superior to (or
the same as) our method, by selecting the appropriate 3-tuple. Thus, we
consider it to be largely meaningless to perform such experiments comparing
EH and EM , since the desired results can always be obtained by numerical
manipulations.
For a particular value of M , we can determine if MM or our method is
superior, but again, such experiments are largely meaningless because they
are not a good representation of real-life (in real-life, M will be determined
by test personnel, and we cannot predict what it will tend to be; also, test
personnel might not assume a xed M .)
Another possible type of experiment is to determine, for some sample soft-
ware, which values of M will cause our method to be superior. However, the
meaningfulness of such experiments is questionable also, because the superior-
ity/inferiority of our method will depend on the true reliability of the software,
which, in real-life, will be unknown at the time of deciding whether or not to
use our method or MM.
Thus, we do not believe it is possible to experimentally demonstrate that
our reliability estimation method is superior (or inferior) to similar methods
in the current literature. Because an evaluation of our method's eectiveness
cannot be shown experimentally by comparing it to similar methods, we will
instead reason that testers might prefer to use our method.
13
4.3 Method Preference
When module testing reveals no failures, using our method (or MM) is prefer-
able to the methods outlined in [TLN78, WW88, BL75, Whi92] because the
latter will all produce a reliability estimate of 1 (perfect) and will give no indi-
cation of the potential error of this estimation. When a non-perfect reliability
estimate and a potential error are desired, our method is preferable to MM
because ours involves well-accepted, non-controversial statistical methods such
as hypothesis testing and goodness of t tests2, while MM involves controver-
sial Bayesian techniques. Our method's estimations are based on observed
results, while MM's are based on observed results and an initial guess, which
will always aect the nal estimations. Further, if the initial guess is too far
o, an infeasibly large number of test cases could be required to achieve a
reliability estimate close to the true reliability, as outlined in [Woi93a]. For
the reasons above, we believe that our method may be preferable to similar
methods in the current literature.
5 Conclusions
In this paper, we have described a method for estimating reliability (op-
erational or non-operational) of software modules or memoryless programs
that, in their current versions, have not yet failed any tests. Our method is
based on statistical hypothesis testing, and thus, produces a series of 3-tuples,
(R; C1; C2), where R is a reliability estimate and C1 and C2 are measures of
our condence in R. We described a general estimation method, which we ex-
pect will be useful for most software; we also described a more precise method,
which may be useful in cases of simple operational proles, when more accurate
estimations are required. We showed that our estimation technique overcomes
the problems and limitations of similar techniques, and explained why we be-
lieve that estimations obtained with our estimation method will tend to be
more precise than estimations obtained with similar methods. Because of our
method's lack of unrealistic assumptions and its ease of use, we believe that
testers might prefer to use our method of reliability estimation.
2 Goodness of t tests are used in the calculation of signicance levels, as discussed
in [Woi93b].
14
References
[BL75] J.R. Brown and M. Lipow. Testing for software reliability. In Proc.
Intl. Conf. Reliable Software (Los Angeles, CA), pages 518{527,
April 1975.
[FS88] W. Farr and O.J. Smith. A tool for statistical modeling and es-
timation of reliability functions for software: Smerfs. Journal of
Systems and Software, 8(1):47{55, Jan, 1988.
[KM89] P.A. Keiller and D.R. Miller. On the use and the performance
of software reliability growth models. Reliability Engineering and
System Safety Special Issue, pages 1{21, 1989.
[KM91] P.A. Keiller and D.R. Miller. On the use and the performance
of software reliability growth models. Reliability Engineering and
System Safety Special Issue, pages 95{117, 1991.
[LBL92] M. Lu, S. Brocklehurst, and B. Littlewood. Combination of pre-
dictions obtained from dierent software reliability growth models.
In Proceedings 10th Software Reliability Symposium, June 25-26,
1992.
[Lit90] B. Littlewood. Limits to evaluation of software dependability.
Draft report-PDCS No. D8 Project 3092-PDCS, City University,
Northampton Square, London, July, 1990.
[Lyu90] M.R. Lyu. Software reliability engineering and measurement at the
jet propulsion laboratory. workshop on software reliability. Car-
leton University, Ottawa Ontario Canada, May 24-26 1990.
[MIO90] J.D. Musa, A. Iannino, and K. Okumoto. Software reliability:
measurement, prediction, application. McGraw-Hill, New York,
1990.
[MMN+92] K.W. Miller, L.J. Morell, L.E. Noonan, S.K. Park, D.M. Nicol,
B.W. Murrill, and J.M. Voag. Estimating the probability of failure
when testing reveals no failures. IEEE Trans. Software Engineer-
ing, 18(1):33{42, January 1992.
15
[PvSK90] D.L. Parnas, J. van Schouwen, and S.P. Kwan. Evaluation stan-
dards for safety critical software. Communications of the ACM,
33(6):836{48, June, 1990.
Previous versions:
Technical Report 88-220, Dept. of Comp. & Info. Sci., Queen's
University, May 1988.
Proc. Intl. Working Group on Nuclear Power Plant Control and In-
strumentation, IAEA NPPCS Specialists' Meeting, International
Atomic Energy Agency, London, United Kingdom, 10-12 May
1988.
Software Development: Tips and Techniques, U.S. Professional
Development Institute, Silver Spring, Maryland, 1989, pp. 311-
350.
Proc. Seventh Intl. Conference on Testing Computer Software, San
Francisco, June 18-21, 1990, pp. 89-117.
[Ric88] John A. Rice. Mathematical Statistics and Data Analysis.
Wadsworth and Brooks/Cole, New York, 1988.
[TLN78] T.A. Thayer, M. Lipow, and E.C. Nelson. Software Reliability.
North-Holland, New York, 1978.
[Whi92] James Whittaker. Markov chain techniques for software testing
and reliability analysis. PhD dissertation, University of Tennessee,
May, 1992.
[Woi93a] D.M. Woit. Operational prole specication, atuomatic test case
generation and reliability estimation for modules. PhD disserta-
tion, Queen's University, Kingston, Ontario, 1993.
[Woi93b] D.M. Woit. Specifying operational proles for modules. In Proceed-
ings ISSTA (International Symposium on Software Testing and
Analysis). ACM, June 28-30, 1993.
[WW88] S.N. Weiss and E.J. Weyuker. An extended domain-based
model of software reliability. IEEE Trans. Software Engineering,
14(10):1512{1524, October 1988.
16
A Theory of Hypothesis Testing
A hypothesis, H , is a statement about the probability distribution of a random
variable. Hypothesis testing is a process by which we may accept or reject H
based upon a sampling of the random variable referred to in H . Generally
speaking, the more consistency between our hypothesis and our sample results,
the more likely we are to accept H ; the less consistency, the more likely we
are to reject H . However, we could be wrong: we might reject H when it is
true in reality, or might accept H when it is false in reality. Such errors are
known as Type I and Type II errors respectively; the probability of a Type I
error is denoted , and the probability of a Type II error is denoted .
We refer to H (the hypothesis we wish to test) as the null hypothesis,
and denote it H0. In rejecting H0 , we are implicitly accepting some alternate
hypothesis. To be precise, we should also identify the alternate hypothesis,
denoted H1 (necessary for calculation of .) A simple hypothesis is one in
which an exact value of the unknown parameter of the assumed probability
distribution is given; in a composite hypothesis, ranges of values are specied
(for example, = :01 is a simple hypothesis, and 0:1 is a composite
hypothesis.)
Example: Suppose for a particular coin we wish 1to test the hypothesis that
the probability of tossing \heads", p, is at least 2 , versus the alternate hy-
pothesis that p is less than 12 (i.e., H0 : p 21 and H1 : p < 21 .)
Suppose we base our test upon 10 tosses of the coin, and suppose we decide
to reject H0 if we do not get at least 2 heads in the 10 tosses. More formally,
let F be the number of heads in 10 tosses. Then F = P10 i=1 Fi, where
(
Fi = 1 if toss i results in heads
0 if toss i results in tails
for i = 1; 2; : : : ; 10; thus, F1; F2; : : : ; F10 is a random sampling of 10 obser-
vations of a Bernoulli random variable, F , with parameter p (i.e., F has the
binomial distribution (10; p), p :5, according to H0.) We will reject H0 if
F 1.
We test H0 by obtaining values for the Fi (tossing the coin ourselves, or
using previously obtained data) and computing F . We reject H0 if F 1, and
accept it otherwise. Next, we estimate the potential error in our decision.
17
The probability of a Type I error, , is
!
P 10
= P [F 1 j p :5] = j =0 j (p)j (1 p)10 j ; p :5
1
Because H0 is composite, we cannot calculate a specic . However, we can

calculate for specic values of p; i.e., for p = :5,
!
P 10
= j =0 j (:5)j (:5)10 j
1
= (:5)10 + 10(:5)(:5)9
' :01
Note that
P [F 1 j p = :5] : (3)
The probability of a Type II error, , is
= P [2 F 10 j p < :5]
Because H1 is composite, we cannot calculate a specic . However, we can
calculate for dierent values of p; i.e., with p = :25,
10 10 !
X
= j 10 j ' :76:
j (:25) (:75)
j =2
Another way of calculating is to use the result P [N N j p] = 1 for any
binomial distribution (N; p), which gives: = 1 P [F 1 j p < :5]. This is
known as the operating characteristic (OC). It describes how the probability
of the Type II error varies with p. Notice that P [F 1 j p = :5] < P [F 1 j
p < :5]. Thus,
< 1 P [F 1 j p = :5]: (4)
By (1) and (2), the probability of a Type I error in the hypothesis test of
H0 is no larger than P [F 1 j p = :5], and the probability of a Type II error
is less than 1 P [F 1 j p = :5].
18

Estimating Software Reliability With Hypothesis Testing

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Estimating Software Reliability With Hypothesis Testing

Uploaded by

Copyright:

Available Formats

Estimating Software Reliability with

October 22, 1993

1.1 Applications of our Method

A test case is a module execution; thus, it may be denoted Init:E1:E2: : : : Et . i

An operational prole is a description of the distribution of input values

2.1 Reliability Growth Models

2.2 Reliability Models

3.1 Application of Hypothesis Testing to Software

4 Eectiveness of the Method

4.1 Comparison Criterion

4.2 Comparison Experiments

Because H0 is composite, we cannot calculate a specic . However, we can

You might also like

Estimating Software Reliability With Hypothesis Testing

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Estimating Software Reliability With Hypothesis Testing

Uploaded by

Copyright:

Available Formats

Estimating Software Reliability with

October 22, 1993

1.1 Applications of our Method

A test case is a module execution; thus, it may be denoted Init:E1:E2: : : : Et . i

An operational pro le is a description of the distribution of input values

2.1 Reliability Growth Models

2.2 Reliability Models

3.1 Application of Hypothesis Testing to Software

4 E ectiveness of the Method

4.1 Comparison Criterion

4.2 Comparison Experiments

Because H0 is composite, we cannot calculate a speci c . However, we can

You might also like

An operational prole is a description of the distribution of input values

4 Eectiveness of the Method

Because H0 is composite, we cannot calculate a specic . However, we can