Professional Documents
Culture Documents
Hypothesis Testing
Denise M. Woit
ii
Acknowledgements
I would like to thank my supervisor, David Parnas of McMaster University,
for his advice support, and guidance. Thanks also goes to Roman Viveros-
Aguilera of McMaster University for his helpful discussions regarding this
work.
iii
Contents
1 Introduction 1
1.1 Applications of our Method : : : : : : : : : : : : : : : : : : : 1
1.1.1 Module Terminology : : : : : : : : : : : : : : : : : : : 1
1.1.2 Memoryless Program Terminology : : : : : : : : : : : : 2
1.1.3 Modes of Use : : : : : : : : : : : : : : : : : : : : : : : 3
2 Relevant Research 3
2.1 Reliability Growth Models : : : : : : : : : : : : : : : : : : : : 3
2.2 Reliability Models : : : : : : : : : : : : : : : : : : : : : : : : : 5
2.2.1 Inappropriateness of Most Methods : : : : : : : : : : : 5
2.2.2 Inappropriateness of Miller's Method : : : : : : : : : : 5
3 Reliability Estimation Method 7
3.1 Application of Hypothesis Testing to Software : : : : : : : : : 7
3.1.1 Reliability Calculations: : : : : : : : : : : : : : : : : : 7
3.1.2 More Exact Calculations : : : : : : : : : : : : : : : : : 9
4 Eectiveness of the Method 11
4.1 Comparison Criterion : : : : : : : : : : : : : : : : : : : : : : : 12
4.2 Comparison Experiments : : : : : : : : : : : : : : : : : : : : : 13
4.3 Method Preference : : : : : : : : : : : : : : : : : : : : : : : : 14
5 Conclusions 14
A Theory of Hypothesis Testing 17
iv
1 Introduction
In this paper, we present a method for estimating software reliability, based
on statistical hypothesis testing. Our estimations are based on the current
version of the software only, and our method is applicable when the current
version has not yet failed any random tests. Our method is particularly useful
for estimating reliability of the nal version of the software, as it does not fail
any test (assuming failure precipitates software revision.)
Other reliability estimation methods have been described in the literature;
however, as detailed in Section 2, these models may be impractical or may
produce imprecise estimations. Our method overcomes the limitations of these
other methods by our lack of unrealistic assumptions. Thus, our method is
especially useful when one desires reliability estimates that are practical to
calculate, and that are more precise than estimations obtainable with current
models.
In Section 1.1, potential applications of our method are described. In
Section 2, we outline reliability estimation methods of the current literature
and detail the problems and limitations associated with them. In Section 3
we outline the theory of classical statistical hypothesis testing and describe
how it can be used in reliability estimation of software. We then describe our
reliability estimation method. Section 4 explains why we believe our method
can be eective. Conclusions are presented in Section 5.
issued after this module initialization (or re-initialization), and Et is the last
event issued before the next module re-initialization (1 j ti.)
i
2 Relevant Research
In the current literature, two general methods of estimating software reliability
are described: reliability growth models and reliability models. Judging from
this literature, the former method is the most commonly used.
3
modication of the software. For some software, the time necessary for
it to achieve the desired level of reliability can be impractically long (or
innite) [KM91, Lit90].
Littlewood [Lit90] presents some real-life data (inter-fail times) depicting
this problem. The data shows that reliability grows quickly at rst, then
progressively more slowly. This is because the failures with the high-
est failure rates tend to be discovered and removed rst; the remaining
failures tend to have lower failure rates and therefore take longer to nd.
2. Good statistical practice requires that the reliability estimate of version
i be based only on trials involving version i. Growth models assume
that the reliability of version i can be obtained by using data from the
previous versions, 1; 2; : : : i 1, i.e., that trials from previous versions
are relevant to the reliability of version i. Growth models also tend to
assume that the reliability of a slightly changed program is only slightly
dierent from the unchanged program. With software, there is no reason
to believe that these assumptions hold, because of the intricate manner
in which modications aect the set of failure-causing input.
Reliability growth estimations are basically a curve-tting and extrapo-
lation problem:
R4 Ri
!
R3
R5
R2
R1
It is argued that with such models, only a rough estimate of Ri is possi-
ble. Some believe that these models are best used as a managerial tool
to predict schedules, etc., but that they should not be used if a fairly
accurate estimate of reliability is required [Lit90].
3. Growth models make predictions based upon the number of failures ob-
served during testing. If no failures are observed, the predictions are
not very meaningful because they are based on values of the model pa-
rameters, which must be guessed. (Once failure data is available, the
4
parameters are estimated using techniques such as Maximum Likelihood
Estimates or Least Squares.)
4. Dierent reliability growth models can produce vastly dierent results
for the same data. In fact, a recent development in the area of growth
models is the use of super-models to help users select which reliability
growth model is most suitable for their particular application [KM89,
FS88, Lyu90, LBL92]. We believe this is evidence that reliability growth
model estimates are too imprecise in general, for if they were all very
precise, they would all produce similar estimations. We are unable to
determine if one model is \correct" and the others are not, because
advocates of each method can produce examples for which their model
is more accurate.
5
^ = a=(N + a + b), where N is the number of tests, a and b are parameters em-
bodying prior assumptions about the possible values of , and the distribution
of is considered to be Beta(a; b).1 Thus, reliability is estimated as 1 ^.
This model is not directly applicable in our situation because it is limited
to memoryless programs and because it assumes that the operational prole is
an unconditional probability distribution. It is possible to modify the model to
apply to modules and to take into account conditional probabilities. However,
even with such modications, we are not convinced that the model produces
meaningful estimates because it uses a controversial technique of Bayesian
estimation, as explained in the following paragraph.
Problems Inherent to Bayesian Estimation: In Miller's method, one
guesses at a distribution of initially (by choosing values for a and b) and then
alters the guess by taking into account some data (the number of successful
test cases, N .) The parameters a and b aect the degree to which the initial
guess is weighed relative to the data. Thus, the ^, and therefore, the estimate
of reliability, will always depend upon our initial guess, and will depend upon
the assumption that the has a Beta(a; b) distribution.
Another diculty with this type of method is its lack of objectivity. For
instance, if one tester guesses (a,b) = (5, 7.3) as the prior values for a given
piece of software, while another selects (a,b) = (3, 6.1), the resulting estimates,
^, will be quite dierent.
The use of an initial guess is very common in estimation methods. The
guess is required at the onset of the estimation process, when no data is yet
available. When data becomes available, the reliability estimations are no
longer calculated as a function of the guess (although the time it takes for the
process to converge to an answer may still depend on the guess.) In Miller's
method, the reliability estimate remains a function of the initial guess, even
when data becomes available. Since the reliability estimate is a function of a
guess, and is based upon an assumption for which no supporting evidence is
presented, we are not convinced that it will tend to be precise.
1 is considered a random variable in the following sense: \the characteristics of
the program and the process used to develop the program determine an ensemble
of programs that could have those characteristics and development history; each of
these possible programs has an associated . The single program that we are testing
has a xed , but the value of is unknown to us" [MMN+ 92].
6
3 Reliability Estimation Method
We expect that a reliability estimate based upon hypothesis testing would
be more appropriate than the methods in the current literature because it
produces a non-perfect estimate of reliability when testing reveals no failures,
because it considers the number of successful test cases, and because it is a
well-accepted statistical technique. A discussion of the potential of hypothesis
testing for estimating module reliability is presented in [PvSK90].
In Appendix A, we informally illustrate certain aspects of the classical
theory of statistical hypothesis testing in order to familiarize the reader with
the basics of the method, if necessary. In the following sections, we describe
the application of hypothesis testing to estimating software reliability.
7
operational usage; if no failures occur, we accept the hypothesis that opera-
tional reliability is at least r. The probability that we have made an error
(that operational reliability is not at least r) is less than (1 )N by (1).
Consider the equation
(1 )N = M; (2)
where N and are as above. >From (1) and (2) we know < M . Thus, given
a batch of N successful operational tests on P , the probability that we will
erroneously accept that reliability is at least r is less than M . Or equivalently,
the probability that P has reliability less than r and still passes a batch of N
random tests is less than M .
The Type II error statistic, , is a valuable measure of the condence we
have in our reliability estimation. Condence is also measured by another
statistic, the signicance level, SL, which gives an indication of the agreement
between test data and the operational prole. Automatic calculation of sig-
nicance levels is described in [Woi93a]. SL is a percentage; the higher the
percentage, the more agreement between the test data and the operational pro-
le. SL allows us to incorporate information about the form of the operational
prole into our reliability estimations.
Our reliability estimation model contains four variables, ; N; M; and SL.
depends on N and M ; N depends on ; M; and SL; M depends on and N ;
and SL depends on N . Test personnel might set the values of some variables
and calculate the values of others, depending on how they plan to use the
estimation method. Some dierent ways of using the method and setting
variables are outlined in the following examples:
Example: It might be decided that reliability must be at least a certain
value, with a certain probability of error ( and M are set). Given this infor-
mation, the number of tests necessary (N ) and the signicance level (SL) can
be calculated.
Example: It might be the case that N random tests were executed without
failure. >From this information, only the signicance level (SL) can be cal-
culated. One of M or may be set to calculate the other; or the estimation
method might be used to obtain a number of (; M ) pairs, for the given N .
Example: We can estimate software reliability as at least 1 , given (A)
that the software has passed a batch of N tests, randomly selected according
8
to an operational prole, and (B) that we require a statistical guarantee that
the probability of error is less than some given M . Given N , we can also
calculate the signicance level, SL.
Example: Perhaps the testers will continue to test the software until the two
condence measures are considered to be \satisfactory" by another source. In
this case, the number of test cases (N ) is increased until both and SL reach
the desired values. It will tend to be the case that an increase in N will both
decrease the error, , and increase the signicance, SL, of our estimation.
3.1.2 More Exact Calculations
A more accurate reliability estimation can be obtained if we incorporate in-
formation about the percentage of tests executed relative to the total number
of possible tests. However, this increase in accuracy is sometimes obtained at
the cost of extensive calculation, as outlined below.
Let U be the total number of possible inputs (if the software is a memoryless
program), or the total number of unique module executions according to the
operational prole specication, (if the software is a module). Let n be the
total number of unique tests in the test set (if any tests occur more than once
in the test set, then
P n will be less than the total number of tests in the test
U
set.) Let a = U i=1 !(Ii)P (Ii), where Ii is the ith input or module execution,
P (Ii ) is the probability Ii is issued according to the operational prole, and
! (Ii ) = 0 if Ii succeeds and 1 if Ii fails.
Our hypothesis testing model can now be modied to use the hypergeo-
metric distribution rather than the binomial distribution, since we can assume
testing without replacement. We are still interested in estimating the probabil-
ity of failure, p, which is equal to a=U . Our null hypothesis is still H0 : p ,
which is equivalent to HP 0 : a U. Similarly, the alternate hypothesis is
H1 : a > U. Let T = ni=1 Ti , where Ti is 1 if test i fails and 0 if it suc-
ceeds. We accept H0 if T = 0. The probability that we have made an error
in accepting H0 is = P [T = 0 j a > U]. According to the hypergeometric
distribution, ! !
= U a U
= n ; a > U
n
9
Thus, ! !
U U U
n = n
We will obtain more precise estimations using the hypergeometric distri-
bution than we would using the binomial distribution in the case where U and
n are known, since the binomial distribution only approximates the hyperge-
ometric; i.e., we would obtain a smaller for the same .
For example, suppose U = 10 and n = 7 and assume that all 7 tests
are unique. Assume we let = :1. Then using hypothesis testing with the
binomial distribution gives :47, while using hypothesis testing with the
hypergeometric distribution gives :3. Even if two of the tests were not
unique (i.e., n = 7 for the binomial and n = 6 for the hypergeometric),
hypothesis testing with the hypergeometric distribution would give :4.
Thus, we can usually decrease the error at the cost of calculating U = the
total number of possible test cases, and n = the total number of unique test
cases executed.
Calculation of U and n: When the software is a memoryless program, U
can be easily calculated from the operational prole, as the number of unique
inputs. However, when the software is a module, the number of calculations
involved in calculating U can be impractically large. U can be calculated from
the operational prole specication as the total number of unique possible
module executions derivable. There are many ways to accomplish this; one is
to build a tree-representation of the operational prole specication, such that
U is the number of leaves in the tree, as described in [Woi93a].
n is calculated as the total number of unique tests in the test set. When
the software is a memoryless program, n can be easily calculated by simple
textual comparison of the test cases. When the software is a module, n may be
calculated by textual comparison, or by making use of the tree representation
mentioned above, as outlined in [Woi93a].
Accuracy of Approximations: As stated above, the binomial distribution
may be used to approximate the hypergeometric distribution; thus, the cal-
culations of Section 3.1.1 may be used to approximate those of Section 3.1.2.
How accurate are such approximations? In general, the binomial distribution
becomes a more accurate approximation as the sample space approaches in-
nity [Ric88]. In other words, the larger U is, the more accurate the binomial
10
approximation is. In the example above, U was small (10); the approximation
was therefore not very good, giving an inaccuracy of :17. When U = 100, the
inaccuracy decreases to :01; when U = 1000, the inaccuracy decreases to :001.
The degree of inaccuracy depends to a certain extent on the values of and n
as well. For instance, when = :01, U = 1; 000, and n = 100, the inaccuracy
is :02.
Usefulness: The above calculations for U can require much computation
in the case of modules, because of the nature of the conditional probability
distributions in the operational proles [Woi93b]. We expect that only in the
cases of simple module operational prole specications will one be able to
take advantage of being able to increase estimation accuracy by incorporating
information about the number of tests executed relative to the total number of
tests possible. However, in the case of memoryless programs, we expect that
it will be feasible to use the more precise calculations, since these operational
proles are unconditional probability distributions.
The increase in accuracy that can be obtained by using the more precise
calculations will depend largely on the size of U , and to a lesser extent on
the values of and n. When U is very large, we expect that the increase in
accuracy will usually be insignicant compared to the cost of calculating U
and n. However, when U is small, or when accuracy is of extreme importance,
testers may wish to use the more exact calculations outlined in Section 3.1.2.
12
superior if j R EH j<j R EM j and MM is superior if j R EH j>j R EM j.
However, in light of the above discussion, we must keep in mind that this is a
crude comparison.
13
4.3 Method Preference
When module testing reveals no failures, using our method (or MM) is prefer-
able to the methods outlined in [TLN78, WW88, BL75, Whi92] because the
latter will all produce a reliability estimate of 1 (perfect) and will give no indi-
cation of the potential error of this estimation. When a non-perfect reliability
estimate and a potential error are desired, our method is preferable to MM
because ours involves well-accepted, non-controversial statistical methods such
as hypothesis testing and goodness of t tests2, while MM involves controver-
sial Bayesian techniques. Our method's estimations are based on observed
results, while MM's are based on observed results and an initial guess, which
will always aect the nal estimations. Further, if the initial guess is too far
o, an infeasibly large number of test cases could be required to achieve a
reliability estimate close to the true reliability, as outlined in [Woi93a]. For
the reasons above, we believe that our method may be preferable to similar
methods in the current literature.
5 Conclusions
In this paper, we have described a method for estimating reliability (op-
erational or non-operational) of software modules or memoryless programs
that, in their current versions, have not yet failed any tests. Our method is
based on statistical hypothesis testing, and thus, produces a series of 3-tuples,
(R; C1; C2), where R is a reliability estimate and C1 and C2 are measures of
our condence in R. We described a general estimation method, which we ex-
pect will be useful for most software; we also described a more precise method,
which may be useful in cases of simple operational proles, when more accurate
estimations are required. We showed that our estimation technique overcomes
the problems and limitations of similar techniques, and explained why we be-
lieve that estimations obtained with our estimation method will tend to be
more precise than estimations obtained with similar methods. Because of our
method's lack of unrealistic assumptions and its ease of use, we believe that
testers might prefer to use our method of reliability estimation.
2 Goodness of t tests are used in the calculation of signicance levels, as discussed
in [Woi93b].
14
References
[BL75] J.R. Brown and M. Lipow. Testing for software reliability. In Proc.
Intl. Conf. Reliable Software (Los Angeles, CA), pages 518{527,
April 1975.
[FS88] W. Farr and O.J. Smith. A tool for statistical modeling and es-
timation of reliability functions for software: Smerfs. Journal of
Systems and Software, 8(1):47{55, Jan, 1988.
[KM89] P.A. Keiller and D.R. Miller. On the use and the performance
of software reliability growth models. Reliability Engineering and
System Safety Special Issue, pages 1{21, 1989.
[KM91] P.A. Keiller and D.R. Miller. On the use and the performance
of software reliability growth models. Reliability Engineering and
System Safety Special Issue, pages 95{117, 1991.
[LBL92] M. Lu, S. Brocklehurst, and B. Littlewood. Combination of pre-
dictions obtained from dierent software reliability growth models.
In Proceedings 10th Software Reliability Symposium, June 25-26,
1992.
[Lit90] B. Littlewood. Limits to evaluation of software dependability.
Draft report-PDCS No. D8 Project 3092-PDCS, City University,
Northampton Square, London, July, 1990.
[Lyu90] M.R. Lyu. Software reliability engineering and measurement at the
jet propulsion laboratory. workshop on software reliability. Car-
leton University, Ottawa Ontario Canada, May 24-26 1990.
[MIO90] J.D. Musa, A. Iannino, and K. Okumoto. Software reliability:
measurement, prediction, application. McGraw-Hill, New York,
1990.
[MMN+92] K.W. Miller, L.J. Morell, L.E. Noonan, S.K. Park, D.M. Nicol,
B.W. Murrill, and J.M. Voag. Estimating the probability of failure
when testing reveals no failures. IEEE Trans. Software Engineer-
ing, 18(1):33{42, January 1992.
15
[PvSK90] D.L. Parnas, J. van Schouwen, and S.P. Kwan. Evaluation stan-
dards for safety critical software. Communications of the ACM,
33(6):836{48, June, 1990.
Previous versions:
Technical Report 88-220, Dept. of Comp. & Info. Sci., Queen's
University, May 1988.
Proc. Intl. Working Group on Nuclear Power Plant Control and In-
strumentation, IAEA NPPCS Specialists' Meeting, International
Atomic Energy Agency, London, United Kingdom, 10-12 May
1988.
Software Development: Tips and Techniques, U.S. Professional
Development Institute, Silver Spring, Maryland, 1989, pp. 311-
350.
Proc. Seventh Intl. Conference on Testing Computer Software, San
Francisco, June 18-21, 1990, pp. 89-117.
[Ric88] John A. Rice. Mathematical Statistics and Data Analysis.
Wadsworth and Brooks/Cole, New York, 1988.
[TLN78] T.A. Thayer, M. Lipow, and E.C. Nelson. Software Reliability.
North-Holland, New York, 1978.
[Whi92] James Whittaker. Markov chain techniques for software testing
and reliability analysis. PhD dissertation, University of Tennessee,
May, 1992.
[Woi93a] D.M. Woit. Operational prole specication, atuomatic test case
generation and reliability estimation for modules. PhD disserta-
tion, Queen's University, Kingston, Ontario, 1993.
[Woi93b] D.M. Woit. Specifying operational proles for modules. In Proceed-
ings ISSTA (International Symposium on Software Testing and
Analysis). ACM, June 28-30, 1993.
[WW88] S.N. Weiss and E.J. Weyuker. An extended domain-based
model of software reliability. IEEE Trans. Software Engineering,
14(10):1512{1524, October 1988.
16
A Theory of Hypothesis Testing
A hypothesis, H , is a statement about the probability distribution of a random
variable. Hypothesis testing is a process by which we may accept or reject H
based upon a sampling of the random variable referred to in H . Generally
speaking, the more consistency between our hypothesis and our sample results,
the more likely we are to accept H ; the less consistency, the more likely we
are to reject H . However, we could be wrong: we might reject H when it is
true in reality, or might accept H when it is false in reality. Such errors are
known as Type I and Type II errors respectively; the probability of a Type I
error is denoted , and the probability of a Type II error is denoted .
We refer to H (the hypothesis we wish to test) as the null hypothesis,
and denote it H0. In rejecting H0 , we are implicitly accepting some alternate
hypothesis. To be precise, we should also identify the alternate hypothesis,
denoted H1 (necessary for calculation of .) A simple hypothesis is one in
which an exact value of the unknown parameter of the assumed probability
distribution is given; in a composite hypothesis, ranges of values are specied
(for example, = :01 is a simple hypothesis, and 0:1 is a composite
hypothesis.)
Example: Suppose for a particular coin we wish 1to test the hypothesis that
the probability of tossing \heads", p, is at least 2 , versus the alternate hy-
pothesis that p is less than 12 (i.e., H0 : p 21 and H1 : p < 21 .)
Suppose we base our test upon 10 tosses of the coin, and suppose we decide
to reject H0 if we do not get at least 2 heads in the 10 tosses. More formally,
let F be the number of heads in 10 tosses. Then F = P10 i=1 Fi, where
(
Fi = 1 if toss i results in heads
0 if toss i results in tails
for i = 1; 2; : : : ; 10; thus, F1; F2; : : : ; F10 is a random sampling of 10 obser-
vations of a Bernoulli random variable, F , with parameter p (i.e., F has the
binomial distribution (10; p), p :5, according to H0.) We will reject H0 if
F 1.
We test H0 by obtaining values for the Fi (tossing the coin ourselves, or
using previously obtained data) and computing F . We reject H0 if F 1, and
accept it otherwise. Next, we estimate the potential error in our decision.
17
The probability of a Type I error, , is
!
P 10
= P [F 1 j p :5] = j =0 j (p)j (1 p)10 j ; p :5
1
= (:5)10 + 10(:5)(:5)9
' :01
Note that
P [F 1 j p = :5] : (3)
The probability of a Type II error, , is
= P [2 F 10 j p < :5]
Because H1 is composite, we cannot calculate a specic . However, we can
calculate for dierent values of p; i.e., with p = :25,
10 10 !
X
= j 10 j ' :76:
j (:25) (:75)
j =2
Another way of calculating is to use the result P [N N j p] = 1 for any
binomial distribution (N; p), which gives: = 1 P [F 1 j p < :5]. This is
known as the operating characteristic (OC). It describes how the probability
of the Type II error varies with p. Notice that P [F 1 j p = :5] < P [F 1 j
p < :5]. Thus,
< 1 P [F 1 j p = :5]: (4)
By (1) and (2), the probability of a Type I error in the hypothesis test of
H0 is no larger than P [F 1 j p = :5], and the probability of a Type II error
is less than 1 P [F 1 j p = :5].
18