You are on page 1of 18

Bayes Factors

James Berger and Luis Pericchi


November 18, 2014

Bayes factors are the primary tool used in Bayesian inference for hypothesis test-
ing and model selection. They also are used by non-Bayesians in the construction of test
statistics. For instance, in the testing of simple hypotheses, the Bayes factor equals the
ordinary likelihood ratio. Bayesian Model Selection discusses the use of Bayes fac-
tors in model selection. Statistical Evidence discusses the general use of Bayes factors in
quantifying statistical evidence. To avoid overlap with these entries, we concentrate here on
the motivation for using Bayes factors, particularly in hypothesis testing.

1 Introduction
Suppose we are interested in testing two hypotheses:

H1 : X has densityf1 (x|1 ) vs.


H2 : X has densityf2 (x|2 ).

If the parameters 1 and 2 are unknown, a Bayesian generally specifies prior densities 1 (1 )
and 2 (2 ) for these parameters, and then computes the Bayes factor of H1 to H2 as the
ratio of the marginal densities of x under H1 and H2 ,

B = m1 (x)/m2 (x),

where Z
mi (x) = fi (x|i )i (i ) di .

The logarithm of the Bayes Factor is called the weight of the evidence provided by the data,
which was extensively used by Turing and Good to break the Nazi secret code during World
War II, Good [35].
The Bayes factor is typically interpreted as the odds provided by the data for H1 to H2 ,
1
This article was originally published online in 2006 in Encyclopedia of Statistical Sciences, c John
Wiley & Sons, Inc. and republished in Wiley StatsRef: Statistics Reference Online, 2014. It was revised and
updated by James Berger and Luis Pericchi, November, 2014

1
although this interpretation is strictly valid only if the hypotheses are simple; otherwise, B
will also depend on the prior distributions. Note, however, that B does not depend on
the prior probabilities of the hypotheses. If one wants a full Bayesian analysis, these prior
probabilities, P (H1 ) and P (H2 ) (which, of course, sum to one), must also be specified.
Then the posterior probability of H1 is given by
B
P (H1 |x) = . (1)
B + [P (H2 )/P (H1 )]

Clearly, the posterior probability of H2 is just 1 P (H1 |x).


One of the attractive features of using a Bayes factor to communicate evidence about
hypotheses is precisely that it does not depend on P (H1 ) and P (H2 ), which can vary con-
siderably among consumers of a study. Any such consumer can, if they wish, determine
their own P (H1 ) and convert the reported Bayes factor to a posterior probability using (1).
Although B will still depend on 1 (1 ) and/or 2 (2 ), the influence of these priors is typically
less significant than the influence of P (H1 ) and P (H2 ). Also, objective (default) choices of
1 (1 ) and 2 (2 ) are increasingly available (see below), in which case B can be used as an
objective measure of evidence. General discussion of Bayes factors can be found in Jeffreys
[38], Berger [4], Kass and Raftery [39], Berger and Pericchi [14],[16], and Pericchi [50].

Example 1. Suppose we observe an i.i.d. normal sample, X1 , X2 , . . ., Xn , from an N (,


2 ) distribution, with 2 known. It is desired to test H1 : = 0 vs. H2 : 6= 0 . For
the prior distribution of under H2 , it is common to choose an N (0 , 2 ) distribution, where
the standard deviation is chosen to reflect the believed plausible range of if H2 were true.
(One could, of course, also choose prior means other than 0 .)
A simple computation then shows that the Bayes factor of H1 to H2 is
1/2
n 2 12 z 2
  
B = 1+ 2 exp , (2)
1 + ( 2 /n 2 )

where z = n( x 0 )/ is the usual standardized test statistic. A frequently used default
choice of is 2 = 2 2 (the quartiles of the prior on are then roughly ), in which
2

case
z 2
 
B = 1 + 2n exp .
2 + (1/n)
For instance, if n = 20 and |z| = 1.96, then B = 0.983. As this very nearly equals 1,
which would correspond to equal odds for H1 and H2 , the conclusion would be that the data
provide essentially equal evidence for H1 and H2 .

Note that |z| = 1.96 corresponds to a p-value of 0.05, which is typically considered
to be significant evidence against H1 , in contradiction to the message conveyed by B =
0.983. This conflict between p-values and Bayes factors is quite general and is discussed from
several perspectives in this article.

2
2 The Jeffreys-Lindley Paradox
It is interesting to note what happens in (2) as n grows. For instance if n = 100 but still
|z| = 1.96, then the Bayes Factor now actually favors the Null Hypothesis by a factor of
roughly two to one. Indeed, for any fixed z, as n (or 2 ), it is immediate from
(2) that B indicating overwhelming evidence for H1 , even though z was any (fixed)
large value (and the p-value was, correspondingly, any fixed small value).
Various versions of this phenomenon have become known as Jeffreys paradox [38], Lind-
leys paradox [42], and Bartletts paradox [2] (see Sharp Null Hypotheses). The la-
beling of this phenomenon as a paradox is somewhat misleading, since it concerns an
incontrovertible mathematical fact. It is a paradox, however, in the sense that Bayesians
and non-Bayesians can reach quite different conclusions in testing, and this is probably the
primary situation with important practical consequences for which this is so.
The mathematical paradox depends on H1 being a believable exact point null hypoth-
esis, such as does this particular gene have no effect whatsoever on the presence of an
specific illness? Even if the real null is a small interval null, however, it can be accurately
approximated by the point null, as long as the sample size n is not too large (roughly, the
size of the interval null must be considerably smaller than the standard deviation of the
estimate of under H2 ). Thus, as n as required in the paradox, it would generally
no longer be adequate to approximate the small interval null by a point null, in which case
the paradox essentially becomes vacuous; see Berger and Delampady [10] for discussion
of this, including the condition under which a small interval null can be approximated by a
point null.
In spite of this resolution of the paradox, the fact remains that the condition for replac-
ing a small interval null by a point null is very often satisfied, and then the Bayes factor
will typically differ dramatically from the p-value. This has led many to view p-values as
being highly misleading when, say, a p-value of 0.05 is (erroneously) interpreted as an error
probability or as saying the odds are 20 to 1 in favor of the alternative.
Another resolution of the paradox arises from the recognition that it is wrong to only
consider Type I error, as is implicit in the use of p-values that are significant, e.g. smaller
than some critical level (such as the ubiquitous = 0.05). Good testing practice also
requires consideration of Type II error (or power); see [52] and references therein. For
instance, it can be shown, that, for simple versus simple hypothesis testing, the paradox
disappears if, instead of fixing Type I error and minimizing Type II error the traditional
method one minimizes a weighted sum of Type I error and Type II error; see [52] for this
result and generalizations. In some fields (e.g., Genome Wide Association Studies), a related
solution has been adopted, considering power/Type I error (instead of a weighted sum of the
two errors) as the test evaluation, interpreting this as the odds of obtaining a true rejection
to obtaining a false rejection; cf. [61].

3
3 Motivation for Using Bayes Factors
Since posterior probabilities are an integral component of Bayesian hypothesis testing and
model selection, and since Bayes factors are directly related to posterior probabilities, their
role in Bayesian analysis is indisputable. We concentrate here, therefore, on reasons why
their use should be seriously considered by all statisticians.
Reason 1. Classical P -values can be highly misleading when testing precise hypotheses. This
has been extensively discussed in Edwards et al. [26], Berger and Sellke [18], Berger and
Delampady [10], and Berger and Mortera [11]. In Example 1 , for instance, we saw that the
P -value and the posterior probability of the null hypothesis could differ very substantially. To
understand that the problem here is with the P -value, imagine that one faces a long series
of tests of new drugs for AIDS. To fix our thinking, let us suppose that 50% of the drugs
that will be tested have an effect on AIDS, and that 50% are ineffective. (One could make
essentially the same point with any particular fraction of effective drugs.) Each drug is tested
in an independent experiment, corresponding to a normal test of no effect, as in Example
1. (The experiments could all have different sample sizes and variances, however.) For each
drug, the P -value is computed, and those with P -values smaller than 0.05 are deemed to be
effective. (While this is an unfair caricature of the standard practice of statisticians, it is
very common practice among non-statisticians who conduct tests.)
Suppose a doctor reads the results of the published studies, but feels confused about
the meaning of P -values. (Let us even assume here that all studies are published, whether
they obtain statistical significance or not; the real situation of publication selection bias
only worsens the situation.) So the doctor asks the resident statistician to answer a simple
question: A number of these published studies have P -values that are between 0.04 and
0.05; of these, what fraction of the corresponding drugs are ineffective?
The statistician cannot provide a firm answer to this question, but can provide useful
bounds if the doctor is willing to postulate a prior opinion that a certain percentage of
the drugs being originally tested (say, 50% as above) were ineffective. In particular, it is
then the case that at least 23% of the drugs having P -values between 0.04 and 0.05 are
ineffective and, in practice, typically 50% or more will be ineffective (see Berger and Sellke
[18]). Relating to this last number, the doctor concludes: So if I start out believing that a
certain percentage of the drugs will be ineffective, say 50%, then a P -value near 0.05 does
not change my opinion much at all; I should still think that about 50% are ineffective.
This is essentially right, and this is essentially what the Bayes factor conveys. In Ex-
ample 1 we saw that the Bayes factor is approximately one when the P -value is near 0.05
(for moderate sample sizes). And a Bayes factor of one roughly means that the data are
equally supportive of the null and alternative hypotheses, so that posterior beliefs about the
hypotheses will essentially equal the prior beliefs.
We cast the above discussion in a frequentist framework to emphasize that this is a fun-
damental fact about P -values; in situations such as that above, a P -value of 0.05 essentially
does not provide any evidence against the null hypothesis. (Note, however, that the situation
is different in situations where there is a well specified interval hypothesis or a well justified

4
one sided hypothesis; then P -values and posterior probabilities often (for moderate and large
samples) happen to be reasonably similarsee Casella and Berger [22].) That the meaning
of P -values is commonly misinterpreted is hardly the fault of consumers of statistics. It is
the fault of statisticians for providing a concept so ambiguous in meaning. The real point
here is that the Bayes factor essentially conveys the right message easily and immediately.
Notice that for one sided and well specified interval hypothesis the Bayes Factors are also
well specified and clearly defined and giving the correct answer, after the priors have been
defined in the appropriate hypotheses sets.

Reason 2. Bayes factors are consistent for hypothesis testing and model selection. Con-
sistency is a very basic property. Its meaning is that, if one of the entertained hypotheses (or
entertained models) is actually true, then a statistical procedure should guarantee selection of
the true hypothesis (or true model) if enough data are observed. Use of Bayes factors guar-
antees consistency (under very mild conditions), while use of most classical selection tools,
such as P -values, Cp , and AIC (see Model Selection: Akaikes Information Crite-
rion), and even DIC [58], frequently used by Bayesians, does not guarantee consistency (cf.
Gelfand and Dey [32]).

In model selection it is sometimes argued that consistency is not a very relevant concept
because no models being considered are likely to be exactly true. There are several possible
counterarguments. The first is that, even though it is indeed typically important to recognize
that entertained models are merely approximations, one should not use a procedure that
fails the most basic property of consistency when you do happen to have the correct model
under consideration. A second counterargument is based on the results of Berk [19] and
Dmochowski [24]; they show that asymptotically (under mild conditions), use of the Bayes
factor for model selection will choose the model that is closest to the true model in terms
of Kullback-Leibler divergence (see Information, Kullback). This is a remarkable and
compelling property of use of Bayes factors. Other criticisms of Bayes Factors, when the
true model is not among those considered, have been raised in predictive settings; see Shibata
[56] and Findley [28] for examples. These findings, however, seem to have been based on
assuming the Bayesian would use the highest probability model, while it is now being seen
that model averaging and recent developments, such as the Median Probability Model [1]
together with the notion of the Effective Sample Size [6], seem to address these concerns.

Reason 3. Bayes factors behave as automatic Ockhams razors (see Parsimony, Prin-
ciple of), favoring simple models over more complex models, if the data provide roughly
comparable fits for the models. Overfitting is a continual problem in model selection, since
more complex models will always provide a somewhat better fit to the data than will simple
models. In classical statistics overfitting is avoided by introduction of an ad hoc penalty term
(as in AIC), which increases as the complexity (i.e., the number of unknown parameters)
of the model increases. Bayes factors (and its approximation BIC) act naturally to penalize
model complexity, and hence need no ad hoc penalty terms. For an interesting historical
example and general discussion and references, see Jefferys and Berger [37].

5
Reason 4. The Bayesian approach can easily be used for multiple hypotheses or models.
Whereas classical testing has difficulty with more than two hypotheses, consideration of such
poses no additional difficulty in the Bayesian approach. For instance, one can easily extend
the Bayesian argument in Example 1 to test between H0 : =0, H1 : < 0, H2 : > 0, cf.
[11],[51].

Reason 5. The Bayesian approach does not require nested hypotheses or models, standard
distributions, or regular asymptotics. Classical hypothesis testing has difficulty if the hy-
potheses are not nested or if the distributions are not standard. There are general classical
approaches based on asymptotics, but the Bayesian approach does not require any of the
assumptions under which an asymptotic analysis can be justified. The following example
illustrates some of these notions.

Example 2. Suppose we observe an i.i.d. sample, X1 , X2 , . . ., Xn , from either a normal


or a Cauchy distribution f , and wish to test the hypotheses

H1 : f isN (, 2 )vs.H2 : f isC(, 2 ).

This is awkward to do classically, as there is no natural test statistic and even no natural
null hypothesis. (One can obtain very different answers depending on which test statistic is
used and which hypothesis is considered to be the null hypothesis.) Also, computations of
error probabilities are difficult, essentially requiring expensive simulations.
In contrast, there is a natural and standard automatic Bayesian test for such hypotheses.
In fact, for comparison of any location-scale distributions, it is shown in Berger et al. [15]
that one can legitimately compute Bayes factors using the standard noninformative prior
density (, 2 ) = 1/ 2 (see Jeffreys Prior Distribution). For testing H1 vs. H2
above, the resulting Bayes factor is available in closed form (see Franck [30] and Spiegelhalter
[57].

Reason 6. The Bayesian approach can allow for model uncertainty, which is called Bayesian
Model Averaging [50] which is often predictively optimal. Selecting a hypothesis or model
on the basis of data, and then using the same data to estimate model parameters or make
predictions based upon the model, is well known to yield (often seriously) overoptimistic
estimates of accuracy. In the classical approach it is often thus recommended to use part of
the data to select a model and the remaining part for estimation and prediction. When only
limited data are available, this can be difficult.

The Bayesian approach takes a different tack: ideally, all models are left in the analysis,
and (say), prediction is done using a weighted average (Bayesian Model Averaging) of the
predictive distributions from each model, the weights being determined from the posterior
probabilities (or Bayes factors) of each model. See Geisser [31] and Draper [25] for discussion
and references.
Although keeping all models in the analysis is an ideal, it can be cumbersome for com-
munication and descriptive purposes. If only one or two models receive substantial posterior

6
probability, it would not be an egregious sin to eliminate the other models from consid-
eration. Even if one must report only one model, the fact mentioned above, that Bayes
factors act as a strong Ockhams razor, means that at least the selected model will not be
an overly complex one, and so estimates and predictions based on this model will not be
quite so overoptimistic. Indeed, one can even establish certain formal optimality properties
of selecting models on the basis of Bayes factors. Here is one such:
Result 1 Suppose it is of interest to predict a future observation Y under, say, a sym-
metric loss L(|Y Y|) (see Decision Theory), where L is nondecreasing. Assume that
two models, M1 and M2 , are under consideration for the data (present and future), that any
unknown parameters are assigned proper prior distributions, and that the prior probabilities
of M1 and M2 are both equal to 1/2. Then the optimal model to use for predicting Y is M1 ,
or M2 , as the Bayes factor exceeds, or is less than, one.
For more than two models though, and specially for many entertained models, it is not clear
than the model with highest posterior probability is nearly optimal, since the highest prob-
ability may be very small. But again here, Bayes Factors and Posterior Model Probabilities
come to the rescue, see below the Median Probability Model [1].
Reason 7. Bayes factors seem to yield optimal conditional frequentist tests. The standard
frequentist testing procedure, based on the Neyman-Pearson lemma, has the disadvan-
tage of requiring the reporting of fixed error probabilities, no matter what the data. (The
data-adaptive versions of such testing, namely P -values, are not true frequentist procedures
and suffer from the rather severe interpretational problems discussed earlier.) In a recent
surprising development (based on ideas of Kiefer [41], Berger et al. [8], [9] show for simple
versus simple testing and for testing a precise hypothesis, respectively, that tests based on
Bayes factors (with, say, equal prior probabilities of the hypotheses) yield posterior proba-
bilities that have direct interpretations as conditional frequentist error probabilities. Indeed,
the posterior probability of H1 is the conditional Type I frequentist error probability, and the
posterior probability of H2 is a type of average conditional Type II error probability (when
the alternative is a composite hypothesis). The reported error probabilities thus vary with the
data, exactly as do the posterior probabilities. Another benefit that accrues is the fact that
one can accept H1 with a specified error probability (again data-dependent).
The necessary technical detail to make this work is the defining of suitable conditioning
sets upon which to compute the conditional error probabilities. These sets necessarily include
data in both the acceptance region and the rejection region, and can roughly be described
as the sets which include data points providing equivalent strength of evidence (in terms
of Bayes factors) for and against H1 . Computation of these sets is, however, irrelevant to
practical implementation of the procedure.
The primary limitation of this Bayesian-frequentist equivalence is that there will typi-
cally be a region, which is called the no-decision region, in which frequentist and Bayesian
interpretations are incompatible. Hence this region is excluded from the decision space. In
Example 1 , for instance, if the default N (0, 2 2 ) prior is used and n = 20, than the no-
decision region is the set of all points where the usual z-statistic is between 1.18 and 1.95. In
all examples we have studied, the no-decision region is a region where both frequentists and

7
Bayesians would feel indecisive, and hence its presence in the procedure is not detrimental
from a practical perspective.
More surprises arise from this equivalence of Bayesian and conditional frequentist testing.
One is that, in sequential analysis using these tests, the stopping rule (see Optimal
Stopping Rules) is largely irrelevant to the stated error probabilities. In contrast, with
classical sequential testing, the error probabilities depend very much on the stopping rule.
For instance, consider a sequential clinical trial which is to involve up to 1000 patients. If
one allows interim looks at the data (after, say, each 100 patients), with the possibility of
stopping the experiment if the evidence appears to be conclusive at an interim stage, then
the classical error probability will be substantially large than if one had not allowed such
interim analysis. Furthermore, computations of classical sequential error probabilities can
be very formidable. It is thus a considerable surprise that the conditional frequentist tests
mentioned above not only provide the freedom to perform interim analysis without penalty,
but also are much simpler than the classical tests. The final surprise is that these conditional
frequentist tests provide frequentist support for the stopping-rule principle (see Berger and
Berry [7]).

4 Using Bayes Factors


Several important issues that arise in using Bayes factors for statistical inference are discussed
here.

4.1 Computation
Computation of Bayes factors can be difficult. A useful simple approximation is the Laplace
approximation; see Bayesian Model Selection for its application to Bayes factors. The
standard method of numerically computing Bayes factors has long been Monte Carlo impor-
tance sampling. There have recently been a large variety of other proposals for computing
Bayes factors; see Bayesian Model Selection and Kass and Raftery [39] for discussion
and references.

4.2 Minimal Statistical Reports and Decision Theory


While Bayes factors do summarize the evidence for various hypotheses or models, they are
obviously not complete summaries of the information from an experiment. In Example 1 ,
for instance, along with the Bayes factor one would typically want to know the location of
given that H2 were true. Providing the posterior distribution of , conditional on the data
and H2 being true, would clearly suffice in this regard. A minimal report would, perhaps,
be a credible set for based on this posterioralong with the Bayes factor, of course. (It is
worth emphasizing that the credible set alone would not suffice as a report, as in frequentist
statistics with the assumption of a fixed confidence level ; the Bayes factor is needed to
measure the strength of evidence against H1 . Often it is argued, by frequentist statisticians

8
that are aware of the dangers of p-values, that a confidence interval solves the problems.
This is simply not so: An interval conditional on the alternative hypothesis assumes the
alternative hypothesis to be true.)
We should also emphasize that, often, it is best to approach hypothesis testing and
model selection from the perspective of decision analysis (Bernardo and Smith [20] discuss a
variety of decision and utility-based approaches to testing and model selection). It should be
noted that Bayes factors do not necessarily arise as components of such analysis. Frequently,
however, the statisticians goal is not to perform a formal decision analysis, but to summarize
information from a study in such a way that others can perform decision analysis (perhaps
informally) based on this information. In this regard, the minimal type of report discussed
above will often suffice as the statistical summary needed for the decision maker.

4.3 Updating Bayes Factors


Full posterior distributions have the pleasant property of summarizing all available informa-
tion, in the sense that if new, independent information becomes available, one can simply
update the posterior with the new information through Bayes rule. The same is not gen-
erally true with Bayes factors. Updating Bayes factors in the presence of new information
typically also requires knowledge of the full posterior distributions (or, at least, the original
likelihoods). This should be kept in mind when reporting results.

4.4 Multiple Hypotheses Or Models


When considering k (greater than two) hypotheses or models, Bayes factors are rather cum-
bersome as a communication device, since they involve only pairwise comparisons. There are
two obvious solutions. One is just to report the marginal densities mi (x) for all hypotheses
and models. But since the scaling of these is arbitrary, it is typically more reasonable to
report a scaled version, such as
mi (x)
Pi = Pk .
j=1 mj (x)

The Pi have the additional benefit of being interpretable as the posterior probabilities of the
hypotheses or models, if one were to assume equal prior probabilities. Note that a consumer
of such a report who has differing prior probabilities P (Hi ) can compute his or her posterior
probabilities as
P P (Hi )
P (Hi |x) = Pk i .
j=1 jP P (H j )
Within the Bayesian paradigm, the optimal analysis is then based on model averaging, where
one would assess the distribution of an uncertain quantity Y via the formula
k
X
p(y) = P (Hi | x) p(y | x, Hi ) ,
j=1

9
where p(y | x, Hi ) is the predictive distribution of Y under hypothesis Hi , given the data x.
Often a single model is required, for sociological reasons, and it is commonly assumed
that the maximum probability model is the optimal model. This is typically not true; the
closest approximation to the optimal model averaged answer is usually the median probability
model, defined as follows, for the simplest scenario of variable selection:
Suppose we have covariate parameters (1 , 2 , . . . , p ) in a potential model, with any
of them possibly being 0.

Suppose Hj states that variables {i1 , . . . , ij } are non zero, with all others being zero.

The inclusion probability, Pi of variable i is then just the sum of the posterior proba-
bilities of all hypotheses for which variable i is nonzero.
The median probability model is then the model that contains only those variables whose
inclusion probabilities are greater than 1/2, and it is often the best predictive model, as
discussed in Barbieri and Berger (2004), [1].

5 Objective Bayes Factors


5.1 Basics
Ideally, 1 (1 ) and/or 2 (2 ) are derived as subjective prior distributions. In hypothesis
testing, especially nested hypothesis testing, there are strong reasons to do this. In Example
1 , for instance, the Bayes factor clearly depends strongly on the prior variance, 2 . Thus, at
a minimum, one should typically specify this prior quantity (roughly the square of the prior
guess as to the possible spread of if H2 is true) to compute the Bayes factor. An attractive
alternative for statistical communication is to present, say, a graph of the Bayes factor as
a function of such key prior inputs, allowing consumers of a study easily to determine the
Bayes factor corresponding to their personal prior beliefs (cf. Dickey [23] and Fan and Berger
[27]).
Another possibility is to use Bayesian robustness methods, presenting conclusions that
are valid simultaneously for a large class of prior inputs. In Example 1 , for instance, one
can show that the lower bound on the Bayes factor over all possible 2 , when z = 1.96,
is 0.473. While this is a useful bound here, indicating that the evidence against H1 is no
more than 1 to 2, such bounds will not always answer the question. The problem is that
the upper bounds on the Bayes factor, in situations such as this, tend to be infinite, so
that one may well be left with an indeterminate conclusion. (Of course, one might very
reasonably apply robust Bayesian methods to a reduced class of possible prior inputs, and
hope to obtain sensible upper and lower bounds on the Bayes factor.) Discussions of robust
Bayesian methods in testing can be found in refs. [26] [4] [18] [10] [5].
Sometimes use of noninformative priors is reasonable for computing Bayes factors. One-
sided testing provides one such example, where taking limits of symmetric proper priors as
they become increasingly vague is reasonable and can be shown to give the same answer as

10
use of noninformative priors (cf. Casella and Berger [22]). Another is nonnested testing,
when the models are of essentially the same type and dimension. Example 2 above was of
this type. See Berger et al. [15] for discussion of other problems of this kind.
In general, however, use of noninformative priors is not legitimate in hypothesis testing
and model selection. In Example 1 , for instance, the typical noninformative prior for
(under H2 ) is the constant density, but any constant could be used (since the prior is
improper regardless) and the resulting Bayes factor would vary with the arbitrary choice
of the constant. This is unfortunateespecially in model selection, because at the initial
stages of model development and comparison it is often not feasible to develop full subjective
proper prior distributions. Notice that the so called vague priors are very dangerous in
hypothesis testing, an illustration of the danger can be seen in Example 1: letting ( /)2 to
be very large will lead to the disaster of accepting the null regardless of the data. In summary,
neither fully improper priors not vague priors are appropriate to construct objective Bayes
Factors.
This has led to a variety of alternative proposals for the development of objective or default
Bayes factors. Some of these methods are briefly discussed below. For discussion of other
methods and comparisons, see Berger and Pericchi [14], [16], Iwaki [36], Pericchi [50] and
Moreno and Pericchi [45].

5.2 BIC and AIC


Apart from the ubiquitous p-values, the two most widely used model selection criteria are
BIC and AIC. The corresponding definitions of a model score (small being good) are

AIC(Mj ) = 2 log(M ax.Likelihood) + 2 p,

BIC(Mj ) = 2 log(M ax.Likelihood) + p log(n),


where p is the number of parameters, and n is the sample size. The best model is that which
minimizes the score.
The Bayes information criterion (BIC) of Schwarz [53], arises from the Laplace approxi-
mation to Bayes factors (see Bayesian Model Selection for discussion). The BIC is often
a quite satisfactory approximation (cf. Kass and Wasserman [40]), but it avoids the problem
of prior specification simply by ignoring the term of the Laplace expansion that relates to the
prior distribution. Perhaps more importantly, BIC depends on the definition of the sample
size n (which enters in the penalty for complexity).
This is a problem, in that n is not well-defined; for instance, what is the sample size
when there are 100 observations that have correlation of 0.9? Or, in the situation where one
has p different normal means with r replicate observations on each, is n equal to the number
of observations pr, or the the number of replicates r ?. In applications, pr is typically used,
but this is wrong; as shown in [6], the correct definition of n here for BIC would be r, an
enormous difference.

11
5.3 Conventional, Intrinsic, and Fractional Bayes Factors
The original objective Bayes approach was to choose conventional prior distributions re-
flecting perceived lack of information concerning the parameters of the hypotheses. Thus,
in Example 1 , we commented that the N (0, 2 2 ) distribution is a standard default prior
for this testing problem. Jeffreys [38] pioneered this approach (although he actually recom-
mended a C(0, 2 ) default prior in Example 1 ); see also Zellner and Siow [60] and the many
references to this approach in Berger and Pericchi [14] and [16].
An attempt to directly use noninformative priors, but with a plausible argument for
choosing particular constant multiples of them when they are improper, was proposed for
linear models in Spiegelhalter and Smith [59].
Two standard objective approaches are the intrinsic Bayes factor approach of Berger
and Pericchi [12][14] and the fractional Bayes factor approach of OHagan [46], [47]. These
use, respectively, parts of the data (training samples) or a fraction of the likelihood to
create, in a sense, an objective proper prior distribution. These approaches operate essen-
tially automatically, and apply in great generality to hypothesis testing and model selection
problems. And the better versions of these approaches can be shown to correspond to use of
actual reasonable objective proper prior distributions. They thus provide the best general
objective methods of testing and model selection that are currently available.
The intrinsic Bayes factor can be shown to correspond to actual priors that integrate to one,
as a density on the parameters under test. In example 1, calculation shows that the assumed
prior, a Normal centered at zero with prior variance twice the sampling variance, is indeed
the intrinsic prior for the comparison in that example. Further in example 2, the intrinsic
prior is the assumed, and in a variety of senses optimal and objective (, 2 ) = 1/ 2 . (Here,
since both models are location-scale with two unknown parameters, the correct objective
prior leads to the same correction in the intrinsic Bayes factor.) Intrinsic priors have been
applied to several statistical problems and have generated several closely related methods.
One of them is the expected posterior prior method or EP-prior by Perez and Berger ([48])
which automatically generates empirical and/or theoretical proper priors.
The intrinsic Bayes factor was originally motivated as using real empirical minimal training
samples. But taking the limit, for a hypothetical infinite population of training samples, this
lead to the intrinsic prior equations, that relate the prior information of an anchor model to
other models. In principle the conditional priors in different models are in no way necessarily
related to each other. But in an objective schema the priors should take into consideration
the different models being compared and in some way put the models on an equal footing.
The propagation of information formula does this as follows,
Z
j (j ) = jN (j |x(l)) m0 (x(l))dx(l).
O

R
Here, m0 is the marginal density under the anchor model M0 , m0 (x(l)) = f0 (x(l)|0 ) N (0 )d0 ,
x(l) is a minimal training sample for both M0 and Mj , and jN (j |x(l)) is the posterior den-
sity under Mj under the usual Non-Informative prior for the parameters j . In this way,
the information propagates, using the anchor model as the generator of samples. The

12
propagation of information formula is the basis for both intrinsic-priors and ep-priors. The
difference lies in what is the assumed anchor model M0 . When M0 is the simplest model then
it leads to the intrinsic prior. But the anchor model can be even the empirical distribution
of the data in which case the equation leads to the empirical ep-prior.
Among the many articles about intrinsic priors and related methods are: [3], [17], [21], [29],
[34],[43], and [44]. For a recent overview of intrinsic priors see Moreno and Pericchi (2014)
[45].

6 Moving Forward Practically


Science needs to abandon p-values and adopt Bayes factors. But p-values are so easy, and
determination of Bayes factors is a considerably higher-level process. Luckily, there is a
calibration of p-values that can be employed which, while not perfect, is virtually guaranteed
to be an improvement on p-values.
This calibration could be called a Robust lower bound on the Bayes factor of the null to
the alternative hypothesis, [55]. The bound is obtained in different ways, as the minimum
over large and sensible classes of priors (unimodal and symmetrical around the Null), or
under large classes of alternative hypotheses. The bound takes the following form, when p
is the p-value:
B(p) e p log(p) = B(p).
Another form of this bound is as a lower bound on the conditional frequentist error probabil-
ity (p) corresponding to a p-value, i.e., a bound on the actual frequentist error probability
that corresponds to a specified p-value, also a bound on the posterior probability of H1 , for
P (H1 ) = P (H2 ) :
1 1
(p) = .
(1 + 1/(e p log(p))) (1 + 1/B(p))

This lower bound is remarkable in several respects. One is it startling simplicity. The
second is that, despite it being extremely general, the difference with p-values is substantial.
For example, coming back to Example 1, if the observed p-value is 0.05 then the Bayes
Factor is bigger or equal to B(p) 0.41 (this is smaller than the Bayes factor of 0.983
found in Example 1, but recall this is a lower bound) and the conditional frequentist error
probability (p) 0.29. Focus on the latter for a minute: many scientists view a p-value
of 0.05 as meaning that the null hypothesis has only probability 0.05 of being true, where
an absolute lower bound on this probability is 0.29. Recognition of this mathematical fact
would dramatically change science.
One problem with this lower bound is that it does not depend on the sample size n, in
contradiction with the behavior of Bayes factors as exhibited in Example 1. There have been
many attempts to find calibrations that depend on n (see [49] and [52] for the latest efforts
and history).

13
Related Articles
Bayesian Model Selection; Decision Theory; Model Selection: Bayesian Infor-
mation Criterion; Parsimony, Principle of ; P-Values; Sharp Null Hypotheses;
Statistical Evidence; P value and Bayesian statistics; Bayesian model selection
in survival analysis; Bayesian Inference; Bayes p-Values.

References
[1] Barbieri, M. and Berger, J. (2004), Optimal Predictive Model Selection. Annals of Statis-
tics 32, 870897.

[2] Bartlett, M. S. (1957). A comment on D. V. Lindleys statistical paradox. Biometrika,


44, 533534.

[3] Bayarri, M.J., Berger, J., Forte, A., and Garca-Donato G. (2012). Criteria for Bayesian
model choice with application to variable selection. The Annals of Statistics 40, 1550-1577.

[4] Berger, J. (1985). Statistical Decision Theory and Bayesian Analysis, 2nd edition.
Springer-Verlag, New York.

[5] Berger, J. (1994). An overview of robust Bayesian analysis. Test, 3, 5124.

[6] Berger, J.O., Bayarri M.J. and Pericchi, L.R. (2013) The Effective Sample Size. Econo-
metrics Review, 33,p.197-217.

[7] Berger, J. and Berry, D. (1988). The relevance of stopping rules in statistical inference.
In Statistical Decision Theory and Related Topics IV, S. S. Gupta and J. Berger, eds.
Springer-Verlag, New York, pp. 2948.

[8] Berger, J., Brown, L., and Wolpert, R. (1994). A unified conditional frequentist and
Bayesian test for fixed and sequential hypothesis testing. Ann. Statist., 22, 17871807.

[9] Berger, J., Boukai, B., and Wang, Y. (1994). Unified frequentist and Bayesian testing of
a precise hypothesis. Statist. Sci., 12, 133156.

[10] Berger, J. and Delampady, M. (1987). Testing precise hypotheses. Statist. Sci., 3, 317
352.

[11] Berger, J. and Mortera, J. (1991). Interpreting the stars in precise hypothesis testing.
Int. Statist. Rev., 59, 337353.

[12] Berger, J. and Pericchi, L. R. (1995). The intrinsic Bayes factor for linear models. In
Bayesian Statistics 5, J. M. Bernardo et al., eds. Oxford University Press, London, pp.
2342.

14
[13] Berger, J. and Pericchi, L. R. (1996). On the justification of default and intrinsic Bayes
factors. In Modeling and Prediction, J. C. Lee et al.,eds. Springer-Verlag, New York, pp.
276293.

[14] Berger, J. and Pericchi, L. R. (1996). The intrinsic Bayes factor for model selection and
prediction. J. Amer. Statist. Ass., 91, 109122.

[15] Berger, J., Pericchi, L., and Varshavsky, J. (1998). Bayes factors and marginal distri-
butions in invariant situations, SankhyaA 60,307-321.

[16] Berger J.O. and Pericchi L.R. (2001) Objective Bayesian Model Selection. Introduction
and Comparisons, in Lectures Notes of the Institute of Mathematical Statistics. Model
Selection, editor: P. Lahiri, pp. 135-207.

[17] Berger J.O. and Pericchi LR. (2004) Training samples in objective Bayesian model
selection. Annals of Statistics, 32, 3, p. 841-869.

[18] Berger, J. and Sellke, T. (1987). Testing a point null hypothesis: the irreconcilability of
P -values and evidence. J. Amer. Statist. Ass., 82, 112122.

[19] Berk, R. (1966). Limiting behavior of posterior distributions when the model is incorrect.
Ann. Math. Statist., 37, 5158.

[20] Bernardo, J. M. and Smith, A. F. M. (1994). Bayesian Theory. Wiley, New York.

[21] Cano, J.A. and Salmeron, D. (2013). Integral Priors and Constrained Imaginary Train-
ing Samples for Nested and Non-nested Bayesian Model Comparison. Bayesian Analysis.
8, 361-380.

[22] Casella, G. and Berger, R. (1987). Reconciling Bayesian and frequentist evidence in the
onesided testing problem. J. Amer. Statist. Ass., 82, 106111.

[23] Dickey, J. (1973). Scientific reporting. J. R. Statist. Soc. B, 35, 285305.

[24] Dmochowski, J. (1996). Intrinsic Bayes factors via KullbackLeibler geometry. In


Bayesian Statistics 5, J. M. Bernardo et al., eds. Oxford University Press, London, pp.
543550.

[25] Draper, D. (1995). Assessment and propagation for model uncertainty (with discussion).
J. R. Statist. Soc. B, 57, 4598.

[26] Edwards, W., Lindman, H., and Savage, L. J. (1963). Bayesian statistical inference for
psychological research. Psych. Rev., 70, 193242.

[27] Fan, T. H. and Berger, J. (1995). Robust Bayesian Displays for Standard Inferences
Concerning a Normal Mean. Technical Report, Purdue University, West Lafayette, Ind.

15
[28] Findley, D. (1991). Counterexamples to parsimony and BIC. Ann. Inst. Statist. Math.,
43, 505514.

[29] Fouskakis, D., Ntzoufras, I. and Draper, D. (2014) Power-Expected- Posterior Priors
for Variable Selection in Gaussian LinearModels. Bayesian Analysis (to appear).

[30] Franck, W. E. (1981). The most powerful invariant test of normal versus Cauchy with
applications to stable alternatives. J. Amer. Statist. Ass., 76, 10021005.

[31] Geisser, S. (1993). Predictive Inference: An Introduction. Chapman and Hall, London.

[32] Gelfand, A. and Dey, D. (1994). Bayesian model choice: asymptotics and exact calcu-
lations. J. R. Statist. Soc. B, 56, 501514.

[33] George, E. I. and McCulloch, R. E. (1993). Variable selection via Gibbs sampling. J.
Amer. Statist. Ass., 88, 881889.

[34] Giron, F.J., Martinez, M.L., Moreno, E. and Torres, F. (2006) Objective Testing Pro-
cedures in Linear Models: Calibration of the p-values Scandinavian Journal of Statistics,
33, 765-784.

[35] Good, I.J. (1985) Weight of evidence: a brief survey. In: Bernardo, J.M. et al (Eds.)
Bayesian Statistics, vol. 2. North-Holland, Amsterdam, pp. 249-270.

[36] Iwaki, K. (1995). Posterior expected marginal likelihood for testing hypotheses. J. Econ.
(Asia Univ.), 21, 105134.

[37] Jefferys, W. and Berger, J. (1992). Ockhams razor and Bayesian analysis. Amer. Sci.,
80, 6472.

[38] Jeffreys, H. (1961). Theory of Probability, 3rd ed. Clarendon, Oxford.

[39] Kass, R. E. and Raftery, A. (1995). Bayes factors and model uncertainty. J. Amer.
Statist. Ass., 90, 773795.

[40] Kass, R. E. and Wasserman, L. (1995). A reference Bayesian test for nested hypotheses
and its relationship to the Schwarz criterion. J. Amer. Statist. Ass., 90, 928934.

[41] Kiefer, J. (1977). Conditional confidence statements and confidence estimators. J. Amer.
Statist. Ass., 72, 789827.

[42] Lindley, D. V. (1957). A statistical paradox. Biometrika, 44, 187192.

[43] Moreno, E., Bertolino, F. and Racugno, W. (1998). An intrinsic limiting procedure for
model selection and hypothesis testing. J. Amer. Statist. Assoc., 93, 14511460.

[44] Moreno, E., Giron, F. J. and Casella, G. (2010). Consistency of objective Bayes factors
as the model dimension grows. The Annals of Statistics., 38, 1937-1952.

16
[45] Moreno E and Pericchi L.R (2014) Intrinsic Priors for Objective Bayesian Model Selec-
tion. Advances in Econometrics. (In press).
[46] OHagan, A. (1994). Bayesian Inference. Edward Arnold, London.
[47] OHagan, A. (1995). Fractional Bayes factors for model comparisons. J. R. Statist. Soc.
B, 57, 99138.
[48] Perez, J.M. and Berger J.O. (2002). Expected-posterior prior distributions for model
selection. Biometrika 89, 491-511.
[49] Perez M.E. and Pericchi L.R (2014) Changing Statistical Significance with the Amount
of Information: The adaptive alpha significance level.Statistics and Probability Letters,85,
pp.20-24. http : //www.sciencedirect.com/science/article/pii/S0167715213003611
[50] Pericchi L.R. (2005) Model Selection and Hypothesis Testing based on Objective Prob-
abilities and Bayes Factors. Elsevier B.V. Handbook of Statistics, vol. 25. p. 115-149.
[51] Pericchi L.R. and Liu G. and Torres D. (2008) Objective Bayes Factors for Informed
Hypothesis: Completing the Informed Hypothesis and Splitting the Bayes Factors. In the
book: Practical Bayesian Approaches to Testing Behavioral and Social Science Hypothe-
ses Chapter 7. Editors: Herbert Hoijtink Irene Klugkist and Paul A. Boelen. Publisher:
Springer,pp. 131-154.
[52] Pericchi, L.R. and Pereira C.A.B. (2014) Adaptive Significance Levels using Optimal
Decision Rules: Balancing by Weighting the Error Probabilities. Brazilian Journal of
Probability and Statistics. (Accepted).
[53] Schwarz, G. (1978). Estimating the dimension of a model. Ann. Statist., 6, 461464.
[54] Scott, J.O. and Berger, J.O. (2010). Bayes and empirical-Bayes multiplicity adjustment
in the variable-selection problem. The Annals of Statistics, 38, 2587-2619.
[55] Sellke, T., Bayarri M.J. and Berger J.O. (2001) Calibration of P-values for testing precise
null hypothesis. Amer. Statist., 55, 62-71.
[56] Shibata, R. (1981). An optimal selection of regression variables. Biometrika, 68, 4554.
[57] Spiegelhalter, D. J. (1985). Exact Bayesian inference on the parameters of a Cauchy
distribution with vague prior information. In Bayesian Statistics 2. J. M. Bernardo et al,
eds. North-Holland, Amsterdam, pp. 743750.
[58] Spiegelhalter, David J.; Best, Nicola G.; Carlin, Bradley P.; van der Linde, Angelika
(2002). Bayesian measures of model complexity and fit (with discussion). Journal of the
Royal Statistical Society, Series B 64 (4): 583-639.
[59] Spiegelhalter, D. J. and Smith, A. F. M. (1982). Bayes factors for linear and log-linear
models with vague prior information. J. R. Statist. Soc. B, 44, 377387.

17
[60] Zellner, A. and Siow, A. (1980). Posterior odds for selected regression hypotheses. In
Bayesian Statistics 1, J. M. Bernardo et al, eds. Valencia University Press, Valencia, pp.
585603.

[61] Burton, Paul R., et al. (2007). Genome-wide association study of 14,000 cases of seven
common diseases and 3,000 shared controls. Nature, 447, 661678.

18

You might also like