You are on page 1of 91

CBMS-NSF REGIONAL CONFERENCE SERIES

IN APPLIED MATHEMATICS
A series of lectures on topics of current research interest in applied mathematics under the direction of
the Conference Board of the Mathematical Sciences, supported by the National Science Foundation and
published by SIAM.
GARRETT BIRKHOFF, The Numerical Solution of Elliptic Equations
D. V. LINDLEY, Bayesian Statistics, A Review
R. S. VARGA, Functional Analysis and Approximation Theory in Numerical Analysis
R. R. BAHADUR, Some Limit Theorems in Statistics
PATRICK BILLINGSLEY, Weak Convergence of Measures: Applications in Probability
J. L. LIONS, Some Aspects of the Optimal Control of Distributed Parameter Systems
ROGER PENROSE, Techniques of Differential Topology in Relativity
HERMAN CHERNOFF, Sequential Analysis and Optimal Design
J. DURBIN, Distribution Theory for Tests Based on the Sample Distribution Function
SOL I. RUBINOW, Mathematical Problems in the Biological Sciences
P. D. LAX, Hyperbolic Systems of Conservation Laws and the Mathematical Theory of Shock
Waves
I. J. SCHOENBERG, Cardinal Spline Interpolation
IVAN SINGER, The Theory of Best Approximation and Functional Analysis
WERNER C. RHEINBOLDT, Methods of Solving Systems of Nonlinear Equations
HANS F. WEINBERGER, Variational Methods for Eigenvalue Approximation
R. TYRRELL ROCKAFELLAR, Conjugate Duality and Optimization
SIR JAMES LIGHTHILL, Mathematical Biofluiddynamics
GERARD SALTON, Theory of Indexing
CATHLEEN S. MORAWETZ, Notes on Time Decay and Scattering for Some Hyperbolic Problems
F. HOPPENSTEADT, Mathematical Theories of Populations: Demographics, Genetics and Epidemics
RICHARD ASKEY, Orthogonal Polynomials and Special Functions
L. E. PAYNE, Improperly Posed Problems in Partial Differential Equations
S. ROSEN, Lectures on the Measurement and Evaluation of the Performance of Computing Systems
HERBERT B. KELLER, Numerical Solution of Two Point Boundary Value Problems
J. P. LASALLE, The Stability of Dynamical Systems - Z. ARTSTEIN, Appendix A: Limiting Equations
and Stability of Nonautonomous Ordinary Differential Equations
D. GOTTLIEB AND S. A. ORSZAG, Numerical Analysis of Spectral Methods: Theory and Applications
PETER J. HUBER, Robust Statistical Procedures
HERBERT SOLOMON, Geometric Probability
FRED S. ROBERTS, Graph Theory and Its Applications to Problems of Society
JURIS HARTMANIS, Feasible Computations and Provable Complexity Properties
ZOHAR MANNA, Lectures on the Logic of Computer Programming
ELLIS L. JOHNSON, Integer Programming: Facets, Subadditivity, and Duality for Group and SemiGroup Problems
SHMUEL WINOGRAD, Arithmetic Complexity of Computations
J. F. C. KINGMAN, Mathematics of Genetic Diversity
MORTON E. GURTTN, Topics in Finite Elasticity
THOMAS G. KURTZ, Approximation of Population Processes
(continued on inside back cover)

Probabilistic
Expert Systems

This page intentionally left blank

Glenn Shafer

Rutgers University
Newark, New Jersey

Probabilistic Expert
Systems

SJLHJTL.

SOCIETY FOR INDUSTRIAL AND APPLIED MATHEMATICS


PHILADELPHIA

Copyright 1996 by the Society for Industrial and Applied Mathematics.


10987654321
All rights reserved. Printed in the United States of America. No part of this book may be
reproduced, stored, or transmitted in any manner without the written permission of the
publisher. For information, write to the Society for industrial and Applied Mathematics,
3600 University City Science Center, Philadelphia, PA 19104-2688.
Library of Congress Cataloging-in-Publication Data
Shafer, Glenn, 1946Probabilistic expert systems / Glenn Shafer.
p. cm. -- (CBMS-NSF regional conference series in applied
mathematics ; 67)
"Sponsored by Conference Board of the Mathematical Sciences"-Cover.
Includes bibliographical references and index.
ISBN 0-89871-373-0 (pbk.)
1. Expert systems (Computer science) 2. Probabilities.
I. Conference Board of the Mathematical Sciences. II. Title.
III. Series.
QA76.76.E95S486 1996
006.3'3--dc20
96-18757

^IH lTL is a registered trademark.

Contents

Preface

vii

Chapter 1. Multivariate Probability


1.1 Probability distributions
1.2 Marginalization
1.3 Conditionals
1.4 Continuation
1.5 Posterior distributions
1.6 Expectation
1.7 Classifying probability distributions
1.8 A limitation

1
2
3
5
7
10
12
13
14

Chapter 2. Construction Sequences


2.1 Multiplying conditionals
2.2 DAGs and belief nets
2.3 Bubble graphs
2.4 Other graphical representations

17
18
20
27
30

Chapter 3. Propagation in Join Trees


3.1 Variable-by-variable summing out
3.2 The elementary architecture
3.3 The Shafer-Shenoy architecture
3.4 The Lauritzen-Spiegelhalter architecture
3.5 The Aalborg architecture
3.6 COLLECT and DISTRIBUTE
3.7 Scope and alternatives

35
37
41
44
50
56
63
66

Chapter 4. Resources and References


4.1 Meetings
4.2 Software
4.3 Books

69
69
69
70

vi

CONTENTS

4.4
4.5

Index

Review articles
Other sources

71
73

79

Preface

Based on lectures at an NSF/CMBS Regional Conference at the University of


North Dakota at Grand Forks during the week of June 1-5, 1992, this monograph analyzes join-tree methods for the computation of prior and posterior
probabilities in belief nets. These methods, pioneered by Pearl [42], [8] Lauritzen and Spiegelhalter [37], and Shafer, Shenoy, and Mellouli [45] in the late
1980s, continue to be central to the theory and practice of probabilistic expert
systems.
In the North Dakota lectures, I began with the topics discussed here and
then moved on in two directions. First, I discussed how the basic architectures
for join-tree computation apply to other methods for combining evidence, especially the belief-function (Dempster^Shafer) method, and also how they apply to
many other problems in applied mathematics and operations research. Second,
I looked at other aspects of computation in expert systems, especially Markov
chain Monte Carlo approximation, computation for model selection, and computation for model evaluation.
I completed a draft of the three chapters that form the body of this monograph in the summer of 1992, shortly after delivering the lectures. Unfortunately,
I set the project aside at the end of that summer, expecting to return in a few
months to write additional chapters covering at least the other major topics I
had discussed in Grand Forks. As it turned out, my return to the project was
delayed for three years, as I found myself increasingly concerned with another
set of ideasthe use of probability trees to understand probability and causality.
Rather than extend this monograph, I completed a new and much longer book,
The Art of Causal Conjecture (MIT Press, 1996).
The field of probabilistic expert systems has continued to flourish in the
past three years, yet the understanding of join-tree architectures set out in my
original three chapters is still missing from the literature. Moreover, the broader
research question that motivated my presentationhow well a general theory of
propagation along the same lines can account for the wide variety of recursive
computation in applied mathematicsremains open. I have decided, therefore,
to publish these three chapters on their own, essentially as they were written in
1992. I have resisted even attempting a brief survey of related topics. Instead I
vii

viii

PREFACE

have added a brief chapter on resources, which gives information on software and
includes an annotated bibliography. I have also added some exercises that will
help the reader begin to explore the problem of generalizing from probability to
broader domains of recursive computation.
The resulting monograph should be useful to scholars and students in artificial
intelligence, operations research, and the various branches of applied statistics
that use probabilistic methods. Probabilistic expert systems are now used in
areas ranging from diagnosis (in medicine, software maintenance, and space exploration) and auditing to tutoring, and the computational methods described
here are basic to nearly all implementations in all these areas.
I wish to thank Lonnie Winnrich, who organized the conference in North
Dakota, as well as the other participants. They made the week very pleasant
and productive for me. I also wish to thank the many students and colleagues,
at the University of Kansas and around the world, who helped me learn about
expert systems in the late 1980s and early 1990s. Foremost among them is
Prakash P. Shenoy, my colleague in the School of Business at the University
of Kansas from 1984 to 1992. I am grateful for his steadfast friendship and
indispensable collaboration.
Augustine Kong and A. P. Dempster, who joined with Shenoy and me in
the early 1980s in the study of join-tree computation for belief functions, were
also important in the development of the ideas reported here. Section 3.1 is inspired by an unpublished memorandum by Kong. Other colleagues and students
with whom I collaborated particularly closely during this period include Khalid
Mellouli, Debra K. Zarley, and Rajendra P. Srivastava.
Special thanks are due Niven Lianwen Zhang, Chingfu Chang, and the late
George Kryrollos, all of whom made useful comments on the 1992 draft of the
monograph.
I would also like to acknowledge the friendship and encouragement of many
other scholars whose work is reported here, especially A. P. Dawid, Finn V.
Jensen, Steffen L. Lauritzen, Judea Pearl, and David Spiegelhalter. The field of
probabilistic expert systems has benefited not only from their energy, intellect,
and vision, but also from their generosity and good humor.
Finally, at an even more personal level, I would like to thank my wife, Nell
Irvin Painter, who has supported this and my other scholarly work through thick
and thin.

CHAPTER 1

Multivariate Probability

This chapter reviews the basic ingredients of the theory of multivariate probability: marginals, conditionals, and expectations. These will be familiar topics
for many readers, but our approach will take us down some relatively unexplored paths. One of these paths opens when we develop an explicit notation
for marginalization. This notation allows us to recognize properties of marginalization that are shared by many types of recursive computation. Another path
opens when we distinguish among probability distributions on the basis of how
they are stored. We distinguish between tabular distributions, which are simply tables of probabilities, and algorithmic distributions, which are algorithms
for computing probabilities. A parametric distribution is a special kind of algorithmic distribution; it consists of a few numerical parameters and a relatively simple algorithm, usually a formula, for computing probabilities from those
parameters.
The most complex topic in this chapter is conditional probability. Our purposes require that we understand conditional probability from several viewpoints,
and we rely on some careful terminology to keep the viewpoints distinct. We
distinguish between conditional probabilities in general, which can stand on
their own, without reference to any prior probability distribution, and posterior probabilities, which are conditional probabilities obtained by conditioning
a probability distribution on observations. And we distinguish two kinds of
tables of conditional probabilities: conditionals and posterior distributions. A
conditional consists of many probability distributions for a set of variables (the
conditional's head)one for each configuration of another set of variables (its
tail). A posterior distribution is a single probability distribution consisting of
posterior probabilities.
In the next chapter, we study how to construct a probability distribution
by multiplying conditional probabilitiesor, more precisely, by multiplying conditionals. When we multiply the conditionals in an appropriate order, each
multiplication produces a larger marginal of the final distribution. This means
that each conditional is a continuer for the final distribution; it continues it from
a smaller to a larger set of variables. The concept of a continuer will help us
minimize complications arising from the presence of zero probabilities, which are
unavoidable in expert systems, where much of our knowledge is in the form of
1

CHAPTER 1
TABLE 1.1
A discrete tabular probability distribution for three variables.

young
middle-aged
old

female
Dem ind
.08
.16
.05
.05
.05
.05

Rep
.08
.05
.05

male
Dem ind
.02
.04
.00
.00
.10
.10

Rep
.02
.00
.10

rules that do not admit exceptions. Continuers will also help us, in Chapter 3,
to understand architectures for recursive computation.
This chapter is about multivariate probability, not about probability in general. Not all probability models are multivariate. The chapter concludes with a
brief explanation of why multivariate models are sometimes inadequate.
1.1.

Probability distributions.

The quickest way to orient those not familiar with multivariate probability is
to give an example. Table 1.1 gives a probability distribution for three variables:
Age, Sex, and Party. Notice that the numbers are nonnegative and add to one.
This is what it takes to be a discrete probability distribution.
We will write QX for the set of possible values of a variable X, and we will
write lx for the set of configurations of a set of variables x. We call QX and Ox
the frames for X and x, respectively. In general, Ox is the Cartesian product of
the frames of the individual variables: f^ = Hxex ^x- In Table 1.1, we assume
that
&Age = {young,middle-aged,old},
^Sex = {male, female},

and
Q Party = {Democrat, independent,Republican}.
Thus the frame ^Age,Sex,Party consists of eighteen configurations:
(young,male,Democrat),(old,male,independent),...
and Table 1.1 gives a probability for each of them. In general, as in this example,
a discrete probability distribution for x gives a probability to every element of
fj x ; abstractly, it is a nonnegative function on lx whose values add to one.
If we add together the numbers for males and females in Table 1.1, we get
marginal probabilities for Age and Party, as in Table 1.2. Adding further, we
get marginal probabilities for Age, as in Table 1.3.
Some readers may be puzzled by the name "marginal." The name is derived
from the example of a bivariate table, where it is convenient and conventional
to write the sums of the row and columns in the margins. In Table 1.4, for

This page intentionally left blank

CHAPTER 1

We can write a formula for P^w:

for each configuration c of w. Here x\w consists of the variables in x but not
w and c.d is the configuration of x that we get by combining the configuration c
of w and the configuration d of x \ w. For example, if x = {Age,Sex,Party} and
w {Age,Party}, then x \ u> = {^ex}; if c = (old,Democrat) and d (male),
then c.d = (old,male,Democrat).
The arrow notation emphasizes the variables that remain when we marginalize. Sometimes we use instead a notation that emphasizes the variables we sum
out: P~y is the marginal obtained when we sum out the variables in y. Thus
when x = w U y, where w and y are disjoint sets of variables, and P is a probability distribution on x, both P^w and P~y will represent P's marginal on w.
Though we are concerned primarily with probability distributions, any
numerical2 function / on a set of variables x has a marginal f ^ w for every subset
w of x. The function / need not be nonnegative or sum to one. If w is not
empty, then f ^ w is a function on w.

for each configuration c of w. If w is empty, then f ^ w is simply a number:

The number f^ will be equal to one if / is a probability distribution. The


function f ^ w will be equal to / if w = x.
Here are two important properties of marginalization:
Property 1. If / is a function on y, and
Property 2. If / is a function on x, and g is a function on y, then
We leave it to the reader to derive these properties from equation (1.2).
It is informative to rewrite Properties 1 and 2 using the /~ notation. This
gives the following:
2

A numerical function is one that takes real numbers as values. We will consider only
numerical functions in this monograph.
3
In order to understand this equation, we must recognize that the product fg is a function
on x U y. Its value for a configuration c of x U y is given by (fg)(c) f ( c ^ x ) g ( c ^ y ) , where
c^x is the result of dropping from c the values for variables not in x. For example, if / is a
function on {Age,Party} and g is a function on {Sex,Party}, then (/g)(old, male, Democrat) =
/(old, Democrat)(7(male, Democrat).

MULTIVARIATE PROBABILITY

FIG. 1.1. Removing y\x from y leaves x n y; removing y\x from x U y leaves x.
Property 1. If / is a function on y. and u and v are disjoint subsets
of y, the
Property 2. If / is a function on x, and g is a function on y, then
This version of Property 2 makes it clear that we are summing out the same
variables on both sides of the equation (fg)^-x = /(fl^xny)- Summing these
variables out of f g , which is a function on x U y. leaves the variables in x, but
summing them out of g, which is a function on y, leaves the variables in x d y
(see Figure 1.1).
The second version of Property 2 also suggests the following generalization:
Property 3.

If / is a function on x. g is a function on y, and

We leave it to the reader to derive this property also from equation (1.2).
As we will see in Chapter 3, Properties 1 and 2 are responsible for the possibility of recursively computing marginals of probability distributions given as
products of tables. These properties also hold and justify recursive computation
in other domains, where we work with different objects and different meanings
for marginalization and multiplication. Because of their generality, we call Properties 1 and 2 axioms; Property 1 is the transitivity axiom, and Property 2 is the
combination axiom.
The definition of marginalization, equation (1.2), together with the proofs
of Properties 1, 2, and 3, can be adapted to the continuous case by replacing
summation with integration. We leave this to the reader. We also leave aside
complications that arise if infinities are allowedif the sum or integral is over an
infinite frame or an unbounded function. Our primary interest is in distributions
given by tables, and here the frames are both discrete and finite.

1.3.

Conditionals.

Table 1.5 gives conditional probabilities for Party given Age and Sex. We call
these numbers conditional probabilities because they are nonnegative and each
group of three (the three probabilities for Party given each Age-Sex configuration) sums to one. In other words, the marginal for {Age,Sex}, Table 1.6, consists
of ones.
We call Table 1.5 as a whole a conditional. We call {Party} its head, and
we call {Age,Sex} its tail In general, a conditional is a nonnegative function Q

CHAPTER 1
TABLE 1.5
A conditional for Party given Age and Sex.

young
middle-aged
old

Dem
1/4
1/3
1/3

female
ind
1/2
1/3
1/3

Rep
1/4
1/3
1/3

Dem
1/4
1/5
1/3

male
ind
1/2
1/5
1/3

Rep
1/4
3/5
1/3

TABLE 1.6
The marginal of Table 1.5 on its tail.

young
middle-aged
old

female
1
1
1

male
1
1
1

on the union of two disjoint sets of variables, its head h and its tail t, with the
property that Q^ = 1^, where lt is the function on t that is identically equal to
one.
Two special cases deserve mention. If t is empty, then Q is a probability
distribution for h. If h is empty, then Q = It. We are interested in conditionals
not for their own sake but because we can multiply them together to construct
probability distributions. This is the topic of the next chapter.
Frequently, we are interested only in a subtable of a conditional. In Table 1.5,
for example, we might be interested only in the conditional probabilities for
femalesthe subtable shown in Table 1.7. We call such a subtable a slice. In
general, if / is a table on x and c is a configuration of a subset w of x, then we
write f\w=c for the table on x \ w given by

and we call f\w=c the slice of / on w = c. We leave it to the reader to verify the
following proposition.
PROPOSITION 1.1. Suppose Q is a conditional with head h and tail t, and
suppose w Ct. Then Q\w=c is a conditional with head h and tail t\w.
Table 1.7 illustrates Proposition 1.1; it is a conditional with {Party} as its
head and {^4<?e} as its tail.
We will sometimes find it convenient to generalize the notation for slicing by
allowing the variables whose values we fix to include variables that are outside
the domain of the table and hence have no effect on the result. In general, if / is
a table on x, w is a set of variables, and c is a configuration of iu, then we write
f\w=c for the table on x \ w given by

for each configuration d of x \ w.

MULTIVARIATE PROBABILITY

TABLE 1.7
The slice of Table 1.5 on Sex = female.

young
middle-aged

old

Dem

ind

Rep

1/4
1/3

1/2
1/3

1/4
1/3

1/3

1/3

1/3

TABLE 1.8
The marginal of Table 1.1 for Age and Sex.

young
middle-aged
old
1.4.

female
.32
.15
.15

male
.08
.00
.30

Continuation.

If / is a function on re, w C re, and

then we say that Q continues f from w; to re.


Here is an example. Suppose x {Age,Sex,Party} and w {Age,Sex}, and
consider the probability distribution P given by Table 1.1 and the conditional
Q given by Table 1.5. The marginal P^w is given by Table 1.8, and the reader
can easily check that P = P^WQ.4 Thus Q continues P from w to x.
When do continuers exist, and when are they unique?
PROPOSITION 1.2. Suppose f is a function on x, and suppose w C x.
1. // all of f's values are positive, then there is a unique function Q on x
that continues f from w to x. This continuer Q is a conditional.
2. // all of f's values are nonnegative, then there is at least one function Q
on x that continues f from w to x. We can choose Q to be a conditional.
Proof. Try to divide both sides of equation (1.6) by f^w to obtain

or

where c is a configuration of w and d is a configuration of x \ w. If the values of


/ are all positive, then the values of f ^ w are as well, and the division succeeds;
4
Bear in mind that P = PiwQ means P(c) = Piw(c^w)Q(c). Thus each entry in Table 1.8
multiplies a whole row (three entries) in Table 1.5.

CHAPTER 1

it produces the unique Q on x satisfying equation (1.6). Using the combination


axiom, we find that

so Q is a conditional with tail w. If the values of / are merely all nonnegative,


then the division in equation (1.8) may fail for some c, but if f ^ w ( c ] 0, then
f(c.d) = 0 for all d, and hence equation (1.6) will be satisfied with arbitrary values of Q(c.d) for that c. In particular, we may chose the Q(c.d) to be nonnegative
and add to one for each such c, so that Q is a conditional.
Since a probability distribution has nonnegative but not necessarily all positive values, it has continuers but not necessarily unique continuers. In our
example, the nonuniqueness is in the conditional probabilities for middle-aged
males. Since middle-aged males have probability zero in Table 1.8, we can change
the numbers 1/5, 1/5, and 3/5 in Table 1.5 however we want without falsifying
equation (1.6).
In addition to continuation to the whole domain of a function, we are also
interested in continuation to subsets. So we generalize the definition of continuation: If / is a function on y, w C x C y, and

then we say that Q continues f from w to x.


In the following chapters, we will frequently be interested in marginals and
continuers for probability distributions that are proportional to a given function.
The next proposition lists some relatively obvious but important aspects of this
situation.

PROPOSITION 1.3.

1. Suppose f is proportional to g. Then any marginal of f is proportional to


the corresponding marginal of g, with the same constant of proportionality. In
other words, if f = kg and w is a subset of the domain of f , then f^w = kg^w.
2. Suppose f is proportional to g; f kg for some nonzero constant k. Then
any continuer for f is also a continuer for g.
3. Suppose the probability distribution P is proportional to the function f on
x. Then the constant of proportionality is l/f^, and P is f's unique continuer
from 0 to x:

and
Moreover,

MULTIVARIATE PROBABILITY

4. A probability distribution is its own unique continuer from the empty set
to its domain.
Proof. Statement 1 follows directly from the definition of marginalization,
equation (1.2).
To prove statement 2, we substitute kg for / in equation (1.9), obtaining
(kg)^x = (kg)^wQ. By the combination axiom, this becomes kg^x = kg^wQ, or
9lx = 9lwQAgain by the combination axiom, P kf implies P^ kf^. Since P
is a probability distribution, P^ = 1, whence k = l/f^. So equation (1.10)
holds. Since f^ is a positive number, equation (1.10) is the unique solution of
equation (1.11); P is the unique continuer of / from 0 to x.
To prove statement 4, substitute P for / in equation (1.11) and again apply
the combination axiom.
Equations (1.6) and (1.9) do not require that Q be a function on x. They
require only that Q's domain, say v, should satisfy x w U v or, equivalently,
x \ w C v C x. In some cases (when the right-hand side of equation (1.8) does
not depend on all the coordinates of c), there is a continuer with a domain v
that is smaller than x. The situation is illustrated in Figure 1.2, where we have
written u\ for w\v, u^ for wHv, and ^3 for v\w. We may say, in this situation,
that u-2 is sufficient for the continuation from w to x; the other variables in w,
those in wi, can be neglected.
If the function / that we are continuing is a probability distribution, then
the idea of sufficiency can be elaborated in terms of the meaning of the probabilities. If we give the probabilities an objective interpretation, then we can say
that once the configuration of u^ is determined, the configuration of u\ will not
affect the determination of the configuration of 143. If we give the probabilities a
subjective interpretation, then we can say that once we know the configuration
of U2, information about the configuration of u\ will not affect our beliefs about
the configuration of 143.
The philosophy of probability that underlies this monograph is neither strictly
objective nor strictly subjective. Instead, it is constructive. We see a probability
distribution as something we deliberately construct in order to make predictions.
Though these predictions may be the best we can do, we need not be fully committed to them as beliefs. And though they should be evaluated empirically,
they need not individually represent stable frequencies. In terms of this constructive interpretation, sufficiency simply means adequacy for prediction. Once
the configuration of u^ is specified, we ignore information about u\ when we
predict u3.
Instead of saying that u<2 is sufficient for the continuation from w to x, we
may say that 113 is independent of u\ given u^. The concept of conditional independence thus defined is mathematically interesting. Its properties include the
symmetry suggested by Figure 1.2: if u^ is independent of u\ given u-2, then u\
is independent of u% given u^ (see Dawid [27], Pearl [8], or Appendix F of Shafer
[9]). Conditional independence is an important concept for both the objective
and subjective interpretations of probability. In the objective interpretation, a

CHAPTER 1

10

FlG. 1.2. Sufficiency

and conditional independence.

conditional independence relation is a hypothesis about population frequencies


or perhaps about causation. In the subjective interpretation, it is a hypothesis
about a person's beliefs. It is also important for the constructive interpretation
of probability, but it does not play a large role in the purely computational issues
considered in this monograph.
1.5.

Posterior distributions.

Suppose the probability distribution P on x expresses our beliefs about the values
of the variables in x. And suppose we now observe the values of the variables
in a subset w of x\ we observe that w has the configuration c. How should this
change our beliefs about the remaining variables, the variables in x \ w!
The standard answer is that we should change our beliefs by conditioning P
on w = c. This means that we should change our belief that x \ w = d from
Plx\w(d) to

We call this number P's posterior probability for d given c. It exists only if
P^w(c) > 0, but we may suppose that if P^w(c) is zero we will not observe
w = c.
Equation (1.13) defines a whole probability distributiona distribution on
x\w that we may designate by px\w\w=c:

We call this distribution P's posterior distribution for w = c. As the following


proposition notes, it is proportional to a subtable of P, and it is equal to a
subtable of any continuer of P from w to x.
PROPOSITION 1.4. Suppose P is a probability distribution on x, w C x, and
c is a configuration of w such that P^w(c) > 0.
1. p*\v\w=c oc P\w=c.
2. IfQ continues P from w to x, then px\\=c = Q\w=c.
Proof. Statement 1 follows from equation (1.14) and the definition of slice,
equation (1.4). Statement 2 follows from equations (1.8) and (1.14).

MULTIVARIATE PROBABILITY

11

Sometimes it is convenient to consider the posterior probability distribution


not just for x\w but for the entire set of variables x. This is the probability
distribution p\w=c on x given by

We will refer to P\w~c as P's extended posterior distribution for w = c. It


consists mostly of zeros. The posterior for the remaining variables, px\w\w=c^ \s
related to p\w=c in two ways. It is a slice:

And it is also a marginal:

Equation (1.15) says that p\w=c is equal to the product of P and the function
on w that assigns the value l/P^w(c) to the configuration c and the value 0 to
all other configurations. It follows that p\w=c is proportional to the product
of P and a function on w that assigns 1 to c and 0 to all other configurations.
This point is sufficiently important to merit being stated in symbols. To this
end, we write Iw=c for the function on w that assigns 1 to c and 0 to all other
configurations:

and we state the following proposition.


PROPOSITION 1.5. If P is a probability distribution on x and c is a configuration of a subset w of x such that P^w(c) > 0, then

and the constant of proportionality is l/P^w(c).


In the following chapters, we will be interested in a probability distribution
P given in factored form, say

where the fi are tables of reasonable size, but the number of variables involved
altogether is too large to allow the actual computation and storage of the table P.
(It will not be difficult to compute the value of P for a particular configuration,
at least if we know the constant of proportionality. But there may be too many
configurations for us to compute the value of P for all of them.) In this situation,
as we will see, we can often work from the factorization to find marginals for P,
even though we cannot compute P itself. We may also be interested in computing
marginals for posteriors of P, and therefore we will be interested in transforming

12

CHAPTER 1

(1.17) into a factorization of the posterior. The following proposition tells us


how to do this.
PROPOSITION 1.6. Suppose P is a probability distribution on x,

and c is a configuration of a subset w of x such that P^w(c) > 0.


w = {Xi,... ,Xn} and c = {ci,..., cn}. Then

Suppose

and

Proof. Equation (1.18) follows from statement 1 of Proposition 1.4, together


with the fact that a slice of a product is the product of the corresponding slices
of the factors.
Equation (1.19) follows from Proposition 1.5, together with the fact that
I =c = Ixi=ci ' ' ' Ixn=c,lw

1.6.

Expectation.

Most readers will be familiar with the idea of the expectation of a function
V on x with respect to a probability distribution P on x. This is a number,
usually denoted by EP(V}. In the discrete case, it is obtained by multiplying
corresponding values of P arid V and adding the products. Thus

Expectation generalizes to conditional expectation. If w; is a subset of x, and


Q is a continuer of P from w to x, then we call the function Ep(V w} on w given

by

a conditional expectation of V given w (if we are out of breath, we may neglect


to say "with respect to P"). If P is strictly positive, so that it has only one
continuer from w to x, then the conditional expectation for V given x is also
unique; in fact, equation (1.21) can be written

The ratio on the right-hand side of (1.22) is unchanged if we substitute for P


any function / proportional to P.
If w is not empty, then the conditional expectation Ep(V\w) is a function,
not a single number. It assigns a value to every configuration c of w. Usually,
however, we write Ep(V w = c) instead of (Ep(V\w))(c). If P^w(c) > 0, then
E(V w = c} is uniquely defined; it is equal to (PV)iw(c)/Plw(c).
P

MULTIVARIATE PROBABILITY

1.7.

13

Classifying probability distributions.

The probability distributions we have been studying are tabular. A tabular distribution is a table that gives a probability for each configuration. We will find
it useful to distinguish tabular distributions from algorithmic distributions. An
algorithmic distribution consists of an algorithm, together possibly with some
numerical information, that enables us to compute the probabilities of individual configurations. Algorithmic distributions can involve more or less complex
algorithms and more or less numerical information. At one extreme are distributions such as the Poisson. which are specified by a single number {the
mean in the case of the Poisson) and a simple formula. At another extreme
are the posterior distributions that arise in Bayesian statistics, which may involve many numbers and complicated algorithms. In the next few chapters,
we will be concerned with an intermediate case; we define a distribution for
a large number of variables as the product of many tables of numbers, each
involving only a few variables. Here there are many numbers but, a simple
algorithm: multiply.
The line between tabular and algorithmic distributions cuts across the line
between discrete arid continuous distributions. A continuous distribution, like a
discrete distribution, can be cither tabular or discrete. In the tabular case, we
store the values of the density at a sufficiently large number of configurations. In
the algorithmic case, we store instead a formula or algorithm that enables us to
compute the value of the density at any configuration. To some extent, the line
also cuts across the line between numerical and categorical variables. (Variables
like Age, Sex, and Party are called categorical, because they have categories
e.g., young, old, and middle-agedrather than numbers as possible values.)
Distributions for categorical variables are usually tabular, but distributions for
numerical variables can be tabular or algorithmic.
When an algorithmic distribution involves only a few numbers, we call the
numbers parameters, and we call the distribution parametric. The distributions
with namesPoisson, multinomial, Gaussian, and so onare parametric.
The terms tabular, parametric, and algorithmic can be applied to conditionals
and other functions as well as to distributions. These terms can help us keep track
of complications involved in finding marginals and continuers of distributions and
in multiplying conditionals. Figure 1.3 shows the main points. When we compute
marginals, we generally stay in the same class of distributions; a marginal of a
table is a table, a marginal of a Gaussian is a Gaussian, and so on. A continuer
or posterior for a tabular distribution is tabular, but only in a few cases (such as
the multinomial and the Gaussian) do continuers or posteriors stay in the same
parametric family as their distributions. Multiplication usually takes us out, of
the class of tabular distributions. Given a collection of tables for the same small
set of variables, we can perform the multiplication to obtain a new table, but
given tables for many different small sets of variables, the size of the frame for
all the variables may prevent us from computing and storing the product we
may have to settle for thinking of the multiplication as an algorithm that allows
us to find the probability for a particular configuration when we want it.

14

CHAPTER 1

FlG. 1.3.

The effect

of computation.

The distinction between tabular and algorithmic distributions is based on the


handling of probabilities or density values for individual configurations. It is only
probabilities for individual configurations that are explicitly stored by a tabular distribution; probabilities for sets of configurations must still be computed.
This emphasis on individual configurations is appropriate for expert systems,
but it is not appropriate for all applications of probability. It is inappropriate
for advanced mathematical probability, which is concerned with infinitely many
variables.
1.8.

A limitation.

Though the multivariate framework for probability is widely used, it has its
limitations. A principal limitation is that it requires every variable to have a
value no matter how matters come out. This is often appropriate in statistical
work; in our example, every individual has an age and a sex, and we invent the
category "independent" so that every individual will have a party affiliation. It is
less appropriate in expert-system work, where the meaningfulness of a variable
often depends on the values of other variables. A particular medical test or
procedure only has a result if it is carried out, and we carry it out only for
some patients. A particular phoneme has a certain characteristic in the seventh
millisecond only if it lasts that long, and sometimes it may not. "Number of
pregnancies" is applicable only to women, not to men and children. We can
pretend that these variables always have values, but when there are many of
them, this is computationally awkward as well as artificial.
It is one thing to recognize this limitation and another to correct it. The
multivariate framework is flexible as well as expressive, and the obvious alternatives lack much of its flexibility. A tree, for example, allows us to represent
some variables as being meaningful only if others have certain values but allows access to the variables only in a certain order. Consequently, most work in
probabilityboth theory and applicationis carried out within the multivariate
framework, and extensions to the framework are developed and used on a fairly
ad hoc basis.
The graphical models that we will study in the following chapters are squarely
within the multivariate framework. For some ideas about going beyond it, see
Dempster [16] and Chapter 16 of Shafer [9].

MULTIVARIATE PROBABILITY

15

Exercises.
EXERCISE 1.1. Derive the three properties of marginalization listed in 1.2
from equation (1.2).
EXERCISE 1.2. Here are some familiar problems, each with its own concept
of combination and its own concept of marginalization. Discuss, in each case,
how to formalize the problems so that the axioms of transitivity and combination
are satisfied.
1. Systems of equations (or, more generally, systems of constraints
on numerical variables) are combined by pooling and marginalized
(we usually say "reduced") by eliminating variables.
2. Linear programming problems can be combined by adding (or
perhaps multiplying) their objective functions and pooling their constraints. They can be reduced by maximizing their objective functions
over variables that are eliminated.
3. Discrete belief functions are combined by Dempster's rule and
marginalized by restricting the events for which beliefs are demanded.
(One formalization is provided by Shafer, Shenoy, and Mellouli [45]
and another by Shenoy and Shafer [48].)
In which of these problems do continuers exist?
EXERCISE 1.3. Fix a set of variables X, and consider all pairs of the form
( f , V ) , where f is a strictly positive table on some subset x of X, and V is an
arbitrary table on the same set of variables x. Call x the domain of ( f , V ) .
Define multiplication for such pairs by setting

Define marginalization by setting

Show that these operations satisfy the axioms of transitivity and combination.
(Compare equation (1.22).) This example, suggested to the author by Robert
Cowell, is relevant to computation in decision theory, where f may represent
a probability distribution and V may represent a utility function.
EXERCISE 1.4. Consider a function f on a set of variables x, together with a
collection hx,xcx of functions on the individual variables in x. For each subset
w of x, let f^w be the marginal on w of the function obtained by multiplying f
by the hx for X not in w. In symbols,

16

CHAPTER 1

The function f^w is called the out-marginal of f on w, since it involves leaving


certain factors out (Cowell and Dawid [25]).
Show that out-marginalization and multiplication satisfy the axioms of transitivity and combination. What is the meaning of out-marginalization in the
context of equation (1.19)?
EXERCISE 1.5. The numerical functions on a given set of discrete variables
and its subsets form a commutative, semigroup under multiplication. The sets of
variables themselves form a lattice. Each element of the semigroup is labeled by
an element of the lattice. Marginalization reduces an element of the semigroup
to a,n element with a smaller label.
Formulate axioms of transitivity and combination in the abstract setting of a
commutative semigroup and associated lattice. Give examples where continuers
do and do not exist.
EXERCISE 1.6. In unpublished work [28], A. P. Dempster has shown how the
Kalman filter can be understood in terms of the combination and multiplication
of belief functions. Dempster calls the belief functions involved normal belief
functions. A normal belief function on a given linear space of variables consists
of a linear functional and an inner product on a subspace of the linear space.
Intuitively, the linear functional tells the expected values of variables in the subspace, and the inner product tells their covariances. Marginalizaiiori amounts
to restricting the linear functional and inner product to a yet smaller subspace.
Combination is most easily described in the dual of the linear space of variables
the linear space of configurations. Here the normal belief function looks like an
inner product (the dual of the covariance inner product) on a hyperplane, and
combination amounts to intersecting hyperplanes and adding the inner products.
Verify that the axioms of transitivity and combination are satisfied in this
geometric framework.

CHAPTER

Construction Sequences

Under certain conditions on the heads and tails of a sequence of conditionals, the
product of the conditionals will be a probability distribution. We call a sequence
of conditionals satisfying these conditions a construction sequence.
As we will see, the conditionals in a construction sequence are coritinuers for
the probability distribution obtained by multiplying them together. Initial segments of the sequence produce marginals of this probability distribution. Thus
the construction sequence represents a step-by-step construction of the probability distribution.
After constructing a probability distribution, we may want to find a marginal
for it or one of its posteriors. This may be difficult computationally, especially
if the joint frame of all the variables is too large to permit us to carry out the
multiplication of the conditionals. Were we able to carry out this multiplication,
we could store the resulting table and work directly with it to find marginals.
But if we are obliged to keep the probability distribution stored as a product of
tables, then we must look for less direct methods.
In some cases, as we will see in this chapter, a computationally inexpensive
adaptation of a construction sequence will produce a construction sequence for
the marginal we desire. To obtain the marginal for the variables in an initial
segment of a construction sequence, we need only omit the later factors from the
construction sequence. To obtain the posterior for later variables given values
of the variables in an initial segment, we need only slice the later factors. If the
construction sequence is a chain, then we can find a construction sequence for
the variables in a final segment by a simple forward propagation. The general
case, however, requires the more general methods that we will study in the next
chapter -methods that apply to any distribution stored as a product of tables,
whether or not the tables form a construction sequence.
If each new conditional in a construction sequence involves a single new variable, then the most essential qualitative aspects of the construction sequence
can be represented by a directed acyclic graph (DAG). Such graphs have been
widely used for knowledge acquisition for probabilistic expert systems, and on
the theoretical side, they have been studied as a representation of conditional independence relations (Pearl [8]). Here we emphasize the value of DAGs for representing alternative construction sequencesconstruction sequences that use the
17

18

CHAPTER 2
TABLE 2.1

Qi, a probability distribution for Age. (This is a conditional with an empty tail and with

Age as its head.)

young
middle-aged
old

.40
.15
.45

TABLE 2.2
Q2, a conditional with Age as its tail and Sex as its head.

young
middle-aged
old

female
4/5
1
1/3

male
1/5
0
2/3

TABLE 2.3
QiQ2, a probability distribution for Age and Sex.

young
middle-aged
old

female
.32
.15
.15

male
.08
.00
.30

same conditionals but order them differently. By bringing these alternative orderings into the picture, a DAG enlarges the number of marginals and posteriors
that we can find by simple manipulations. In the general case, where each new
conditional is allowed to involve more than one new variable, we can similarly
indicate alternative orderings with a bubble graph, which is slightly more general
than a DAG.
2.1.

Multiplying conditionals.

Table 2.1 gives a probability distribution Q\ for Age (its single column adds to
one), and Table 2.2 gives a conditional Q% for Sex given Age (each row adds to
one). When we multiply these two tables, we get Table 2.3, which qualifies as a
probability distribution for Age and Sex (its six entries add to one). Notice that
Qi is a marginal of this probability distribution and hence Qi is a continuer.
We need not carry out the numerical multiplication in order to see that the
product Q\Qi is a probability distribution. We can instead perform an abstract
computation:

CONSTRUCTION SEQUENCES

19

Here we have first broken the summation into a summation over Sex followed
by a summation over Age. Since Qi does not involve Sex, it can be factored out
of the first summation, leaving Qi, which sums to one over Sex because it is a
conditional. This leaves us with the sum of Qi over Age, which is one because
Qi is a probability distribution.
Consider more generally any two conditionals Q\ and Q^. Write ti for the
tail, hi for the head, and di for the domain of Q%. (Recall that dl = ^ U/i z .) Our
example generalizes to the following proposition.
PROPOSITION 2.1. Suppose t\ is empty, t? is contained in d\, and hi is
disjoint from d\.
1. The product Q\Qz is a probability distribution on d\ U di.
2. The conditional Qi is Q\Qi 's marginal on d\.
3. The, conditional Qi continues Q\Qi from d\ to d\ U di.
Proof. Since we do not have symbols for individual variables, we will not use
summations like those in equation (2.1); instead, we will use our notation for
marginalization. We prove statement 1 by writing

Here we have used both the transitivity and the combination axioms.
Since Qi has an empty tail, it is a probability distribution. By the combination axiom,

Thus Qi is Q\Qi$ marginal on d\, and therefore, by the definition of continuer,


Qi continues Q\Qi from di to d\ U d%.
Now consider a sequence of n conditionals, Qi,..., Qn. Proposition 2.1 generalizes, by induction, as follows.
PROPOSITION 2.2. Suppose t\ is empty. Suppose ti is contained in di U U
di-i and hi is disjoint from d\ U U d z -i for i = 2 , . . . , n.
1. Qi Qn is <i probability distribution with domain d\ U U dn.
2. For i 1,... ,n 1, Q\ Qi is the marginal of Q\ Qn on d\ U - - U d j .
3. Fori = 2,... ,n, Qi continues Q\ -Qn fromdiU- - U d j _ i to d\\J- - U d j .
4. More generally, if 1 < i < j < n, then Qi- Qj continues Q\- Qn from
di U U di-i to di U U dj.
When the hypotheses of Proposition 2.2 are satisfied, we call the sequence
Qi,.--,Qn a construction sequence for the probability distribution Q\ ---Qn,

20

CHAPTER 2

FlG. 2.1. Left: the first tail is empty. The. second tail in contained in the first domain,
and the second head is disjoint from the. first domain. Right: two more head-tail pairs have
been added. Each time, the new tail is contained in the existing domain, and the new head is
disjoint from, it.

and we say that the construction sequence represents this probability distribution. The restrictions on the head tail structure of a construction sequence are
illustrated in Figure 2.1.
Statement 2 of Proposition 2.2 indicates one way that we can exploit a construction sequence. If we are interested only in the variables in di U U di and
not in the remaining variablesthose in /ii+1 U U hnthen we can simply
omit the last n i conditionals from the construction sequence: Q\- Qi is a
construction sequence for the marginal probability distribution on d\ U U ci,.
Another way to exploit a construction sequence is to fix the values of variables
we have observed. If these variables appear at the beginning of the construction
sequence, then this produces a construction sequence for the posterior distribution.
PROPOSITION 2.3. Suppose Qi,---,Qn is a construction sequence. Suppose
1 < i < n. Write d for U"=1/ij, the domain of Q\- Qn, and write i for U*=1 hj,
the domain of Q^ Q,. Suppose c is a configuration o f t . Then

Proof. By Statement 4 of Proposition 2.2. Ql+i Qn continues Q\ Qn


from t to d. So the proposition follows from Proposition 1.4, together with the
fact that a slice of a product is equal to the product of the corresponding slices
of the factors.

2.2.

DAGs and belief nets.

The expert-systems literature has devoted considerable attention to construction


sequences that add one new variable at a timei.e., construction sequences in
which each head consists of a single variable. In this case, we can write

where P is the probability distribution being constructed, Xi is the single variable


in the head of Ql, and t^ C {.Xi,... ,Xi_i}. We began the chapter with an
example of equation (2.2):

CONSTRUCTION SEQUENCES

21

T,\rn.K 2. 1
A conditional jFor Party given .Age.

young
middle-aged
old

Dem
1/4
1/3
1/3

hid
1/2
1/3
1/3

Rep
1/4
1/3
1/3

We leave it to the reader to check that we if also multiply in the conditional Q%


given by Table 1.5. then we obtain the probability distribution PAge.Sex.Party
given by Table 1.1:

Notice that if we use instead the conditional Q'3 given by Table 2.4, then we
obtain the same probability distribution PA<;K.Sex,Party'-

Like equation (2.3). equations (2.4) and (2.5) represent one-new-variable-at-atime construction sequences.
When one new variable is added at a time, the head-tail structure of the
construction sequence can be represented by a directed acyclic graph (DAG for
short). This graph has the variables as nodes, and it has arrows to Xi from
each element of $, for i 2 , . . . ,n. We call this graph directed because the
links between the nodes are arrows, and we call it acyclic because there are no
cycles following the arrows.5 (Since the arrows we draw to each Xt are all from
X} with j < i, any path following the arrows always goes in the direction of
increasing indices; it cannot cycle back to a smaller index.) Figure 2.2 shows
DAGs for the construction sequences represented by equations (2.3), (2.4), and
(2.5), respectively. Figure 2.3 shows the DAG for the more complex construction
sequence represented by the equation

The middle graph in Figure 2.2 and the graph in Figure 2.3 both have cycles,
but not cycles following the arrows. The cycle Xi,X3,X/i,Xi in Figure 2.3, for
example, goes against an arrow on its last step.
A belief net is a finite DAG with variables as nodes, together with, for each
node X, a conditional that has X as its head and X's immediate predecessors
5
Some authors prefer the name acyclic directed graph in order to emphasize that only
directed cycles are forbidden; a path that does not always follow the arrows is allowed to be a
cycle. But the name directed acyclic graph and the acronym DAG are strongly established in
the literature.

22

CHAPTER 2

FIG. 2.2. DAGs for the numerical example.

FIG. 2.3. A more complex DAG.


6

in the DAG as its tail. We have just explained how a construction sequence
determines a belief net. It is also true that the conditionals in a belief net can
always be ordered so as to form a construction sequence. This follows from the
following lemma.
LEMMA 2.1. The nodes of a finite DAG can always be ordered so that each
variable's immediate predecessors in the DAG precede it in the ordering. In other
words, we can find an ordering X\,..., Xn such that the immediate predecessors
of Xi in the DAG are a subset of {X\,...,Xi}. (In particular, Xi has no
predecessors in the DAG.)
Proof. The simplest proof is by induction on n, the number of variables in
the DAG. There is at least one node in the DAG that has no successors; if
every node had a successor, then we could form a cycle by going from each node
to a successor until (because there are only finitely many nodes) we repeated
ourselves. If we choose a node with no successors as Xn, and if we then remove
this node and the arrows to it, then we obtain, a DAG with only n I nodes
which, by the inductive hypothesis, has an ordering Xi,..., Xn-\ satisfying the
condition. The ordering Xi,..., Xn then also satisfies the condition.
We may call an ordering of the nodes of a DAG that satisfies the conditions
of Lemma 2.1 a DAG construction ordering. Unless a DAG is merely a chain,
it has more than one DAG construction ordering. The DAG in Figure 2.3, for
example, has five:

A variety of other names are also in use, including Bayesian network and graphical model.

CONSTRUCTION SEQUENCES

23

Every DAG construction ordering for the DAG of a belief net gives, of course,
an ordering of its conditionals that is a construction sequence for the probability distribution represented by the belief net. Thus the five DAG construction
orderings we just listed produce five construction sequences for the probability distribution in equation (2.6)five ways to permute the Qi and still have a
construction sequence.
We can talk about a belief net representing a probability distribution, without
reference to any particular construction sequence: a belief net represents a probability distribution P if P is equal to the product of the conditionals attached
to its DAG. We can also talk about a DAG by itself representing a probability
distribution: a DAG represents P if by attaching appropriate conditionals we
can make it into a belief net representing Pi.e., if P factors into conditionals
in the way indicated by the DAG.
Considered abstractly, a belief net represents a probability distribution more
concisely than a construction sequence does. It provides the same conditionals,
but it refrains from ordering them completely. For this reason, belief nets are
considered more fundamental than construction sequences in much of the literature on probabilistic expert systems. As a practical matter, however, belief nets
arise from a step-by-step construction that provides a complete ordering, and
we usually preserve this ordering when we store a belief net. Moreover, as we
will see in the next section, there is no practical advantage in considering only
construction sequences that introduce one new variable at a time. So in this
monograph, we take construction sequences as fundamental, and we treat belief
nets as secondary toolstools that help us see alternative orderings for particular one-new-variable-at-a-time construction sequences. In small problems, where
we can actually draw the DAG, it enables us to see alternative orderings at a
glance. In larger problems, the idea of the DAG reminds us of the existence of
alternative orderings.
Marginals and posteriors. From a computational point of view, the alternative construction sequences that we can discern by studying a DAG are important
because they broaden the application of Propositions 2.2 and 2.3. Since we can
apply these propositions to any construction sequence consistent with the DAG,
we can obtain construction sequences for a much larger class of marginals and
posteriors than we can obtain by working with a single construction sequence.
Propositions 2.2 and 2.3 are concerned with initial segments of a construction
sequence. We may also talk about initial segments of a DAG. We say that a set
w of nodes of a DAG is an initial segment of the DAG if all the immediate
predecessors of each element of w are also in w.
LEMMA 2.2. A set w of nodes in a finite DAG is an initial segment of the
DAG if and only if the DAG has a DAG construction ordering X\,..., Xn such
that

for some k.

24

CHAPTER 2

Proof. It is obvious that if a DAG construction ordering satisfying the two


conditions exists, then w is an initial segment in the DAG. To derive the existence
of such an ordering from the assumption that w is an initial segment in the DAG,
we adapt the proof of Lemma 2.1. We argue by induction on m, the number of
nodes not in w. IfTO= 0, then the ordering exists by Lemma 2.1. If m ^ 0i.e.,
w does not include all the nodes in the DAGthen there is at least one node
outside w that has 110 successors, for if every node outside w had a successor, this
successor would also be outside w, and we could form a cycle of nodes outside w
by going from each node to a successor until we repeated ourselves. If we choose
a node that lies outside w and has no successors as Xn, and if we then remove
this node and the arrows to it, then we obtain a DAG with only m 1 nodes
outside w which, by the inductive hypothesis, has a DAG construction ordering
Xi,...,Xn-i satisfying (2.7). By adding Xn to the end of this ordering, we
obtain a DAG construction ordering X\,...,Xn for the original DAG that also
satisfies (2.7).
The definition of initial segment in a DAG, together with Lemma 2.2 and
Propositions 2.2 and 2.3, yields the following proposition.
PROPOSITION 2.4. Suppose w is an initial segment of a belief net that represents a probability distribution P.
1. Suppose we delete the nodes not in w, together with the arrows to them, and
the conditionals associated with them. Then the resulting belief net represents P 's
marginal on w.
2. Suppose c is a configuration of w. Suppose, we delete the nodes in w,
together with the arrows from them and the conditionals associated with them,
and suppose we change the conditional on each of the remaining nodes by slicing
it on w = c. Then the resulting belief net represents P's posterior given w c.
The simplicity and visual clarity of this proposition accounts for much of the
appeal of belief nets.
Proposition 2.4 can be thought of as a statement about alternative construction sequences. It says that if we begin with one construction sequence (the one
we used to construct the belief net), then we can shift to an alternative one to
get marginals and conditionals. We can say this without reference to the belief
net as follows.
PROPOSITION 2.5. Suppose Qi,..., Qn is a one-new-variable-at-a-time construction sequence for a probability distribution P. Suppose ii,...,ik is a sequence of distinct integers between 1 and n such that t^ is empty and tij is
contained in { X ^ , . . . , Xi^^} for j = 2 , . . . , k. Write w for {Xtl,... ,X^k}.
1. Q j j , . . . , Qlk is a construction sequence for P^w.
2. Suppose c is a configuration of w. Suppose we modify the sequence
Qi,---:Qn by deleting each Qi, and by slicing each of the other conditionals
on w = c. Then the result is a construction sequence for P's posterior given
w c.
Forward propagation in chains. As we have seen, it is trivial to reduce a
belief net to a belief net for an initial segment. If the belief net is a chain, then
with a bit of work we can also reduce it to a belief net for a final segment.

CONSTRUCTION SEQUENCES

25

FlG. 2.4. A belief chain.

We call a DAG a chain if its nodes can be ordered, as in Figure 2.4, so that
the first has no immediate predecessors in the DAG and each of the others has
its predecessor in the ordering as its only immediate predecessor in the DAG.
Notice that a chain has only one DAG construction ordering: Xi,... ,Xn is the
unique DAG construction ordering for the chain X\ > - > Xn.
We call a belief net a belief chain if its DAG is a chain. Thus a belief chain
consists of a chain X\ + . . . > Xn and corresponding conditionals Q\,..., Qn.
The first conditional has X\ as its head and an empty tail; the ith conditional
has Xi as its head and _X";_i as its tail. The idea of forward propagation in such
a chain is based on the following lemma.
LEMMA 2.3. In a belief chain,

Thus (Q\Qi)^x^,Qs, - - - ,Qn is a construction sequence for the marginal on


{X2,... ,Xn}.
Proof. Since {^2} is the intersection of {X2,...,Xn} with the domain of
QiQ2, equation (2.8) is an instance of the combination axiom.
By applying Lemma 2.3 repeatedly, we can reduce our initial construction
sequence Qi,.-.,Qn to a construction sequence for any final segment of the
belief chain. Indeed, once we have a construction sequence Ri, Q ; + i , . . . , Qn for
Xi > f Xn, we can obtain a construction sequence Ri+i,Ql+'2, . ,Qn for
Xl+l -
> Xn by setting Ri+l = (RiQi+i)i{x'+l}.
The point of this step-by-step computation is that the tables will generally be
small enough for it to be implemented. In theory, we can move directly from the
construction sequence Qi,.--,Qn to a construction sequence for the marginal
on {Xi,.. ., Xn}, for the combination axiom implies that

But Q\- -Qi may be too large a table to compute.


Markov chains and hidden Markov models. Readers familiar with the
theory of Markov chains may find it illuminating to note that a finite Markov
chain is a special kind of belief net. It is a belief chain such that each variable has
the same frame and all the conditionals after the first are identical. Figure 2.5
shows a simple Markov chain.
Most of the theory of Markov chains is concerned with their repetitive nature
and hence does not extend to belief nets in general or even to belief chains in
general. For example, a Markov chain is sometimes described in terms of its state
graph. This is a directed graph (not usually acyclic) with the states (elements of
the common frame) as nodes and with an arrow from state i to state j whenever
the (i,j)th entry of the common conditional is positive. (Figure 2.6 shows the

26

CHAPTER 2

FIG. 2.5. A Markov chain.

FIG. 2.6. The state graph for the Markov chain in Figure 2.5.

state graph for the Markov chain of Figure 2.5.) In general, we cannot draw a
state graph for a belief chain because the successive variables may have different
frames. Even if the frames are the same, the possible transitions or at least their
probabilities will vary.
In recent years, considerable use has been made of belief nets of a type
slightly more general than Markov chainshidden Markov models. To form a
hidden Markov model, we begin with a Markov chain, say X\ > + Xn, and
from each node Xi we add an arrow to a new node, say Yi, so as to obtain a
DAG as in Figure 2.7. All the Yi have the same frame (possibly different from the
frame for the Xi) and the same conditional. In applications, the Yi are observed,
while the Xi are notthe Markov chain X\ > - - Xn is hidden. We are
interested in rinding posterior probabilities for the Xi, We may, for example,
want to find the most likely configuration of Xi,... ,Xn. Since the Yi do not
form an initial segment of the belief net, we cannot use Proposition 2.4 to find
posterior probabilities for the Xi. But efficient methods for finding posterior
probabilities (and for finding most likely configurations) have been developed in
the literature on hidden Markov models, and these methods, as it turns out, are
special cases of more general methods that we will study in Chapter 3.
Figure 2.7 represents only the simplest type of hidden Markov model; in
practice, the model is elaborated in various ways. One common elaboration
involves attaching more than one observable variable to each X;. There may be
a fixed number of observable variables for each Xit or this number itself may be
an observable variable. In speech recognition, for example, each Xi represents

CONSTRUCTION SEQUENCES

27

FIG. 2.7. A hidden Markov model.

a phoneme, and we observe features of the sound every successive millisecond


that the phoneme lasts. Since the length of a phoneme varies, the number of
observations will vary; it itself will be an observed variable. Strictly speaking,
this takes us outside the framework of the belief netit even takes us outside
the multivariate framework. Fortunately, the computational methods needed are
natural extensions of the multivariate methods we will study in Chapter 3.
2.3.

Bubble graphs.

Though the visual clarity of belief nets is very attractive, there is no practical reason to limit ourselves to construction sequences involving only one new
variable at a time. All the computational ideas we considered in the preceding
section generalize to the general case, and we can also generalize the graphical
representation itself.
The simplest graphical representation of a general construction sequence is
the bubble graph. This graph has a node for each conditional. This nodecalled
a bubblecontains all the variables in the head and has an arrow to it from each
variable in the tail. Figure 2.8 shows a bubble graph for a construction sequence
for ten variables:

A bubble graph is acyclic in the same sense that a DAG is acyclicwe cannot
go in a cycle following the arrows. Moreover, a bubble graph, like a DAG,
permits us to pick out alternative construction orderings for the nodes i.e.,
alternative construction sequences for the probability distribution. In Figure 2.8,
for example, the bubbles can be ordered in seven different ways:

And hence there are seven ways of ordering the conditionals to form a construction sequence:

28

CHAPTER 2

FTG. 2.8. A bubble graph.

Marginals and posteriors. In the general case, as in the one-new-variableat-a-time case, we can exploit alternative construction sequences to find prior
marginals for initial segments or posterior marginals given initial segments, and
we can propagate forward in chains to find prior marginals for final segments.
The idea of initial segments is defined for bubble graphs just as for DAGs,
and Proposition 2.4 continues to hold. Translating this proposition into a direct statement about alternative construction sequences, we get the following
generalization of Proposition 2.5.
PROPOSITION 2.6. Suppose Qi----,Qn is a construction sequence for P.
Suppose ii,.... ik is a sequence of distinct integers between 1 and n such that tt~
is empty and ti.. is contained in h^ U U hlj_l for j = 2 , . . . , k. Write w for
h^ U U hik. '
1. Qil,..., Qik is a construction sequence for P^w.
2. Suppose c is a configuration of w. Suppose we modify the sequence
Q},--.,Qn by deleting each Q.L} and by slicing each of the other conditionals
on w = c. Then the result is a construction sequence for P 's posterior given
w = c.
A construction sequence Q i , . . . , Qn is a construction chain if each ti is contained in ht-i for i = 2 , . . . , n. Figure 2.9 shows a bubble graph for a construction
chain: the bubbles are ordered, and each bubble has arrows only from variables
in the preceding bubble.
Lemma 2.3 generalizes as follows.
LEMMA 2.4. Suppose Q\.... ,Qn is a construction chain. Then

Thus (QiQ2)ih2,Qs, ,Qn is a construction chain for the marginal on /i2 U


U/in.
Forward propagation proceeds, based on this lemma, just as in the one-newvariable-at-a-time case; from the sequence fi,, Qz+i,. . , Qn for the marginal on

CONSTRUCTION SEQUENCES

29

FIG. 2.9. A bubble graph for a chain.

FIG. 2.10. The join chain for Figure 2.9.

h% U U /?,,, we obtain the sequence -R.;+i, Qt+j2, Qn fr the marginal on


hi+i U U /in by setting Rl+l = (#,Q; + i) l / l '+ J .
Figure 2.10 shows an alternative to the bubble graph in Figure 2.9. Here
instead of showing arrows from the individual variables, we put these variables in
the following bubble. They can still be identified; they constitute the intersection
of the two bubbles. A graph of the type shown in Figure 2.fO is called a join
graph. It has the property that the variables that a given node has in common
with any of the preceding nodes are all in the immediately preceding node. In the
next chapter, we will generalize the idea of a join chain to the idea of a join tree.
A more difficult example. For a concrete example of a construction sequence for which we cannot so easily find the marginals we want, consider the
external audit of an organization's financial statement. Figure 2.11 sketches, in
a simplified form, the structure of the evidence in one such audit. The auditor
is concerned with the accounts receivable, and she has distinguished between
the accounts receivable riot allowing for bad debts and the net accounts receivable, which do allow for bad debts. The accounts receivable are fairly stated
only if they are complete, properly classified, and properly valued. The auditor
has obtained evidence for completeness by tracing a sample from a subsidiary
ledger. Customer confirmations have provided evidence that the accounts are
properly classified and properly valued. In addition, the auditor's assessment of
the internal accounting system ("review of the environment") provides evidence
for the accounts receivable being correct, and her assessment of the state of the
economy ("analytic review") provides evidence for the adequacy of the allowance
for bad debts.
The bubble graph in Figure 2.12 depicts a probability model for the situation
described by Figure 2.11. Using the abbreviations indicated in Figure 2.13, we
write

30

CHAPTER 2

FlG. 2.11. The audit evidence.

Each abbreviation represents a variable corresponding to an assertions or item of


evidence shown in Figure 2.11. The variable N, for example, might be a binary
variable indicating whether the net accounts receivable are fairly stated (N = 1)
or not (N = 0).
The auditor's evidence consists of observed values of the variables E, R: T,
and CC, which we may designate by corresponding lower case letters. We are
interested in the posterior distribution of the remaining variables given these
observations, arid according to Equation 1.18 in Proposition 1.6, this is proportional to the function obtained by substituting the observations in the right-hand
side of equation (2.11):

We are particularly interested in the marginal of this posterior for the variable
N, which corresponds to an overall judgment that the financial statement is fairly
stated. Since the observed variables do not form an initial segment of the bubble
graph, we cannot find this marginal using the methods we have studied in this
chapter. Instead, we must use the methods of the next chapter, which apply to
arbitrary factorizations.
2.4.

Other graphical representations.

There are a number of alternatives to the bubble graph for representing the headtail structure of construction sequences, including chain graphs (Wermuth and
Lauritzen [50]) and valuation networks (Shenoy [47]). Figure 2.14 shows a chain
graph and Figure 2.15 shows a valuation network corresponding to the bubbl
graph of Figure 2.12. Both types of graph have uses beyond that of-representing
construction sequences. In the chain graph for a construction sequence, all the

CONSTRUCTION SEQUENCES

31

FIG. 2.12. A bubble graph for the audit.

FlG. 2.13. Variables for the construction sequence.

variables in each head are linked with each other, but by omitting some of these
links, we can represent additional conditional independence relations. By varying
the shape of the relational nodes and its arrows in a valuation network, we can
represent a wide variety of relations.
Another more complex graphical representation has been developed by Heckerman [30] under the name similarity network. A similarity network is a tool for
knowledge acquisition; it allows someone constructing a probability distribution
to allow certain variables in a construction sequence to be sufficient for other
variables given some values for earlier variables but not given other values for
these earlier variables.

Exercises.
EXERCISE 2.1. The idea of a construction sequence for a probability distribution generalizes to the idea of a construction sequence for a conditional. In

32

CHAPTER 2

FlG. 2.14. A chain grapli for the audit.

FIG. 2.15. A valuation, network for the audit,

this generalization, we no longer require that the first tail be empty and that each
new tail, be contained in the existing domain. We require only that each new head
be disjoint from the existing domain.
Consider first two conditionals Qi and Qi. Under the hypothesis that hi is
disjoint from d\ (Figure 2.16), prove the following statements:
1. The product Q\Q^ is a conditional with head hi U h% and domain d\ U d-2..
2. The product Qilt2 *s Q\Qi 's marginal on d\ U t%.
3. The conditional Q^ continues Q\Qz from d\ U t? to d\ U d^Then consider a sequence of conditionals Q\,..., Qn. Under the hypothesis that
hi is disjoint from d,\ U Ud,_i for i = 2 , . . . , n, prove the following statements:
1. The product Q\ Qn is a conditional with head h\ U - U hn
and domain d\ U U dn.
2. For i = 2, ...,n, Qi Qi-]l(dlij-udn)\(h,\j-uh.n)isthe
marginal of Qi Qn on (d\ U U dn) \ (ht U U hn).

CONSTRUCTION SEQUENCES

33

FIG. 2.16. Here we ask only that the second head be disjoint from the first domain.

FIG. 2.17. The "and" structure of the audit.

3. For i = 2 , . . . , n, the conditional Qi continues Q\- Qn from


(di U U dn) \ (h, U U hn) to (d\ U U dn) \ (hi+i U U hn).
4. More generally, ifl<i<j<n, then the product Qi Qj
continues Q\- -Qn from (d\ U U dn) \ (hi U U hn) to (di U U
dn)\(hj+1\J---\Jhn).
When hi is disjoint from di(J- -\Jdi-i fori = 2 , . . . , n, we say that Q i , . . . , Qn
is a construction sequence for the conditional Q\ Qn- Notice that any subsequence of a construction sequence is itself a construction sequence.
EXERCISE 2.2. Discuss how the idea of a state graph for a Markov chain can
be generalized so as to apply to more general belief chains.
EXERCISE 2.3. Devise graphical representations for hidden Markov models
in which the number of observed variables attached to a node in the Markov
chain is itself an observed variable.
EXERCISE 2.4. The basic graph in Figure 2.11 can be interpreted as an "and
graph": N = 1 if and only if A = 1 and B 1, and A = I if and only
if C = I, PC = I, and PV = 1. This suggests arrows pointing the other
way, as in Figure 2.17. Show that the marginal on {N,A,B,C,PC,PV} of
a probability distribution of the form provided by equation (2.11) will not, in
general, be represented by the DAG in Figure 2.17.
EXERCISE 2.5. The conditionals involving a particular set of variables form
only a partial commutative semigroup, since products and marginals are not always conditionals.
Generalize the axioms of transitivity and combination you formulated in Exercise 1.5 to the case where the semigroup may be only partial. Consider also
the case where labels are binaryhead and tail.

This page intentionally left blank

CHAPTER 3

Propagation in Join Trees

In this chapter, we study the problem, which we encountered in the preceding


chapter, of computing marginals of a function given as a product of tables on
different sets of variables, say

where /j is a table on a^. We want to compute /'s marginal on a particular


variable X, on one of the sets x^, or on some other set x of variables. The frame
of all the variables, $l\jXi, is too large for us to compute the table / and then sum
variables out of this table. So our task is to compute marginals for / without
computing / itself.
The approach we take in this chapter is the obvious one: we exploit the
factorization as we sum variables out. We sum variables out one at a time, and
we deal each time only with factors that involve the variable we are summing out;
the others we factor out of the summation. Each step produces a new product of
the same form as the right-hand side of equation (3.1), possibly involving some
larger clusters of variables (when we sum Y out, we must multiply together
all the fi involving Y", and the resulting cluster may be large even after Y is
removed). The next step must deal with these larger clusters, but with luck and
a good choice of the order in which we sum variables out, we may be able to
compute a given marginal without encountering a prohibitively large cluster.
As it turns out, this variable-by-variable summing out produces a join tree,
and the process can be understood directly in terms of the join tree. A join tree
is a tree with clusters of variables as nodes, with the property that any variable
in two nodes is also in any node on the path between the two (equivalently, the
nodes containing any particular variable are connected). The join tree produced
by summing variables out in a given order has the clusters produced by the
summing out as its nodes, and each summing out can be thought of in terms of
a message passed (or "propagated") from one node to a neighbor in this tree.7
7
The name "join tree" was coined in the theory of relational databases in the early 1980s
(Beeri et al. [22]). An alternative, "junction tree," is also current in the literature on belief
nets.

35

36

CHAPTER 3

There are a number of ways to arrange the details of propagation in a join


tree. We can sum out more than one variable at a time. We can carry out
a multiplication after each summing out, or we can leave the multiplications
until they are required for a new summing out. In some cases, we can reduce the number of multiplications by judicious divisions. Thus we can distinguish different architectures for join-tree marginalization. In this chapter, we
study four: the elementary, Shafer-Shenoy, Lauritzen-Spiegelhalter, and Aalborg architectures. The elementary architecture produces the marginal for a
single node of the join tree. The other architectures produce marginals for all
nodes of the tree. The Shafer-Shenoy architecture achieves this by storing the
results of each summing out so they can be used for propagation in any direction. This architecture is very general; it applies not only to the problem
we study in this chapter but also to other problems of recursive computation
involving unrestricted combination and marginalization operations that satisfy
the transitivity and combination axioms. It is somewhat wasteful, however, in
its appetite for multiplication. The Lauritzen Spiegelhalter and Aalborg architectures eliminate some of the multiplication by substituting a smaller number of
divisions.
If we are concerned only with calculating marginals of factored probability
distributions, the Aalborg architecture is the architecture of choice. Moreover,
the Aalborg architecture handles new evidence quite flexibly. Once it has computed marginals for given observations, it can adjust the marginal for a particular
variable X after the further observation of a variable Y using only the part of
the join tree that lies between X and Y. But the alternative architectures come
into play for a wide variety of collateral problems that do not, for one reason
or another, satisfy all the assumptions made by the Aalborg architecture. For
example, when observations are subject to retraction, the Aalborg architecture
cannot be used because it does not retain the original inputs; Jensen [32] resorts
to the Shafer-Shenoy architecture in this case.
The methods of this chapter require only that the function / be given as a
product of tables; it need not be a probability distribution, and even if it is,
the tables need not be conditionals. (In the case of the elementary and ShaferShenoy architectures, they can even have negative entries.) But we are most interested in the case where / is equal or proportional to a probability distribution.
If / is only proportional to a probability distribution P, it is usually the marginals
of P, not the marginals of /, that we want, but most of the work will be in finding
the marginals of /; we can obtain P's marginals from /'s by equation (1.2).
As noted in the preface and in the exercises at the end of this chapter, jointree computation is much broader and older than the problem of finding marginal
posterior probabilities in probabilistic expert systems. In fact, techniques similar
to each of the architectures studied in this chapter have been applied to a variety
of problems in applied mathematics and operations research. Perhaps the oldest
such problem is that of solving a "sparse" set of linear equationsone in which
only a few variables appear in each equation. Other examples include the fourcolor problem, dynamic programming, and constraint propagation (Diestel [2]).

PROPAGATION IN JOIN TREES

37

The feasibility and efficiency of join-tree computation depends, of course


on the nodes of the tree being sufficiently small. In the case of probability
propagation, they must be small enough that multiplication and marginalization
within nodes is inexpensive. Roughly speaking, this means that the the sum of
the frame sizes must be small, or even more roughly, that the largest frame must
be small. Finding a join tree that achieves either of these minima exactly is an
NP-complete problem, but it is known that such minima are always achieved by
join trees that are produced by summing variables out in some order (Mellouli
[39]). Moreover, there are good heuristics for finding reasonable join trees if they
exist (Kong [36], Kjasrulff [35]).
3.1.

Variable-by-variable summing out.

A simple example will suffice to show how variable-by-variable summing out


produces a join tree and how the summing out can be interpreted as messagepassing in this join tree.
Here is a function on seven variables given as a product of five tables:

The clusters of variables involved in the tables are shown in Panel 1 of Figure 3.1.
Let us imagine summing the variables out in the reverse of the order in which
they are numbered, keeping track as we go of the new clusters we create.
Summing Xj out yields

where we have written f 1 ( X ^ } for ^x f^X^.X-j}.


The clusters in this new
factorization are shown in Panel 2. Above them, we have begun to construct a
join tree by drawing a node representing the variables involved in the summation,
Xc, and X-?. We temporarily link this node to the single variable Xr>, which is
the only variable involved in the new table resulting from the summation.
Next, we sum X$ out, obtaining

38

CHAPTER 3

FlG. 3.1. Constructing the join tree.

The result is shown in Panel 3, where we have added to the join-tree-to-be a


second node consisting of the variables involved in the summation on this step.
We have linked the new node to the cluster of variables involved in the new table
resulting from the summation.
The next step, which produces Panel 4, is more interesting. Here we sum ^5
out, obtaining

PROPAGATION IN JOIN TREES

39

We again add a node consisting of the variables involved in the summation. We


remove the clusters for tables absorbed in the summation, replacing them with a
single cluster for the new table resulting from the summation. One node already
in the picture was linked to a cluster removed from the list; it is now linked to
the new node.
The reader can write down the formulas for the remaining steps, which are
represented by Panels 5-8. At each step, we pull out from our product the factors
involving the variable we are summing out, multiply them together, perform the
summation, and give a new name to the resulting table (our system for naming
identifies the original tables involved in the subscript and the variables summed
out in the superscript, but this is of no importance). We add to our picture a node
representing the variables involved in the summation. We remove from the list
all the clusters corresponding to tables absorbed into the summation, replacing
them with the single cluster for the new table resulting from the summation
this is the union of the clusters removed minus the variable summed out. We
link the node created to the cluster added. When a linked cluster is removed
from the list, the link is inherited by the new node that absorbs it.
The final result in Panel 8 is indeed a join tree. It is a tree with sets for
nodes, and whenever a variable is contained in two nodes, it is also contained in
all the nodes on the path joining the two. For example, the variable 2, which is
contained in both 23 and 1245, is also contained in the two nodes between them,
12 and 124.
Though we have worked in terms of an example, we have spelled out a general
algorithm. This algorithm applies to any product of tables and to any order for
summing the variables out of such a product. It identifies the clusters involved in
the variable-by-variable summing out, and it arranges these clusters in a graph.
Is this graph always a join tree?
Certainly the graph is always a treei.e., it is always connected and acyclic.
We introduce the nodes in a sequence. Each node except the last is linked with
some later node, so the graph is connected. (Since we can follow the links from
any node to the last node, we can follow them from one node to the last node
and then back to any other node we please.) Each node is linked with only one
later node, so there cannot be any cycles. (If there were a cycle, the earliest
node in it would have to be linked with two later nodes.)
To see that the tree is always a join tree, consider Figure 3.2, where the links
have become arrows pointing from old to new nodes, and each arrow is labeled
with the variable that was summed out when the node from which the arrow

40

CHAPTER 3

FlG. 3.2.

The join tree with arrows to the root.

comes was created. The node to which an arrow points always includes all the
variables in the node from which the arrow comes, except the variable that was
summed out. For any particular variable X, any node n containing X must be
connected to the node n' created when X is summed out, because the tables
created as we go downward from n continue to contain X until it is summed out.
It follows that all the nodes containing X are connected in the tree (i.e.form
a subtree), and this is equivalent to the tree being a join tree.
The join tree that we construct is this way is interesting because it can be
interpreted as a picture of the computations involved in the variable-by-variable
summing out. We interpret a node x as a register that can store a table for its
variables, and we interpret an arrow from x to y as an instruction to sum out a
variable from x's table and multiply y's table by the result.
We begin by putting tables in the storage registers; in Figure 3.2, for example,
we put the table /i in 23, the table /2 in 57, the product /3/4 in 1234, and the
table /s in 146. We put tables of ones in the other three nodes. The number
beside each arrow tells us which variable to sum out of the table in the node
preceding the arrow. Figure 3.3 shows the summations we perform when we
follow these instructions.
We summed the variables out in the reverse of the order in which they were
numbered: 7, 6, 5, 4, 3, 2. Figures 3.2 and 3.3 make it clear, however, that
this order can be varied to some extent without changing the join tree or the
computations performed. The only constraint is that we sum out of a given node
only after the node has absorbed messages from all nodes with arrows pointing
to it. Only the three nodes 23, 57, and 146 can begin the computation, 1245 can
act after 57, 124 can act after 1245 and 146, and so on.
We do not need the numbers beside the arrows in Figure 3.2. These numbers
tell us which variable to sum out, but we can also find this information by
comparing the node sending the message to the node receiving it. The sender
always sums out the variable it has that its neighbor does not have. In other
words, it marginalizes to its intersection with the neighbor.
The final result of the computation is f ^ X l , the marginal of / for X\. If we
continue by summing X\ out of this table, then we obtain /^ 0 , the marginal of
/ on the empty set. Figure 3.2 can be extended to include this final summation;
we simply add 0 as a node, with an arrow to it from 1.

PROPAGATION IN JOIN TREES

41

FlG. 3.3. The successive summations.

3.2.

The elementary architecture.

Marginalization in join trees can be understood directly, without any reference


to an ordering of the variables. If we place tables in the nodes of an arbitrary
join tree and propagate to a root following the algorithm just described, then
the final table on the root will always be the marginal on the root of the product
of the initial tables. It is not necessary that the join tree or the placement of the
tables should have been determined by an ordering of the variables.
In this section, we will spell out the marginalization algorithm in terms of an
arbitrary join tree. Then we will prove, using only the transitivity and combination axioms, that the algorithm always produces the marginal on the root.
Before beginning the algorithm, we place in each node x of the join tree a
table on x, say (px. We write (p for the product of the <>x; (p = HxgAr Vx, where N
is the set consisting of all the nodes in the tree. The purpose of the algorithm is
to find the marginal (p^r for a particular node r, which we call the root of the tree.
To begin the algorithm, we make all the links in the tree into arrows in the
direction of r. (Each node other than r will then have exactly one arrow outward,
pointing to its unique neighbor in the direction of r.) Then we have each node
pass a message to its neighbor nearer r according to the rules we learned in
Figure 3.1:

42

CHAPTER 3

Rule 1. Each node waits to send its message to its neighbor nearer
to r until it has received messages from all its other neighbors.
Rule 2. When a node is ready to send its message, it computes the
message by summing out of its current table any variables it has but
the neighbor to whom it is sending the message does not have. (This
was always a single variable in Figure 3.1, but it could be several
variables or none.) In other words, it marginalizes its current table
to its intersection with the neighbor.
Rule 3. When a node receives a message, it replaces its current table
with the product of that table and the message.
Eventually, all the nodes except r will have sent messages, and r will have received a message from each of its neighbors and will have multiplied its original
table by all these messages.
Here is the proposition we need to prove.
PROPOSITION 3.1. At the end of the algorithm just described, the table on r
will be (f>^r, the marginal on r of the product of the initial tables.
Proof. Imagine for the moment that the nodes are peeled away from the join
tree as they send their messages, so that in the end only r remains. Thus a single
step of the algorithm consists of three parts: (1) a node t computes the marginal
of its table to b D t, (2) the neighbor b multiplies this marginal into its current
table, and (3) the node t is removed from the tree. This allows us to state the
following lemma.
LEMMA 3.1. After each step, the product of the tables that remain is the
marginal to the variables that remain of the product of the tables before the step.
To see that Lemma 3.1 is true, write N\ for the set of nodes in the tree before
the step, iV2 for the set of nodes in the tree after the step, and i/)x for the table
in node x before the step. Thus the product of the tables before the step is
rizeAT! ^xi and the product of the tables after the step is (O^eTv ^o;)W 0< (see
Figure 3.4). Since the tree is a join tree, b H t = (UA^) H t. So we find, using the
combination axiom, that

which is a restatement of Lemma 3.1. Lemma 3.1, together with the transitivity
axiom, yields the next lemma.
LEMMA 3.2. After each step, the product of the tables that remain is the
marginal to the variables that remain of the product of the initial tables.

PROPAGATION IN JOIN TREES

43

FlG. 3.4. The loaded join tree before and after t 'sends its inward message to b.

At the end of the algorithm, we have only one table, the table on the root,
and so we obtain Proposition 3.1 as a special case of Lemma 3.2.
We can gain some further insight into the algorithm by noting that when a
node b receives a message from a neighbor t, it is also receiving, indirectly, information from the nodes on the other side of t. After any step (message-passing
and multiplication) in the algorithm, we can identify the nodes from which a
given node b has received information, either directly or indirectly. These nodes,
together with b itself, form a subtree, which we may call the b's information
branch at that point (see Figure 3.5). The steps we have taken within this subtree are the same as the steps we would have taken had we implemented the
algorithm on it alone, with b as the root. So as a corollary of Proposition 3.1,
we have the following proposition.
PROPOSITION 3.2. After each step, the table on a given node b will be the
marginal on b of the product of the initial tables in b's current information branch.
This is a generalization of Proposition 3.1, because at the end of the algorithm, the root's information branch is the whole tree.
In the course of explaining our algorithm, we have found ourselves talking
about the nodes of the join tree as storage registers and even as individual
processors. Each node can store tables for a certain set of variables, multiply
such tables, and marginalize them. In effect, we have made the join tree, together
with the algorithm, into an architecture for marginalization. We call it the
elementary architecture. In the next few sections, we consider some alternative
architectures, based on the same join tree, that are able to compute marginals
for all the nodes, not merely for a single root node.
Join-tree architectures are potentially applicable to any instance of the general problem of computing marginals of a function given as a product of tables,
as in equation (3.1), but in order to apply a join-tree architecture to such a problem, we first find a join tree that covers the product, one that includes for each
factor a node containing the domain of that factor. (If we want the marginal for a

44

CHAPTER 3

FlG. 3.5. The dashed arrows are those over which messages have already been sent. The
circled subtree is b's information branch at this point.

cluster of variables that is not the domain of one of the factors, then we must
make sure that the join tree also has a node containing this cluster.) Once we
have such a join tree, we place each factor in a node containing its domain. If
a node x receives more than one factor, we multiply them together, and we also
multiply by lx if necessary in order to obtain a table that involves all the variables in x. If a node x does not receive a factor, we simply assign it the table l x .
If the join tree has more than one node containing the domain of a particular
factor, we can put the factor in whichever of these nodes we please. In Figure 3.2,
for example, we have two different nodes that can accept a table on 124. To
minimize computation, we should choose the node with the smaller frame size,
but this is a minor consideration.
The choice of the join tree is much more important. We want a join-tree
cover with nodes small enough to permit computation. If such a join-tree cover
does not exist, we will have to turn to alternative methods for marginalization,
such as Markov-chain Monte Carlo.
As we noted at the beginning of the chapter, there are heuristics that do
produce reasonable choices for join-tree covers. Some of these heuristics do
involve choosing an order for eliminating (summing out) the variables. This not
only produces a join-tree cover; it also determines a placement of the factors in
the join treeeach factor goes as close as possible to the root.
3.3.

The ShaferShenoy architecture.

The elementary architecture allows us to find the marginal for an arbitrary root
of a join tree. If we then want to find the marginal for another node, we can
use the same join tree, but we must repeat the algorithm using the new node

PROPAGATION IN JOIN TREES

45

FIG. 3.6. The partial Shafer-Shenoy architecture. Like the elementary architecture, its
finds the marginal for a single root node. In each separator, we have indicated the set of
variables involved in the messages that will be stored there; this is always the intersection of
the two neighboring nodes.

as the root. This usually involves a great deal of duplication. In Figure 3.4, for
example, most of the steps for computing the marginal on w will be the same as
those for computing the marginal on r.
The Shafer- Shenoy architecture provides one way to eliminate much of this
duplication. In this architecture, each node sends messages in all directions. It
is allowed to send its message to a particular neighbor as soon as it has messages
from all its other neighbors. In order that the computations for a message in one
direction should not interfere with those for a message in another direction, a
node no longer replaces its table each time it receives a message. Instead, it keeps
its initial table, stores the incoming messages, and performs multiplications only
as needed for computing outgoing messages.
As a first step in describing the Shafer-Shenoy architecture, we will describe
a partial version, in which, as in the elementary architecture, messages are propagated only to a single root r. Figure 3.6 shows this partial architecture. The
squares on the arrows in this figure are called separators; they contain storage
registers for storing the messages sent in the direction of the arrows. As in the
elementary architecture, we begin with a table <px on each node x and we want
to find (f>^r for a particular node r, where </? is the product of the (px. The storage
registers in the separators are initially empty.
Here are the rules for propagation in the partial Shafer Shenoy architecture:
Rule 1. Each node waits to send its message to its neighbor nearer to
r until it has received messages from all its other neighbors. (More
precisely, it waits until messages have been received by the separators
between it and these other neighbors.)

46

CHAPTER 3

Rule 2. When a node is ready to send its message to its neighbor


nearer r (or, more precisely, to the separator between it and its neighbor nearer r), it computes the message by collecting all its messages
from neighbors farther from r, multiplying its own table by these
messages, and marginalizing the product to its intersection with the
neighbor nearer r.
Rule 1 is the same as in the elementary architecture. Here, however, the messages
are intercepted by the separators, where they are stored until they are collected
in accordance with Rule 2. Rule 3, which provides for changing the tables on
nodes, has been omitted. In this architecture, propagation only has the effect of
filling the storage registers in the separators. It does not change the tables on
the nodes.
Since the rules for message-passing are the same in the partial Shafer-Shenoy
architecture as in the elementary architecture, the course of the propagation and
the messages sent will be the same. At the end of the propagation, the root r
will have a message from each neighbor stored in the separator it shares with
that neighbor. Thus we have the following proposition.
PROPOSITION 3.3. At the end of the partial Shafer-Shenoy propagation, we
can get (p^r by collecting all of r's incoming messages and multiplying r's table
by them.
The full Shafer-Shenoy architecture extends the partial architecture by
putting two storage registers in each separator, one for a message in each direction, as in Figure 3.7. Each node sends messages to all its neighbors, following
these rules:
Rule 1. Each node waits to send its message to a given neighbor until
it has received messages from all its other neighbors.
Rule 2. When a node is ready to send its message to a particular
neighbor, it computes the message by collecting all its messages from
other neighbors, multiplying its own table by these messages, and
marginalizing the product to its intersection with the neighbor to
whom it is sending.
Here, as in the partial architecture, the tables on the nodes do not change. At
the end of the propagation, each node x still has its initial table y>x. The only
effect of the propagation is to fill all the storage registers in the separators.
A comparison of the rules for the full and partial architectures makes it clear
that the full architecture produces the same messages towards any particular
node as the partial architecture with that node as root. So once we have completed the propagation in the full architecture, we can find the marginal for
any particular node by collecting all its incoming messages and multiplying the
node's table by them.
PROPOSITION 3.4. At the end of the full Shafer-Shenoy propagation, we can
get <p^x for any node x by collecting all ofx's incoming messages and multiplying
x's table by them.

PROPAGATION IN JOIN TREES

47

FlG. 3.7. The full Shafer-Shenoy architecture. The arrow in each storage register indicates the direction of the message to be stored there.

We will find it useful, when we compare the Shafer-Shenoy architecture to


other architectures, to express its computations in formulas. Let us write mn^x
for the Shafer-Shenoy message to x from neighbor n. Then Rule 2 says that the
message from x to neighbor w is given by

where Nx consists of x's neighbors.


Because of Rule 1, the computation must begin with the leaves, the nodes
that have only one neighbor. In Figure 3.7, for example, the leaves are 1, 23, 57,
and 146. Any of these leaves can begin, and the message they send is the only
message they send in the course of the computation. The situation for the other
nodes is more complicated. Node 12, for example, can send a message to 124 as
soon as it has heard from leaves 1 and 23, but it must wait then wait to hear
back from 124 before it can send messages back to 1 and 23.
Figure 3.8 shows one sequence in which messages might be sent in the architecture of Figure 3.7. The messages first move inward to the node 1 and then
back outward again. The inward pass is identical to propagation to 1 in the
partial Shafer-Shenoy architecture of Figure 3.6.

48

CHAPTER 3

FIG. 3.8. One order in which messages might be sent in the full Shafer-Shenoy architecture.

If the computations are performed serially, there will necessarily be one node,
such as 1 in Figure 3.8, that is the first to receive messages from all its neighbors.
This node can be considered the root. The propagation consists of a pass inward
to the root and another pass back outward. It is not necessary, however, to
specify the root in advance. If the computations are performed in parallel (a
possibility suggested when we talk as if the nodes were individual processors),
then which node is the first to receive all its messages will depend on the pace
of the computations for the different nodes farther out in the tree, and it is even
possible that two nodes will tie for first. This happens in Figure 3.9, where
the computations proceed in parallel and in synchrony, and 124 and 12 receive
messages from each other simultaneously on the third step of the computation.

PROPAGATION IN JOIN TREES

49

FlG. 3.9. An example of parallel computation.

By comparing Figures 3.6 and 3.8, we can understand better why the Shafer
Shenoy architecture stores so many messages. The elementary architecture uses
and discards each message when it is sent. But what would happen if we were
to follow the inward pass of the elementary architecture with an outward pass?
In the case of Figures 3.6 and 3.8, this means that after 1 absorbed the message
from 12, it would send a message back to 12. By the usual rule, the message back
would simply be its current table, which was obtained by multiplying its original
table by the message (no marginalization is needed, because the intersection of
1 with 12 is simply 1). Intuitively, this is the wrong, because it forces 12 to
absorb again the message it just sent, effectively counting it twice. The ShaferShenoy architecture sends instead only the original table, uncontaminated with
the message from 12. It is able to do this because it has kept both its original
table and the message. The same thing happens at each further step on the
outward pass. Node 12, for example, since it still has both its original table and
the messages from 23 and 1, is able to send a message back to 124 that is not
contaminated with the message it received from 124.
Roughly speaking, the Shafer Shenoy architecture computes marginals for
all the nodes at about three times the price for a single marginal. We double
the computation because we compute two messages instead of one for each link,

50

CHAPTER 3

and then we increase it by about the same amount again when we do the final
multiplications to get the marginal for each node. This contrasts with repeating the elementary architecture for each node, which multiplies the amount of
computation for a single marginal by the number of nodes.
Unfortunately, the Shafer-Shenoy architecture is still rather wasteful in its
demand for multiplication. Each node computes a message for each of its neighbors only once (in contrast to what happens if we use the elementary architecture
over and over), but the multiplication a node performs to compute the message
to one neighbor still duplicates much of the multiplication it performs to compute the message to another. In Figure 3.7, for example, node 124 will multiply
its original table by the message from 1245 once when it sends its message to
146 and again when it sends a message to 12. With yet more storage, we could
reduce this remaining duplication somewhat, but it is more effective to take another tack. Instead of trying to keep the message a node sends on the inward
pass from being included in the message it gets back, we can allow for the message's later return by dividing the it out of the node's current table as it is sent.
This is the tack taken by the Lauritzen-Spiegelhalter architecture.
3.4.

The Lauritzen-Spiegelhalter architecture.

The Lauritzen-Spiegelhalter architecture explicitly designates a particular node


r as the root of the propagation. It does not use separators. It begins with a
pass inward to r that duplicates the elementary architecture, except that when
a node sends a message, it divides its own table by that message. It then follows
with a pass outward from r, during which it follows the elementary architecture's
rule for propagation, without the division. This is illustrated by Figure 3.10.
Here is a precise statement of the rules for the inward pass.
Rule 1. Each node waits to send its message to its neighbor nearer r
until it has received messages from all its other neighbors.
Rule 2. When a node is ready to send its message to its neighbor
nearer to r, it computes the message by marginalizing its current table to its intersection with its neighbor. It sends this marginal to the
neighbor nearer to r, and then it divides its own current table by it.
Rule 3. When a node receives a message, it replaces its current table
with the product of that table and the message.
These rules are the same as the rules for the elementary architecture, except for
the addition of the italicized phrase in Rule 2. For the outward pass, we use the
same rules, without the divisions:
Rule 1. Each node waits to send its message to a particular neighbor outward from r until it has received messages from all its other
neighbors.
Rule 2. When a node is ready to send its message to a particular
neighbor outward from r, it computes the message by marginalizing
its current table to its intersection with this neighbor.

PROPAGATION IN JOIN TREES

51

FIG. 3.10. Rules for the Lauritzen-Spiegelhalter architecture. The message, In or Out, is
always the marginal of the sender's current table to the sender's intersection with the receiver.

Rule 3. When a node receives a message, it replaces its current table


with the product of that table and the message.
Since each node received messages from all its outward neighbors on the inward
pass, we can restate Rule 1 for the outward pass in a simpler way: Each node
waits to send its messages outward until it has received a message from its unique
neighbor nearer to r. (This neighbor may be r itself; r must begin the outward
pass by sending one or more messages.)
Let us check that the Lauritzen-Spiegelhalter architecture produces the appropriate marginals for all the nodes.
PROPOSITION 3.5. At the end of the Lauritzen-Spiegelhalter propagation, the
table on each node x is </^ x .
Proof. First consider the situation at the end of the inward pass. On the
inward pass, the messages sent are the same as in the elementary architecture
and hence also the same as in the Shafer-Shenoy architecture. If x is not equal to
r, then during the inward pass, x sends its inward neighbor w the Shafer-Shenoy
message mx^w. At the end of the inward pass, x has received messages from
all its own outward neighbors (if any) and has sent only the message to w. This
gives the following lemma.
LEMMA 3.3. At the end of the inward pass, a node x not equal to the root
has as its table

where w is x's inward neighbor.


The root r, on the other hand, receives messages from all its neighbors and
sends no messages on the inward pass. So at the end of the inward pass, it has
the same table as at the end of the elementary architecture.

52

CHAPTER 3

LEMMA 3.4. At the end of the inward pass, the table on r is (p^T.
Now consider the outward pass. On the outward pass, each node except the
root receives just one message: the message from its inward neighbor. The root
itself sends messages but does not receive any. So the table on the root does not
change, and each of the other tables changes exactly once, when it is multiplied
by the message from its inward neighbor. Since the propagation moves outward
from the root, Proposition 3.5 follows by induction from Lemma 3.4 together
with the following lemma.
LEMMA 3.5. Suppose w has (p^w as its table when it sends its message to outward neighbor x. Then after absorbing the message, x will have (p^x as its table.
To prove Lemma 3.5, we need a formula for the message w sends to x.
LEMMA 3.6. If w has (p^w as its table when it sends its message to outward
neighbor x, then the message it sends is the product of the Shafer-Shenoy messages in both directions: mu,^xmx^w.
To prove Lemma 3.6, we note that by its hypothesis and equation (3.3), the
table on w is

The message w sends out to x is the marginal of this table to w r\ x, which is


equal, by the combination axiom and equation (3.2), to

When we multiply the expressions in Lemmas 3.6 and 3.3, we obtain

which proves Lemma 3.5 (see Figure 3.11).


Since the hypothesis of Lemma 3.6 is always true, its conclusion is too: the
Lauritzen-Spiegelhalter message from w back out to x is always the product of
the Shafer-Shenoy messages in both directions. This substantiates the intuitive
characterization of the Lauritzen Spiegelhalter architecture with which we began: dividing out the inward message when we send it compensates for the fact
that it will be part of the message that comes back.
Another equally important way of describing the message from w back out
to x is to say that it is the marginal of (f> on w fl x. This is because w has the

PROPAGATION IN JOIN TREES

53

After w has received messages from all its neighbors,


including x and its neighbor nearer r, and before it sends
a message back to x.

After w sends a message back to x.


FlG. 3.11. The node x and its neighbor w nearer the root before and after w sends a
message back to x.

marginal of (p on w as its table before sending the message, and it computes the
message by marginalizing this table to w fl x.
Using continuers. The alert reader will have noticed that we glossed over the
problem of zero probabilities in our description of the Lauritzen Spiegelhalter
architecture. If the table mx^w has zero values, then we will not be able to
perform the division in equation (3.4). Fortunately, it is not really necessary to
perform this division. The reasoning with which we proved Proposition 3.5 will
work if we can find a continuer, say Qxnw-^xj of (px HneAf \w mn^x from x PI w
to x, for we can use Qxr\w-+x as x's table after it has sent its message inward
to u>, and this will have the same effect as the division. When the message
mw-+xmx^w comes back, we obtain

54

CHAPTER 3

as our table on x, so that Lemma 3.3 and Proposition 3.5 still hold.
The requirement that continuers should exist makes the LauritzenSpiegelhalter architecture slightly less general than the Shafer-Shenoy architecture, which allows negative entries in the tables (px. Continuers may fail to exist
when negative values are allowed. But if the product of the <px is proportional
to a probability distribution, then we can take it for granted that all the entries
are all nonnegative, because dropping minus signs will not change the product.
And, in this case, continuers exist by Proposition 1.1.
Notice the other implication of Proposition 1.1: we can choose the continuers
to be conditionals. More precisely, we can choose the continuer Qxr\w>x to be a
conditional with head x \ w and tail x n w.
When we look beyond probability to other problems satisfying the transitivity and combination axioms (see the exercises at the end of Chapter 1 and at the
end of this chapter), we find that the Shafer-Shenoy and Lauritzen-Spiegelhalter
architectures have overlapping but distinct ranges of application. The ShaferShenoy architecture works whenever there are no restrictions on multiplication
and marginalization, even if continuers do not exist. The Lauritzen-Spiegelhalter
architecture, on the other hand, can sometimes work under restrictions on
multiplication or marginalization that prevent the use of the Shafer-Shenoy
architecture.
The new construction sequence. One interesting feature of the LauritzenSpiegelhalter architecture is that the product of the tables on the nodes remains
equal to (p during the inward pass. This is clear when we divide: each time we
divide one of the tables by a message, we multiply another by the same message,
so the product does not change. It is equally clear in terms of continuers: each
time we factor a table into a marginal and a continuer and remove the continuer
from the node, we add it as a factor in another node.
Suppose we always choose the continuers to be conditionals. Then at the
end of the inward pass, we have transformed the original factorization of </?,
(p = IlxeAr Vx, into a new factorization,

where w(x) is x's inward neighbor. This new factorization, as it turns out, can
be interpreted as a construction sequence.
In order to make the interpretation as a construction sequence precise, let us
take one more step, continuing the inward pass, as it were, from r to the empty

PROPAGATION IN JOIN TREES

FIG. 3.12.

55

The tables at the end of the inward pass.

set 0. In other words, we factor the marginal <^r into the product of (/^0 and
a continuer from 0 to r. Since (p is proportional to a probability distribution
P, (p^ ^ 0, and hence the continuer is unique; it is the marginal P^r. So
equation (3.6) becomes

If we imagine the a node 0 added to the join tree, with an arrow to it from r,
then at the end of the inward pass, we have the factors on the right-hand side
of equation (3.7) on the nodes of the tree (see Figure 3.12).
By Proposition 1.3, the probability distribution P is equal to (p/tp^. So
equation (3.7) tells us that

It is the conditionals on the right-hand side of this equation that can be arranged
in a construction sequence for P. Indeed, suppose x i , . . . , xm is an ordering of
the nodes of the join tree that moves outward from the rooti.e., such that x\
is the root and each later Xi is an outward neighbor of one of r c i , . . . , x^-i. (Such
orderings exist in any tree.) Write Qi for QXir\w(xi)-*xii fr z = 2 , . . . ,m. Then
we have the following lemma.
LEMMA 3.7. P^r, Q^-, , Qm is a construction sequence for P.
Proof. Equation (3.8) says that P is the product of P^ r ,Q2, , Qmi and
the union of their heads is clearly equal to TV, the domain of P. So to prove
the lemma, we need only show that the head of each conditional is disjoint from
the domain of the preceding ones. But this is an obvious property of join trees:
whenever we order the nodes in a sequence moving outward from a root, the
intersection of each node Xi with the preceding nodes is always contained in its
inward neighbor w(xi), and hence Xi \w(xi) is disjoint from x\ U- -Uzj-i.
Lemma 3.7 says that at the end of the inward pass, the tables on the nodes
are conditionals, and any outward sequence is a construction sequence.

56

CHAPTER 3

FIG. 3.13. The propagation back from r to x.

The outward pass of the Lauritzen-Spiegelhalter architecture can be understood in terms of the construction sequences produced by the inward pass. Consider, for example, the action of the outward pass on the path going outward
from the root r to a particular node x (Figure 3.13). It is evident that the
conditionals along this path form a construction chain for the marginal of P
on the variables involved, and the propagation outward in this chain is forward
propagation in the sense of Chapter 2.
3.5.

The Aalborg architecture.

As we have seen, the message from x in to w in the Lauritzen-Spiegelhalter


architecture is the Shafer-Shenoy message in that direction, mx-+w, while the
message from w back out to x is the product of the Shafer-Shenoy messages in
both directions, mx^wmw^x. When we send mx^w inward, we divide it out of
the table on x in order to compensate for its later return.
The Aalborg architecture takes a more direct tack. In this architecture, we
do not divide mx^w out of the table on x as we send it inward. Instead, we save
mx^w and divide it out of mx^wmw^x when this message comes back. This
requires more storage, but it saves computation, because the division is now in
the domain w n x rather than in the larger domain x. Each entry in mx^,w
divides a whole row, as it were, in the table on x, but only a single entry in the
table mx^wmw-^x.
Messages are stored in separators, just as in the Shafer-Shenoy architecture.
Each message is computed as in the Lauritzen Spiegelhalter architecture: the
node marginalizes its current table to the intersection with the node to which
it is sending the message. On the inward pass, we both store the messages in
the separators (as in the Shafer-Shenoy architecture) and multiply them into the
receiving nodes (as in the elementary and Lauritzen-Spiegelhalter architectures).
On the outward pass, the separator divides the outward message by the message
it has stored before passing it on to be multiplied into the table on the receiving
node. (See Figure 3.14.) By the end, the initial table on each node x will be
multiplied by the Shafer-Shenoy messages from all of x's neighbors. So the final
table on x will be the marginal (p^x.
When a node w computes a message for its outward neighbor x, its own table
is already its marginal, (p^x. So the message it sends to the separator is ip^xnw.

PROPAGATION IN JOIN TREES

57

FlG. 3.14. The inward and outward action of the Aalborg architecture between x and its
inward neighbor w. Here ifjx and t^w are the tables on x and w, respectively, just before x
computes its message to w, and ipx and ifr'w are the tables just before w computes a message
to send back. The table on w may have changed one or more times as a result of messages
from other outward neighbors and its own inward neighbor.

Since we are more interested in this marginal than in the Shafer-Shenoy message,
we store it in the separator after we forward its quotient by the old message.
The action of the separator on the inward pass seems different from its action
on the outward pass, but Figure 3.15 shows how to describe it in a way that
makes it similar. Instead of beginning with the separator empty, we begin with
it containing l^nx, a table of ones. Since In is the same as In/lwr\Xi we can
say that here too the separator is sending forward a quotient rather than merely
sending forward the message it receives. Thus we have the uniform action shown
in Figure 3.16; the separator always stores New but sends forward New/ Old.
In summary, the Aalborg architecture uses a rooted tree with a separator
between each pair of nodes. Initially, each node x has a table </?x, and each
separator has a table of ones. The propagation follows these rules:
Rule 1. Each nonroot node waits to send its message to a given
neighbor until it has received messages from all its other neighbors.
Rule 2. The root waits to send messages to its neighbors until it has
received messages from them all.
Rule 3. When a node is ready to send its message to a particular
neighbor, it computes the message by marginalizing its current table
to its intersection with this neighbor, and then it sends the message
to the separator between it and the neighbor.
Rule 4. When a separator receives a message New from one of its two
nodes, it divides the message by its current table Old, sends the quotient New/ Old on to the other node, and then replaces Old with New.
Rule 5. When a node receives a message, it replaces its current table
with the product of that table and the message.

58

CHAPTER 3

FIG. 3.15. If we suppose that the separator begins with a table of ones, then the inward
action is the same as the outward.

FIG. 3.16. The uniform action of the Aalborg architecture: When u sends New to its
neighbor v, the message is intercepted by the separator, which divides it by Old and passes the
quotient on.

Rules 1 and 2 force the propagation to move in to the root and then back out.
At the end of the propagation, the tables on all the nodes and separators are
marginals of </?, where ip = Y\ xDealing with zeros. We have again been making the simplifying assumption
that there are no negative or zero values in the <px, so that division is always
possible. Now let us relax this to the assumption that there are no negative
values, which is sufficient for continuers to exist.
When zeros are not allowed in the table Old, the quotient New/ Old is the
unique solution ty of the equation Old tp = New. As it turns out, this equation
can still be solved when we allow zeros; the solution is not unique, but it does
not matter what solution we use. So there are two ways we can proceed. We
can stop talking about divisionwe can talk instead about solving the equation
Old ip = New. Or we can extend the definition of division by picking out a
particular solution of the equation Old ty = New and calling it the quotient
New/ Old.

PROPAGATION IN JOIN TREES

59

We will explore both approaches. First, let us see what happens when we
drop talk about division. Since division appears only in Rule 4, all we need to
do is replace that rule with the following rule:
Rule 4'. When a separator containing Old receives a new message,
say New, it solves the equation

for ip and sends tp on to its other node. It then discards Old and
stores New in its place.
As the following proposition shows, this works; it is always possible to solve
equation (3.9), and doing so produces the result we want.
PROPOSITION 3.6. If there are no negative values in the initial tables on the
nodes, then propagation under Rules 1,2,3,4', and 5 will result in each node and
separator containing its marginal of (p.
Proof. Since the propagation proceeds inward just as in the elementary architecture, the root will have its marginal at the end of the inward pass. So we can
prove the proposition by induction on the outward pass. Suppose propagation
to w on the outward pass has resulted in the table (p^w on w, and let us show
that the next step will produce (p^x on ID'S outward neighbor x. On the inward
pass, x had sent in mx^w, and w now sends back (p^xC]w, or mx-+wmw-+x. So
equation (3.9) can be rewritten as

or

Equation (3.11) obviously has a solution, but it may have more than one. We
need to show that any solution will produce the marginal on x when it multiplies
the table now on x. To this end, let Qxt^w-^x De a Lauritzen Spiegelhalter continuer for x. The current table on x is Qxr\w-*xmx->wi so the result of multiplying
it by any solution of equation (3.10) is

which is equal, by equation (3.5), to (p^x.


Though the solution ty of equation (3.11) may not be unique, the range of
choice is simple. Since all the tables involved in the equation are the same size,
the multiplications are all entry-by-entry. When an entry in mx^w is nonzero,
the corresponding entry in -0 is unique; we obtain it by division. When an entry
in mx>w is zero, the corresponding entry in mx^wmw^x is also zero, and so we
can choose the entry in 1/1 however we please. It is this factthe fact that we
can choose the entries of T/> arbitrarily when they are not fully determinedthat
allows us to handle the situation by extending the definition of division.

60

CHAPTER 3

In the case at hand, we want to divide one table by another of the same
size, but with an eye to further developments, let us consider a more general
situation, where we want to divide one table by another of the same or possibly
smaller size. Say we want to divide a table B on y by a table A on x, where
x C y. We will show how to do so under the assumption that whenever an entry
in A is zero, everything in the corresponding row in B is zeroi.e.,

or, equivalently,

We will say that A supports B when this condition is met. Given a table A on
x that supports a table B on y, we define a table B/A on y by

Here we have set the value of the quotient equal to zero when the value of
denominator is zero. Any other number would do just as well for our immediate
purpose, but zero will prove convenient later.
This extended definition of division immediately yields the following lemma.
LEMMA 3.8. If A supports B, then

This lemma, in turn, yields the following proposition.


PROPOSITION 3.7. // there are no negative values in the initial tables on the
nodes, and we use equation (3.14) as the definition of division, then propagation
under Rules 1,2,3,4, and 5 will result in each node and separator containing its
marginal of (p.
Proof. Since Old is mx^w and New is mx^wmw^x, Old supports New. So
by Lemma 3.8, New/Old, defined as in equation (3.14), is a solution of equation (3.9). So Rule 4 with our extended definition of division is a special case of
Rule 4', and the proposition follows from Proposition 3.6.
As the following lemma asserts, we can work with extended division in much
the same way that we work with ordinary division. We can combine numerators
and denominators (statement 5), and we can cancel factors in denominators by
multiplication (statement 6).

LEMMA 3.9.

1.
2.
3.
4.
5.
6.

f (c) = 0 if and only if B(c) = 0.


If A supports B, then A supports BC.
If A supports B and C supports D, then AC supports BD.
If B is a table on y and x C y, then B^x supports B.
If A supports B, then f C = $-.
If A supports B and C supports D, then ^- % ^

PROPAGATION IN JOIN TREES

61

7. // A and C both support B, then -jjfe C = ^.


8. If A and C both support B, then ;f = (This may not be true if C
does not support B.)
9. If A on x supports B on y, then ( ^ ) i x = ^-.
We leave the proofs of these statements to the reader. In contrast to
Lemma 3.7, most of them (namely, 1 and 5-9) do depend on our having chosen
zero as the value of a quotient, when the denominator is zero.
The Aalborg formula. Let us return, for just a moment, to the assumption
that our tables never have zero entries. Write N for the set of nodes, S for the
set of separators, Tx for the current table on the node x, and Us for the current
table on the separator s. At the beginning of the propagation, Tx <px, UK = 1 S)
and hence

At each step, we change the table on one node and on one separator. The table
on the node is multiplied by New/Old, and the table on the separator is changed
from Old to New- i.e., it also is multiplied by New/ Old. Since the table on the
node is multiplied by the same factor as the table on the separator, the ratio

This is the Aalborg formula. In words, the function whose marginals we want is
always the ratio of the product of the tables on the nodes to the product of the
tables on the separators.
The Aalborg formula still holds even if zero entries are allowed in our tables,
but the reasoning with which we established it holds only if we plug a couple of
holes.
First, we must check that Hse?^ 8 alwavs supports Ilze/v^-' so tnal' ^ ne
ratio (3.16) is defined. To check this, we write x ( s ) for the outward neighbor
of the separator s. Since [/.,. if it is not equal to l s , is a marginal of T x ( s ) , Us
supports Tx(s) (statement 4 of Lemma 3.9). Hence Pises ^ suPPrts FLes ^(s)
(statement 3) and also Tr HseS1 ^() (statement 2), which is equal to Y\xN Tx.
Second, we must check that multiplying the top and bottom of the ratio (3.16) by New/0ldvfi\\ not change it. This follows from statements 6 and 8
of Lemma 3.9, together with the fact that New/Old supports the numerator. We
know that New/ Old supports the numerator because New is a marginal of one
of its factors, and by statement 1 of Lemma 3.9, New/'Old supports whatever
New supports.

62

CHAPTER 3

There is one point of notation that should be clarified in connection with the
Aalborg formula. For simplicity, we have been using a notation that identifies
each node x with a set of variables. We could also identify each separator with a
set of variableswe could say that the separator s between the nodes u and v is
equal to uC\v. It is better, however, to assume that the names of the separators
are distinct from the sets of variables involved, for two or more separators might
involve the same set of variables. (We might have one pair of neighboring nodes
u\ and v\ and another pair 11% and V2 with uiHvi = u^ Pi v-2.) It would burden
our notation unnecessarily for us introduce distinct symbols for the separator
and its set of variables, but the distinction should be kept in mind, even when,
as will happen shortly, we write as if they are the same.
Loading the separators. Though we have presented the Aalborg architecture
under the assumption that the tables on the separators are initially tables of
ones, this assumption too can be relaxed. Suppose we put nonnegative tables
Tx and Us on the nodes and separators in such a way that the table on each
separator supports the tables on the neighboring nodes. Then the denominator
in equation (3.16) supports the numerator. If we set the quotient equal to <p and
propagate by the Aalborg rules, then we have the following proposition.
PROPOSITION 3.8. At the end of the propagation, the tables on the nodes and
separators will be the corresponding marginals oftp.
Proof. By statements 5 and 6 of Lemma 3.9,

where x(s) is the outward neighbor of the separator s. This suggests that we
compare propagation with Us on s arid Tx on x to propagation with ls on s. Tr
on r, and Tx^s->/Us on x ( s ) . Call the former the loaded propagation (because the
separators are loaded at the beginning) and the latter the adjusted propagation
(because the tables on the nodes are adjusted). We know that the adjusted
propagation results in the marginals of (p on all the nodes and separators; let us
show that the loaded propagation gives the same results.
For the moment, we reserve Tx and Us for the initial tables in the loaded
propagation; we write T_Jaded and y]oaded for the current tables in the loaded
propagation and T*dJusted and [/adjusted f or the current tables in the adjusted
propagation. Initially,

and

These equations will hold throughout the inward pass, for if they hold before an
inward step, they hold after it. To see this, write Mx(s^s for the message from

PROPAGATION IN JOIN TREES

63

x ( s ) to s on the inward pass. We have

the inward loaded message from x(s) is multiplied by Us in comparison with the
inward adjusted message. Since this is the new table for s, equation (3.20) will
still hold. But the loaded propagation divides Us out before sending the message
on to the neighbor w; hence the message multiplied into w is the same in the two
propagations, and the relation between T^oaded and J1djusted (equation (3.19) or
(3.21)) will also be unaffected.
Since the root has the same table at the end of the inward pass in the two
propagations, it sends the same messages back out. So we can complete the
proof by induction on the outward pass. We need only show that if the message
from w out back to s is the same in the two propagations, then the table on x ( s )
will end up the same. But if we write Mw->a for the message from w back to s,
then the table we get on x(s) in the loaded propagation is

which is the table we get in the adjusted propagation.


The Aalborg formula can be used to find a probability distribution that has
given marginals.
PROPOSITION 3.9. Suppose we are given a probability distribution Tx for
each node x in a join tree. And suppose these distributions are consistent in
the sense that for neighboring nodes x and y, T^xny = T^xCly. Set Us for the.
separator s between x and y equal to this common marginal. Then the function f given by equation (3.17) is a probability distribution with the Tx as its
marginals.
Proof. When we run the Aalborg propagation, nothing changes. The tables
on the separators are already the marginals of the tables on the nodes, so the
message to the separator is always identical with the table already there, and
the ratio, which is passed on to the neighboring node, is always a table of ones.
So the tables are already the marginals of (p. And any nonnegative table with a
probability distribution as a marginal is itself a probability distribution.

3.6.

COLLECT and DISTRIBUTE.

The three major architectures we have studied in this chapterthe ShaferShenoy, Lauritzen-Spiegelhalter, and Aalborg architecturesmove inward in a
tree and then back outward. How should we organize or program this movement? This is a very general question, for many computations are tree recursive.
But we should take a moment to consider it.
We have described each of the three architectures by giving, along with rules
for what the nodes do, rules for when they are allowed to do it. The simplicity
of this description made it convenient for the theoretical understanding we have
been seeking, but at the programming level, it suggests rather expensive control

64

CHAPTER 3

regimes. Were the nodes independent processors, we seem to be suggesting a


regime in which each node constantly checks on whether it is allowed to act. In
a serial machine, we seem to be suggesting a regime (as in a rule-based program)
in which we constantly search for nodes that are ready to act (rules that are
ready to fire).
A more economical approach is to use the connections of the tree to propagate
signals to act as well as the results of actions. To trigger the inward pass, we
can have the root ask for inward messages from its neighbors, which, in order
to comply with the request, must ask for inward messages from their other
neighbors, and so on. To trigger the outward pass, we can have the root send
messages to its neighbors, together with the request that they pass messages on
to their other neighbors, and so on.
If we run the propagation in this way, the root need not be specified in the
data structure representing the tree; it is merely the node at which we begin the
propagation. Having propagated with one node as the root, and perhaps then
having made changes in the input tables, we can propagate with a different node
as the root.
The tree itself can be represented in object-oriented fashion, with each node
as an object. Each node has a list of neighbors and the ability to communicate
with these neighbors. At a coarse level of description that is common to all three
architectures, a node has two actions, COLLECT, which is used on the inward
pass, and DISTRIBUTE, which is used on the outward pass. Both actions can be
called from outside the system or from a neighboring node. These actions are
recursive, and they also trigger a more basic action, SENDMESSAGE.
When the action COLLECT is called in a node from outside the system, that
node in turn calls COLLECT in all its neighbors. When COLLECT is called in a
node by a neighbor, that node calls COLLECT in all its other neighbors and also,
after the neighbors have completed their action, performs SENDMESSAGE to the
neighbor that made the call. This means that we can trigger the inward pass
simply by calling COLLECT in the node that we want to act as the root. The
call is automatically relayed out toward the leaves, and when it has reached the
leaves, the messages come back in (Figure 3.17).
When the action DISTRIBUTE is called in a node from outside the system,
that node performs SENDMESSAGE to each neighbor and then calls DISTRIBUTE
in that neighbor. When DISTRIBUTE is called in a node by a neighbor, the node
performs SENDMESSAGE to and calls DISTRIBUTE in its other neighbors. So we
can trigger the outward pass by calling DISTRIBUTE in the node we have chosen
to be the root. The call will automatically move outward in the tree, preceded
by outward messages (Figure 3.18).
The action SENDMESSAGE differs from architecture to architecture. In the
Lauritzen-Spiegelhalter architecture, there are actually two distinct SENDMESSAGE actions, SENDMESSAGElN, which is used by COLLECT, and SENDMESSAGEOuT, which is used by DISTRIBUTE. But the other two architectures,
the Shafer-Shenoy architecture and the Aalborg architecture, use the same
SENDMESSAGE in COLLECT as in DISTRIBUTE.

PROPAGATION IN JOIN TREES

65

FlG. 3.17. After COLLECT is called outward from the root, messages move inward.

In the Lauritzen Spiegelhalter architecture, SfiNDMESSAGElN affects both


the sending node and the receiving node. The message sent is divided out of
the table in the first and multiplied into the table in the second. The action
SENDMESSAGEOUT, on the other hand, affects only the receiving node.
The description of SENDMESSAGE in the Shafer-Shenoy and Aalborg architectures is affected by where we place the separators. In the case of the ShaferShenoy architecture, it is most convenient to split the separator and put each
storage register in the node to which its messages are directed, so that the affect
of SENDMESSAGE is to fill the storage register in the receiving node. In the case
of the Aalborg architecture, it seems most appropriate to place copies of the
separator in both nodes; when a message is sent, it is stored in the copy in the
sending node and then sent to the receiving node, where it is stored again after
being used to compute the quotient that is multiplied into the node's main table.
To complete the picture, we can also provide each node with a REPORT
action, which results in the node's marginal being sent to the user of the system.
In the Lauritzen-Spiegelhalter and Aalborg architectures, this action involves
no computation, but in the Shafer-Shenoy architecture, it requires the node to
collect the messages in its separators and multiply them all into its main table.
We can make REPORT an action that is called from outside the system, or we
can make it part of DISTRIBUTE, so that marginals are reported as the outward
pass proceeds.

CHAPTER 3

66

FIG. 3.18.

3.7.

As DISTRIBUTE is called outward from the root, messages move outward.

Scope and alternatives.

Join-tree propagation may or may not succeed in finding marginals of a particular product of tables. It will not succeed if the belief net is so highly connected that no feasible join-tree cover exists. In this case, we may be able to
use approximate rather than exact methods. Presently, the most widely used
approximate methods are Gibbs sampling and its cousinsmethods now collectively called "Markov-chain Monte Carlo." These methods were proposed
for probabilistic expert systems by Pearl [43], but they have been less successful for expert systems than for vision (Geman and Geman [29]) and Bayesian
statistics (Besag et al. [13]). The small or zero conditional probabilities often
encountered in expert systemswhere a priori knowledge is strongertend to
violate the conditions that allow the Markov-chain methods to converge. A recent candidate to fill the gap left by the weakness of Markov-chain methods for
expert systems is mean-field theory, also borrowed from statistical physics (Saul
et al. [44]).
In this chapter, we have discussed only the problem of finding marginals
of probability distributions given as products of tables. In principle, join-tree
propagation is applicable to finding marginals in any other problem in which the
transitivity and combination axioms are satisfied. (Examples are given in the
exercises.) There arc, however, problems in which the axioms are satisfied but
the operations are not feasible. Join-tree propagation depends on marginaliza-

PROPAGATION IN JOIN TREES

67

tion and multiplication being computationally feasible in small domains (small


numbers of variables), and sometimes it is not. Continuous probability densities provide an example. We know how to marginalize (integrate) in many
parametric families of densities, but multiplication usually takes us outside the
parametric family, producing densities that are difficult to integrate, even if
only a few variables are involved. As a practical matter, join-tree propagation for continuous densities has been limited mainly to the multivariate normal distribution, where it is often discussed in connection with the Kalman
filter.
We should also note another limitation of the join-tree methodin general,
it only helps us find marginals for small clusters of variables. In many problems,
we want to compute other numbers: probabilities involving many variables and
expectations. Markov-chain Monte Carlo, when it works, allows us to compute
these numbers as well.
Exercises.
EXERCISE 3.1. How great is the computational advantage of the LauritzenSpiegelhalter architecture over the Shafer-Shenoy architecture? For a first pass
at answering this question, you may wish to assume that each nonleaf in the join
tree has the same number of neighbors (the tree's "branching factor"), that each
variable has the same number of elements in its frame, and that each node has
the sam,e number of variables in common with its branch as well as the same
number of new variables.
EXERCISE 3.2. Compare the three architectures on the basis of their storage
requirements. Consider the case where we need to keep the initial inputs and the.
case where we do riot.
EXERCISE 3.3. Show how to use join-tree computation to find P^w(x) for
any set w of variables and any single configuration x of w -even ifw is too large
to be contained in any node of the join tree. (Hint: Pretend x is observed, and
exploit the fact that Piw(x) is the inverse of the normalizing constant for the
posterior probabilities.)
EXERCISE 3.4. Discuss ways of measuring the amount of computation required by a join tree. (In the introduction to Chapter 3, two measures were
suggested: the, sum of the sizes of the frames, and the size of the largest frame.)
Discuss the issue separately for probability propagation and for each of the problems listed in Exercise 1.2.
EXERCISE 3.5. Verify that the elementary and Shafer-Shenoy architectures
always work in the. abstract framework you formulated in Exercise 1.5.
EXERCISE 3.6. Explore the analogy between the outward pass of the
LauriLzen-Spiegelhalter architecture and the outward pass in recursive, dynamic
programming, in which solutions of reduced problems are. used to build up an
overall solution (Mitten [40], Bertele and Rrioschi [1], Shenoy [46]). Formulate,
an abstract theory that includes both examples as special cases.

68

CHAPTER 3

EXERCISE 3.7. What constraints must be imposed on the placement of conditionals in the nodes of a join tree in order for the results of Shafer-Shenoy
computations to remain within the partial semigroup of conditionals' (See Exercise 2.5.) Explore conditions on the existence of continuers that allow the
Lauritzen-Spieyelhalter architecture to work in this context.
EXERCISE 3.8. In some problems, the mathematical objects that one combines can be embedded in a larger class that comes closer to being a group, so
that the division required by the Aalborg architecture is possible. Discuss the
extent to which this is possible in the examples considered in Exercise 1.2.

CHAPTER

Resources and References

4.1.

Meetings.

The annual Conference on Uncertainty in Artificial Intelligence (UAI) plays a


leading role in the development of probabilistic, belief-function, fuzzy, and qualitative expert systems. Papers given in its first six years (1985-1990) were collected and published by North-Holland in a series entitled Uncertainty in Artificial Intelligence. Proceedings of subsequent meetings have been published by
Morgan Kaufmann. The Association for Uncertainty in Artificial Intelligence,
the sponsor of the conference, has a site on the World-Wide Web:
http: / /www .auai .org/
This site gives instructions for subscribing to the association's electronic mailing
list and includes links to many other sources of information about the management of uncertainty in expert systems.
The biennial International Workshop on Artificial Intelligence and Statistics
is also devoted in part to uncertainty in expert systems. The Web site for its
sponsor, the Society for Artificial Intelligence and Statistics, is
http://www.vuse.vanderbilt.edu/~dfisher/ai-stats/socicty.html
This site is maintained by Douglas H. Fisher at Vanderbilt University.
Another important conference for this community is the International Conference on Information Processing and Management of Uncertainty in KnowledgeBased Systems (IPMU), which has been held biennially since 1986. The proceedings of the most recent conference, held in Paris in 1994, was published
by Springer-Verlag in 1995 under the title Advances in Intelligent Computing,
edited by Bernadette Bouchon-Meunier, Ronald R. Yager, and Lotfi A. Zadeh.
4.2.

Software.

A number of software packages for probabilistic expert systems are available.


The most highly developed is the commercial product HUGIN. Developed at
Aalborg, Denmark, it uses the Aalborg architecture described in Chapter 3.
Information on HUGIN can be obtained at:
http: //www.hugin .dk
69

70

CHAPTER 4

The most thorough implementation of the Shafer-Shenoy architecture is Pulcinella. Developed by the IRIDIA research group in Brussels, it handles belief
functions, categorical judgments, and possibility measures as well as probabilities. It is implemented in Common Lisp and is distributed free. Information is
available from IRIDIA's Web site:
http://iridia.ulb.ac.be/pulcinella/
Further information on these and other packages, some commercial and some
free, is available at a Web site maintained by R.ussell Almond:
http://bayes.stat.washington.edu/almond/belief.html
4.3.

Books.

There are now many excellent books on probabilistic expert systems and related
topics.
[1] Bertele, Umberto, and Francesco Briosdii (1972). Nonserial Dynamic Programming. Academic Press. New York. A readable treatment of join-tree computation for decomposable dynamic programming problems.
[2] Diestel, R. (1990). Graph Decompositions. Clarendon Press. Oxford. A general perspective on decompositions of the type exemplified
by join trees, with hints at the diversity of the applied problems that
inspire these decompositions.
[3] Jensen, Finn V. (1996). An Introduction to Bayesian Networks.
University College Press. London. An engaging and readable introduction to probabilistic networks, with an emphasis on construction
and computation within the Aalborg architecture.
[4] Judd, J. Stephen (1990). Neural Network Design and the Complexity of Learning. MIT Press. Cambridge. This interesting and
readable book demonstrates the relevance of join-tree ideas to the
problem of learning in neural networks.
[5] Lauritzen, Steffen L. (1996). Graphical Models. Oxford University Press. London. A superb treatment of probabilistic networks as
models for data, this book marries probabilistic expert systems with
up-to-date statistical methodology. Relatively comprehensive, it covers undirected as well as directed graphs, and continuous (normal)
as well as discrete probability distributions. Its greatest originality
lies in its treatment of mixed cases: chain graphs, which combine
directed and undirected graphs, and models with both discrete and
continuous variables.
[6] Neapolitan, E. (1990). Probabilistic Reasoning in Expert Systems.
John Wiley. New York. This readable book covered the state of the

RESOURCES AND REFERENCES


art in computation in probabilistic expert systems at the time of its
publication. It is now somewhat dated.
[7] Oliver, Robert M., and James Q. Smith, eds. (1990). Influence
Diagrams, Belief Nets, and Decision Analysis. John Wiley. New
York. Still a good introduction to the motivations behind influence
diagrams, which generalize probabilistic expert systems by including
variables representing a user's decisions. It includes an introductory essay by Ron Howard, the most influential proponent of these
diagrams.
[8] Pearl, Judea (1988). Probabilistic Reasoning in Intelligent Systems. Morgan Kaufmann. San Mateo, California. In a series of
articles preceding this book, its author initiated the study and use
of probabilistic expert systems as the term is now understood. The
book, lively and energetic, introduced them to a wide audience.
[9] Shafer, Glenn (1996). The Art of Causal Conjecture. MIT Press.
Cambridge. A study of causality in terms of the dynamics of probability, this book shows that the causal interpretation of probabilistic
expert systems, like the causal interpretation of other statistical models, is often complex: models may have more than one possible causal
interpretation. This book also explores some generalizations of the
DAG structure.
[10] Shafer, Glenn, and Judea Pearl, eds. (1990). Readings in Uncertain Reasoning. Morgan Kaufmann. San Mateo, California. This
volume collects classic and recent papers on uncertain reasoning in
artificial intelligence. Probabilistic, belief-function, fuzzy, and qualitative approaches are included.
[11] Spirtes, Peter, Clark Glymour, and Richard Schemes (1993).
Causation, Prediction, and Search. Lecture Notes in Statistics 81.
Springer-Verlag. New York. This monograph explores a variety of
non-Bayesian ideas for constructing belief nets from data. The emphasis is on using limited a priori assumptions about causal relations
among variables together with observed independencies among those
variables.
[12] Whittaker, J. (1990). Graphical Models in Applied Multivariate
Statistics. John Wiley. Chichester. A pioneering statistical treatment of belief nets, emphasizing the multivariate normal distribution.
Many examples.
4.4.

Review articles.

These articles review several topics mentioned in preceding chapters.


[13] Besag, Julian, Peter Green, David Higdon, and Kerrie Mengersen
(1995).
Bayesian computation and stochastic systems (with

71

72

CHAPTER 4
discussion). Statistical Science. 10, pp. 1-66. A review of Markovchain Monte Carlo methods, with an emphasis on Bayesian statistical
problems.
[14] Buntine, Wray (1996). A guide to the literature on learning
graphical models. IEEE Transactions on Knowledge and Data Engineering. An excellent review of the problem of selecting graphical
models for probabilistic expert systems on the basis of data.
[15] Charniak, Eugene (1991). Bayesian networks without tears. AI
Magazine. Winter 1991, pp. 50-63. A nontechnical introduction
to belief nets, especially useful for students with limited interest in
mathematical probability theory.
[16] Dempster, A. P. (1971). An overview of multivariate data analysis. Journal of Multivariate Analysis. 1, pp. 316-346. This classic
article includes a discussion of the limitations of the multivariate
framework, limitations still not overcome in the main body of work
in statistics and probabilistic expert systems.
[17] Neal, Radford M. (1993). Probabilistic inference using Markov
chain Monte Carlo methods. Technical Report. Department of Computer Science. University of Toronto. In contrast to Besag et al., this
review emphasizes probabilistic expert systems.
[18] Rabiner, L. R. (1989). A tutorial on hidden Markov models and
selected applications in speech recognition. Proceedings of the IEEE.
77, pp. 257-286. Still one of the best introductions to hidden Markov
models.
[19] Spiegelhalter, David J., A. Philip Dawid, Steffen L. Lauritzen,
and Robert G. Cowell (1993). Bayesian analysis in expert systems
(with discussion). Statistical Science. 8, pp. 219-283. Currently
the best brief overview of the state of the art of probabilistic expert
systems.
[20] Tatman, J. A., and Ross Shachter (1990). Dynamic programming and influence diagrams. IEEE Transactions on Systems, Man,
and Cybernetics. 20, pp. 365-379. This article reviews influence diagrams, which generalize belief nets by including nodes for decisions,
and shows how dynamic programming can be understood within the
framework of influence diagrams.
[21] Xu, Hong, and Robert Kennes (1994). Steps towards an efficient implementation of Dempster-Shafer theory. Advances in the
Dempster-Shafer Theory of Evidence. R. R. Yager, M. Fedrizzi, and
J. Kacprzyk, eds. John Wiley. New York. Pp. 153 174. This article
reviews various ways of making the Shafer-Shenoy architecture as
efficient as possible for belief functions.

RESOURCES AND REFERENCES

4.5.

73

Other sources.

This is not a comprehensive bibliography of the very extensive work on probabilistic expert systems, but it contains the articles and dissertations that have
most engaged the author's attention.
[22] Beeri, Catriel, Ronald Fagin, David Maier, and Mihalis Yannakakis (1983). On the desirability of acyclic database schemes.
Journal of the Association for Computing Machinery. 30, pp. 479513. This very widely cited paper first introduced the idea of a join
tree into the literature on relational databases. It is also responsible
for the name "join tree."
[23] Cano, Jose, Miguel Delgado, and Serafin Moral (1993). An axiomatic framework for propagating uncertainty in directed acyclic
networks. International, Journal of Approximate Reasoning. 8, pp.
253-280. This article extends the axioms for join-tree computation,
discussed in Chapter 1 and in Shenoy and Shafer [48], to computation within directed acyclic graphs, in the style developed in Pearl's
Probabilistic Reasoning in Intelligent Systems [8].
[24] Cooper, Gregory F., and Edward Herskovits (1992). A Bayesian
method for the induction of probabilistic networks from data. Machine Learning. 9, pp. 309-347. An influential exposition of a
straightforward Bayesian approach to choosing and parametrizing a
DAG from data for a given set of variables. The method developed
in this article can be contrasted with the non-Bayesiari methods developed in Spirtes, Glymour, arid Scheines's Causation, Prediction,
and Search [11].
[25] Cowell, Robert G., and A. Philip Dawid (1992). Fast retraction
of evidence in a probabilistic expert system. Statistics and Computing. 2, pp. 37-40. Using out-marginalization (see Exercise 1.4),
this article gives a quick join-tree algorithm for adjusting marginal
probabilities to allow for the omission of previously included observations. The algorithm allows efficient computation of statistics for
monitoring the performance of a belief net.
[26] Cox, David R., and Nanny Wermuth (1993). Linear dependencies
represented by chain graphs (with discussion). Statistical Science. 8,
pp. 204-283. Taking DAGs and chain graphs as a starting point,
this article discusses a wide variety of graphical representations of
multivariate probability distributions.
[27] Dawid, A. Philip (1980). Conditional independence for statistical
operations. Annals of Statistics. 8, pp. 598-617. This pioneering
article studies general properties of conditional independence that
were later studied as axioms by Judea Pearl.

74

CHAPTER 4

[28] Dempster, A. P. (1990). Normal belief functions and the Kalman


filter. Technical Report. Department of Statistics. Harvard University.
[29] Geman, Stuart, and Donald Geman (1984). Stochastic relaxation, Gibbs distributions, and the Bayesian restoration of images.
IEEE Transactions on Pattern Analysis and Machine Intelligence. 6,
pp. 721-741. This article shows how image-analysis problems can be
modeled so that the computation problems are susceptible to resolution by Gibbs sampling. Very much influenced by the work of Ulf
Grenander, the article was in turn very influential in vision, artificial
intelligence, and Bayesian statistics.
[30] Heckerman, David (1990). Probabilistic similarity networks.
Networks. 20, pp. 607 636. This article explores an interesting generalization of belief networks, in which the factorization that permits
representation by a DAG may apply only conditionally on some values of the preceding variables.
[31] Jensen, Finn V. (1991). Calculation in HUGIN of probabilities
for specific configurationsa trick with many applications. Scandinavian Conference on Artificial Intelligence 91. IOS Press. Burke,
Virginia. Pp. 176-186. This article puts the trick of Exercise 3.3 to
use for practical tasks in probabilistic expert systems: comparison of
competing hypotheses, analysis of conflicts in data, and evaluation
of approximate calculations.
[32] Jensen, Finn V. (1995). Cautious propagation in Bayesian networks. Proceedings of the llth Conference on Uncertainty in Artificial Intelligence. Philippe Besnard and Steve Hanks, eds. Morgan
Kaufmann. San Mateo, California. Pp. 323-328. This article uses
the Shafer- Shenoy architecture to supply a more general solution to
the problem considered by Cowell and Dawid [25].
[33] Jensen, Finn V., and Frank Jensen (1994). Optimal junction
trees. Proceedings of the IQth Conference on Uncertainty in Artificial
Intelligence. R. L. Mantaras and D. Poole, eds. Morgan Kaufmann.
San Mateo, California. Pp. 360-366. Even when sets of variables can
be arranged in a join tree, there may be more than one arrangement,
some more efficient than others. This paper presents an algorithm
for choosing an optimal one.
[34] Jensen, Finn V., Steffen L. Lauritzen, and K. G. Olesen (1990).
Bayesian updating in causal probabilistic networks by local computation. Computational Statistics Quarterly. 4, pp. 269-282. This
article, all of whose authors work at the University of Aalborg in
Aalborg, Denmark, introduced the architecture named after that city
in Chapter 3.

RESOURCES AND REFERENCES

[35] Kjagrulff, Uffe (1992). Optimal decomposition of probabilistic


networks by simulated annealing. Statistics and Computing. 2, pp.
7-17. This article suggests a sophisticated heuristic for near-optimal
join trees (or. in the terminology its uses, near-optimal "decompositions" or "triangulations"). It also gives references to other heuristics.
[36] Kong, Augustine (1986). Multivariate belief functions and graphical models. Doctoral dissertation. Department of Statistics. Harvard University. This dissertation spells out how the concept of jointree cover is related to the concept of triangulation, which is used
more often in the older literature. It also studies some heuristics for
rinding join-tree covers or triangulations.
[37] Lauritzen, Steffen, and David Spiegelhalter (1988). Local computations with probabilities on graphical structures and their application to expert systems (with discussion). Journal of the Royal
Statistical Society, Series B. 50, pp. 157-224. This classic article introduced probabilistic expert systems to the statistical community.
It is the source of the Lauritzen-Spiegelhalter architecture discussed
in Chapter 3. The reader of this article should be cautioned that the
heuristic it uses for finding join-tree covers, maximum cardinality
search, gives rather poor results in general. See [35] and [36].
[38] Li, Zhaoyu, and Bruce D'Ambrosio (1994). Efficient inference
in Bayes networks as a combinatorial optimization problem. International Journal of Approximate Reasojiing. 11, pp. 55-81. The authors formulate the problem of rinding an optimal order for summing
variables out as a combinatorial problem.
[39] Mellouli, Khaled (1987). On the propagation of beliefs in networks using the Dempster Shafer theory of evidence. Doctoral dissertation. School of Business. University of Kansas. This dissertation
includes a demonstration that the class of join-tree covers obtained
by summing out is always large enough to include optimal join-tree
covers.
[40] Mitten. L. G. (1964). Composition principles for synthesis of
optimal multistage processes. Operations Research. 12, pp. 610-619.
An early exploration of the extent of applicability of recursive methods for optimization such as those described in Bertele and Brioschi's
book.
[41] Ndilikilikesha, Pierre C. (1994). Potential influence diagrams.
International Journal of Approximate Reasoning. 10, pp. 251-285.
This article shows how influence diagrams can be solved using a
rooted join tree.
[42] Pearl, Judea (1986). Fusion, propagation, and structuring in belief networks. Artificial Intelligence. 29, pp. 241-288. An extremely
influential contribution to computation in belief nets, emphasizing

75

76

CHAPTER 4

methods that preserve a net's directed semantics in the course of


the computation. The material in this article was incorporated into
Pearl's 1988 book [8].
[43] Pearl, Judea (1987). Evidential reasoning using stochastic simulation. Artificial Intelligence. 32, pp. 245-257. This may be the first
proposal to nse Markov-chain Monte Carlo for computations in belief
nets. The method had long been used in statistical physics and in
operations research.
[44] Saul, Lawrence K., Tommi Jaakkola, and Michael I. Jordan
(1995). Mean field theory for sigmoid belief networks. Computational Cognitive Science Technical Report 9501, Center for Biological
and Computational Learning. Massachusetts Institute of Technology. This article sketches a program for borrowing the idea of meanfield theory from statistical physics in order to address the problem of approximate computation in belief nets with extremely high
connectivity.
[45] Shafer, Glenn, Prakash P. Shenoy, and Khaled Mellouli (1987).
Propagating belief functions in qualitative Markov trees. International Journal of Approximate Reasoning. 1, pp. 349-400. This
paper explores a way of understanding constraint propagation arid
belief-function computation abstractly, without variables.
[46] Shenoy. Prakash P. (1991). Valuation based systems for discrete optimization. Uncertainty in Artificial Intelligence 6. P. P.
Bonissone, M. Henrion, L. N. Kanal, arid J. F. Leinmer, eds. NorthHolland. Amsterdam. Pp. 385-400. The abstract understanding of
inward and outward passes in join-tree computation in this article
generalizes the method of nonserial dynamic programming discussed
by Bertele and Brioschi [1].
[47] Shenoy, Prakash P. (1994). Representing conditional independence relations by valuation networks. International Journal of Uncertainty, Fuzziness and Knowledge-Based Systems. 2, pp. 143-165.
This article advances a general framework for propagating information in expert systems. Shenoy's framework applies not only to probability but also to belief functions and other calculi satisfying the
axioms of Chapter 1.
[48] Shenoy, Prakash P., and Glenn Shafer (1990). Axioms for probability and belief-function propagation. Uncertainty in Artificial Intelligence 4. R. D. Shachter, T. S. Levitt, L. N. Kanal, and J. F.
Lemrner, eds. North-Holland. Amsterdam. Pp. 169-198. The axioms for join-tree computation, discussed in Chapter 1, were first
isolated in this article. The article also describes the Shafer-Shenoy
architecture.

RESOURCES AND REFERENCES

[49] Srivastava, Rajendra P., and Glenn Shafer (1992). Belieffunction formulas for audit risk. The Accounting Review. 67. pp.
249-283. This article discusses the propagation of evidence for financial audits, using belief functions rather than probabilities.
[50] Wermuth, Nanny, and Steffen L. Lauritzen (1990). On substantive research hypotheses, conditional independence graphs, and
graphical chain models (with discussion). Journal of the Royal Statistical Society, Series B. 52, pp. 21 50. This wide-ranging article
includes a good introduction to the uses of cha,in graphs.
[51] Xu, Hong, and Philippe Smets (1996). Reasoning in evidential
networks with conditional belief functions. International Journal of
Approximate Reasoning. 14. pp. 158 185. This article adds a concept
of conditionals to the theory of belief functions and shows how they
can be implemented in join-tree computation.
[52] Zhang, Neviri Liariwen, Runping Qi, and David Poole (1994). A
computational theory of decision networks. International Journal of
Approximate Reasoning. 11, pp. 83-158. This article extends jointree computation to influence diagrams and even to slightly more
general networks; forgetting is allowed.

77

This page intentionally left blank

Index

DISTRIBUTE, 64

Aalborg architecture, 56
Aalborg formula, 61
audit evidence, 29

domain. 3
dynamic programming, 36

Bayesian network, 22
Bayesian statistics, 66
belief chain, 25, 33
belief functions, 15
belief net, 21
bubble graph, 27

elementary architecture, 43
expectation, 12
extended division, 60
factorization, 35, 54
four-color problem, 36
frame, 2

categorical variables, 13
chain, 25
chain graph, 30
COLLECT, 64
combination axiom, 5
computational cost, 67
computional cost, 50
conditional, 5, 18
conditional probabilities, 5
conditioning, 10
configuration, 2
constraint propagation, 36
construction chain, 28
construction sequence, 19, 54
constructive interpretation of
probability, 9
continuer, 7, 15, 16, 18, 53

Gibbs sampling, 66
graphical model, 22
head, 5
heuristics, 37
hidden Markov model, 26. 33
independence, 9
information branch, 43
join graph, 29
join tree, 35, 39
cover, 43
heuristics, 37
root, 41
junction tree, 35
Kalman filter, 16, 67

DAG, 21
construction ordering, 22
initial segment, 23
density, 3
directed acyclic graph, 21

lattice, 16
Lauritzen-Spiegelhalter
architecture, 50
linear programming, 15
79

80

marginal, 2, 3, 18
Markov chain, 25
Markov-chain Monte Carlo, 66
mean field theory, 66
multivariate framework, 2, 14
object-oriented computation, 64
out-marginal, 16
parallel computation, 48
parameter. 13
posterior probability, 10
probability distribution, 2
algorithmic, 13
continuous, 3
discrete, 2
parametric, 13
posterior, 10
tabular, 13
with given marginals, 63
recursive computation, 5
recursive dynamic programming, 67

INDEX

relational database, 35
rules, 63
semigroup, 16, 33, 68

SENDMESSAGE, 64

separator, 45, 56, 62


Shafer-Shenoy architecture, 45
similarity network, 31
slice, 6
state graph, 25, 33
sufficient, 9
support, 60
systems of equations, 15, 36
tail, 5
transitivity axiom, 5
valuation network, 30
variable, 2
vision, 66
zeros, 58

(continued from inside front cover)


JERROLD E. MARSDEN, Lectures on Geometric Methods in Mathematical Physics
BRADLEY EFRON, The Jackknife, the Bootstrap, and Other Resampling Plans
M. WOODROOFE, Nonlinear Renewal Theory in Sequential Analysis
D. H. SATTINGER, Branching in the Presence of Symmetry
R. TEMAM, Navier-Stokes Equations and Nonlinear Functional Analysis
MiKL6s Cs6RGO, Quantile Processes with Statistical Applications
J. D. BUCKMASTER AND G. S. S. LuDFORD, Lectures on Mathematical Combustion
R. E. TARJAN, Data Structures and Network Algorithms
PAUL WALTMAN, Competition Models in Population Biology
S. R. S. VARADHAN, Large Deviations and Applications
KIYOSI Ir6, Foundations of Stochastic Differential Equations in Infinite Dimensional Spaces
ALAN C. NEWELL, Solitons in Mathematics and Physics
PRANAB KUMAR SEN, Theory and Applications of Sequential Nonparametrics
LASZLO LOVASZ, An Algorithmic Theory of Numbers, Graphs and Convexity
E. W, CHENEY, Multivariate Approximation Theory: Selected Topics
JOEL SPENCER, Ten Lectures on the Probabilistic Method
PAUL C. FIFE, Dynamics of Internal Layers and Diffusive Interfaces
CHARLES K. CHUI, Multivariate Splines
HERBERT S. WILF, Combinatorial Algorithms: An Update
HENRY C. TUCKWELL, Stochastic Processes in the Neurosciences
FRANK H. CLARKE, Methods of Dynamic and Nonsmooth Optimization
ROBERT B. GARDNER, The Method of Equivalence and Its Applications
GRACE WAHBA, Spline Models for Observational Data
RICHARD S. VARGA, Scientific Computation on Mathematical Problems and Conjectures
INGRID DAUBECHIES, Ten Lectures on Wavelets
STEPHEN F. McCoRMiCK, Multilevel Projection Methods for Partial Differential Equations
HARALD NIEDERREITER, Random Number Generation and Quasi-Monte Carlo Methods
JOEL SPENCER, Ten Lectures on the Probabilistic Method, Second Edition
CHARLES A. MICCHELLI, Mathematical Aspects of Geometric Modeling
ROGER TEMAM, Navier-Stokes Equations and Nonlinear Functional Analysis, Second Edition
GLENN SHAFER, Probabilistic Expert Systems

You might also like