You are on page 1of 4

Bayesian network models of gene expression data

1. Introduction
All the cells in an organism carry the same genomic data, yet their protein makeup can be
drastically different both temporally and spatially, due to regulation. Protein synthesis is
regulated by many mechanisms at its different stages. These include mechanisms for
controlling transcription initiation, RNA splicing, mRNA transport, translation initiation,
post-translational modifications, and degradation of mRNA/protein. One of the main
junctions at which regulation occurs is mRNA transcription. A major role in this
machinery is played by proteins themselves, that bind to regulatory regions along the
DNA, greatly affecting the transcription of the genes they regulate.
DNA microarray is a technique to measure the abundance of thousands of mRNA
targets simultaneously. Such experiments collect enormous amounts of data, which
clearly reflect many aspects of the underlying biological processes. An important
challenge is to develop methodologies that are both statistically sound and
computationally tractable for analyzing such data set and inferring biological interactions
from them.
Most of the analysis tools currently used are based on clustering algorithms.
These algorithms attempt to locate groups of genes that have similar expression patterns
over a set of experiments. Such analysis has proven to be useful in discovering genes that
are co-regulated. A more ambitious goal for analysis is revealing the structure of the
transcriptional regulation process. This is a clearly a hard problem. The current data is
extremely noisy. Moreover, mRNA expression data alone only gives a partial picture that
does not reflect key events such as translation and protein activation. Finally, the amount
of samples, even in the largest experiments, does not provide enough information to
construct a full detailed model with high statistical significance.
Bayesian network is an approach for analyzing gene expression patterns, that
uncovers properties of the transcriptional program by examining statistical properties of
dependence and conditional independence in the data. These networks represent the
dependence structure between multiple interacting quantities (expression levels of
different genes). Bayesian networks are mathematically defined strictly in terms of
probabilities and conditional independence statements. They can also be used to infer
causality.
2. Representing distributions with Bayesian networks
Consider a finite set = { X 1 ,..., X n } of random variables where each variable X i may
take on a value xi from the domain of X i . Sets of variables are denoted by boldface
capital letters X, Y, Z . We denote I ( X; Y | Z) to mean X is independent of Y
conditional on Z .
A Bayesian network is a representation of a joint probability distribution. This
representation consists of two components: a directed acyclic graph (DAG) G , whose
vertices correspond to the random variables X 1 ,..., X n , and describing a conditional
distribution for each variable, given its parents in G . Together, these two components
specify a unique distribution on X 1 ,..., X n .

The graph G represents conditional independence assumptions that allow the


joint distribution to be decomposed, economizing on the number of parameters. The
graph G encodes the Markov assumption:
(*) Each variable X i is independent of its non-descendants, given its parents in G .
By applying the chain rule of probabilities and properties of conditional
independencies, any joint distribution that satisfies (*) can be decomposed into the
product form
n

P( X 1 ,..., X n ) = P( X i | Pa G ( X i )),

(1)

i =1

where Pa G ( X i ) is the set of parents of X i in G . Figure 1 shows an example of a graph


G , lists the Markov independencies it encodes, and the product form they imply.

Figure 1: An example of simple Bayesian network structure. This network structure


implies several conditional independence statements:
I ( A; E ), I ( B; D | A, E ), I (C ; A, D, E | B ), I ( D; B, C , E | A), I ( E ; A, D).
The network structure also implies that the joint distribution has the product form
P ( A, B, C , D, E ) = P ( A) P ( E ) P ( B | A, E ) P (C | B ) P ( D | A)
We also need to specify each of the conditional probabilities in the product form. The
second component of the Bayesian network describes these conditional distributions by
parameters . Suppose that the parents of a variable X are {U 1 ,..., U k } . Then
for discrete variable, if each of X and {U 1 ,...,U k } takes discrete values from a

finite set, then we can represent P( X | U 1 ,...,U k ) as a table that specifies the
probability of values for X for each joint assignment to U 1 ,...,U k .
For continuous variables, a natural choice for multivariate continuous
distributions is the use of Gaussian distributions. For instance
P( X | u1 ,..., u k ) ~ N (a 0 + ai u i , 2 )
i

That is, X is normally distributed around a mean that depends linearly on the values
of its parents.
3. Learning Bayesian networks

The problem of learning Bayesian network can be stated as follows. Given a training set
D = {x1 ,..., x N } of independent instances of , find a network B =< G , > that best
matches D . The theory of learning networks from data has been examined extensively
over the last decade. The common approach is to introduce a statistically motivated
scoring function that evaluates each network with respect to the training data, and to
search for the optimal network according to this score. One score definition is
S (G : D ) = log P (G | D )
= log P ( D | G ) + log P (G ) + C
where C is a constant independent of G and
P ( D | G ) = P ( D | G , ) P ( | G ) d

The particular choice of priors P (G ) and P ( | G ) for each G determines the exact
Bayesian score. Skipping the detailed selection methods of priors, we only point out an
important property of the score, decomposable. That is, the score can be rewritten as the
sum
S (G : D) = ScoreContribution X i , Pa G ( X i ) : D

where the contribution of every variable X i to the total network score depends only on
its own value and the values of its parents in G . These local contributions for each
variable can be computed using a closed form equation.
Once the prior is specified and the data is given, learning amounts to finding the
structure G that maximizes the score. This problem is known to be NP-hard, thus we
resort to heuristic search. The decomposition of the score is crucial for this optimization.
A local search procedure that changes one arc at each move can efficiently evaluate the
gains made by adding, removing or reversing a single arc. An example of such a
procedure is a greedy hill-climbing algorithm that at each step performs the local change
that results in the maximal gain, until it reaches a local maximum.
A Bayesian network is a model of dependencies between multiple measurements.
Causal interpretations for Bayesian networks have been proposed. Under causal
networks, the parents of a variable are interpreted as its immediate causes. A causal
network models not only the distribution of the observations, but also the effects of
interventions. If X causes Y , then manipulating the value of X affects the value of Y .
On the other hand, if Y is a cause of X , then manipulating X will not affect Y . Thus,
although X Y and X Y are equivalent Bayesian networks, they are not equivalent
as causal networks. In biological domain assume X is a transcription factor of Y . If we
knockout gene X then this will affect the expression of gene Y , but a knockout of gene
Y has no effect on the expression of gene X .
There is also a corresponding Markov assumption in causal network, the causal
Markov assumption: given the values of a variables immediate causes, it is independent
of its earlier causes.
4. Analyzing expression data
We describe the state of the system (a cell or an organism and its environment) using
random variables. These random variables denote the expression level of individual
genes. In addition, we can include random variables that denote other attributes that
affect the system, such as experimental conditions, temporal indicators (i.e., the

time/stage that the sample was taken from), background variables (e.g., which clinical
procedure was used to get a biopsy sample), and exogenous cellular conditions.
We thus attempt to build a model which is a joint distribution over a collection of
a random variables. If we had such a model, we could answer a wide range of queries
about the system. For example, does the expression level of a particular gene depend on
the experimental condition? Is this dependence direct, or indirect? If it is indirect, which
genes mediate the dependency? Not having a model at hand, we want to learn one from
the available data and use it to answer questions about the system.
Most of the difficulties in learning from expression data are that contrary to most
previous applications of learning Bayesian networks, expression data involves transcript
levels of thousands of genes while current data sets contain at most a few dozen samples.
This raises problems in computational complexity and the statistical significance of the
resulting networks. On the positive side, genetic regulation networks are sparse, i.e.,
given a gene, it is assumed that no more than a few dozen genes directly affect its
transcription.
The large search space of the learning process requires efficient search
algorithms. We will identify a relatively small number of candidate parents for each gene
based on simple local statistics (such as correlation). We then restrict our search to
networks in which only the candidate parents of a variable can be its parents, resulting in
a much smaller search space.
When learning models with many variables, small data sets are not sufficiently
informative to significantly determine that a single model is the right one. Instead,
many different networks should be considered as reasonable explanations of the given
data. From a Bayesian perspective, we say that the posterior probability over models is
not dominated by a single model. Our approach is to analyze this set of plausible
networks. Although this set can be very large, we might attempt to characterize features
that are common to most of these networks. Two types of features may be studied. One is
Markov relations: Is Y in the Markov blanket of X ? The Markov blanket of X is the
minimal set of variables that shield X from the rest of the variables in the model. The
second type of features is order relations: Is X an ancestor of Y in all the networks? The
statistical confidence of these features can be estimated by bootstrap method. Details are
referred to the reference.
The approach of learning Bayesian networks is applied to the data of Spellman et
al.s cell cycle experiments. The results are referred to the reference.
Reference
Friedman, N., Linial, M., Nachman, I., and Peer, D. (2000) Using Bayesian networks to
analyze expression data. J. Comput Biol, vol 7, 601-620.

You might also like