Professional Documents
Culture Documents
Discovery
searchers are recently resorting to asymmetrical rela- Kalisch & B uhlmann, 2007), Markov Blanket discov-
tionship between the cause and effect variables un- ery methods (Zhu et al., 2007). These methods pro-
der assumptions on the generation process. The dis- vide the skeleton of causal structures, i.e. parent-child
covery ability is dramatically improved, by exploit- pairs and Markov Blanket. However, these methods
ing linear assumption (Zscheischler et al., 2012), linear usually cannot distinguish causes from consequences,
non-Gaussian assumption (Shimizu et al., 2006; 2011), thus unable to output exact causalities.
nonlinear non-Gaussian assumption (Hoyer et al.,
Pearl is the pioneer of the causality analysis the-
2008), discrete property (Peters et al., 2010), and so
ory (Pearl, 2009). Since Pearls Inductive Causality
on. When the variables are correlated under linear re-
(Pearl & Verma, 1991), a large number of extensions
lationships and the noises follow non-Gaussian distri-
are proposed. Most causality inference algorithms as-
butions, for example, LiNGAM (Shimizu et al., 2006)
sume the acquisition of a sufficiently large sample base
and its variants (Shimizu et al., 2011) are known as
(Aliferis et al., 2010). Though there are studies aiming
the best causality inference algorithms. However, the
at the inference problem under small sample cardinal-
scalability of LiNGAM and its variants is still ques-
ity (Bromberg & Margaritis, 2009), the actual number
tionable, since they heavily depend on the independent
of the samples used in their empirical evaluations re-
component analysis (ICA) during the computation. To
mains significantly larger than the number of variables.
return robust results from ICA, it is necessary to feed
Cais study (Cai et al., 2013) is another attempt under
a large bulk of samples, which are expected to be no
this category to extend the method to the high dimen-
smaller than the number of variables.
sional gene expression data by exploring the local sub-
Motivated by the common observations on the spar- structures. However, all these approaches, based on
sity of causal structures, i.e. each variable usually only independence conditional testings, cannot distinguish
depends on a small number of parent variables, we de- two causality structures if they come from a so-called
rive a new general framework in this paper. The new Markov equivalence class (Pearl, 2009), in which ex-
framework helps the existing causation algorithms to pensive intervention experiments were previously con-
get rid of difficulties on small sample cardinality in sidered essential (He & Geng, 2008).
practice. Well designed conditional independence test-
Recently, Additive Noise Model is proposed to break
ings are conducted first, to partition the problem do-
the limitation of the class of method purely under
main into small subproblems. With the same number
conditional independence testings, by exploiting the
of samples, existing causation algorithms could gener-
asymmetric property of the noises in the generative
ate more robust and accurate results on these small
progress, which brings a gleam of dawn to resolve
subproblems. Partial results from all subproblems are
the causal equivalence problem. The Additive Noise
finally merged together, to return a complete picture
Model highly depends on the type of noise and the
of causalities among all the variables. This framework
form of causality. Existing studies on this line can be
is theoretically solid, as it always returns correct and
categorized based on the adopted assumption on the
complete result under the optimal setting. Our exper-
noise type and data generation mechanism. LiNGAM
iments on synthetic and real datasets verify the supe-
and its variants (Shimizu et al., 2006; 2011), for ex-
rior scalability and effectiveness of our proposal, when
ample, assume that the data generating process is lin-
applied together with two mainstream causation anal-
ear and the noise distributions are non-Gaussian. The
ysis algorithms.
nonlinear non-Gaussian method (Hoyer et al., 2008)
works when the data generating process is nonlin-
2. Related Work ear. And discrete model (Peters et al., 2010) is pro-
posed for the causal inference on domains with only
Causality Bayesian network (CBN) is part of the theo-
discrete variables. There are other research stud-
retical background of this work. CBN is a special case
ies on this topic, such as explaining the underly-
of Bayesian network, whose edge direction presents
ing theoretical foundation behind additive noise mode
the causality relations among the nodes (Pearl, 2009).
(Janzing et al., 2012), regression-based model infer-
CBN has been used to model the causal structure in
ence (Mooij et al., 2009) and kernel independence test
many real-world applications, for example, the gene
(Zhang et al., 2012). To the best of our knowledge,
regulatory network (Friedman et al., 2000; Kim et al.,
there is no existing work to address the problem of
2004), causal feature selection (Cai et al., 2011).
sample cardinality under Additive Noise Model.
Most of existing work try to explore the struc-
ture learning approach to learning the CBN, e.g.
the well known PC algorithm (Spirtes et al., 2001;
SADA: A General Framework to Support Robust Causation Discovery
3. SADA Framework v1 v2
3.1. Preliminaries C
C'
Assume that all samples from the problem domain v3 v4 v5
contain information on m different variable, i.e. V =
{v1 , v2 . . . , vm }. Let D = {x1 , x2 , , xn } denote an
v6 v7 v8 v9
observation sample set. Each sample xi is denoted
by a vector xi = (xi1 , xi2 , . . . , xim , yi ), where xij in-
dicates the value of the sample xi on variable vj and Figure 1. An example probabilistic graphical model over 9
yi is the target variable under investigation. If P is variables and two causal cuts deduced by C and C .
a distribution over the domain of variables in V , we
assume that there exists a causal Bayesian network
N faithful to the distribution P. The network N in- (ICA) over the sample.
cludes a directed acyclic graph G, each edge in which
When assuming non-linear generation procedure
indicates a conditional (in)dependent relationship be-
(Hoyer et al., 2008) and discrete data domain
tween two variable nodes. Each edge is also associated
(Peters et al., 2010), additive noise model provides an-
with a conditional probability function which simu-
other approach to utilize the asymmetric relation be-
lates conditional probability distribution of each vari-
tween causal variables and consequence variables. A
able given the values of the parent variables. Following
regression model vi = f (vj )+ei is trained for each pair
the common assumption of existing studies, we only
of variables vi and vj . If the noise variable ei is inde-
consider problem domain with the Faithfulness Con-
pendent of vj , variable vj is returned as the cause of
dition(Koller & Friedman, 2009). Specifically, P and
variable vi . Note that algorithms under discrete addi-
N are faithful to each other, iff every conditional inde-
tive noise model are usually run over pairs of variables
pendence entailed by N corresponds to some Markov
independently.
condition present in P.
A common observation on the CBNs in real-world
Due to the probabilistic nature, it is likely to find a
domains is the sparsity on the causal relationships.
huge number of equivalent Bayesian networks. Two
Specifically, a variable usually only has a small num-
different Bayesian Networks, N1 and N2 , are indepen-
ber of parental causal variables in the CBN, regard-
dence (or Markov) equivalent, if N1 and N2 entail
less of the underlying true generative procedure. This
exactly the same conditional independence relations
property, however, is not fully exploited by the existing
among the variables. In all these Bayesian networks,
causation algorithms.
Causal Bayesian network (CBN) is a special one in
which each edge is interpreted as a direct causal re-
lationship between a parent variable node and a child 3.2. Framework
variable node. Let G = (V, E) denote a directed graph on the variable
Generally speaking, it is difficult to distinguish CBN set V . A variable set C V forms a causal cut set
from independence equivalent Bayesian networks, un- over G, iff C deduces three non-overlapping variable
less additional assumptions are made. When the vari- subsets V1 ,V2 and C of V such that (1) V1 V2
ables are correlated under linear relationships and C = V ; (2) there is no edge between V1 and V2 in E.
the noises follow non-Gaussian distributions, LiNGAM Intuitively, variables in C block all paths between the
and its variants (Shimizu et al., 2006; 2011) are known variables in V1 and V2 . For each directed edge u v
to return more accurate causations from the uncontrol- in E, one of the two following cases must hold: (1)
lable samples. In particular, such assumption can be intra-causality: u, v V1 , u, v V2 or u, v C; and
formulated by an equation, such that every variable (2) inter-causality: (u V1 V2 and v C), or (u C
P and v V1 V2 ).
vi = vj P (vi ) Aij vj + ei , where P (vi ) contains all
the parent variables of vi , Aij is the linear dependence In Figure 1, for example, C = {v4 } is a valid
weight w.r.t. vi and its parent vj , and ei is an non- causal cut, which separates the variables into V1 =
Gaussian noise over vi . Assume that the variables in {v1 , v3 , v6 , v7 }, V2 = {v2 , v5 , v8 , v9 }. Given a directed
V are organized based on a topological order in the graph G, there could be different valid cuts satis-
causal graph. The generation procedure of a sample fying the above conditions. In the example graph,
could be written as V = A V + E. LiNGAM aims to C = {v3 , v4 , v5 } is another valid causal cut with
find such a topological order and reconstructs the ma- V1 = {v1 , v2 }, V2 = {v6 , v7 , v8 , v9 }. Note that causal
trix A by exploiting independence component analysis cut may not lead to d-separation, e.g. C in the exam-
SADA: A General Framework to Support Robust Causation Discovery
v1 v3 v4 v6 v7 v2 v4 v5 v8 v9 0.12
v1 0 0 0 0 0 0 0 0 0 0 0.1
v3 1 0 0 0 0 0 0 0 0 0
Figure 5. Causal cut errors on discrete models Table 3. Results on Discrete Model
Kalisch, M. and Buhlmann, P. Estimating high- Zhu, Z., Ong, Y.S., and Dash, M. Markov blanket-
dimensional directed acyclic graphs with the pc- embedded genetic algorithm for gene selection. Pat-
algorithm. The J. Mach. Learn. Res., 8:613636, tern Recognition, 40(11):32363248, 2007.
2007.
Zscheischler, Jakob, Janzing, Dominik, and Zhang,
Kim, Sunyong, Imoto, Seiya, and Miyano, Satoru. Dy- Kun. Testing whether linear equations are causal:
namic bayesian network and nonparametric regres- A free probability theory approach. CoRR,
sion for nonlinear modeling of gene networks from abs/1202.3779, 2012.
time series gene expression data. Biosystems, 75(1-
3):5765, 2004.