Geometric Decision Tree

IEEE TRANSACTIONS ON SYSTEMS, MAN, AND CYBERNETICSPART B: CYBERNETICS, VOL. 42, NO.
1, FEBRUARY 2012
181
Geometric Decision Tree

Naresh Manwani and P. S. Sastry, Senior Member, IEEE
AbstractIn this paper, we present a new algorithm for learning

oblique decision trees. Most of the current decision tree algorithms rely
on impurity measures to assess the goodness of hyperplanes at each
node while learning a decision tree in top-down fashion. These
impurity measures do not properly capture the geometric structures
in the data. Motivated by this, our algorithm uses a strategy for
assessing the hyperplanes in such a way that the geometric structure
in the data is taken into account. At each node of the decision tree, we
find the clustering hyperplanes for both the classes and use their
angle bisectors as the split rule at that node. We show through
empirical studies that this idea leads to small decision trees and better
performance. We also present some analysis to show that the angle
bisectors of clustering hyperplanes that we use as the split rules at
each node are solutions of an interesting optimization problem and
hence argue that this is a principled method of learning a decision
tree.
Index TermsDecision trees, generalized eigenvalue problem,
multiclass classification, oblique decision tree.
I. INTRODUCTION
HE DECISION tree is a well-known and widely used

method for classification. The popularity of the decision
tree is because of its simplicity and easy interpretability as a

classification rule. In a decision tree classifier, each nonleaf node
is associated with a so-called split rule or a decision function,
which is a function of the feature vector and is often binary
valued. Each leaf node in the tree is associated with a class label.
To classify a feature vector using a decision tree, at every nonleaf
node that we encounter (starting with the root node), we branch to
one of the children of that node based on the value assumed by
the split rule of that node on the given feature vector. This process
follows a path in the tree, and when we reach a leaf, the class label
of the leaf is what is assigned to that feature vector. In this paper,
we address the problem of learning an oblique decision tree, given
a set of labeled training samples. We present a novel algorithm that
attempts to build the tree by capturing the geometric structure of the
class regions.
Decision trees can be broadly classified into two types, i.e., axis
parallel and oblique [1]. In an axis-parallel decision tree, the split
rule at each node is a function of only one of the components of
the feature vector. Axis-parallel decision trees are particularly
attractive when all features are nominal; in such cases, we can have
a nonbinary tree where, at each node, we test one feature value,
and the node can have as many children as the values assumed by
that feature [2]. However, in more
Manuscript received November 2, 2010; revised March 30, 2011 and June 22,
2011; accepted July 10, 2011. Date of current version December 7, 2011. This paper
was recommended by Associate Editor M. S. Obaidat.
The authors are with the Department of Electrical Engineering, Indian
Institute of Science, Bangalore 560012, India (e-mail: naresh@ee.iisc.ernet.in;
sastry@ee.iisc.ernet.in).
Digital Object Identifier 10.1109/TSMCB.2011.2163392
general situations, we have to approximate even arbitrary linear

segments in the class boundary with many axis-parallel pieces;
hence, the size of the resulting tree becomes large. The oblique
decision trees, on the other hand, use a decision function that
depends on a linear combination of all feature components. Thus,
an oblique decision tree is a binary tree where we associate a
hyperplane with each node. To classify a pattern, we follow a
path in the tree by taking the left or right child at each node based
on which side of the hyperplane (of that node) the feature vector
falls in. Oblique decision trees represent the class boundary as a
general piecewise linear surface. Oblique decision trees are more
versatile (and hence are more popular) when features are real
valued.
The approaches for learning oblique decision trees can be
classified into two broad categories. In one set of approaches, the
structure of the tree is fixed beforehand, and we try to learn the
optimal tree with this fixed structure. This methodology has been
adopted by several researchers, and different optimization
algorithms have been proposed [3]-[8]. The problem with these
approaches is that they are applicable only in situations where
we know the structure of the tree a priori, which is often not
the case. The other class of approaches learns the tree in a topdown manner. Top-down approaches have been more popular
because of their versatility.
Top-down approaches are recursive algorithms for building the
tree in a top-down fashion. We start with the given training data
and decide on the "best" hyperplane, which is assigned to the root
of the tree. Then, we partition the training examples into two sets
that go to the left child and the right child of the root node using
this hyperplane. Then, at each of the two child nodes, we repeat
the same procedure (using the appropriate subset of the training
data). The recursion stops when the set of training examples that
come to a node is pure, that is, all these training patterns are of the
same class. Then, we make it a leaf node and assign that class to
the leaf node. (We can also have other stopping criteria such as
we make a node a leaf node if, for example, 95% of the training
examples reaching that node belongs to one class.) A detailed
survey of top-down decision tree algorithms is available in [9].
There are two main issues in top-down decision tree learning
algorithms: 1) given the training examples at a node, how to rate
different hyperplanes that can be associated with this node and, 2)
given a rating function, how to find the optimal hyperplane at each
node.
One way of rating hyperplanes is to look for hyperplanes that
are reasonably good classifiers for the training data at that node. In
[10], two parallel hyperplanes are learned at each node such that
one side of each hyperplane contains points of only one class and
the space between these two hyperplanes contains the points that
are not separable. A slight variant of the aforementioned
algorithm is proposed in [11], where only
1083-4419/$26.00 2011 IEEE
182
IEEE TRANSACTIONS ON SYSTEMS, MAN, AND CYBERNETICSPART B: CYBERNETICS, VOL. 42, NO. 1, FEBRUARY 2012
one hyperplane is learned at each decision node in such a way that

one side of the hyperplane contains points of only one class.
However, in many cases, such approaches produce very large trees
that have poor generalization performance. Another approach is to
learn a good linear classifier (e.g., least mean square error
classifier) at each node (see, e.g., [12]). Decision tree learning for
multiclass classification problems using linear- classifier-based
approaches is discussed in [13], [14]. Instead of finding a linear
classifier at each node, Cline [15], which is a family of decision
tree algorithms, uses various heuristics to determine hyperplanes
at each node. However, they do not provide any results to show
why these heuristics help or how one chooses a method. A Fisherlinear-discriminant-based decision tree algorithm is proposed in
[16]. All the previously discussed approaches produce crisp
decision boundaries. The decision tree approach giving
probabilistic decision boundary is discussed in [17]. Its fuzzy
variants are discussed in [18] and [19].
In a decision tree, each hyperplane at a nonleaf node should
split the data in such a way that it aids further classification; the
hyperplane itself need not be a good classifier at that stage. In view
of this, many classical top-down decision tree learning algorithms
are based on rating hyperplanes using the so-called impurity
measures. The main idea is given as follows: Given the set of
training patterns at a node and a hyperplane, we know the set of
patterns that go into the left and right children of this node. If each
of these two sets of patterns have predominance of one class over
others, then, presumably, the hyperplane can be considered to have
contributed positively to further classification. At any stage in the
learning process, the level of purity of a node is some measure of
how skewed is the distribution of different classes in the set of
patterns landing at that node. If the class distribution is nearly
uniform, then the node is highly impure; if the number of patterns
of one class is much larger than that of all others, then the purity
of the node is high. The impurity measures used in the algorithms
give higher rating to a hyperplane, which results in higher purity of
child nodes. The Gini index, entropy, and twoing rule are some of
the frequently used impurity measures [9]. A slightly different
measure is the area under the ROC curve [20], which is also called
AUCsplit and is related inversely to the Gini index. Some of the
popular algorithms that learn oblique decision trees by optimizing
some impurity measures are discussed in [9]. Many of the
impurity measures are not differentiable with respect to the
hyperplane parameters. Thus, the algorithms for decision tree
learning using impurity measures need to use some search
techniques for finding the best hyperplane at each node. For
example, CART-LC [1] uses a deterministic hill-climbing
algorithm; OC1 [21] uses a randomized search. Both of these
approaches search in one dimension at a time, which becomes
compu- tationally cumbersome in high-dimensional feature
spaces. In contrast to these approaches, evolutionary approaches
are able to optimize in all dimensions simultaneously. Some
examples of decision tree algorithms in which the rating function
is optimized using evolutionary approaches are in [18], [22], and
[23]. Evolutionary approaches are tolerant to noisy evaluations of
the rating function and also facilitate optimizing multiple rating
functions simultaneously [24], [25].
A problem with all impurity measures is that they depend only

on the number of (training) patterns of different classes on either
side of the hyperplane. Thus, if we change the class regions
without changing the effective areas of class regions on either
side of a hyperplane, the impurity measure of the hyperplane will
not change. Thus, the impurity measures do not really capture the
geometric structure of class regions.
In [26], a different approach is suggested, where the function for
rating hyperplanes gives high values to hyperplanes, which
promote the "degree of linear separability" of the set of patterns
landing at the child nodes. It has been found experimentally that the
decision trees learned using this criterion are more compact than
those using impurity measures. In [26], a simple heuristic is used
to define what is meant by "degree of linear separability." This
function also does not try to capture the geometry of pattern
classes. Again, the cost function is not differentiable with respect
to the parameters of the hyperplanes, and the method uses a
stochastic search technique called Alopex [27] to find the optimal
hyperplane at each node.
In this paper, we present a new decision tree learning algorithm, which is based on the idea of capturing, to some extent, the
geometric structure of the underlying class regions. For this, we
borrow ideas from some recent variants of the support vector
machine (SVM) method, which are quite good at capturing the
(linear) geometric structure of the data.
For a two-class classification problem, the multisurface proximal
SVM (GEPSVM) algorithm [28] finds two clustering
hyperplanes, i.e., one for each class. Each hyperplane is close to
patterns of one class while being far from patterns of the other
class. Then, new patterns are classified based on the nearness to
the hyperplanes. In problems where one pair of hyperplanes like
this does not give sufficient accuracy, Mangasarian and Wild [28]
suggests the idea of using the kernel trick of (effec- tively)
learning the pair of hyperplanes in a high-dimensional space to
which the patterns are transformed. Motivated by GEPSVM, we
derive our decision tree approach as follows: At each node of the
tree, we find the clustering hyperplanes as in GEPSVM. After
finding these hyperplanes, we choose the split rule at this node as
the angle bisectors of the two hyperplanes. Then, we split the data
based on the angle bisector and recursively learn the left and right
subtrees of this node. Since, in general, there will be two angle
bisectors, we select that which is better based on an impurity
measure. Thus, the algorithm combines the ideas of linear
tendencies in data and purity of nodes to find better decision
trees. We also present some analysis to bring out some interesting
properties of our angle bisectors that can explain why this may
be a good technique to learn decision trees.
The rest of this paper is organized as follows:1 We describe our
algorithm in Section II. Section III presents some analysis that
brings out some properties of the angle bisectors of the clustering
hyperplanes. Based on these results, we argue that our angle
bisectors are a good choice for split rule at a node while learning
the decision tree. Experimental results are given in Section IV. We
conclude this paper in Section V.
1
A preliminary version of this work has been presented in [29], where some
experimental results for the two-class case are presented without any analysis.
MANWANI AND SASTRY: GEOMETRIC DECISION TREE
183
Fig. 1. Example to illustrate the proposed decision tree algorithm. (a) Hyperplane learned at the root node using an algorithm (OC1) that relies on the impurity measure of
Gini index. (b, solid line) Angle bisectors of (dashed line) the clustering hyperplanes at the root node on this problem, which were obtained using our method.
II. GEOMETRIC DECISION TREE

The performance of any top-down decision tree algorithm
depends on the measure used to rate different hyperplanes at each
node. The issue of having a suitable algorithm to find the
hyperplane that optimizes the chosen rating function is also
important. For example, for all impurity measures, the optimization is difficult because finding the gradient of the impurity
function with respect to the parameters of the hyperplane is not
possible. Motivated by these considerations, here, we propose anew
criterion function to assess the suitability of a hyperplane at a
node that can capture the geometric structure of the class regions.
For our criterion function, the optimization problem can also be
solved more easily.
We first explain our method by considering a two-class
problem. Given the set of training patterns at a node, we first find
two hyperplanes, i.e., one for each class. Each hyperplane is such
that it is closest to all patterns of one class and is farthest from all
patterns of the other class. We call these hyperplanes as the
clustering hyperplanes (for the two classes). Because of the way
they are defined, these clustering hyperplanes capture the
dominant linear tendencies in the examples of each class that are
useful for discriminating between the classes. Hence, a hyperplane that passes in between them could be good for splitting the
feature space. Thus, we take the hyperplane that bisects the angle
between the clustering hyperplanes as the split rule at this node.
Since, in general, there would be two angle bisectors, we choose
the bisector that is better, based on an impurity measure, i.e., the
Gini index. If the two clustering hyperplanes happen to be parallel
to each other, then we take a hyperplane midway between the two
as the split rule. Before presenting the full algorithm, we illustrate
it through an example.
Consider the 2-D classification problem shown in Fig. 1,
where the two classes are not linearly separable. The hyperplane learned at the root node using OC1, which is an oblique
problem. Fig. 1(b) shows the two clustering hyperplanes for

the two classes and the two angle bisectors, obtained through our
algorithm, at the root node on this problem. As can be seen,
choosing any of the angle bisectors as the hyperplane at the root
node to split the data results into linearly separable classification
problems at both child nodes. Thus, we see here that our idea of
using angle bisectors of two clustering hyperplanes actually
captures the right geometry of the classification problem. This is
the reason we call our approach "geometric
decision tree (GDT)."
We also note here that neither of our angle bisectors scores
high on any impurity based measure; if we use either of these
hyperplane as the split rule at the root, both child nodes would
contain roughly equal number of patterns of each class. This
example is only for explaining the motivation behind our
approach. Not all classification problems have such a nice symmetric structure in class regions. However, in most problems, our
approach seems to be able to capture the geometric structure well, as
seen from the results in Section IV.
A. GDT Algorithm: Two-Class Case
Let S =(xi,yi) : xi d;yi1,1, i = 1. . .n, be the
training data set. Let C+ be the set of points for which yi = 1.
In addition, let C be the set of points for which yi =1.
For an oblique decision tree learning algorithm, the main
computational task is given as follows: Given a set of data points
at a node, find the best hyperplane to split the data.
Let St be the set of points at node t. Let nt+ and nt denote
the number of patterns of the two classes at that node. Let
A nt+d be the matrix containingt points of class C+ at
node t as rows.2 Similarly, let B nd be the matrix whose
rows contain points of class C at node t. Let h1(w1, b1) :
wT x + b1 = 0 and h2(w2, b2) : wT x + b2 = 0 be the two
1
decision tree algorithm that uses the

impurity measure of Gini index, is shown
in Fig. 1(a). As can be seen, although
this hyperplane promotes the (average)
purity of child nodes, it does not really
simplify the classification problem; it
does not capture the symmetric
distribution of class regions in this
clustering hyperplanes. Hyperplane h1 is to be closest to all

points of class C+ and farthest from points of class C.
2
We use C+ and C to denote the sets of examples of the two classes and
the label of the two classes; the meaning would be clear from context.
184
Similarly, hyperplane h2 is to be closest to all points of

class C and farthest from points of class C+. To find the
clustering hyperplanes, we use the idea as in GEPSVM [28].
The nearness of a set of points to a hyperplane is represented by the average of squared distances. The average of
squared distances of points of class C+ from a hyperplane
wT x + b = 0 is D +(w, b) = (1/nt+ w 2) xiC+ wT xi +
b2, where . denotes the standard Euclidean norm. Let w =
[ wT b ]T d+1 and = [ xT 1 ]T d+1. Then, wT xi + x
and (w2, b2) of two clustering hyperplanes gets reduced to finding eigenvectors corresponding to the maximum and minimum
eigenvalues of the generalized eigenvalue problem described by
(3). It is easy to see that, if w1 is a solution of problem (3),
kw1 also happens to be a solution for any k. Here, for our
purpose, we choose k = 1/ w1 . That is, the clustering hyperplanes [obtained as the eigenvectors corresponding to the maximum and minimum eigenvalues of the generalized eigenvalue
problem (3)] are w1 = [ wT b1 ]T and w2 = [ wT b2 ]T ,
1
2
and they are scaled such that w1 = w2 = 1. We solve

this generalized eigenvalue value problem using the standard
LU-decomposition-based method [30] in the following way: Let
matrix G have full rank. Let G = F F T , which can be done
using LU factorization. Now, from (3), we get the following:
b = wTi. Note that, by the definition of matrix A, we

x
have xiC+wT xi + b2 = Aw + bent+ 2, where ent+ is an
nt+-dimensional column vector3 of ones. Now, D+(w, b) can
be further simplified as
D+(w, b) = nt
wT i
Hw =F F T w
x
1w +
C
+
F1HF1 =
y
1
2
wT
x
xiiT
w
wT Gw
n
t
1
w
x i C+
=w
where = F T w, which means that y is an eigenvector of

y
T
T
F HF1 . Since F1HF1 is symmetric, we can find orT
thonormal eigenvectors of F1HF1 . If w is an eigenvector
1
where
G = (1/nt+) xiC+iT = x xi
Similarly, the average of the squared distances

of points of class C from h will be D(w, b) =
corresponding to the generalized eigenvalue problem Hw =

T
Gw , then F T w will be the eigenvector of F1HF1 .
Once we find clustering hyperplanes, the hyperplane we
(1/nt w 2)wT Hw, where H = (1/nt) xiCiT =
associate with the current node will be one of the angle bisectors
(
1
ent ]T
[B
x xi
ent ] and ent is the nt -dimensional
of these two hyperplanes. Let wT x + b3 = 0 and wT x + b4 =

3
vector of ones. To find each clustering hyperplane, we
0 be the angle bisectors of wT x + b1 = 0 and wT x + b2 = 0.
need to find h such that one of D+ or D is

maximized
while minimizing the other. Hence, the two clustering
Assuming w1 = w2, it is easily shown that (note that w1, w2

are such that w1 = w2 = 1)
hyperplanes, which are specified by w1 = [ wT b1 ]T and

1
T
T
w2 = [ w b2 ] , can be formalized as the solution of
2
w3 = w 1 + w2
w4 = w1 w2.
(4)
optimization problems as follows:

As we mentioned earlier, we choose the angle bisector that has
w1 = arg minw=0 D+(w,, b) = arg minw=0 wT Gw T
w Hw
D(w b)
= arg maxw=0 wTHw T
w Gw
(1)
in set Stl, and let nt+r and ntr denote the number of patterns of
the two classes in set Str . Then, the Gini index of hyperplane
w2 = arg minw=0 D(w,, b) = arg minw=0 wTHw . (2) T
w Gw
D+(w b)
wt is given by [1]
It can easily be verified that G = (1/n +) xiC+

xixT is a i
(d + 1) (d + 1) symmetric positive
semidefinite matrix. G is
strictly positive definite when matrix A has full column rank.
Similarly, matrix H is also a positive semidefinite matrix, and it
t
Hw =Gw.
nt l 2
n tl
ntl
2
nt
nt
+r
ntr1
+ nt
n tr
n tr
where nt = nt+ + nt is the number of points in St. In addition,

ntl = nt+l + ntl is the number of points falling in set Stl , and
(3)
It can be shown [30]

that any w that is a
local solution of the
optimization problems given by (1) and (2) will satisfy (3) and
the value of the corresponding objective functions is given by
eigenvalue. Thus, the problem of finding parameters (w1, b1)
Unless stated otherwise all vectors are assumed to be column vectors.
nt l1 nt+l
Gini(wt) = nt
is strictly positive definite when matrix B has full column

rank.
The problems given by (1) and (2) are standard optimization
problems, and their solutions essentially involve solving the
following generalized eigenvalue problem [30]:
a lower value of the Gini index. Let wt be a hyperplane that is

used for dividing the set of patterns St in two parts Stl and Str .
Let nt+l and ntl denote the number of patterns of the two classes
ntr = nt+r + ntr is the number of points falling in set Str . We

choose w3 or w4 to be the split rule for St based on which of
the two gives lesser value of the Gini index.

When the clustering hyperplanes are parallel (that is, when
w1 = w2), we choose a hyperplane given by w = (w, b) =
(w1, (b1 + b2)/2) as the splitting hyperplane.
As is easy to see, in our method, the optimization problem
of finding the best hyperplane at each node is solved exactly
(5)
rather than by relying on a search technique. The clustering

hyperplanes are obtained by solving the eigenvalue problem.
After that, to find the hyperplane at the node, we need to
compare only two hyperplanes based on the Gini index.
The complete algorithm for learning the decision tree is given
as follows: At any given node, given the set of patterns St, we
find the two clustering hyperplanes (by solving the generalized
eigenvalue value problem) and choose one of the two angle
bisectors, based on the Gini index, as the hyperplane to be
associated with this node. We then use this hyperplane to split
St
int
o
tw
o
set
s,
i.e.
,
tho
se
tha
t
go
int
o
the
left
an
d
rig
ht
chi
ld
no
des
of
thi
s
no
de.
We
the
n
rec
urs
ive
ly
do
the
sa
me
at
the
tw
185
implies that, for each w , which satisfies wT Hw = 0, we
want to maximize (wT Hw)/(wT Gw + wT Hw). However,
simply maximizing the modified ratio (wT Hw)/(wT (G +
H)w) is not sufficient because wT Hw may not reach to its
maximum value [31]. Selecting w enforces G = 0.
Algorithm 1: Multiclass GDT
o
c
h
i
l
d
n
o
d
e
s
.
T
h
e
r
e
c
u
r
s
i
o
n
s
t
o
p
s
w
h
e
n
t
h
e
s
e
t
o
f
pat
ter
ns
at
a
no
de
are
suc
h
tha
t
the
fra
cti
on
of
pat
ter
ns
bel
on
gin
g
to
the
mi
nor
ity
cla
ss
of
thi
s
set
are
bel
ow
a
use
rspe
cifi
ed
thr
esh
old
or
the
de
pth
of
the
tre
e
rea
ch
es
a
pre
spe
cifi
e
d
m
a
x
i
m
u
m
l
i
m
i
t
.
B. Handling the Small-Sample-Size Problem
Input: S =(xi, yi)n=1, Max-Depth, 1 i

Output: Pointer to the root of a decision tree
begin
Root = GrowTreeMulticlass(S);
return Root;
In our method, we solve the generalized eigenvalue value
end
GrowTreeMulticlass (St)
Input: Set of patterns at node t (St)
Output: Pointer to a subtree
begin
Divide set St in two parts, i.e., St and St ;
+
problem using the standard LU-decomposition-based technique. In the optimization problem (1), the LU-decompositionbased method is applicable only when matrix G has full rank
contains points of the remaining classes;

Find matrix A corresponding to the points of St ; +
Find matrix B corresponding to the points of St ;
Find w1 and w2, which are the solutions of optimiza
tion problems (1) and (2);

Find angle bisectors w3 and w4 using (4);
Choose the angle bisectors having lesser Gini index

[cf. (5)] value. Call it w;
Let wt denotes the split rule at node t. Assign
(which happens when matrix A has full column rank). In

general, if there are a large number of examples, then (under
the usual assumption that no feature is a linear combination
of others) we would have full column rank for A. (This is the
case, for example, in the proximal SVM method [28], which
also finds the clustering hyperplanes like this.) However, in our
decision tree algorithm, as we go down
in the tree, the number
of points falling at nonleaf nodes will keep decreasing. Hence,
wt w;
Let S tl =xi StwT< 0 and Str =xit x

StwT 0; t x
Define1(St) = (max(nt1, . . . , ntK ))/(nt);
if (T ree-Depth = M ax-Depth)
then
Get a node tl, and make tl a leaf node;
Assign the class label associated to the majority
class to tl;
Make tl the left child of t;
else if (1(Stl )> 1 1) then
there might be cases where matrix G becomes rank deficient.

We describe a method of handling this problem of small sample
size by adopting the technique presented in [31].
Suppose that matrix G has rank r < d + 1. Let be
the null space of G. Let Q = [1 . . . d+1r] be the matrix whose columns are an orthonormal basis for . According to the method given in [31], we first project all
the points in class C to the null space of G. Every
vector belonging to class C after
projection will be- x
come QQT. Let the matrix corresponding
to H after pro- x
Get a node tl, and make tl a leaf node;
jection be H. Then, H = (1/n) xC QQTT QQT =
xx
Q T HQQT . Note that G = QQT GQQT would be zero be-
cause columns of Q span the null space of G. Now, the
eig
en
ve
cto
r
cor
res
po
ndi
ng
to
the
lar
ge
st
eig
en
val
ue
of
H
St contains points of the majority class, and St

class in set Stl to tl;
Make tl the left child of t;

i
ss
t
We now explain how this approach works.
The whole analysis is based on the following result:
Theorem 1 [31]: Suppose that R is a set in the d-dimensional
space andx R, f (x) 0, g(x) 0, and f (x) + g(x)>
0. Let h1(x) = f (x)/g(x) and h2(x) = f (x)/(f (x) + g(x)).
Then, h1(x) has a maximum (including positive infinity) at point
x0 in R if h2(x) has a maximum at point x0.
Using Theorem 1, it is clear that w, which maximizes
the ratio (wT Hw)/(wT Gw + wT Hw), will also maximize
(wT Hw)/(wT Gw). It is obvious that (wT Hw)/(wT Gw +
wT Hw) = 1 if and only if wT Hw = 0 and wT Gw = 0. This
e
if (T ree-Depth = M ax-Depth)
then
Get a node tr, and make tr a leaf node;
class to tr;
Make tl the right child of t;
else if (1(Str )> 1 1) then
Get a node tr, and make tr a leaf node;

class in the set Str to tr;
186
else
Make tl the right child of t;

tr = GrowTreeMulticlass(Str );
Make tr the right child of t;
end
return t
end
Case 1: (+ = = ) We have the following result:

Theorem 2: Let S be a set of feature vectors with equal
sample covariance matrices of the two classes. Then, the angle
bisector of two clustering hyperplanes will have the same
orientation as the Fisher linear discriminant hyperplane.
Proof: Given any arbitrary w d, b , we can show4
through simple algebra that
1
To summarize, when matrix G becomes rank deficient, we

find the null space of it and project all the feature
vectors of
class C on to this null space. The clustering hyperplane for
class C+ is chosen as the principal eigen vector of
matrix H de-
scribed earlier. The small-sample-size problem can occur only
when matrix G becomes rank deficient. The rank deficiency of
matrix H does not affect the solution of the optimization
problem given by (1).
=2 +2 +
=2 +2
n+ Aw + be n+
1 B w + be
n
where2 = wTw,+ = wT + + b, and = wT + b.

Let f1(w1, b1), f2(w2, b2) be the objective functions of optimization problems (1) and (2), respectively. Then, we have
2 +2
f1(w1, b1) =12 +1+2
1
2
C. GDT for
Multiclass
Classification
+2
f2(w2, b2) =2 +22

The algorithm
presented in the previous section can
be easily generalized to handle the case
when we have
more than two classes. Let S =(xi, yi) : xi d; yi
1, . . . , K i = 1 . . . n be the training data set, where K
is the number of classes. At a node t of the tree, we divide the set of points St at that node in two subsets, i.e.,
St and St . St contains points of the majority class in St,
+
w
h
e
r
e
a
s
S
where 2 = wTwj,j+ = wT + + bj, andj =

j
t
h
e
r
e
wT + bj, j = 1, 2. Now, by taking the derivatives of j

fi(wi, bi) with respect to (wi, bi), i = 1, 2 and equating them
to zero, we get
(w
s
t
o
f
t
h
e
c
o
n
t
a
i
n
s
2
2+
p
o
i
n
t
s
.
W
e
l
e
a
r
n
2fi(wi2, bi) wi =fi i, bi) ( + ).
t
h
e
t
r
e
e
as in the binary case discussed earlier. The only

difference
here is that we use the fraction of the
points of the majority class to decide
whether a given node is a leaf node or
not. A complete description of the
decision tree method for multiclass
classification is given in Algorithm 1.
Algorithm 1
recursively calls the procedure GrowTreeMulticlass(St),
which will learn a split rule for node t and return a subtree at
that node.
The preceding set of equations will give us

w1 =11( + ) w2 =21( + )
III. ANALYSIS
In this section, we present some analysis of our algorithm.
We consider only the binary classification problem. We prove
some interesting properties of the angle bisector hyperplanes to
indicate why angle bisectors may be a good choice (in a decision
tree) for the split rule at a node.
Let S be a set of n
patterns (feature vectors)
of which n+
A en+ T
where1 and2 are some scalars. This means that both

clustering hyperplanes are parallel to each other and (w3, b3) is
such that w31( + ). This is the same as the Fisher
linear discriminant, thus proving the theorem.
Case 2: ( + = = ) Next, we discuss the case of the
data distribution, where both classes have the same mean.
Theorem 3: If the sample mean of two classes are the same,
then the clustering hyperplane found by solving optimization
problems (1) and (2) will pass through the common mean.
Proof: Optimization problem (1), which finds the clustering hyperplane for C+, is
w,b)=0
xiC
xT w + b
= max wTHw
max
1
are of class C+ and n are of class C. Recall that, as per our

notation, A is a matrix whose rows are feature vectors of
class
C+ and B is a matrix whose rows are feature vectors of class
C. Let +, d be the sample means. Let +, be the
sample covariance matrices. Then, we have
+ = n1 A en+ T ++
xT w + b
n+
xi C+
w=0
w Gw
(8)
This problem can be equivalently written as a constrained

optimization problem in the following way:
max
(6)
(w,b)=0
1
n
xT w + b
xiC
where en+ is an n+-dimensional vector having
s.t.
all elements
one. Similarly, we will have
T
= n1 B en T
B en T .
xiC
xT w + b 2
n+
i
n+
xT w + b
xi C+
= 1.
To see the complete calculations in the proof, please refer [32].

187
Equating the derivative of the Lagrangian of the preceding

problem, with respect to b to zero, we get (with as the
Lagrange multiplier)
2
(7)
xT w + b = 0 i
x i C+
b = w T .
This means that the clustering hyperplane for class C+ passes

through the common mean. Similarly, we can show that the
The difference above will be maximum when the first term is

maximized and the second term is minimized. For any real
symmetric matrices and , if we want to maximize wTw subject to
the constraint wT w = constant, the solution is
the eigenvector corresponding to the maximum eigenvalue of the
generalized eigenvalue problem w = w. Similarly, to minimize
wTw subject to the same constraint, the solution is the eigenvector
corresponding to the minimum eigenvalue.
clustering hyperplane for class C also passes through the
common mean.
When + = = , Theorem 3 says that b =wT .

Now, putting this value of b in (8), we get the optimization
problem for finding w as
: max
w=0
xiC
1
n
x i C+
xT w w T i
xT w w T
Hence, the optimal solution to (11) is obtained when (wa +

wb) is the eigenvector corresponding to the maximum eigenvalue and (wa wb) is the eigenvector corresponding to the
minimum eigenvalue of the generalized eigenvalue problem
w =+w. Thus, wa = w1 + w2 and wb = w1 w2
constitute the solution to optimization problem (11).
Now, we try to interpret this optimization problem to argue
that this is a good optimization problem to solve when we
want to find the best hyperplane to split the data at a node
= max wTw .
w=0 w w +
while learning a decision tree. Let X and Y be random variables denoting the feature vectors from classes C+ and C,
respectively. Define new random variables Xa, Xb, Ya, and Yb
as X a = wT X, Xb = wT X, Ya = wT Y, and Yb = wT Y.
a
b
a
b
Now, let us assume that we have enough samples from both

classes, so that we can assume that the empirical averages are
close to the expectations. We can rewrite the objective function
in the optimization problem given by (11) as
Hence, w1 and w2 will be the eigenvectors corresponding

to the maximum and minimum eigenvalues of the generalized eigenvalue problem w =+w, respectively. Since
the eigenvector can be determined only up to a scale factor,
under our notation, we take w1 = w2 = 1. Since the ratio
wTw/wT+w is invariant to the scaling of vector w,
we can maximize or minimize the ratio by
constraining the
denominator to have any constant value. Thus, we can write
w1 and w2 as
wTwb = n1 a
wT (x )(x )T wb a
C
x
E
w1 = arg maxw
wTw
s.t. wT+w =
(9)
s.t. wT+w =
(10)
Yb E[Yb]
= cov(Ya, Yb).
w2 = arg minw wTw

where the value of can be chosen, so that it is consistent with our scaling
of w1 and w2. Now, the parameters of the two angle
Ya E[Ya]
Similarly, we can rewrite the constraints of that problem as

2
w
T
Xa E[Xa]
w
a
a
bisectors can be written as (w3, b3) = (w1

+ w2, b1 + b2) and
(w4, b4) = (w1 w2, b1 b2). We show that the pair of vectors w3 and w4 are the solution to the following
optimization
p
r
o
b
l
e
m
:
wT+wb b
ba
w
,
w
= var(Xa)
Xb E[Xb]
wT+wb
max
wTw
Xa E[Xa]
= var(Xb)
Xb E[Xa]
s.t. wT+wa = 2 = wT+wb,

a
cov(Xa, Xb).
b
wT+wb = 0. (11) a
Consider the possible solution to the optimization problem (11)
given by wa = w1 + w2 and wb = w1 w2. We know that
w1
an
d
w2
are
fea
sib
le
sol
uti
on
s
to
pro
ble
ms
(9)
an
d
(10
),
respectively. In addition, because w1 and w2 are eigenvectors
corresponding
to the
maximum and
minimum
eigenvalues of
the generalized eigen value problem w
=+w, they satisfy wT+w2 = 0. Thus, we see that the pair of vectors wa = 1
w1 + w2 and wb = w1 w2 satisfies all the constraints of the
optimization problem (11) and hence is a feasible solution. We
can rewrite the objective function of problem (11) as
wTwb = 1 (wa + wb)T(wa + wb)
a
(
w
Hence, the angle bisectors, which are the solution of (11),

would be the solution of the optimization problem given as
max cov(Ya, Yb)
w a ,w b
s.t. var(Xa) = var(Xb) = 2,
cov(Xa, Xb) = 0. (12)
This optimization problem seeks to find wa and wb (which would

be our angle bisectors) such that the covariance between
Ya and Yb is maximized while keeping Xa and Xb uncor
related. (The constraints on the variances are needed only to

ensure that the optimization problem has a bounded solution.)
Ya and Yb represent random variables that are projections of a
w
b
w
b
)
T
(
w
a
class C feature vector onto wa and wb, respectively, and Xa
188
and Xb are projections of the class C+ feature vector on wa and

wb. Thus, we are looking for two directions such that one class
pattern becomes uncorrelated when projected onto these two
directions, whereas the correlation between projections of the
other class feature vector becomes maximum. Thus, our angle
bisectors give us directions that are good for discriminating
between two classes; hence, we feel that our choice of the angle
bisectors as split rule is a sound choice while learning a decision
tree.
Case 3: (+ = and + = ) We next consider the
general case of different covariance matrices and different
means of two classes. Recall that the parameters of the two
clustering hyperplanes are w1 and w2, which are eigenvectors
corresponding to the maximum and minimum eigenvalues of

the generalized eigenvalue problem Hw =Gw. Then, using
similar arguments as in case 2, one can show that the angle bisectors are the solution of the following optimization problem:
max wT
Hwb a
w a ,w b
s wT Gw = 2 = wT Gw , wT Gw = 0. (13)
a
b
b
a
b
a
Again, consider X as a random feature vector coming from class
C+ and Y as a random feature vector coming from class
C. We define new random variables Xa, Xb, Ya, and Yb
as X a = wT X + ba, Xb = wT X + bb, Ya = wT Y + ba, and

a
b
a
Yb = wT Y + bb. As earlier, we assume that there are enough

.t.
examples from both the classes, so that the empirical averages

can be replaced by expectations. Then, as in the earlier case, we
can rewrite the optimization problem given by (13) as
max E[YaYb]
TABLE I
DETAILS OF REAL-WORLD DATA SETS USED FROM UCI ML REPOSITORY
with GEPSVM [28] on binary classification problems. The

experimental comparisons are presented on four synthetic data
sets and ten "real" data sets from the UCI ML repository [33].
Data Set Description: We generated four synthetic data sets in
different dimensions, which are described here.
1) 2 2 checkerboard data set: 2000 points are sampled
uniformly from [1 1] [1 1]. Apoint is labeled +1 if
it is in ([1 0] [0 1]) ([0 1] [0 1]); otherwise, it
is labeled1. Out of 2000 sampled points, 979 points are
labeled +1, and 1021 points are labeled1. Now, all the
points are rotated by an angle of/6 with respect to the
first axis in counterclockwise direction to form the final
training set.
2) 4 4 checkerboard data set: 2000 points are sampled
uniformly from [0 4] [0 4]. This whole square is
divided into 16 unit squares having unit length in both
dimensions. These squares are given indexes ranging
from 1 to 4 on both axes. If a point falls in a unit square
w a ,w b
such that the sum of its two indices is even, then we assign
s.t. E X2 = 2 = E X2 ,
E[XaXb] = 0.
This is very similar to the optimization problem we derived

in the previous case, with the only difference being that the
covariances are replaced by cross expectation or correlation.
Thus, finding the two angle bisector hyperplanes is the same
a
as finding two vectors wa and wb in d+1 such that the
cross expectation of the projection of class C points on these

ectors is maximized while keeping the cross expectation of the
pro
jec
tio
n
of
cla
ss
label +1 to that point; otherwise, we assign label1 to it.

Out of 2000 sampled points, 997 points are labeled +1,
and 1003 points are labeled1. Now, all the points are
rotated by an angle of/6 with respect to the first axis in
counterclockwise direction.
3) Ten-dimensional synthetic data set: 2000 points are
sampled uniformly from [ 1 1]10. Consider three
v
hyperplanes in 10 whose parameters are given by

C
+
p
o
i
n
t
s
o
n
t
h
e
s
e
v
e
cto
rs
at
zer
o.
Ag
ain
,
vectors
a
a
E[X2] and E[X2] are kept constant to ensure that the solutions
of the optimization problem are bounded. Once again, we feel that
the preceding discussion shows that the angle bisectors are a good
choice as the split rule at a node in the decision tree.
[1,1, 0, 0, 1, 0, 0, 1, 0, 0, 0]T
w1 = [1, 1, 0, 1, 0, 0, 1, 0, 0, 1,
0]T ,
w2 =
and
w3 =
[0, 1,1, 0,1, 0, 1, 1,1, 1, 0]T . Out of 2000 sampled

points, 1020 points are labeled +1, and 980 points are
labeled1 as follows:
if wT 0 & wT 0 & wT 0
1
,
2 x
3 x
x
IV.
EXPERIM
ENTS
wT 0 & wT 0 & wT 0
2 x
3 x
x
In this section, we present empirical results to show
the
w1 x
y=
T 0 & wT 0 & wT 0
2 x
3 x
effectiveness of our decision tree learning algorithm.

We test
the performance of our algorithm on several synthetic and
real data sets. We compare our approach with OC1 [21] and
CART-LC [1], which are among the best state-of-art oblique decision tree algorithms. We also compare our approach with the
recently proposed linear-discriminant-analysis-based decision
tree (LDDT) [15] and with the SVM classifier, which is among
the best generic classifiers today. We also compare our approach
wT 0 & wT 0 & wT 0
1 x
2 x
3 x
else.
4) 100-dimensional synthetic data set: 2000 points are
sampled uniformly from [ 1 1]100. Consider two
100
hyperplanes in
whose parameters are given
T
by vectors w 1 = [ 2e eT 3 ]T and w2 =
50
50
[eT eT 5 ]T , where e50 is a 50-dimensional
1,
50
TABLE II
COMPARISON RESULTS BETWEEN GEOMETRIC DECISION TREE
AND OTHER DECISION TREE APPROACHES
50
189
vector, whose elements are all 1. Now, the points are

labeled as follows:
repository [33]. The ten data sets that we used are described in
Table I. The U.S. Congressional Votes data set available on the
UCI ML repository has many observations with the missing values
of some features. For our experiments, we choose only those
observations for which there are no missing values for any
feature. We also do not use all the observations in the Magic data
set. It has a total of 19 020 samples of both classes. However, for
our experiments, we randomly choose total 6000 points, with
3000 from each class.
Experimental Setup: We implemented GDT and LDDT in
MATLAB. For OC1 and CART-LC, we have used the
downloadable package available on Internet [34]. To learn SVM
classifiers, we use the libsvm [35] code. Libsvm-2.84 [35] uses the
one-versus-rest approach for multiclass classification. We have
implemented GEPSVM in MATLAB.
GDT has only one user-defined parameter, which is 1 (the
threshold on a fraction of the points to decide on any node being
a leaf node). For all our experiments, we have chosen 1 using
tenfold cross validation. SVM has two user-defined parameters,
i.e., penalty parameter C and the width parameter for Gaussian
kernel. The best values for these parameters are found using fivefold
cross validation, and the results reported are with these parameters.
Both OC1 and CART use 90% of the total number of points for
training and 10% points for pruning. OC1 needs two more userdefined parameters. These parameters are the number of restarts R
and the number of random jumps J. For our experiments, we
have set R = 20 and J = 5, which are the default values suggested
in the package. For the cases where we use GEPSVM with the
Gaussian kernel, we found
the best width parameter using fivefold cross validation.
Simulation Results: We now discuss the performance of GDT
in comparison with other approaches on different data sets. The
results provided are based on ten repetitions of tenfold cross
validation. We show the average values and standard deviation
(computed over the ten repetitions).
Table II shows the comparison results of GDT with other
decision tree approaches. In the table, we show the average and
standard deviation5 for the accuracy, size, and depth of tree and
the time taken for each of the algorithms on each of the problems.
We can intuitively take the confidence interval of the estimated
accuracy of an algorithm to be one standard deviation on either
side of the average. Then, we can say that, on a problem, one
algorithm has significantly better accuracy than another if the
confidence interval for the accuracy of the first is completely to
the right of that of the second.
From Table II, we see that the average accuracy of GDT is
better than all the other decision tree algorithms, except for the
Wine, Votes, and Heart data sets, where LDDT has the same or
better average accuracy. In terms of the confidence interval of the
average accuracy, the performance of GDT is
if wT 0 & wT 0
1,
2 x
comparable to the best of other decision tree algorithms on the
x
y=
wT 0 & wT 0
2 x
Breast Cancer, Bupa Liver, Magic, Heart, Votes, and Wine data
sets. On the remaining eight data sets, the performance of GDT
is significantly better than all the other decision tree approaches.
Out of 2000 sampled points, 862 points are labeled +1,

and 1138 points are labeled1.
Apart from these four data sets, we also tested the GDT on
sev
era
l
be
nc
hm
ark
dat
a
set
s
do
wn
loa
de
d
fro
m
the
U
CI
M
L
Thus, overall, in terms of accuracy, the performance of the GDT

is quiet good.
5
We do not show the standard deviation if it is less than 0.001.
190
Fig. 2. Comparison of GDT with OC1 on 4 4 checkerboard data (a) Hyperplanes learned at the root node and its left child using GDT. (b) Hyperplane learned
at the root node and its left child node using OC1 (oblique) decision tree.
Fig. 3. Sensitivity of the performance of GDT to the parameter 1. The first column shows how the average cross-validation accuracy changes with
second column shows the change of the average number of leaves with 1.
In majority of the cases, GDT generates trees with smaller

depth with lesser number of leaves, compared with other decision tree approaches. This supports the idea that our algorithm
better exploits the geometric structure of the data set while
generating decision trees.
1, and the
Timewise GDT algorithm is much faster than OC1 and CART,

as can be seen from the results in the table. In most cases, the
time taken by GDT is less by at least a factor of ten. We feel that
this is because the problem of obtaining the best split rule at each
node is solved using an efficient linear algebra
191
TABLE III
COMPARISON RESULTS OF GEOMETRIC DECISION TREE
WITH SVM AND GEPSVM
algorithm in case of GDT, whereas these other approaches have to

resort to search techniques because optimizing impurity- based
measures is tough. In all cases, the time taken by GDT is
comparable to that of LDDT. This is also to be expected because
LDDT uses similar computational strategies.
We next consider comparisons of the GDT algorithm with
SVM and GEPSVM. Table III shows these comparison results.
GEPSVM with linear kernel performs the same as GDT for the
2 2 checkerboard problem because, for this problem, the two
approaches work in a similar way. However, when there are
more than two hyperplanes required, GEPSVM with Gaussian
kernel performs worse than our decision tree approach. Moreover, with Gaussian kernel, GEPSVM solves the generalized
eigenvalue problem of the size of the number of points, whereas our
decision tree solves the generalized eigenvalue problem of the
dimension of the data at each node (which is the case with
GEPSVM only when it uses the linear kernel). This gives us an
extra advantage in computational cost over GEPSVM. For all
binary classification problems, GDT outperforms GEPSVM.
The performance of GDT is comparable to that of SVM in
terms of accuracy. GDT performs significantly better than SVM
on 10 and 100-dimensional synthetic data sets and the Balance
Scale data set. GDT performs comparable to SVM on
the 2 2 checkerboard, Bupa Liver, Pima Indian, Magic, Heart,
and Votes data sets. GDT performs worse than SVM on the 4 4
checkerboard and the Breast Cancer, Vehicle, and Waveforms data sets.
In terms of the time taken to learn the classifier, GDT is faster
than SVM on majority of the cases. At every node of the tree, we
are solving a generalized eigenvalue problem that takes time on
the order of (d + 1)3, where d is the dimension of the feature
space. On the other hand, SVM solves a quadratic program whose
time complexity is O(nk), where k is between 2 and 3 and n is
the number of points. Thus, in general, when the number of points
is large compared to the dimension of the feature space, GDT learns
the classifier faster than SVM.
Finally, in Fig. 2, we show the effectiveness of our algorithm in
terms of capturing the geometric structure of the classification
problem. We show the first two hyperplanes learned by
our approach and OC1 for 4 4 checkerboard data. We see
that our approach learns the correct geometric structure of the
classification boundary, whereas the OC1, which uses the Gini
index as impurity measure, does not capture that.
Although GDT gets the correct decision boundary for the
4 4 chessboard data set, as shown in Fig. 2, its crossvalidation accuracy is lesser than that of SVM. This may be
because the data here are dense, and hence, numerical round-off
errors can affect the classification of points near the boundary. On
the other hand, if we allow some margin between the data points
and the decision boundary (by ensuring that all the sampled
points are at least 0.05 distance away from the decision boundary),
then we observed that SVM and GDT both achieve 99.8% crossvalidation accuracy.
In the GDT algorithm described in Section II, 1 is a parameter. If more than (1 1) fraction of the points fall into
the majority class, then we declare that node as a leaf node
and assign the class label of the majority class to that node.
As we increase 1, chances of any node to become a leaf node
will increase. This leads to smaller sized decision trees, and the
learning time also decreases. However, the accuracy will suffer.
To understand the robustness of our algorithm with respect to
this parameter, we show, in Fig. 3, variation in crossvalidation accuracy and the average number of leaves with 1.
The range of values of 1 is chosen to be 0.05-0.35. We see
that the cross-validation accuracy does not change too much
with 1. However, with increasing 1, the average number of
leaves decrease. Thus, even though the tree size decreases with
, the cross-validation accuracy remains in a small interval.
1
This happens because, for most of the points, the decision is
governed by nodes closer to the root node. Few remaining
examples, which are tough to classify, lead the decision tree to
grow further. However, as the value of 1 increases, only nodes
containing these tough-to-classify points become leaf nodes.
From the results in Fig. 3, we can say that 1 in the range of
0.1-0.3 would be appropriate for all data sets.
V. CONCLUSION
In this paper, we have presented a new algorithm for learning
oblique decision trees. The novelty is in learning hyperplanes that
captures the geometric structure of the class regions. At each
node, we have found the two clustering hyperplanes and
192
chosen one of the angle bisectors as the split rule. We have

presented some analysis to derive the optimization problem for
which the angle bisectors are the solution. Based on this, we
argued that our method of choosing the hyperplane at each node is
sound. Through extensive empirical studies, we showed that the
method performs better than the other decision tree approaches in
terms of accuracy, size of the tree, and time. We have also shown
that the classifier obtained with GDT is as good as that with SVM,
whereas it is faster than SVM. Thus, overall, the algorithm
presented here is a good and novel classification method.
REFERENCES
[1] L. Breiman, J. Friedman, R. Olshen, and C. Stone, Classification and
Regression Trees. Belmont, CA: Wadsworth and Brooks, 1984, ser.
Statistics/Probability Series.
[2] J. Quinlan, "Induction of decision trees," Mach. Learn., vol. 1, no. 1,
pp. 81-106, 1986.
[3] R. O. Duda and H. Fossum, "Pattern classification by iteratively determined linear and piecewise linear discriminant functions," IEEE Trans.
Electron. Comput., vol. EC-15, no. 2, pp. 220-232, Apr. 1966.
[4] M. I. Jordan, "A statistical approach to decision tree modeling," in Proc.
7th Annu. Conf. COLT, New Brunswick, NJ, Jul. 1994, pp. 13-20.
[5] K. P. Bennett, "Global tree optimization: A non-greedy decision tree
algorithm," Comput. Sci. Statist., vol. 26, pp. 156-160, 1994.
[6] K. P. Bennett and J. A. Blue, "Optimal decision trees," Dept. Math.
Sci., Rensselaer Polytech. Inst., Troy, NY, Tech. Rep. R.P.I. Math Report No.
214, 1996.
[7] K. P. Bennett and J. A. Blue, "A support vector machine approach to
decision trees," in Proc. IEEE World Congr. Comput. Intell., Anchorage, AK,
May 1998, vol. 3, pp. 2396-2401.
[8] A. Suarez and J. F. Lutsko, "Globally optimal fuzzy decision trees for
classification and regression," IEEE Trans. Pattern Anal. Mach. Intell., vol. 21,
no. 12, pp. 1297-1311, Dec. 1999.
[9] L. Rokach and O. Maimon, "Top-down induction of decision trees
classifiersA survey," IEEE Trans. Syst., Man, Cybern. C, Appl. Rev., vol.
35, no. 4, pp. 476-487, Nov. 2005.
[10] O. L. Mangasarian, "Multisurface method of pattern separation," IEEE
Trans. Inf. Theory, vol. IT-14, no. 6, pp. 801-807, Nov. 1968.
[11] T. Lee and J. A. Richards, "Piecewise linear classification using seniority
logic committee methods, with application to remote sensing," Pattern
Recognit., vol. 17, no. 4, pp. 453-464, 1984.
[12] P. E. Utgoff and C. E. Brodley, "Linear machine decision trees," Dept.
Comput. Sci., Univ. Massachusettes, Amhersts, MA, Tech. Rep. 91-10, Jan.
1991.
[13] M. Lu, C. L. P. Chen, J. Huo, and X. Wang, "Multi-stage decision tree
based on inter-class and inner-class margin of SVM," in Proc. IEEE Int. Conf.
Syst., Man, Cybern., San Antonio, TX, 2009, pp. 1875-1880.
[14] R. Tibshirani and T. Hastie, "Margin trees for high dimensional classification," J. Mach. Learn. Res., vol. 8, pp. 637-652, Mar. 2007.
[15] M. F. Amasyali and O. Ersoy, "Cline: A new decision-tree family," IEEE
Trans. Neural Netw., vol. 19, no. 2, pp. 356-363, Feb. 2008.
[16] A. Koakowska and W. Malina, "Fisher sequential classifiers," IEEE
Trans. Syst., Man, Cybern. B, Cybern., vol. 35, no. 5, pp. 988-998, Oct. 2005.
[17] D. Dancey, Z. A. Bandar, and D. McLean, "Logistic model tree extraction
from artificial neural networks," IEEE Trans. Syst., Man, Cybern. B, Cybern.,
vol. 37, no. 4, pp. 794-802, Aug. 2007.
[18] W. Pedrycz and Z. A. Sosnowski, "Genetically optimized fuzzy decision trees," IEEE Trans. Syst., Man, Cybern. B, Cybern., vol. 35, no. 3, pp.
633-641, Jun. 2005.
[19] B. Chandra and P. P. Varghese, "Fuzzy SLIQ decision tree algorithm,"
IEEE Trans. Syst., Man, Cybern. B, Cybern., vol. 38, no. 5, pp. 1294- 1301,
Oct. 2008.
[20] C. Ferri, P. Flach, and J. Hernndez-Orallo, "Learning decision trees using
the area under the ROC curve," in Proc. 19th ICML, San Francisco, CA, Jul.
2002, pp. 139-146.
[21] S. K. Murthy, S. Kasif, and S. Salzberg, "A system for induction of
oblique decision trees," J. Artif. Intell. Res., vol. 2, no. 1, pp. 1-32, Aug. 1994.
[22] S.-H. Cha and C. Tappert, "A genetic algorithm for constructing compact
binary decision trees," J. Pattern Recognit. Res., vol. 4, no. 1, pp. 1-13, Feb.
2009.
[23] Z. Fu, B. L. Golden, S. Lele, S. Raghavan, and E. A. Wasil, "A genetic algorithm-based approach for building accurate decision trees,"
INFORMS J. Comput., vol. 15, no. 1, pp. 3-22, Jan. 2003.
[24] E. Cant-Paz and C. Kamath, "Inducing oblique decision trees with evolutionary algorithms," IEEE Trans. Evol. Comput., vol. 7, no. 1, pp. 54-68, Feb.
2003.
[25] J. M. Pangilinan and G. K. Janssens, "Pareto-optimality of oblique decision trees from evolutionary algorithms," J. Global Optim., pp. 1-11, Oct.
2010.
[26] S. Shah and P. S. Sastry, "New algorithms for learning and pruning oblique
decision trees," IEEE Trans. Syst., Man, Cybern. C, Appl. Rev., vol. 29, no. 4,
pp. 494-505, Nov. 1999.
[27] P. S. Sastry, M. Magesh, and K. P. Unnikrishnan, "Two timescale analysis
of the alopex algorithm for optimization," Neural Comput., vol. 14, no. 11, pp.
2729-2750, Nov. 2002.
[28] O. L. Mangasarian and E. W. Wild, "Multisurface proximal support vector
machine classification via generalized eigenvalues," IEEE Trans. Pattern Anal.
Mach. Intell., vol. 28, no. 1, pp. 69-74, Jan. 2006.
[29] N. Manwani and P. S. Sastry, "A geometric algorithm for learning
oblique decision trees," in Proc. 3rd Int. Conf. PReMI, New Delhi, India, Dec.
2009, pp. 25-31.
[30] G. H. Gulub and C. F. V. Loan, Matrix Computations, 3rd ed. Baltimore,
MD: Johns Hopkins Univ. Press, 1996.
[31] L. F. Chen, H. Y. M. Liao, M. T. Ko, J. C. Lin, and G. J. Yu, "A new
LDA-based face recognition system which can solve the small sample size
problem," Pattern Recognit., vol. 33, no. 10, pp. 1713-1726, Oct. 2000.
[32] N. Manwani and P. S. Sastry, "Geometric decision tree," CoRR,
vol. abs/1009.3604, 2010. [Online]. Available: http://arxiv.org/abs/
1009.3604
[33] D. N. A. Asuncion,UCI Machine Learning Repository, School Inf. Comput. Sci., Univ. California, Irvine, Irvine, CA, 2007. [Online]. Available:
http://www.ics.uci.edu/~mlearn/MLRepository.html
[34] S. K. Murthy, S. Kasif, and S. Salzberg, The OC1 Decision Tree
Software System, 1993. [Online]. Available: http://www.cbcb.umd.edu/
salzberg/announce-oc1.html
[35] C.-C. Chang and C.-J. Lin, LIBSVM: A Library for Support Vector
Machines, 2001. [Online]. Available: http://www.csie.ntu.edu.tw/~cjlin/
libsvm
Naresh Manwani received the B.E. degree in

electronics and communication from Rajasthan
University, Jaipur, India, in 2003 and the M.Tech.
degree in information and communication technology
from Dhirubhai Ambani Institute of Informa- tion and
Communication Technology, Gandhinagar, India, in
2006. He is currently working toward the Ph.D. degree
in the Department of Electrical En- gineering, Indian
Institute of Science, Bangalore, India.
His research interests are machine learning and
pattern recognition.
P. S. Sastry (S'82-M'85-SM'97) received the B.Sc. (Hons.)

degree in physics from Indian Institute of Technology,
Kharagpur, India, and the B.E. degree in electrical
communications engineering and the Ph.D. degree from
the Department of Electrical Engineering, Indian Institute of Science, Bangalore, India.
Since 1986, he has been a faculty member with the
Department of Electrical Engineering, Indian Institute
of Science, where he is currently a Profes- sor. He has
held visiting positions at the University of
Massachusetts, Amherst; University of Michigan,
Ann Arbor; and General Motors Research Labs, Warren. His research interests
include machine learning, pattern recognition, data mining, and computational
neuroscience.
Dr. Sastry is a Fellow of Indian National Academy of Engineering. He is an
Associate Editor for the IEEE TRANSACTIONS ON SYSTEMS, MAN, AND
CYBERNETICSPART B and the IEEE TRANSACTIONS ON AUTOMATION SCIENCE
AND ENGINEERING. He was the recipient of the Sir C.V.Raman Young Scientist
Award from the Goverment of Karnataka; the Hari Om Ashram Dr. Vikram Sarabhai
Research Award from Physical Resaerch Laboratory, Ahmedabad, India; and the
Most Valued Collegue Award from General Motors Corporation, Detroit, MI.

Geometric Decision Tree

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Geometric Decision Tree

Uploaded by

Copyright:

Available Formats

IEEE TRANSACTIONS ON SYSTEMS, MAN, AND CYBERNETICSPART B: CYBERNETICS, VOL. 42, NO.

Geometric Decision Tree

AbstractIn this paper, we present a new algorithm for learning

HE DECISION tree is a well-known and widely used

tree is because of its simplicity and easy interpretability as a

general situations, we have to approximate even arbitrary linear

1083-4419/$26.00 2011 IEEE

one hyperplane is learned at each decision node in such a way that

A problem with all impurity measures is that they depend only

MANWANI AND SASTRY: GEOMETRIC DECISION TREE

II. GEOMETRIC DECISION TREE

problem. Fig. 1(b) shows the two clustering hyperplanes for

decision tree algorithm that uses the

distribution of class regions in this

clustering hyperplanes. Hyperplane h1 is to be closest to all

Similarly, hyperplane h2 is to be closest to all points of

and they are scaled such that w1 = w2 = 1. We solve

b = wTi. Note that, by the definition of matrix A, we

where = F T w, which means that y is an eigenvector of

Similarly, the average of the squared distances

corresponding to the generalized eigenvalue problem Hw =

(1/nt w 2)wT Hw, where H = (1/nt) xiCiT =

of these two hyperplanes. Let wT x + b3 = 0 and wT x + b4 =

vector of ones. To find each clustering hyperplane, we

0 be the angle bisectors of wT x + b1 = 0 and wT x + b2 = 0.

need to find h such that one of D+ or D is

Assuming w1 = w2, it is easily shown that (note that w1, w2

hyperplanes, which are specified by w1 = [ wT b1 ]T and

optimization problems as follows:

= arg maxw=0 wTHw T

w2 = arg minw=0 D(w,, b) = arg minw=0 wTHw . (2) T

It can easily be verified that G = (1/n +) xiC+

where nt = nt+ + nt is the number of points in St. In addition,

It can be shown [30]

is strictly positive definite when matrix B has full column

a lower value of the Gini index. Let wt be a hyperplane that is

ntr = nt+r + ntr is the number of points falling in set Str . We

the two gives lesser value of the Gini index.

MANWANI AND SASTRY: GEOMETRIC DECISION TREE

rather than by relying on a search technique. The clustering

implies that, for each w , which satisfies wT Hw = 0, we

want to maximize (wT Hw)/(wT Gw + wT Hw). However,

simply maximizing the modified ratio (wT Hw)/(wT (G +

H)w) is not sufficient because wT Hw may not reach to its

maximum value [31]. Selecting w enforces G = 0.

Algorithm 1: Multiclass GDT

B. Handling the Small-Sample-Size Problem

Input: S =(xi, yi)n=1, Max-Depth, 1 i

contains points of the remaining classes;

tion problems (1) and (2);

Choose the angle bisectors having lesser Gini index

(which happens when matrix A has full column rank). In

Let S tl =xi StwT< 0 and Str =xit x

there might be cases where matrix G becomes rank deficient.

Get a node tl, and make tl a leaf node;

jection be H. Then, H = (1/n) xC QQTT QQT =

St contains points of the majority class, and St

Assign the class label associated to the majority

Make tl the left child of t;

(wT Hw)/(wT Gw). It is obvious that (wT Hw)/(wT Gw +

wT Hw) = 1 if and only if wT Hw = 0 and wT Gw = 0. This

Assign the class label associated to the majority

Make tl the right child of t;