Professional Documents
Culture Documents
1, FEBRUARY 2012
181
I. INTRODUCTION
182
IEEE TRANSACTIONS ON SYSTEMS, MAN, AND CYBERNETICSPART B: CYBERNETICS, VOL. 42, NO. 1, FEBRUARY 2012
183
Fig. 1. Example to illustrate the proposed decision tree algorithm. (a) Hyperplane learned at the root node using an algorithm (OC1) that relies on the impurity measure of
Gini index. (b, solid line) Angle bisectors of (dashed line) the clustering hyperplanes at the root node on this problem, which were obtained using our method.
2
We use C+ and C to denote the sets of examples of the two classes and
the label of the two classes; the meaning would be clear from context.
184
IEEE TRANSACTIONS ON SYSTEMS, MAN, AND CYBERNETICSPART B: CYBERNETICS, VOL. 42, NO. 1, FEBRUARY 2012
and (w2, b2) of two clustering hyperplanes gets reduced to finding eigenvectors corresponding to the maximum and minimum
eigenvalues of the generalized eigenvalue problem described by
(3). It is easy to see that, if w1 is a solution of problem (3),
kw1 also happens to be a solution for any k. Here, for our
purpose, we choose k = 1/ w1 . That is, the clustering hyperplanes [obtained as the eigenvectors corresponding to the maximum and minimum eigenvalues of the generalized eigenvalue
problem (3)] are w1 = [ wT b1 ]T and w2 = [ wT b2 ]T ,
1
2
wT i
Hw =F F T w
x
1w +
C
+
F1HF1 =
y
1
2
wT
x
xiiT
w
wT Gw
n
t
1
w
x i C+
=w
T
T
F HF1 . Since F1HF1 is symmetric, we can find orT
thonormal eigenvectors of F1HF1 . If w is an eigenvector
1
where
G = (1/nt+) xiC+iT = x xi
associate with the current node will be one of the angle bisectors
(
1
ent ]T
[B
x xi
ent ] and ent is the nt -dimensional
T
T
w2 = [ w b2 ] , can be formalized as the solution of
2
w3 = w 1 + w2
w4 = w1 w2.
(4)
w Hw
D(w b)
w Gw
(1)
in set Stl, and let nt+r and ntr denote the number of patterns of
the two classes in set Str . Then, the Gini index of hyperplane
w Gw
D+(w b)
wt is given by [1]
Hw =Gw.
nt l 2
n tl
ntl
2
nt
nt
+r
ntr1
+ nt
n tr
n tr
nt l1 nt+l
Gini(wt) = nt
(5)
185
o
c
h
i
l
d
n
o
d
e
s
.
T
h
e
r
e
c
u
r
s
i
o
n
s
t
o
p
s
w
h
e
n
t
h
e
s
e
t
o
f
pat
ter
ns
at
a
no
de
are
suc
h
tha
t
the
fra
cti
on
of
pat
ter
ns
bel
on
gin
g
to
the
mi
nor
ity
cla
ss
of
thi
s
set
are
bel
ow
a
use
rspe
cifi
ed
thr
esh
old
or
the
de
pth
of
the
tre
e
rea
ch
es
a
pre
spe
cifi
e
d
m
a
x
i
m
u
m
l
i
m
i
t
.
end
GrowTreeMulticlass (St)
Input: Set of patterns at node t (St)
Output: Pointer to a subtree
begin
Divide set St in two parts, i.e., St and St ;
+
problem using the standard LU-decomposition-based technique. In the optimization problem (1), the LU-decompositionbased method is applicable only when matrix G has full rank
wt w;
xx
Q T HQQT . Note that G = QQT GQQT would be zero be-
cause columns of Q span the null space of G. Now, the
eig
en
ve
cto
r
cor
res
po
ndi
ng
to
the
lar
ge
st
eig
en
val
ue
of
H
t
We now explain how this approach works.
The whole analysis is based on the following result:
Theorem 1 [31]: Suppose that R is a set in the d-dimensional
space andx R, f (x) 0, g(x) 0, and f (x) + g(x)>
0. Let h1(x) = f (x)/g(x) and h2(x) = f (x)/(f (x) + g(x)).
Then, h1(x) has a maximum (including positive infinity) at point
x0 in R if h2(x) has a maximum at point x0.
Using Theorem 1, it is clear that w, which maximizes
the ratio (wT Hw)/(wT Gw + wT Hw), will also maximize
e
if (T ree-Depth = M ax-Depth)
then
Get a node tr, and make tr a leaf node;
Assign the class label associated to the majority
class to tr;
Make tl the right child of t;
else if (1(Str )> 1 1) then
Get a node tr, and make tr a leaf node;
186
IEEE TRANSACTIONS ON SYSTEMS, MAN, AND CYBERNETICSPART B: CYBERNETICS, VOL. 42, NO. 1, FEBRUARY 2012
else
end
return t
end
=2 +2 +
=2 +2
n+ Aw + be n+
1 B w + be
n
2 +2
f1(w1, b1) =12 +1+2
1
2
C. GDT for
Multiclass
Classification
+2
w
h
e
r
e
a
s
S
t
h
e
r
e
s
t
o
f
t
h
e
c
o
n
t
a
i
n
s
2
2+
p
o
i
n
t
s
.
W
e
l
e
a
r
n
t
h
e
t
r
e
e
III. ANALYSIS
In this section, we present some analysis of our algorithm.
We consider only the binary classification problem. We prove
some interesting properties of the angle bisector hyperplanes to
indicate why angle bisectors may be a good choice (in a decision
tree) for the split rule at a node.
Let S be a set of n
patterns (feature vectors)
of which n+
A en+ T
w,b)=0
xiC
xT w + b
= max wTHw
max
1
xT w + b
n+
xi C+
w=0
w Gw
(8)
(6)
(w,b)=0
1
n
xT w + b
xiC
s.t.
all elements
one. Similarly, we will have
T
= n1 B en T
B en T .
xiC
xT w + b 2
n+
i
n+
xT w + b
xi C+
= 1.
(7)
xT w + b = 0 i
x i C+
b = w T .
: max
w=0
xiC
1
n
x i C+
xT w w T i
xT w w T
= max wTw .
w=0 w w +
while learning a decision tree. Let X and Y be random variables denoting the feature vectors from classes C+ and C,
respectively. Define new random variables Xa, Xb, Ya, and Yb
as X a = wT X, Xb = wT X, Ya = wT Y, and Yb = wT Y.
a
b
a
b
wTwb = n1 a
wT (x )(x )T wb a
C
x
E
w1 = arg maxw
wTw
s.t. wT+w =
(9)
s.t. wT+w =
(10)
Yb E[Yb]
= cov(Ya, Yb).
Ya E[Ya]
w
T
Xa E[Xa]
w
a
a
wT+wb b
ba
w
,
w
= var(Xa)
Xb E[Xb]
wT+wb
max
wTw
Xa E[Xa]
= var(Xb)
Xb E[Xa]
cov(Xa, Xb).
b
wT+wb = 0. (11) a
Consider the possible solution to the optimization problem (11)
given by wa = w1 + w2 and wb = w1 w2. We know that
w1
an
d
w2
are
fea
sib
le
sol
uti
on
s
to
pro
ble
ms
(9)
an
d
(10
),
respectively. In addition, because w1 and w2 are eigenvectors
corresponding
to the
maximum and
minimum
eigenvalues of
the generalized eigen value problem w
=+w, they satisfy wT+w2 = 0. Thus, we see that the pair of vectors wa = 1
w1 + w2 and wb = w1 w2 satisfies all the constraints of the
optimization problem (11) and hence is a feasible solution. We
can rewrite the objective function of problem (11) as
wTwb = 1 (wa + wb)T(wa + wb)
a
(
w
w a ,w b
w
b
w
b
)
T
(
w
a
188
IEEE TRANSACTIONS ON SYSTEMS, MAN, AND CYBERNETICSPART B: CYBERNETICS, VOL. 42, NO. 1, FEBRUARY 2012
similar arguments as in case 2, one can show that the angle bisectors are the solution of the following optimization problem:
max wT
Hwb a
w a ,w b
s wT Gw = 2 = wT Gw , wT Gw = 0. (13)
a
b
b
a
b
a
Again, consider X as a random feature vector coming from class
C+ and Y as a random feature vector coming from class
C. We define new random variables Xa, Xb, Ya, and Yb
TABLE I
DETAILS OF REAL-WORLD DATA SETS USED FROM UCI ML REPOSITORY
w a ,w b
such that the sum of its two indices is even, then we assign
s.t. E X2 = 2 = E X2 ,
E[XaXb] = 0.
p
o
i
n
t
s
o
n
t
h
e
s
e
v
e
cto
rs
at
zer
o.
Ag
ain
,
vectors
a
a
E[X2] and E[X2] are kept constant to ensure that the solutions
of the optimization problem are bounded. Once again, we feel that
the preceding discussion shows that the angle bisectors are a good
choice as the split rule at a node in the decision tree.
[1,1, 0, 0, 1, 0, 0, 1, 0, 0, 0]T
w1 = [1, 1, 0, 1, 0, 0, 1, 0, 0, 1,
0]T ,
w2 =
and
w3 =
if wT 0 & wT 0 & wT 0
1
,
2 x
3 x
x
IV.
EXPERIM
ENTS
wT 0 & wT 0 & wT 0
2 x
3 x
x
In this section, we present empirical results to show
the
w1 x
y=
T 0 & wT 0 & wT 0
2 x
3 x
wT 0 & wT 0 & wT 0
1 x
2 x
3 x
else.
4) 100-dimensional synthetic data set: 2000 points are
sampled uniformly from [ 1 1]100. Consider two
100
hyperplanes in
whose parameters are given
T
by vectors w 1 = [ 2e eT 3 ]T and w2 =
50
50
1,
50
TABLE II
COMPARISON RESULTS BETWEEN GEOMETRIC DECISION TREE
AND OTHER DECISION TREE APPROACHES
50
189
repository [33]. The ten data sets that we used are described in
Table I. The U.S. Congressional Votes data set available on the
UCI ML repository has many observations with the missing values
of some features. For our experiments, we choose only those
observations for which there are no missing values for any
feature. We also do not use all the observations in the Magic data
set. It has a total of 19 020 samples of both classes. However, for
our experiments, we randomly choose total 6000 points, with
3000 from each class.
Experimental Setup: We implemented GDT and LDDT in
MATLAB. For OC1 and CART-LC, we have used the
downloadable package available on Internet [34]. To learn SVM
classifiers, we use the libsvm [35] code. Libsvm-2.84 [35] uses the
one-versus-rest approach for multiclass classification. We have
implemented GEPSVM in MATLAB.
GDT has only one user-defined parameter, which is 1 (the
threshold on a fraction of the points to decide on any node being
a leaf node). For all our experiments, we have chosen 1 using
tenfold cross validation. SVM has two user-defined parameters,
i.e., penalty parameter C and the width parameter for Gaussian
kernel. The best values for these parameters are found using fivefold
cross validation, and the results reported are with these parameters.
Both OC1 and CART use 90% of the total number of points for
training and 10% points for pruning. OC1 needs two more userdefined parameters. These parameters are the number of restarts R
and the number of random jumps J. For our experiments, we
have set R = 20 and J = 5, which are the default values suggested
in the package. For the cases where we use GEPSVM with the
Gaussian kernel, we found
the best width parameter using fivefold cross validation.
Simulation Results: We now discuss the performance of GDT
in comparison with other approaches on different data sets. The
results provided are based on ten repetitions of tenfold cross
validation. We show the average values and standard deviation
(computed over the ten repetitions).
Table II shows the comparison results of GDT with other
decision tree approaches. In the table, we show the average and
standard deviation5 for the accuracy, size, and depth of tree and
the time taken for each of the algorithms on each of the problems.
We can intuitively take the confidence interval of the estimated
accuracy of an algorithm to be one standard deviation on either
side of the average. Then, we can say that, on a problem, one
algorithm has significantly better accuracy than another if the
confidence interval for the accuracy of the first is completely to
the right of that of the second.
From Table II, we see that the average accuracy of GDT is
better than all the other decision tree algorithms, except for the
Wine, Votes, and Heart data sets, where LDDT has the same or
better average accuracy. In terms of the confidence interval of the
average accuracy, the performance of GDT is
if wT 0 & wT 0
1,
2 x
x
y=
wT 0 & wT 0
2 x
Breast Cancer, Bupa Liver, Magic, Heart, Votes, and Wine data
sets. On the remaining eight data sets, the performance of GDT
is significantly better than all the other decision tree approaches.
190
IEEE TRANSACTIONS ON SYSTEMS, MAN, AND CYBERNETICSPART B: CYBERNETICS, VOL. 42, NO. 1, FEBRUARY 2012
Fig. 2. Comparison of GDT with OC1 on 4 4 checkerboard data (a) Hyperplanes learned at the root node and its left child using GDT. (b) Hyperplane learned
at the root node and its left child node using OC1 (oblique) decision tree.
Fig. 3. Sensitivity of the performance of GDT to the parameter 1. The first column shows how the average cross-validation accuracy changes with
second column shows the change of the average number of leaves with 1.
1, and the
191
TABLE III
COMPARISON RESULTS OF GEOMETRIC DECISION TREE
WITH SVM AND GEPSVM
and Votes data sets. GDT performs worse than SVM on the 4 4
checkerboard and the Breast Cancer, Vehicle, and Waveforms data sets.
In terms of the time taken to learn the classifier, GDT is faster
than SVM on majority of the cases. At every node of the tree, we
are solving a generalized eigenvalue problem that takes time on
the order of (d + 1)3, where d is the dimension of the feature
space. On the other hand, SVM solves a quadratic program whose
time complexity is O(nk), where k is between 2 and 3 and n is
the number of points. Thus, in general, when the number of points
is large compared to the dimension of the feature space, GDT learns
the classifier faster than SVM.
Finally, in Fig. 2, we show the effectiveness of our algorithm in
terms of capturing the geometric structure of the classifica- tion
problem. We show the first two hyperplanes learned by
our approach and OC1 for 4 4 checkerboard data. We see
that our approach learns the correct geometric structure of the
classification boundary, whereas the OC1, which uses the Gini
index as impurity measure, does not capture that.
Although GDT gets the correct decision boundary for the
4 4 chessboard data set, as shown in Fig. 2, its crossvalidation accuracy is lesser than that of SVM. This may be
because the data here are dense, and hence, numerical round-off
errors can affect the classification of points near the boundary. On
the other hand, if we allow some margin between the data points
and the decision boundary (by ensuring that all the sampled
points are at least 0.05 distance away from the decision boundary),
then we observed that SVM and GDT both achieve 99.8% crossvalidation accuracy.
In the GDT algorithm described in Section II, 1 is a parameter. If more than (1 1) fraction of the points fall into
the majority class, then we declare that node as a leaf node
and assign the class label of the majority class to that node.
As we increase 1, chances of any node to become a leaf node
will increase. This leads to smaller sized decision trees, and the
learning time also decreases. However, the accuracy will suffer.
To understand the robustness of our algorithm with respect to
this parameter, we show, in Fig. 3, variation in crossvalidation accuracy and the average number of leaves with 1.
The range of values of 1 is chosen to be 0.05-0.35. We see
that the cross-validation accuracy does not change too much
with 1. However, with increasing 1, the average number of
leaves decrease. Thus, even though the tree size decreases with
, the cross-validation accuracy remains in a small interval.
1
This happens because, for most of the points, the decision is
governed by nodes closer to the root node. Few remaining
examples, which are tough to classify, lead the decision tree to
grow further. However, as the value of 1 increases, only nodes
containing these tough-to-classify points become leaf nodes.
From the results in Fig. 3, we can say that 1 in the range of
0.1-0.3 would be appropriate for all data sets.
V. CONCLUSION
In this paper, we have presented a new algorithm for learning
oblique decision trees. The novelty is in learning hyperplanes that
captures the geometric structure of the class regions. At each
node, we have found the two clustering hyperplanes and
192
IEEE TRANSACTIONS ON SYSTEMS, MAN, AND CYBERNETICSPART B: CYBERNETICS, VOL. 42, NO. 1, FEBRUARY 2012
[23] Z. Fu, B. L. Golden, S. Lele, S. Raghavan, and E. A. Wasil, "A genetic algorithm-based approach for building accurate decision trees,"
INFORMS J. Comput., vol. 15, no. 1, pp. 3-22, Jan. 2003.
[24] E. Cant-Paz and C. Kamath, "Inducing oblique decision trees with evolutionary algorithms," IEEE Trans. Evol. Comput., vol. 7, no. 1, pp. 54-68, Feb.
2003.
[25] J. M. Pangilinan and G. K. Janssens, "Pareto-optimality of oblique decision trees from evolutionary algorithms," J. Global Optim., pp. 1-11, Oct.
2010.
[26] S. Shah and P. S. Sastry, "New algorithms for learning and pruning oblique
decision trees," IEEE Trans. Syst., Man, Cybern. C, Appl. Rev., vol. 29, no. 4,
pp. 494-505, Nov. 1999.
[27] P. S. Sastry, M. Magesh, and K. P. Unnikrishnan, "Two timescale analysis
of the alopex algorithm for optimization," Neural Comput., vol. 14, no. 11, pp.
2729-2750, Nov. 2002.
[28] O. L. Mangasarian and E. W. Wild, "Multisurface proximal support vector
machine classification via generalized eigenvalues," IEEE Trans. Pattern Anal.
Mach. Intell., vol. 28, no. 1, pp. 69-74, Jan. 2006.
[29] N. Manwani and P. S. Sastry, "A geometric algorithm for learning
oblique decision trees," in Proc. 3rd Int. Conf. PReMI, New Delhi, India, Dec.
2009, pp. 25-31.
[30] G. H. Gulub and C. F. V. Loan, Matrix Computations, 3rd ed. Baltimore,
MD: Johns Hopkins Univ. Press, 1996.
[31] L. F. Chen, H. Y. M. Liao, M. T. Ko, J. C. Lin, and G. J. Yu, "A new
LDA-based face recognition system which can solve the small sample size
problem," Pattern Recognit., vol. 33, no. 10, pp. 1713-1726, Oct. 2000.
[32] N. Manwani and P. S. Sastry, "Geometric decision tree," CoRR,
vol. abs/1009.3604, 2010. [Online]. Available: http://arxiv.org/abs/
1009.3604
[33] D. N. A. Asuncion,UCI Machine Learning Repository, School Inf. Comput. Sci., Univ. California, Irvine, Irvine, CA, 2007. [Online]. Available:
http://www.ics.uci.edu/~mlearn/MLRepository.html
[34] S. K. Murthy, S. Kasif, and S. Salzberg, The OC1 Decision Tree
Software System, 1993. [Online]. Available: http://www.cbcb.umd.edu/
salzberg/announce-oc1.html
[35] C.-C. Chang and C.-J. Lin, LIBSVM: A Library for Support Vector
Machines, 2001. [Online]. Available: http://www.csie.ntu.edu.tw/~cjlin/
libsvm