You are on page 1of 36

This article has been accepted for publication in a future issue of this journal, but has not been

fully edited. Content may change prior to final publication.


IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING

Cluster Oriented Ensemble Classifier: Impact of Multi-cluster

Characterisation on Ensemble Classifier Learning

B. Verma and A. Rahman

Centre of Intelligent and Networked Systems

School of Computing Sciences, CQUniversity

Rockhamton, Queensland 4702, Australia

Email: b.verma@cqu.edu.au, a.rahman@cqu.edu.au

Abstract

This paper presents a novel cluster oriented ensemble classifier. The proposed ensemble

classifier is based on original concepts such as learning of cluster boundaries by the base

classifiers and mapping of cluster confidences to class decision using a fusion classifier. The

categorised data set is characterised into multiple clusters and fed to a number of distinctive

base classifiers. The base classifiers learn cluster boundaries and produce cluster confidence

vectors. A second level fusion classifier combines the cluster confidences and maps to class

decisions. The proposed ensemble classifier modifies the learning domain for the base

classifiers and facilitates efficient learning. The proposed approach is evaluated on

benchmark data sets from UCI machine learning repository to identify the impact of multi-

cluster boundaries on classifier learning and classification accuracy. The experimental results

and two–tailed sign test demonstrate the superiority of the proposed cluster oriented ensemble

classifier over existing ensemble classifiers published in the literature.

Keywords: Ensemble classifier, clustering, classification, fusion of classifiers

Digital Object Indentifier 10.1109/TKDE.2011.28 1041-4347/11/$26.00 © 2011 IEEE


This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication.
IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING

1. Introduction

An ensemble classifier is conventionally constructed from a set of base classifiers that

separately learn the class boundaries over the patterns in a training set. The decision of an

ensemble classifier on a test pattern is produced by fusing the individual decisions of the base

classifiers. Ensemble classifiers are also known as multiple classifier systems, committee of

classifiers and mixture of experts [1]. An ensemble classifier produces more accurate

classification than its individual counterparts provided the base classifier errors are

uncorrelated [3].

Contemporary ensemble generation techniques train the base classifiers on different subsets

of the training data in order to make their errors uncorrelated. The different algorithms

including bagging [4] and boosting [7] vary in terms of generating the training subsets for

base classifier training. The decisions of the base classifiers are fused into a single decision

by using either majority voting on discrete decisions [1] or algebraic combiners [15] on

continuous valued confidence measures. Although the contemporary ensemble classifiers

(detailed in Section 2) are capable of making the base classifier errors uncorrelated they fail

to establish any mechanism to improve the learning domain of the individual base classifiers.

To clarify this concern let us consider a real world data set with overlapping patterns from

different classes. The learning of class boundaries between overlapping class patterns in such

cases is a difficult problem. Excessive training of the base classifiers will lead to accurate

learning of the decision boundary but resulting in overfitting thus misclassifying instances of

test data. On the other hand learning generalized boundaries will avoid overfitting but at the

cost of always misclassifying some overlapping patterns. This problem on learning the class

boundaries of overlapping patterns remains inherent in all the base classifiers and is

2
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication.
IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING

propagated to the decision fusion stage as well even though the base classifier errors are

uncorrelated.

We opt to bring in clustering at this point. Clustering is the process of partitioning a data set

into multiple groups where each group contains data points that are very close in Euclidian

space. The clusters have well defined and easy to learn boundaries. Let’s assume that the

patterns are labelled with their cluster number. Now if the base classifiers are trained on the

modified data set they will learn the cluster boundaries. As the clusters have well defined

easy to learn boundaries the base classifiers can learn them with high accuracy. Clusters can

contain overlapping patterns from multiple classes. A fusion classifier can be trained to

predict the class of a pattern from the predicted cluster. The proposed cluster oriented

ensemble classifier is based on the above philosophy.

With the aim to achieve better learning and improved accuracy of the ensemble classifier, in

this paper we propose an ensemble classifier approach that clusters classified data into

multiple clusters, learns the decision boundaries between the clusters using a set of base

classifiers and combine the cluster decisions produced by the base classifiers into class

decision by a fusion classifier. Learning cluster boundaries leads to superior performance of

the base classifiers. The fusion classifier maps the clustering pattern produced by the base

classifiers into class decision. Altogether the ensemble of base and fusion classifiers aims

better learning leading to higher classification accuracy as evidenced from the experimental

results.

While achieving the above mentioned aim, the research presented in this paper would like to

find out the answers of four major research questions. The first research question is to

investigate the performance of different clustering approaches namely heterogeneous

clustering (i.e. clustering all the patterns from different classes) and homogeneous clustering

3
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication.
IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING

(i.e. clustering patterns within a class). The second research question is to investigate whether

the ensemble classifier outperforms the base classifiers significantly. The third research

question is to find out the impact of fusion classifier. The final research question is to find the

standing of the proposed ensemble classifier with respect to other ensemble classifiers on

benchmark data sets.

This paper is organized as follows. Section 2 presents the literature review. The proposed

ensemble classifier is discussed in Section 3 and the methodology is presented in Section 4.

Section 5 describes the experimental setup used for evaluating the proposed approach.

Section 6 presents the results and comparative analysis. Finally, Section 7 concludes the

paper.

2. Literature Review

The major concentration of ensemble classifier research [1]–[2] is on (i) generation of base

classifiers for achieving diversity among them, and (ii) methods for fusing the decision of the

base classifiers. Two classifiers are diverse if they make different errors on different instances.

The ultimate objective of diversity is to make the base classifiers as unique as possible with

respect to misclassified instances. We present a review of the contemporary ensemble

classifiers related to the proposed approach in this section.

Bagging [4][6] is a sampling based ensemble classifier generation approach that was

introduced by Breiman. Bagging generates the multiple base classifiers by training them on

data subsets randomly dawn (with replacement) from the entire training set. The decisions of

the base classifiers are combined into the final decision by majority voting. The sampling

procedure of bagging creates the various training subsets by bootstrap sampling which results

in the diversity among the base classifiers. Bagging is suitable for small data sets. For large

data sets however the sampling scheme based on the bootstrap with replicates of the training

4
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication.
IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING

set is infeasible. Moreover, the randomness introduced by the sampling process in bagging

cannot guarantee the performance of the overall ensemble classifier. A number of variations

to bagging are observed in the literature to improve its performance and the list includes

random forests [5], ordered aggregation [11], adaptive generation and aggregation approach

[14], and fuzzy bagging [13].

Schapire proposed a method called boosting [7][8] that creates data subsets for base classifier

training by re-sampling the training data, however, by providing the most informative

training data for each consecutive classifier. In boosting each of the training instances is

assigned a weight that determines how well the instance was classified in the previous

iteration. The subset of the training data that is badly classified (i.e. instances with higher

weights) are included in the training set for the next iteration. This way boosting pays more

attention to instances that are hard to classify. Although boosting identifies difficult to

classify instances it does not provide any mechanism to improve the learning of base

classifiers on these instances. The problem of base classifier learning that is raised by

overlapping patterns still remains (as mentioned in the previous section), and leads to poor

base classifier performance. A number of variants of boosting can be observed in the

literature including boosting recombined weak classifiers [12], weighted instance selection

[10], Learn++ [20] and its variant Learn++.NC [21].

Random subspace [9] is an ensemble creation method that uses feature sub sets to create the

different data subsets to train the base classifiers. Maclin and Shavlik proposed a neural

ensemble [22] where a number of new approaches are presented to initialise the network

weights in order to achieve diversity and generalization. Pujol and Masip presented a binary

discriminative learning technique [23] based on the approximation of the non-linear decision

boundary by a piece-wise linear smooth additive model. Chaudhuri et. al. presented a hybrid

ensemble model [24] that combines the strengths of parametric and non–parametric
5
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication.
IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING

classifiers. In recent times there are some works relating to cluster ensembles that aims to

obtain improved clustering of the data set by combining multiple partitioning of the data set

[25]. Note that the focus of ensemble classifier is to obtain improved classification accuracy

that is significantly different from cluster ensembles that aims to achieve improved clustering

accuracy.

The other key aspect of ensemble classifier is the fusion of base classifier outputs into class

decisions. The mapping can be done on discrete class decisions or continuous class

confidence values produced by the base classifiers. The commonly used fusion methods [1]

for combining class labels are majority voting, weighted majority voting, behaviour

knowledge space, and Borda count. The commonly used fusion methods for combining

continuous outputs are algebraic combiners [15] including mean rule, weighted average,

trimmed mean, min/max/median rule, product rule, and generalized mean. A number of other

fusion rules include decision template [16], pair-wise fusion matrix [17], adaptive fusion

method [18], and non–Bayesian probabilistic fusion [19]. Note that all these approaches are

designed to fuse the class decisions from the base classifiers into a single class decision.

Summarizing, the contemporary ensemble classifier generation methods are able to produce

diversity among the base classifiers by making their errors uncorrelated. It however does not

provide any mechanism to improve the learning process of the individual base classifiers on

difficult to classify overlapping patterns. The proposed ensemble classifier aims to address

this issue by creating multiple boundaries through data clustering, training the base classifiers

on easy to learn cluster boundaries and handling the cluster to class mapping process by a

fusion classifier. The overall philosophy of the proposed approach is presented in the

following section.

6
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication.
IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING

3. The Proposed Ensemble Classifier

3.1 Motivation

The decision boundaries in real world data sets are not simple. This is primarily because of

overlapping patterns from different classes in the data set. As a result the learning of decision

boundaries in such data sets leads to either overfitting or poor generalization. In both cases it

causes classification errors. The situation is explained in Figure 1. The data set in Figure 1(a)

contains overlapping patterns from two classes. Accurate learning from the training data by a

generic classifier will result in class boundaries in Figure 1(b) leading to overfitting and thus

misclassification of test data. An alternate solution to the problem can be achieved by

reducing penalties for misclassification during training. In this case the generic classifier will

learn simple decision boundaries (Figure 1(c)) but will cause misclassification of training as

well as test data.

This is the point where we would like to introduce multiple decision boundaries for each

class through clustering. Clustering is the process of grouping similar patterns. Clustering the

data set in Figure 1(a) with overlapping patterns will result in smaller groups of patterns as in

Figure 1(d). Note that the cluster boundaries (Figure 1(e)) are simple and easy to learn. A

generic classifier, if trained, now learns simple cluster boundaries that neither causes

overfitting nor extreme generalization. Cluster to class mapping can be done by a fusion

classifier. The underlying theoretical model and the methodologies of the proposed ensemble

classifier are based on the above fact and are presented bellow.

7
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication.
IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING

Class 1 Class 1 Class 1


Class 2 Class 2 Class 2
Test Case Class 2 Test Case Class 2 Test Case Class 2

(a) (b) (c)


Class 1 Class 1
Class 2 Class 2
Test Case Class 2 Test Case Class 2

(d) (e)

Figure 1: Impact of clustering on an example data set consisting of two classes. (a) The
original data set with overlapping patterns, (b) Overfitting caused by accurate learning of the
decision boundaries, (c) Generalized decision boundary with overlapping patterns of class
two considered as part of class one, (d) Clustered data set, and (e) Decision boundaries
learned on clustered data set.

3.2 Ensemble Classifier Model

Let the ensemble classifier is composed of a set of ܰ௕௖ base classifiers ߰ଵ , ߰ଶ , … , ߰ே್೎ and a

fusion classifier ߮. Given a pattern ࢞ the ensemble classifier ݂ can be defined to achieve the

following mapping:

݂(࢞) = [‫ݐ‬ଵ , … , ‫ݐ‬ே೎ ] (1)

where ‫ݐ‬ଵ , … , ‫ݐ‬ே೎ are class confidence values for the ܰ௖ classes. The base and fusion classifiers

combine to achieve the above mapping.

Assuming that the data set is partitioned into ‫ ܭ‬clusters, each pattern belongs to a cluster. The

base classifier ߰௜ is set to map the input pattern ࢞ to a set of cluster confidence measures

‫ݓ‬௜ଵ , … , ‫ݓ‬௜௄ as

߰௜ (࢞) = [‫ݓ‬௜ଵ , … , ‫ݓ‬௜௄ ]. (2)

8
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication.
IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING

The training set Ȟ௕ of a base classifier ߰ is made of pairs (࢞, [‫ݓ‬ଵ , … , ‫ݓ‬௄ ]) where ࢞ represents

the input and [‫ݓ‬ଵ , … , ‫ݓ‬௄ ] represents the target. Given that ࢞ belongs to cluster k the target

cluster confidence vector is set as

1 ݂݅ ݆ = ݇
‫ݓ‬௝ = ൝ . (3)
0 ‫݁ݏ݅ݓݎ݄݁ݐ݋‬

The base classifier parameters ߠట are tuned to optimization such that

ߠట = argminఏ෡ഗ σ‫࢞(׊‬,[௪భ ,…,௪಼])‫ߝ ್୻א‬௕ ቀ߰ఏ෡ഗ (࢞), [‫ݓ‬ଵ , … , ‫ݓ‬௄ ]ቁ (4)

where ߝ௕ is the error function. Let ߰ఏ෡ഗ (࢞) = [ߛଵ , … , ߛ௄ ]. The error function ߝ௕ for the base

classifier is defined as

ߝ௕ = σ௄
௞ୀଵ|‫ݓ‬௞ െ ߛ௞ |. (5)

Given the cluster confidence vectors produced by the base classifiers the fusion classifier ߮

performs the following mapping

߮൫[‫ݓ‬ଵଵ , … , ‫ݓ‬ଵ௄ ], … , [‫ݓ‬ே್೎ଵ , … , ‫ݓ‬ே್೎௄ ]൯ = [‫ݐ‬ଵ , … , ‫ݐ‬ே೎ ] (6)

where the ‫ݓ‬௜ଵ , … , ‫ݓ‬௜௄ are the class confidence measures produced by base classifier ߰௜ and

‫ݐ‬ଵ , … , ‫ݐ‬ே೎ are class confidence values. The training set for the fusion classifier Ȟ௙ is composed

of pairs ൫࢝, [‫ݐ‬ଵ , … , ‫ݐ‬ே೎ ]൯ where ࢝ is the cluster confidence vector and [‫ݐ‬ଵ , … , ‫ݐ‬ே೎ ] is the target

class confidence vector. A cluster can contain patterns from multiple classes and in that case

a unique mapping is not possible by the fusion classifier. Depending on the number of classes

ܰ௖ , each class deserves a share of the cluster. There are thus a total of ܰ௖ outputs/targets of

the fusion classifier each representing a class and each target receives a weight during

training according to the proportion of its patterns in the cluster. Let the cluster confidence

9
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication.
IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING

vectors produced by the base classifiers in (6) correspond to cluster k that contains ݊௝ patterns

of class j where 1 ൑ ݆ ൑ ܰ௖ . The target class confidence for the j–th class is set as

௡ೕ
‫ݐ‬௝ = ಿ೎ . (7)
σೕసభ ௡ೕ

The parameters Ʌఝ for the fusion classifier ߮ are optimized such that

ߠఝ = argminఏ෡ക σ‫׊‬൫࢝,[௧భ ,…,௧ಿ ߝ௙ ቀ߮ఏ෡ക (࢝), [‫ݐ‬ଵ , … , ‫ݐ‬ே೎ ]ቁ (8)


೎ ]൯‫୻א‬೑

where ߝ௙ is the error function. Assuming ߮ఏ෡ക (࢝) = [ߟଵ , … , ߟே೎ ] the error function ߝ௙ is

defined as:


ߝ௙ = σ௝ୀଵ

ห‫ݐ‬௝ െ ߟ௝ ห. (9)

Using (2) and (6), the ensemble classifier mapping in (1) can be enumerated as:

݂(࢞) = ߮൫ൣ߰௜ (࢞) Ȉ ‫ ڮ‬Ȉ ߰ே್೎ (࢞)൧൯


= ߮൫[‫ݓ‬ଵଵ , … , ‫ݓ‬ଵ௄ ] Ȉ ‫ ڮ‬Ȉ [‫ݓ‬ே್೎ଵ , … , ‫ݓ‬ே್೎௄ ]൯ (10)
= ൣ‫ݐ‬ଵ , … , ‫ݐ‬ே೎ ൧

The proposed ensemble classifier is based on the above model and corresponding architecture

is presented in Figure 2.

The objective of the proposed Cluster Oriented Ensemble Classifier (COEC) is to improve

Input Base classifier Cluster Confidence Vector Fusion classifier Class Confidence Vector

\1 w11 ,, w1K

x \2 w21 ,, w2 K
M t1 ,, t N c


\ N bc wN bc 1 ,, wN bc K

Figure 2: Architecture of the proposed ensemble classifier

10
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication.
IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING

the learning process as well as the overall prediction accuracy by partitioning the data set,

learning cluster boundaries by the base classifiers and mapping base classifiers’ output to

class confidence vector using a fusion classifier. The novelty of the proposed method lies in:

(i) Partitioning classified data into multiple clusters for achieving better separation.

(ii) Use of base classifiers in an ensemble to learn cluster boundaries.

(iii) Fusion of cluster confidence values produced by the base classifiers into class

confidence values by a fusion classifier.

3.3 Clustering in COEC

The learning of the base and fusion classifiers in COEC depends on multiple class boundaries

produced by clustering. The outcome of the clustering algorithm depends on the similarity

measure ο between the patterns and we have used Euclidian distance that computes the

geometric distance between two patterns ࢞௜ =< ‫ݔ‬௜ଵ , ‫ݔ‬௜ଶ , … , ‫ݔ‬௜௡ > and

࢞௝ =< ‫ݔ‬௝ଵ , ‫ݔ‬௝ଶ , … , ‫ݔ‬௝௡ > in n–dimensional hyperspace. We performed two types of clustering

in COEC:

(i) Heterogeneous clustering to partition all the patterns in the training set independent of any

knowledge of the class of the patterns.

(ii) Homogeneous clustering for partitioning the patterns belonging to a single class only.

Patterns belonging to each class are partitioned separately.

The characteristics and outcome of the two types of clustering is significantly different and

influence the accuracy of COEC as evidenced from the experimental results.

Assuming a set of K clusters {ȍଵ , ȍଶ , … , ȍ௄ } and the associated cluster centres

{țଵ , țଶ , … , ț௄ } the clustering algorithm aims to minimize an objective function

11
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication.
IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING

‫ = ܬ‬σ௄
௞ୀଵ σ‫࢞׊‬೔ ‫א‬ȍ ο(࢞௜ , ૂ௞ ), (11)

for the patterns in the corresponding training set. Considering an augmented training set īሖ

defined as īሖ = {(࢞ଵ , ݈ଵ ), (࢞ଶ , ݈ଶ ), … , (࢞หīሖ ห , ݈หīሖ ห )} where ݈௜ ‫{ ג‬ȍଵ , ȍଶ , … , ȍ௄ } , a generic

classifier ߰௜ learns the decision boundaries between the clusters and produces cluster

confidence vector ‫ݓ‬௜ଵ , … , ‫ݓ‬௜௄ . The fusion classifier maps cluster confidence vector to class

confidence vector [‫ݐ‬ଵ , … , ‫ݐ‬ே೎ ].

The performance of the fusion classifier depends on the content of the cluster. If all the

patterns in the cluster belong to the same class the mapping is unique. We refer to these

clusters as atomic clusters. Non–atomic clusters are composed of patterns from different

classes. The target vector of the fusion classifier for these clusters is set according to the

proportion of patterns from different classes during training as mentioned in (7).

4. Learning and Prediction Methodology of COEC

The overall learning and prediction methodology of COEC is presented in Figure 3 and

Figure 4. The learning process is depicted in Figure 3 where the training data is first clustered

and the base classifiers then learn the mapping from patterns to clusters. The cluster

confidence values produced by the different base classifiers are then merged to form the

inputs for the fusion classifier and the targets are set to the original class values for learning

the cluster to class map. During prediction (Figure 4), the base classifiers produce cluster

confidence vectors for a test pattern. These vectors are merged to form the input for the

fusion classifier that produces the class confidence vector.

The different steps of learning and prediction of the ensemble of classifiers are detailed in the

following sections.

12
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication.
IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING

Figure 3: Training process for COEC

Figure 4: Test process for COEC

13
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication.
IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING

4.1 Homogeneous/Heterogeneous Clustering

The learning process starts by partitioning the training data into multiple clusters. Given the

training data set [‫ݔ‬௜௝ ] = [݀௜௝ ] Ȉ [݈ܿܽ‫ݏݏ‬௜ ] where 1 ൑ ݅ ൑ ܰ௘௫௔௠௣௟௘௦ and 1 ൑ ݆ ൑ ܰ௙௘௔௧௨௥௘௦ , the

purpose of the clustering algorithm is to partition the training data set into a number of

ܰ௖௟௨௦௧௘௥ clusters. The output of the clustering algorithm is the modified data set [‫ݕ‬௜௝ ] =

[݀௜௝ ] Ȉ [݈ܿ‫ݎ݁ݐݏݑ‬௜ ]. Given the training data set, the clustering algorithm is presented in Figure

5. At the completion of clustering each row of [݀௜௝ ] is augmented with cluster id producing

[‫ݕ‬௜௝ ] = [݀௜௝ ] Ȉ [݈ܿ‫ݎ݁ݐݏݑ‬௜ ].

The output of the clustering algorithm depends on the input argument type. We have used two

types of clustering in COEC – (i) Homogeneous clustering: Clustering is performed

separately on the patterns belonging to the same class, and (ii) Heterogeneous clustering:

Clustering is performed on the entire data set. We have reported our findings on both of these

clustering approaches in Section 6.

Figure 5: Homogeneous/Heterogeneous Clustering algorithm for partitioning classified data


into multiple clusters.

14
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication.
IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING

4.2 Base Classifier Training

A set of ܰ௕௖ base classifiers are trained with [‫ݕ‬௜௝ ] = [݀௜௝ ] Ȉ [݈ܿ‫ݎ݁ݐݏݑ‬௜ ] as produced by the

clustering algorithm. The input to each base classifier is set to [݀௜௝ ]. The target for each base

classifier is set to [‫ݐ‬௜௞ ] such that

1 ݂݅ ݈ܿ‫ݎ݁ݐݏݑ‬௜ = ݇
‫ݐ‬௜௞ = ൝ , (12)
0 ‫݁ݏ݅ݓݎ݄݁ݐ݋‬

where 1 ൑ ݇ ൑ ܰ௖௟௨௦௧௘௥௦ . The aim of training the base classifiers with the target cluster

matrix is that during prediction the base classifiers produce cluster confidence values for a

pattern. The training parameters for each base classifier are optimized to fit the training data.

The training algorithm for a generic classifier is presented in Figure 6. At the completion of

training, for each base classifier ܾ a model ߠ෠ ௕ is obtained where 1 ൑ ܾ ൑ ܰ௕௖ and [݀௜௝ ] is

presented to each of the base classifiers producing a set of ܾ cluster confidence matrices


{[‫ݓ‬௜௞ ]} for the training patterns where 1 ൑ ܾ ൑ ܰ௕௖ , and 1 ൑ ݇ ൑ ܰ௖௟௨௦௧௘௥௦ .

Figure 6: Base classifier training algorithm.

15
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication.
IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING

4.3 Fusion Classifier Training

The confidence matrices produced by the base classifiers are combined to form the input to

the fusion classifier ߮ where 1 ൑ ݅ ൑ ܰ௘௫௔௠௣௟௘௦ and 1 ൑ ݇ ൑ ܰ௖௟௨௦௧௘௥௦ . The target matrix for

߮ is composed of class confidence vectors that are set according to the proportion of class

instances within the cluster. The parameters for fusion classifier ߮ are optimized to fit the

above input-output pattern produced by the training examples. At the completion of training a

model for the ensemble classifier ߠ෠ ఝ is obtained. The training algorithm for the fusion

classifier is presented in Figure 7.

Figure 7: Fusion classifier training algorithm.

4.4 Prediction

The test pattern ݁ =< ݁ଵ , … , ݁ே೑೐ೌ೟ೠೝ೐ೞ > is presented to each of the base classifiers. Each

base classifier ܾ produces ܰ௖௟௨௦௧௘௥௦ different confidence values < ‫ݓ‬ଵ௕ , … , ‫ݓ‬ே௕೎೗ೠೞ೟೐ೝೞ > that

indicate the possibility of the pattern belonging to the different clusters. The cluster

confidence vectors produced by the different base classifiers are combined to produce
ே ே
< ‫ݓ‬ଵଵ , … , ‫ݓ‬ேଵ ೎೗ೠೞ೟೐ೝೞ , … , ‫ݓ‬ଵ ್೎ , … , ‫ݓ‬ே೎೗ೠೞ೟೐ೝೞ
್೎
> that forms the input to the fusion classifier. At

output the fusion classifier produces the class confidence values < ߟଵ , … , ߟே೎೗ೌೞೞ > that

indicate the possibility of the example belonging to different classes. The ensemble classifier

prediction algorithm is presented in Figure 8.

16
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication.
IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING

Figure 8: Ensemble classifier prediction algorithm.

5. Experimental Setup

We have conducted a number of experiments on benchmark data sets from UCI machine

learning repository [27] to verify the strength of COEC and investigate the research questions

mentioned in Section 1. We have used the same data sets as used in recently published

research [10]–[12][17] so that the results can be easily compared. A summary of the data sets

is presented in Table 1. The Wine data set has well defined training and test sets so as

directed by the description of the data set [27], we have used it as it is. We have used 10–fold

cross validation for reporting the classification results for all the other data sets.

Table 1: Data sets used in the experiments.


Dataset # instances # attributes # classes

Breast Cancer (Wisconsin) 699 10 2


Sonar 208 60 2
Iris 150 4 3
Ionosphere 351 34 2
Thyroid (New) 215 5 3
Vehicle 946 18 4
Liver 345 7 2
Diabetes 768 8 2
Wine 178 13 3
Satellite 6435 36 6
Segment 2310 19 7

We used the k–means clustering algorithm [26] for partitioning the data sets. Two types of

clustering were performed: (i) heterogeneous clustering: conventional clustering of the entire

data set into k clusters where a cluster can contain examples of more than one class. The

17
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication.
IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING

target for the fusion classifier is set as per the proportions of the class examples within each

cluster; (ii) homogeneous clustering: examples of a single class are partitioned into k clusters.

The target of the fusion classifier is set to the class for which the clustering is performed. We

have reported the impact of both types of clustering on ensemble classifier accuracy and

analysed for their superiority.

We have investigated the proposed ensemble classifier by incorporating three well known

and distinct classifiers such as k Nearest Neighbour (k–NN), Neural Network (NN), and

Support Vector Machine (SVM) as the base classifiers. A Neural Network is used as the

fusion classifier. The neural networks for small data sets are trained using a single hidden

layer and tan sigmoid activation functions for the neurons. The Levenberg–Marquardt

backpropagation method is used for learning of the weights in these cases. Larger data sets

are however learned with log sigmoid activation function and gradient descent training

function. We have used the radial basis kernel for SVM and the libsvm library [28] in all the

experiments. The different parameters for the classifiers (e.g. k in k–NN classifier, sigma in

RBF kernel of SVM, and epochs, RMS error goal, learning rate in neural network) were

hand tuned for different data sets. The classification accuracies of bagging, boosting and

random subspace on the data sets in Table 1 are obtained from [17] and WEKA [31]. All the

experiments were conducted on MATLAB 7.5.0.

6. Results and Discussion

6.1 Heterogeneous and Homogeneous Clustering

6.1.1 Heterogeneous clustering

Given a set of training examples the heterogeneous clustering partitions the entire data set. In

a data set where examples of different classes are well separated in Euclidian space,

18
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication.
IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING

heterogeneous clustering will produce partitions each containing examples from one class

only. We use the term atomic cluster to refer to a partition containing examples from a single

class. Most of the real world data sets however contain overlapping examples from different

classes. It is thus likely to observe mostly non–atomic clusters (clusters containing examples

from multiple classes) when the data set is partitioned using heterogeneous clustering where

the number of clusters equals the number of classes. Figure 9 represents a set of co–

occurrence matrices that are obtained from different data sets by counting the number of

instances of each class belonging to a particular cluster when the data sets are partitioned into

k clusters using k–means clustering with k=# of classes.

Class Class Class Class

1 2 1 2 1 2 3 1 2 3
1 0 42 45 1 0 32 0
Cluster

Cluster

1 61 82 1 30 32
Cluster

Cluster
2 21 3 0 2 30 1 0
2 141 31 2 57 68
3 24 0 0 3 0 3 24

Ionosphere Sonar Iris Wine

Figure 9: Cluster–Class co–occurrence matrix when heterogeneous clustering is performed


on the data sets using k–mean clustering algorithm where k = # of classes.

Note from Figure 9 that in Ionosphere and Sonar data sets each cluster contains examples

from multiple classes. This implies overlapping data points in these data sets. Nearly atomic

and atomic clusters are obtained for the Iris data set at the second and third clusters

respectively. The first cluster however contains overlapping examples from class 2 and class

3. Clustering these data sets into higher number of partitions will lead to higher number of

atomic or nearly atomic clusters leading to better learning of the ensemble classifier. The

clusters produced for the Wine data set are however either atomic or nearly atomic. It is easier

to produce the cluster to class mapping for these clusters by the fusion classifier in COEC.

Clustering further is unlikely to provide any benefit for the ensemble classifier learning for

such data sets. Figure 10 represents the co–occurrence matrices when the Ionosphere, Sonar

and Iris data sets are partitioned into higher number of clusters.

19
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication.
IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING

It can be observed from Figure 10 that the higher number of clusters improves the learning

scenario for all the data sets. Six out of ten clusters in the Ionosphere data set are atomic and

two clusters are near atomic. Four clusters are atomic and three clusters are near atomic for

Sonar data set. All the clusters are either atomic or near atomic for Iris data set. These results

imply that higher number of clusters in heterogeneous clustering produce significant numbers

of atomic and near atomic clusters and it becomes easier for the fusion classifier in COEC to

produce the cluster to class map leading to better classification accuracy.

Class Class Class

1 2 1 2 1 2 3
1 0 10 1 17 26 1 0 24 2
2 35 0 2 11 1 2 0 0 10
3 0 7 3 0 11 3 0 20 2
4 108 26 4 18 22 4 12 0 0
5 22 2 5 17 5 5 0 1 16
6 20 38 6 0 9 6 9 0 0
7 0 15 7 4 7 7 0 0 15

Cluster
8 17 1 8 0 8 8 9 0 0
Cluster

Cluster

9 0 8 9 20 5 9 15 0 0
10 0 6 10 0 6

Ionosphere Sonar Iris

Figure 10: Cluster–Class co–occurrence matrix when heterogeneous clustering is


performed on the data sets using k–mean clustering algorithm with higher number of
clusters.

Figure 11 presents the classification accuracies of the datasets in Table 1 at different number

of clusters using heterogeneous clustering in COEC. The best classification accuracies are

obtained for all the data sets at number of clusters greater than the number of classes. As the

clusters have well defined boundaries the base classifiers learn cluster boundaries easily.

Higher number of clusters produces mostly atomic and near–atomic clusters for data sets like

Iris, Ionosphere and Sonar (Figure 10). As a result the fusion classifiers learn the cluster to

class maps with high accuracy resulting in better classification performance of the COEC.

Data sets like Wine has class patterns that are already well separated (Figure 9) and further

clustering does not significantly improve the classification performance of the COEC.

20
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication.
IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING

(a) Breast cancer (b) Sonar (c) Iris

(d) Ionosphere (e) Thyroid (f) Vehicle

(g) Liver (h) Diabetes (i) Wine

(j) Satellite (k) Segment

Figure 11: Heterogeneous clustering in COEC at different number of clusters on the test cases
of the datasets in Table 1.

6.1.2 Homogeneous clustering

Homogeneous clustering partitions the examples belonging to single class only and ignores

the instances of other classes. Consider the partitioning of the data sets in Figure 9 using

homogeneous clustering. The resultant cluster–class co–occurrence matrices are represented

in Figure 12 considering two clusters for each class. The total number of clusters equals the

number of classes time the number of clusters per class. Note that all the clusters are atomic

in nature.

21
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication.
IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING

Class Class Class Class

1 2 1 2 1 2 3 1 2 3
1 59 0 1 51 0 1 21 0 0 1 11 0 0
2 143 0 2 36 0 2 24 0 0 2 19 0 0
Cluster

Cluster
3 0 82 3 0 61 3 0 23 0 3 0 18 0
4 0 31 4 0 39 4 0 22 0 4 0 18 0

Cluster

Cluster
5 0 0 20 5 0 0 15
6 0 0 25 6 0 0 9

Ionosphere Sonar Iris Wine

Figure 12: Cluster–Class co–occurrence matrix when homogeneous clustering is


performed on the data sets using k–mean clustering algorithm with two classes for each
cluster.

Figure 13 represents the classification performance of COEC at different number of clusters

on the data sets in Table 1 using homogeneous clustering. Here n clusters imply a total of

n×number_of_classes clusters in the data set. For example, the Vehicle data set has four

classes and the four clusters in Figure 13 means 4×4=16 clusters in the data set. Too many

clusters in small data sets imply small number of patterns in each cluster that leads to poor

learning of the fusion classifier in COEC. This explains the fall of accuracy at higher number

of clusters for majority of the data sets in Figure 13.

6.1.3 Comparison

Homogeneous clustering can be beneficial over heterogeneous clustering for overlapping

patterns. For clarification, consider an artificial data set in Figure 14. The data set contains

overlapping patterns from multiple classes. Heterogeneous clustering is likely to produce the

partitions presented in Figure 14(b) where a large cluster is non–atomic. Even with higher

number of partitions the situation is unlikely to change or the produced clusters will be

random with each being non–atomic. The partitions produced by homogeneous clustering

under identical situation are presented in Figure 14(c). Note that all the clusters are atomic in

nature. The groups within each cluster are well separated geometrically for the data set. As

22
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication.
IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING

data is clustered class wise the cluster to class mapping becomes easier by the fusion

classifier. COEC thus performs better using homogeneous clustering.

(a) Breast cancer (b) Sonar (c) Iris

(d) Ionosphere (e) Thyroid (f) Vehicle

(g) Liver (h) Diabetes (i) Wine

(j) Satellite (k) Segment


Figure 13: Homogeneous clustering in COEC at different number of clusters on the test cases
of the datasets in Table 1.

Class 1 Class 1 Class 1

Class 2 Class 2 Class 2

(a) Data Set (b) Heterogeneous clustering (c) Homogeneous clustering

Figure 14: Clustering of an artificial data set with overlapping data points using
homogeneous and heterogeneous clustering.

23
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication.
IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING

To verify the above observation we have conducted a set of classification experiments on the

data sets in Table 1 using both homogeneous and heterogeneous clustering with COEC. The

10–fold cross validation results on the test sets are presented in Table 2. It can be observed

that homogeneous clustering performs 14.38% better than heterogeneous clustering on an

average with COEC. These real world data sets contain significantly overlapping patterns and

the performance of homogeneous clustering is better than that of heterogeneous clustering as

evidenced from Table 2. To validate this claim, we define the null and alternative hypothesis

as follows:

Null Hypothesis: Homogeneous clustering is equivalent to heterogeneous clustering for

classifying data using COEC.

Alternative Hypothesis: Homogeneous clustering is significantly better than heterogeneous

clustering for classifying data using COEC.

Note that the Null Hypothesis is rejected at 0.05 significance level by two–tailed sign test

[29][30] from the comparative classification performances of heterogeneous and

homogeneous clustering in Table 2.

Table 2: Classification performance comparison of COEC at homogeneous clustering and


heterogeneous clustering on the test cases of the data sets in Table 1 using 10–fold cross
validation. The sign test on the results implies that homogeneous clustering is significantly
better than heterogeneous clustering with COEC.
Data Set Heterogeneous clustering Homogeneous clustering
Breast Cancer 97.59±1.29 97.72±2.23
Sonar 67.29±9.90 84.44±7.60
Iris 95.33±5.49 96.00±3.44
Ionosphere 86.55±7.03 89.09±5.49
Thyroid 86.06±9.55 94.89±6.20
Vehicle 52.15±4.45 71.77±2.99
Liver 57.67±6.28 63.33±9.05
Diabetes 64.8±3.50 71.08±5.65
Wine 98.10±0.00 99.05±0.00
Satellite 76.08±4.97 89.19±1.22
Segment 66.93±5.07 95.97±1.08

24
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication.
IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING

Note that the performance of COEC with clustering depends on the number of clusters. The

main objective of this paper is to observe the influence of clustering on classification

accuracy. We have adopted a step wise search method by changing the number of clusters

within a limited range and observing its influence on classification accuracy. The actual

number of clusters is a function of the number of patterns in the data set and it is thus

required that a wider range of number of clusters be considered for finding the optimal

number of clusters at which the classification accuracy is maximum. Further research is

required for finding the optimal number of clusters.

6.2 Impact of Clusters on Diversity

In order to ascertain the impact of clusters on diversity we have computed the errors made by

the base classifiers as we change the number of clusters in COEC. Figure 15 represents the

errors made by k–NN, NN and SVM base classifiers as the number of clusters change. Note

that the base classifier errors at each cluster are different for all the data sets. This is possible

only if the base classifiers make different errors on identical patterns. This implies that the

errors made by the base classifiers are not correlated which in turn refer to the diversity

among the base classifiers.

25
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication.
IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING

(a) Breast cancer (b) Sonar (c) Iris

(d) Ionosphere (e) Thyroid (f) Vehicle

(g) Liver (h) Diabetes (i) Wine

(j) Satellite (k) Segment

Figure 15: Change in errors made by base classifiers as the number of clusters change in COEC.
The errors are normalized within a range of zero to one.

6.3 Comparative Performance Analysis of COEC and Base Classifiers

Table 3 represents a comparative analysis of the classification performance of COEC and the

corresponding base classifiers. Note that different base classifiers achieve different accuracies

on the data sets. This indicates the fact that the errors made by the base classifiers are

different and diversity among the base classifiers in achieved in COEC. On an average COEC

performs 3.92% better than k–NN, 7.26% better than NN and 7.39% better than SVM as the

base classifiers. The fusion classifier mingles the decisions from the base classifiers to find

the best possible verdict and this can be attributed to the better performance of COEC. In

26
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication.
IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING

order to validate the claims we define the null and alternative hypothesis for each classifier

pair in Table 4. Note that the null hypothesis is rejected at 0.05 significance level by two–

tailed sign test for each classifier pair in Table 4 implying that COEC performs significantly

better than the corresponding base classifiers.

Table 3: Classification performance comparison between COEC and the


corresponding base classifiers.
Data Set k–NN NN SVM COEC
Breast Cancer 96.78 95.09 92.02 97.72
Sonar 81.49 73.09 55.04 84.44
Iris 94 93.33 93.33 96.00
Ionosphere 80.66 77.69 84.66 89.09
Thyroid 85.17 91.33 93.83 94.89
Vehicle 68.71 65.95 68.31 71.77
Liver 61.08 62.58 61.75 63.33
Diabetes 70.29 61.27 70.64 71.08
Wine 97.14 93.50 96.07 99.05
Satellite 87.45 83.23 88.89 89.19
Segment 94.76 94.98 95.28 95.97

Table 4: Significance test for comparing the classification performance of COEC


and the corresponding base classifiers using sign test.
Classifier pair Hypothesis Test
COEC vs. k–NN Null Hypothesis: COEC is equivalent to base k–NN classifier
Alternative Hypothesis: COEC is significantly better than the base k–NN classifier
Sign-Test: Null Hypothesis rejected at 0.05 significance level from the comparative
classification performances of COEC and k–NN in Table 3

COEC vs. NN Null Hypothesis: COEC is equivalent to base NN classifier


Alternative Hypothesis: COEC is significantly better than the base NN classifier
Sign-Test: Null Hypothesis rejected at 0.05 significance level from the comparative
classification performances of COEC and NN in Table 3

COEC vs. SVM Null Hypothesis: COEC is equivalent to base SVM classifier
Alternative Hypothesis: COEC is significantly better than the base SVM classifier
Sign-Test: Null Hypothesis rejected at 0.05 significance level from the comparative
classification performances of COEC and SVM in Table 3

We also conducted a classification experiment of the entire data set with the base classifiers

only without any clustering. The classification results are presented in Table 5. COEC

performs 3.62% better than k–NN, 5.33% better than NN and 6.51% better than SVM

classifiers. This implies that clustering has significant impact on the learning of the ensemble

classifier (Section 6.1) leading to overall better performance. We justify this claim by

defining the null and alternative hypothesis in Table 6 for each pair of classifiers. Note that

the null hypothesis is rejected at 0.05 significance level using two–tailed sign test for each
27
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication.
IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING

classifier pair. This implies that clustering significantly impacts the learning in COEC and

improves the classification performance.

Table 5: Classification performance comparison between COEC and individual


classifiers with no clustering.
Data Set k-NN NN SVM COEC
Breast Cancer 96.78 95.09 92.02 97.72
Sonar 80.53 70.89 53.29 84.44
Iris 95.33 96 94.67 96.00
Ionosphere 82.80 82.04 87.23 89.09
Thyroid 88.5 87.11 93.83 94.89
Vehicle 69.39 72.26 70.15 71.77
Liver 59.08 61.67 66.92 63.33
Diabetes 68.9 68.62 71.74 71.08
Wine 97.14 93.50 96.07 99.05
Satellite 87.45 83.23 88.89 89.19
Segment 95.24 95.46 93.33 95.97

Table 6: Significance test for comparing the pair–wise classification performance


between COEC and the individual classifiers (without clustering) using sign test.
Classifier pair Hypothesis Test
COEC vs. k–NN Null Hypothesis: COEC is equivalent to k–NN classifier
Alternative Hypothesis: COEC is significantly better than the k–NN classifier
Sign-Test: Null Hypothesis rejected at 0.05 significance level from the comparative
classification performances of COEC and k–NN in Table 5

COEC vs. NN Null Hypothesis: COEC is equivalent to NN classifier


Alternative Hypothesis: COEC is significantly better than the NN classifier
Sign-Test: Null Hypothesis rejected at 0.05 significance level from the comparative
classification performances of COEC and NN in Table 5

COEC vs. SVM Null Hypothesis: COEC is equivalent to SVM classifier


Alternative Hypothesis: COEC is significantly better than the SVM classifier
Sign-Test: Null Hypothesis rejected at 0.05 significance level from the comparative
classification performances of COEC and SVM in Table 5

6.4 Comparative Performance Analysis of Classifier Fusion and Algebraic Fusion

Conventional algebraic fusion methods fuse the class confidence values produced by the base

classifiers to produce the class confidence values of the ensemble classifier. In COEC the

base classifiers produce cluster confidence values. If conventional algebraic methods (e.g.

mean of confidence values) are used in COEC the cluster confidence values will be produced

for the ensemble classifier. The cluster–to–class mapping can then be obtained using majority

voting. The class having maximum number of patterns in the cluster will win the vote. This

process is not suitable for strong non–atomic clusters as it undermines the class patterns

28
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication.
IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING

significantly present in the cluster but not in maximum. This will thus impact the overall

classification accuracy. A fusion classifier will perform better under this circumstance. The

targets of the classifier are set according to proportions of class patterns and trained

accordingly. The fusion classifier thus gives importance to all the classes in a cluster

according to their proportion whereas the majority voting undermines that.

Table 7 provides a comparative classification performance of fusion classifier and algebraic

fusion (mean confidence for cluster and majority voting for class) while used with COEC.

Overall the fusion classifier performs 1.08% better than algebraic fusion. This implies that

the use of fusion classifier significantly improves the performance of COEC compared to

algebraic fusion. To justify this claim we define the following null and alternative hypothesis:

Null Hypothesis: Fusion classifier approach is equivalent to algebraic fusion approach while

used with COEC

Alternative Hypothesis: Fusion classifier approach is significantly better than the algebraic

fusion approach while used with COEC

Note that the null hypothesis rejected at 0.05 significance level by two–tailed sign test from

the comparative classification performances presented in Table 7.

Table 7: Classification performance comparison between algebraic fusion and classifier


fusion in COEC.
Data Set Algebraic fusion Classifier fusion
Breast Cancer 97.61 97.72
Sonar 83.89 84.44
Iris 95.33 96.00
Ionosphere 84.00 89.09
Thyroid 94.28 94.89
Vehicle 70.86 71.77
Liver 62.83 63.33
Diabetes 70.46 71.08
Wine 97.14 99.05
Satellite 89.89 89.19
Segment 96.36 95.97

29
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication.
IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING

6.5 Comparative Performance Analysis of COEC and Classical Ensemble Classifiers

In order to find the position of COEC we have classified the data sets using classical

ensemble classifiers namely bagging, boosting, and random subspace method. Figure 16

provides a summary of the classification accuracies obtained using COEC and other

ensemble classifiers. On an average COEC performs 6.05% better than bagging, 8.20% better

than boosting and 9.08% better than random subspace method. As mentioned in Section 2 the

classical methods aim to achieve diversity and do not provide any mechanism to improve the

learning performance of base classifiers. In COEC this issue is handled by first allowing the

base classifier to learn cluster boundary. As clusters have well defined boundaries it is easier

to learn by the base classifiers. The fusion classifier performs the cluster to class mapping and

as observed in the previous section it performs better than the conventional fusion methods.

This combination of cluster boundary learning and fusion classifier mapping leads to better

performance of COEC. We justify this claim by conducting a sign test as presented in Table 8.

Note that the null hypothesis is rejected in all cases either at 0.05 or 0.15 significance level

indicating the fact that COEC performs significantly better than the conventional ensemble

classifiers.

30
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication.
IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING

Figure 16: Classification performance comparison between COEC and classical ensemble
classifiers.

Table 8: Significance test for comparing the pair–wise classification performance


between COEC and the classical ensemble classifiers using sign test.
Classifier pair Hypothesis Test
COEC vs. bagging Null Hypothesis: COEC is equivalent to bagging
Alternative Hypothesis: COEC is significantly better than bagging
Sign-Test: Null Hypothesis rejected at 0.05 significance level from the comparative
classification performances of COEC and bagging in Figure 16

COEC vs. boosting Null Hypothesis: COEC is equivalent to boosting


Alternative Hypothesis: COEC is significantly better than boosting
Sign-Test: Null Hypothesis rejected at 0.05 significance level from the comparative
classification performances of COEC and boosting in Figure 16

COEC vs. random Null Hypothesis: COEC is equivalent to random subspace method
subspace method Alternative Hypothesis: COEC is significantly better than random subspace method
Sign-Test: Null Hypothesis rejected at 0.15 significance level from the comparative
classification performances of COEC and random subspace method in Figure 16

7. Conclusion

We have presented a novel cluster oriented ensemble classifier (COEC) which is based on

learning of cluster boundaries by the base classifiers leading to better learning capability and

cluster–to–class mapping by a fusion classifier leading to better classification accuracy.

The proposed COEC has been evaluated on benchmark data sets from UCI machine learning

repository. The detailed experimental results and their significance using two-tailed sign test

31
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication.
IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING

have been presented and analysed in Section 6. The evidence from the experimental results

and two–tailed sign test show that (i) homogeneous clustering performs significantly better

than heterogeneous clustering with COEC. As shown in Section 6.1, overall the

homogeneous clustering performs 14.38% better than heterogeneous clustering. (ii) the

proposed COEC performs significantly better than its base counterparts. As shown in Section

6.3, overall COEC performs 3.62% better than k–NN, 5.33% better than NN and 6.51% better

than SVM classifiers. (iii) fusion classifier performs significantly better than algebraic fusion

with COEC. As shown in Section 6.4, overall the fusion classifier performs 1.08% better than

algebraic fusion. (iv) COEC outperforms classical ensemble classifiers namely bagging,

boosting and random subspace method significantly on benchmark data sets. As shown in

Section 6.5, overall COEC performs 6.05% better than bagging, 8.20% better than boosting

and 9.08% better than random subspace method.

In our future research, we would like to focus on finding the optimal number of clusters and

global optimization of the parameters of the base and fusion classifiers.

References

[1] R. Polikar, “Ensemble based systems in decision making,” IEEE Circuits and Systems

Magazine, vol. 6, no. 3, pp. 21–45, 2006.

[2] R. Caruana and A. N. Mizil, “An Empirical Comparison of Supervised Learning

Algorithms,” Proceedings of International Conference on Machine Learning (ICML), pp.

161–168, 2006.

[3] T. Windeatt, “Accuracy/Diversity and ensemble MLP classifier design,” IEEE

Transaction on Neural Networks, vol. 17, no. 5, pp. 1194–1211, 2006.

[4] L. Breiman, “Bagging predictors,” Machine Learning, vol. 24, no. 2, pp. 123–140, 1996.

[5] L. Breiman, “Random Forests,” Machine Learning, vol. 45, no. 1, pp. 5–32, Oct. 2001.

32
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication.
IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING

[6] G. Fumera, F. Roli and A. Serrau, “A Theoretical Analysis of Bagging as a Linear

Combination of Classifiers,” IEEE Transaction on Pattern Analysis and Machine

Intelligence, vol. 30, no. 7, pp. 1293–1299, 2008.

[7] R. E. Schapire, “The strength of weak learnability,” Machine Learning,” vol. 5, no. 2, pp.

197–227, 1990.

[8] Y. Freund and R. E. Schapire, “Decision-theoretic generalization of on-line learning and

an application to boosting,” Journal of Computer and System Sciences, vol. 55, no. 1, pp.

119–139, 1997.

[9] R. E. Banfield, L. o. Hall, K. W. Bowyer, W. P. Kegelmeyer, “ A new ensemble

diversity measure applied to thinning ensembles,” International workshop on Multiple

Classifier Systems (MCS), pp. 306–316, 2003.

[10] N. G. Pedrajas, “Constructing ensembles of classifiers by means of weighted instance

selection,” IEEE Transaction on Neural Networks, vol. 20, no. 2, pp. 258–277, 2009.

[11] G. M. Munoz, D. H. Lobato, and A. Suarez, “An analysis of ensemble pruning

techniques based on ordered aggregation,” IEEE Transaction on Pattern Analysis and

Machine Intelligence, vol. 31, no. 2, pp. 245–259, 2009.

[12] J. J. Rodriguez and J. Maudes, “Boosting recombined weak classifiers,” Pattern

Recognition Letters, vol. 29, pp. 1049–1059, 2008.

[13] L. Nanni and A. Lumini, “Fuzzy bagging: a novel ensemble of classifiers,” Pattern

Recognition, vol. 39, pp. 488–490, 2006.

[14] L. Chen and M. S. Kamel, “A generalized adaptive ensemble generation and aggregation

approach for multiple classifiers systems,” Pattern Recognition, vol. 42, pp. 629–644,

2009.

33
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication.
IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING

[15] J. Kittler, M. Hatef, R. P. W. Duin, and J. Matas, “On combining classifiers,” IEEE

Transaction on Pattern Analysis and Machine Intelligence, vol. 20, no. 3, pp. 226–239,

1998.

[16] L. I. Kuncheva, J. C. Bezdek, and R. Duin, “Decision templates for multiple classifier

fusion: An experimental comparison,” Pattern Recognition, vol. 34, no. 2, pp. 299–314,

2001.

[17] A. H. R. Ko, R. Sabourin, A. de S. Britto, and L. Oliveira, “ Pairwise fusion matrix for

combining classifiers,” Pattern Recognition, vol. 40, pp. 2198–2210, 2007.

[18] N. M. Wanas, R. A. Dara, and M. S. Kamel, “ Adaptive fusion and co-operative training

for classifier ensembles,” Pattern Recogntion, vol. 39, pp. 1781–1794, 2006.

[19] O. R. Terrades, E. Valveny, and S. Tabbone, “Optimal classifier fusion in a non-

Bayesian probabilistic framework,” IEEE Transaction on Pattern Analysis and Machine

Intelligence, vol. 31, no. 9, pp. 1630–1644, 2009.

[20] D. Parikh and R. Polikar, “Ensemble based incrimental learning approach to data fusion,”

IEEE Transaction on Systeams, Man, and Cybernetics, vol. 37, no. 2, pp. 437–450, 2007.

[21] M. D. Muhlbaier, A. Topalis, and R. Polikar, “Learn++.NC: Combining ensemble of

classifiers with dynamically weighted consult-and-vote for efficient incremental learning

of new classes,” IEEE Transaction on Neural Networks, vol. 20, no. 1, pp. 152–168,

2009.

[22] R. Maclin and J. W. Shavlik, “Combining the Predictions of Multiple Classifiers: Using

Competitive Learning to Initialize Neural Networks,” International Joint Conference on

Artificial Intelligence, pp. 524–531, 1995.

[23] O. Pujol and D. Masip, “Geometry-based ensembles: toward a structural characterization

of the classification boundary,” IEEE Transaction on Pattern Analysis and Machine

Intelligence, vol. 31, no. 6, 1140–1146, 2009.

34
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication.
IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING

[24] P. Chaudhuri, A. K. Ghosh, and H. Oja, “Classification based on hybridization of

parametric and non-parametric classifiers,” IEEE Transaction on Pattern Analysis and

Machine Intelligence, vol. 31, no. 7, pp. 1153–1164, 2009.

[25] A. Strehl and J. Ghosh, “Cluster ensembles – a knowledge reuse framework for

combining multiple partitions,” The Journal of Machine Learning Research, vol. 3, pp.

583–617, 2003.

[26] E. Forgy, “Cluster analysis of multivariate data: Efficiency vs. interpretability of

classifications,” Biometrics, vol. 21, pp. 768–780, 1965.

[27] UCI Machine Learning Database, http://archive.ics.uci.edu/ml/, accessed on 10th

February 2010.

[28] LIBSVM, “A library for support vector machines,”

http://www.csie.ntu.edu.tw/~cjlin/libsvm/, accessed on 10th February 2010.

[29] J. Demsar, “Statistical comparisons of classifiers over multiple data sets,” Journal of

Machine Learning Research, vol. 7, pp. 1–30, 2006.

[30] D. J. Sheskin, “Handbook of parametric and nonparametric statistical procedures,”

Chapman & Hall/CRC, 2000.

[31] M. Hall, E. Frank, G. Holmes, B. Pfahringer, P. Reutemann, and I. H. Witten, “The

WEKA Data Mining Software: An Update,” SIGKDD Explorations, vol. 11, no. 1, 2009.

35
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication.
IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING

Author Biography

Brijesh Verma is a Chair Professor in the School of Information and Communication

Technology at Central Queensland University, Australia. His research interests include

pattern recognition and computational intelligence. He has published thirteen books, seven

book chapters and over hundred papers in journals and conference proceedings. He has

received twelve competitive research grants and supervised thirty one research students in the

areas of pattern recognition and computational intelligence. He has served on the program

committees of over thirty international conferences and editorial boards of six international

journals. He is a Senior Member of IEEE and has served as a Chair of IEEE Computational

Intelligence Society’s Queensland Chapter (2007-2008) and a member of IEEE CIS

Subcommittee (2010) for Outstanding Chapter Award.

Ashfaqur Rahman received his Ph.D. degree in Information Technology from Monash

University, Australia in 2008. Currently, he is a Research Fellow at the Centre for Intelligent

and Networked Systems (CINS) at Central Queensland University (CQU), Australia. His

major research interests are in the fields of data mining, multimedia signal processing and

communication and artificial intelligence. He has published more than 20 peer-reviewed

journal articles and conference papers. Dr. Rahman is the recipient of numerous academic

awards including CQU Seed Grant, the International Postgraduate Research Scholarship

(IPRS), Monash Graduate Scholarship (MGS) and FIT Dean Scholarship by Monash

University, Australia.

36

You might also like