You are on page 1of 78

UNIVERSIDAD TECNICA FEDERICO SANTA MARIA

Departamento de Inform atica


Valparaso - Chile
An Ensemble of One-Class Domain Descriptors for
Imbalanced Classication
Thesis submitted in partial fulllment
of the requirements for the degree of
Magster en Ciencias de la Ingeniera Informatica
Felipe Ramrez Gonzalez
Evaluation Committee
Prof. Dr. rer. nat. Hector Allende Olivares (Advisor, UTFSM)
Prof. Dr. Ra ul Monge Anwandter (UTFSM)
Prof. Dr. Max Chac on Pacheco (USACH)
September, 2012
UNIVERSIDAD TECNICA FEDERICO SANTA MARIA
Departamento de Inform atica
Valparaso - Chile
Thesis title
An Ensemble of One-Class Domain Descriptors for Imbal-
anced Classication
Author
Felipe Ramrez Gonzalez
Thesis submitted to the Departamento de Inform atica of the Universidad Tecnica Federico
Santa Mara in partial fulllment of the requirements for the degree of Magster en Ciencias
de la Ingeniera Inform atica.
Dr. Hector Allende O.
(Advisor)
Dr. Ra ul Monge A.
(Internal Evaluator)
Dr. Max Chac on P.
(External Evaluator)
September, 2012
to Claudia
Acknowledgments
I would like to oer my sincerest gratitude to my advisor, Dr. Hector Allende, who has
been a fundamental element throughout the course of this project. I specially thank him for
providing me with the freedom to work in my own way and for trusting in my capabilities. I
would also like to acknowledge the assistance provided by the members of the INCA Research
Group, I deeply thank their valuable comments and suggestions in the elaboration of this
work.
In a personal note, I would like to acknowledge the support of my family and friends: my
parents Silvia and Sergio, my brother Alvaro and specially my girlfriend Claudia, to whom
this work is dedicated. I thank them for their patience and companionship throughout this
year of hard work.
I would also like to acknowledge the support of the following Research Grants: Fondecyt
1110854, CCTVal FB0821 and DGIP 24.12.01. I specially thank DGIP and Dr. Allende
for oering me a position as scientic assistant in the Advanced Computing Laboratory at
Universidad Tecnica Federico Santa Mara where this research was developed.
ii
Abstract
Over the past few years pattern recognition algorithms have made possible a growing number
of applications in biomedical engineering, such as mammography analysis for breast cancer
detection, electrocardiogram signal analysis for cardiovascular disease diagnosis, magnetic
resonance imaging analysis for brain tumor segmentation, among many others.
Given the low rates at which some of these types of diseases occur in real life, available
observations of such phenomena are highly underrepresented, often accounting for less than
1% of all available cases. Under these circumstances, an automatic test which unconditionally
issues a negative diagnosis given any observation, will be correct 99% of the time, but at the
cost of not being able to detect any diseases, which is the whole purpose of the test. This
simple example reveals that when rare cases are described, the tests empirical error is an
inappropriate performance measure, as well as error minimization-based learning models, as
a consequence.
This situation is commonly known as the class imbalance problem, it often characterizes
applications where a highly infrequent but really important phenomenon is described, and
hinders the performance of error minimization-based pattern recognition algorithms. New so-
lutions capable of compensating this problem have been proposed under two main approaches:
applying data resampling schemes, or performing modications to existing traditional algo-
rithms. This discipline is known in the literature as class imbalanced learning.
In this thesis we propose a new algorithm for imbalanced classication designed to improve
the accuracy of the minority class, hence improving the overall performance of the classier.
Computer simulations show that the proposed strategy, which we have termed Dual Support
Vector Domain Description, outperforms related literature approaches in specially interesting
benchmark instances.
Keywords: imbalanced data, one-class learning, ensemble learning
iv
Resumen
En los ultimos a nos los algoritmos de reconocimiento de patrones han hecho posible un
creciente n umero de aplicaciones en el ambito de la ingeniera biomedica, tales como el analisis
de mamografas para la detecci on de cancer, el analisis de se nales electrocardiogr acas para el
diagnostico de enfermedades cardiovasculares, el analisis de imagenes de resonancia magnetica
para la segmentaci on de tumores cerebrales, entre muchas otras.
Dada la infrecuencia con la que algunas clases de patologas se maniestan en la vida real,
los casos observables estan altamente subrepresentados, muchas veces contando por menos
del 1% del total de los datos disponibles. Bajo estas condiciones, un examen autom atico que
incondicionalmente emita un diagnostico negativo estar a en lo correcto el 99% de las veces,
pero sera incapaz de detectar aquellos casos realmente importantes donde la enfermedad
esta presente. Este simple ejemplo revela que cuando se describen casos infrecuentes el error
emprico del examen es una medida de desempe no inadecuada, y en consecuencia, los modelos
de aprendizaje basados en su minimizacion tambien lo son.
Esta situaci on es com unmente conocida como el problema de desequilibrio de clases, suele
presentarse en aplicaciones donde se describe un fen omeno altamente infrecuente pero de vital
importancia, y deteriora signicativamente el desempe no de los algoritmos de reconocimiento
de patrones basados en la minimizacion del error. Por este motivo surge la necesidad de
desarrollar algoritmos capaces de compensar este deterioro, ya sea aplicando alg un esquema
de remuestreo de datos, o haciendo alguna modicaci on a algoritmos tradicionales. Esta
disciplina en la literatura recibe el nombre de class imbalanced learning o aprendizaje de
clases desequilibradas.
En esta tesis se propone un nuevo algoritmo para resolver problemas de clasicaci on
desequilibrada, especialmente dise nado para mejorar la exactitud de la clase minoritaria,
mejorando as el desempe no total del clasicador. Simulaciones computacionales muestran
que la estrategia propuesta, a la cual se le ha denominado Dual Support Vector Domain
Description, tiene mejor desempe no que metodos relacionados de la literatura en instancias
del problema especialmente interesantes.
Palabras clave: datos desequilibrados, aprendizaje una-clase, aprendizaje de ensamblados
vi
Contents
1 Introduction 1
1.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.2 Learning from Imbalanced Datasets . . . . . . . . . . . . . . . . . . . . . . . . 2
1.3 Proposal . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.4 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.5 Summary of the Chapter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
2 Machine Learning Overview 5
2.1 Denition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2.2 Classication Tasks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.3 Classier Performance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
2.3.1 Precision and Recall . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.3.2 Sensitivity and Specicity . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.4 Traditional Classiers Under Class Imbalances . . . . . . . . . . . . . . . . . . 10
2.5 Summary of the Chapter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
3 State of the Art 11
3.1 External Approaches . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
3.1.1 Resampling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
3.1.2 Complex Data Edition . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
3.2 Internal Approaches . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
3.2.1 Ensemble Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
3.2.2 Cost-Sensitive Learning . . . . . . . . . . . . . . . . . . . . . . . . . . 16
3.2.3 One-Class Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
3.3 Appropriate Performance Measures . . . . . . . . . . . . . . . . . . . . . . . . 20
3.3.1 Geometric Mean . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
3.3.2 F-measure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
3.3.3 ROC Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
3.3.4 Optimized Precision . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
3.3.5 Optimized Accuracy with Recall-Precision . . . . . . . . . . . . . . . . 24
3.3.6 Index of Balanced Accuracy . . . . . . . . . . . . . . . . . . . . . . . . 25
viii CONTENTS
3.4 Summary of the Chapter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
4 Methodology 29
4.1 Proposed Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
4.1.1 Dual Domain Descriptions . . . . . . . . . . . . . . . . . . . . . . . . . 30
4.1.2 Nested Aggregation Rule . . . . . . . . . . . . . . . . . . . . . . . . . . 31
4.2 Related Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
4.3 Performance Assessment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
4.3.1 Measures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
4.3.2 Validation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
4.4 Summary of the Chapter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
5 Results 37
5.1 Experimental Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
5.2 Experiments with Real Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
5.2.1 Benchmark Instances . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
5.2.2 Algorithm Tuning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
5.2.3 Performance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
5.2.4 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
5.3 Experiments with Synthetic Data . . . . . . . . . . . . . . . . . . . . . . . . . 46
5.3.1 Domain Generation Framework . . . . . . . . . . . . . . . . . . . . . . 46
5.3.2 Algorithm Tuning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
5.3.3 Performance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
5.3.4 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
5.4 Validation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
5.4.1 Relevant Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
5.4.2 Hypothesis Testing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
5.5 Summary of the Chapter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
6 Conclusions 55
6.1 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
6.2 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
6.3 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
6.4 Achievements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
Bibliography 57
List of Figures
1.1 Distribution of labels of the MIT-BIH ECG database . . . . . . . . . . . . . . 2
2.1 Illustration of supervised and unsupervised learning. . . . . . . . . . . . . . . . 7
2.2 Performance assessment in an information retrieval system . . . . . . . . . . . 9
3.1 Discrimination-based versus recognition-based classication . . . . . . . . . . . 17
3.2 SVDD model trained with both target and outlier objects . . . . . . . . . . . . 18
3.3 G-mean as a function of sensitivity and specicity . . . . . . . . . . . . . . . . 21
3.4 F-measure as a function of precision and recall . . . . . . . . . . . . . . . . . . 22
3.5 Sample ROC plot . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
3.6 Optimized precision as a function of sensitivity and specicity . . . . . . . . . 24
3.7 Balanced accuracy graph . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
3.8 Index of balanced accuracy as a function of sensitivity and specicity . . . . . 26
4.1 Tightened target-class and outlier-class SVDDs . . . . . . . . . . . . . . . . . 30
4.2 Data boundaries for three SVDD variations . . . . . . . . . . . . . . . . . . . 32
4.3 Width of the extended decision boundary of the SVDD for three data scenarios 32
4.4 Width of the extended decision boundary of the SVDD for two thresholds . . . 33
5.1 Testing G-mean as a function of imbalance in real instances . . . . . . . . . . 44
5.2 Testing G-mean as a function of imbalance in synthetic instances . . . . . . . 51
5.3 G-mean of a decision tree for ve complexity levels . . . . . . . . . . . . . . . 52
x LIST OF FIGURES
List of Tables
4.1 Selection of related state-of-the-art algorithms . . . . . . . . . . . . . . . . . . 35
5.1 Summary of real benchmark datasets . . . . . . . . . . . . . . . . . . . . . . . 38
5.2 Multi-class to two-class mapping applied to benchmark datasets . . . . . . . . 39
5.3 Optimal parameters for real datasets . . . . . . . . . . . . . . . . . . . . . . . 40
5.4 Training performance over real datasets . . . . . . . . . . . . . . . . . . . . . . 42
5.5 Testing performance over real datasets . . . . . . . . . . . . . . . . . . . . . . 43
5.6 CPU performance over real datasets . . . . . . . . . . . . . . . . . . . . . . . . 45
5.7 Summary of synthetic benchmark datasets . . . . . . . . . . . . . . . . . . . . 47
5.8 Optimal parameters for synthetic datasets . . . . . . . . . . . . . . . . . . . . 47
5.9 Training performance over synthetic datasets . . . . . . . . . . . . . . . . . . . 48
5.10 Testing performance over synthetic datasets . . . . . . . . . . . . . . . . . . . 49
5.11 CPU performance over synthetic datasets . . . . . . . . . . . . . . . . . . . . . 50
5.12 Summary of results with real and synthetic data . . . . . . . . . . . . . . . . . 54
xii LIST OF TABLES
Chapter 1
Introduction
Machine learning has recently acquired special attention for real-world problem solving in
an increasing number of domains where new issues not previously considered have appeared,
given the nature of certain phenomena. One of such diculties is the so-called class imbalance
problem [JS02], a bias in data class distributions that is said to hinder the performance of
traditional pattern recognition algorithms such as decision trees, articial neural networks,
and support vector machines, as their respective training algorithms assume a relatively
uniform distribution of labels [Li07; JS02; Bre96; Bis96; Tom76].
The literature refers to the term class imbalance problem as a series of diculties found
by some pattern recognition algorithms when learning from imbalanced datasets. A training
dataset is said to be imbalanced if the samples of one of the classes are severely outnumbered
by the samples of the other classes. Formally, in the case of binary classication, let n
0
and
n
1
be the number of available training examples of each class. In cases where n
1
n
0
,
then the training set is imbalanced, class 1 being the minority, and class 0 the majority. An
imbalance level or ratio is often dened as n
0
/n
1
to characterize the amount of imbalance
between classes in a given dataset. In this work, we represent this concept as the percentage
of samples that belong to the majority class.
Some well-known real benchmark datasets have imbalanced ratios of up to 10
3
[CHG10],
i.e.: datasets where there are 10
3
observations of a class per each observation of the other.
These high imbalance levels are often found in applications where uncommon but vital phe-
nomena need to be detected, such as computer-assisted diagnosis of rare diseases [LLH10],
oil spill detection in satellite images [BS05], and recognition of many other infrequent events.
It should be noted, however, that class imbalances do not pose a problem themselves, they
are just the reection of a phenomenons underlying probability distribution when data is
sampled at random. The actual problem is that error minimization-based pattern recognition
algorithms have diculties when fed with data drawn from these distributions.
2 Introduction
Normal sinus rhythm
Atrial fibrillation
Paced rhythm
Ventricular bigeminy
Sinus bradycardia
Ventricular trigeminy
Atrial flutter
Pre-excitation
2nd heart block
Nodal rhythm
Ventricular tachycardia
Sup. tachyarrhythmia
Idioventricular rhythm
Ventricular flutter
Atrial bigeminy
Figure 1.1: Distribution of records among the 14 types of rhythm annotated in the MIT-BIH
Arrhythmia Database.
1.1 Motivation
Despite the fact that there is an increasing number of works addressing machine learning
methods and models, most of them are focused on solving theoretical issues rather than
proposing practical applications: so much research, so few good products [SN98]. Our main
motivation behind this work was to address learning issues that show up when facing real-
world problems.
In previous work we analyzed a database of human electrocardiogram records to ex-
tract features for arrhythmia classication [RACVA10]. We observed that the distribution
of records among the 14 types of rhythm was heavily imbalanced: as can be seen in Figure
1.1 normal rhythm represents almost three quarters of the entire length of the database, and
the remaining quarter is split into 13 other types. We observed that our classier tended
to memorize these patterns of normal rhythm, resulting in overall high specicities and low
sensitivities in several tests.
1.2 Learning from Imbalanced Datasets
Most learning algorithms of pattern recognition models are based on minimizing the dis-
crepancy between known observations and the responses of the model given a certain task.
Usually this discrepancy (also called loss) is measured as the empirical classication error,
that is, the quotient between the number of discrepancies and the total number of avail-
able observations. We refer to these kind of error minimization-based models as traditional
algorithms, in the context of pattern recognition.
It has been observed that the inuence of the minority class in the learning process of
traditional algorithms on highly imbalanced datasets is practically null. Classiers tend to
over-t the majority class achieving high accuracy overall but low accuracy on the minority
(and often, most important) class. Two main reasons explain why traditional classiers fail
under class imbalances: their goal is to minimize the classication error (or maximize the
overall accuracy), and it is assumed that both training and testing data are drawn from the
same distribution.
1.3 Proposal 3
Suppose an application where positive examples account only for 1% of all available
observations, such as the case of arrhythmia classication mentioned earlier. Based on the
facts stated above, a simple classier that unconditionally issues negative responses will
achieve 99% accuracy in operation (testing performance). It may seem that the classier has
comparable performance with state-of-the-art methods, however by denition it is unable
to spot any positive examples, i.e., it has 0 sensitivity (0% accuracy on the positive class).
Thus, the problem of class imbalances is precisely to nd a way to improve the performance
of the positive class without diminishing the performance on the negative class.
However, regarding the nature of the problem, in [BPM04] the authors consider that the
imbalance level is not the only factor that hinders the performance of traditional classiers.
Sample size, data overlapping, and subconcepts among classes are further factors that pose
additional diculties when learning from imbalanced datasets.
1.3 Proposal
Our previous experience with imbalance data, along with the reports in the literature en-
couraged us to continue to investigate the problem to nd more suitable solutions. Thus,
in this thesis we propose a classication model that is able to address problems with such
imbalances among class distributions in general. We aim to outperform related state-of-
the-art algorithms in particular cases of interest, to provide a new approach to imbalanced
classication given certain specic problem features.
In particular, we propose a supervised binary classication algorithm consisting of an
ensemble-based aggregation of data descriptors that achieves generalization ability under
class distribution imbalances. The method consists of a pair of data domain descriptors, each
one modeled over a dierent class, along with a rule-based combination scheme designed to
improve the accuracy of the outlier class, hence improving the overall performance.
We performed computer simulations and compared our proposal with several related lit-
erature approaches in terms of appropriate performance measures for class imbalances. The
results show that our strategy achieves comparable performance in most instances and outper-
forms other approaches in some specic instances. Moreover, our method has an interesting
behavior regarding performance as a function of imbalance level, which could be studied with
more detail in further work.
1.4 Contributions
This work contributes to the literature on machine learning methods and algorithms, particu-
larly to the elds of novelty detection and imbalanced classication. Our stated contributions
are the following:
An in-depth survey on related literature methods to solve the class imbalance problem
A new method able to properly solve instances of the class imbalance problem
4 Introduction
A comparative experimental study of the classication and computational performance
of the proposed and related methods
The results of this research project also contributed to the publication of two papers in
international proceedings: a new related cost-sensitive approach to imbalanced classication
in medical scenarios [ORV
+
12], and the method proposed in this thesis [RA12].
1.5 Summary of the Chapter
In this chapter we have introduced the problem of imbalanced classication by discussing
real-world cases and stated a formal denition for it. We have also characterized classic pat-
tern recognition algorithms that fail to appropriately solve imbalanced problems in practice.
Finally we described the big picture of our proposal, and stated the contributions of this
thesis to the literature.
This rst chapter covered an introduction to the problem and the proposed solution.
In the next chapter we discuss basic machine learning concepts for the reader to be able
to comprehend topics that are covered further in the thesis. In Chapter 3 we survey the
most relevant existing methods for imbalanced classications problems. In Chapter 4 we
describe in detail our proposed method along with the methodology followed to conduct the
experimental study. In Chapter 5 we report and analyze the results of the comparative study
and discuss its implications. Finally, in Chapter 6 we summarize this thesis, give a general
discussion of the results and conclude with our nal remarks.
Chapter 2
Machine Learning Overview
Articial intelligence is the branch of computer science that aims to develop intelligent agents
for complex automated decision making. Intelligent agents have been dened as systems that
perceive their environment and take actions that maximize their chances of success [RN02].
Machine learning is a discipline within this branch of computer science that aims to allow
intelligent agents evolve their behaviors based on empirical data, or experience. In this
chapter we introduce some basic machine learning concepts in order to understand a more
specic framework in which this work focuses. We rst give a general description of what
this discipline is about, then we give a detailed explanation of the particular case of pattern
recognition tasks, and nally we show how the class imbalance problem introduced in Chapter
1 aect traditional machine learning algorithms.
2.1 Denition
Tom Mitchell dened the notion of machine learning in [Mit97]: a computer program is said
to learn from experience E with respect to some class of tasks T and performance measure P,
if its performance at tasks in T, as measured by P, improves with experience E. Thus, within
this context, intelligent agents as dened before are regarded as learners.
The objective of a learner is to determine a given phenomenons underlying probability
distribution using a set of known training examples, in order to be able to answer questions
about new observations. This is often regarded as the generalization ability of a learner,
or how well it explains unknown cases based on knowledge extracted from a nite number
of training observations. Thus, the core purpose of a learning machine is to maximize its
chances of success in unknown environments.
Vapnik described the model of learning as a function estimation model in terms of three
components and a discrepancy function [Vap99]:
1. a generator of random vectors x, drawn independently from a xed but unknown dis-
tribution P(x);
6 Machine Learning Overview
2. a supervisor that returns an output vector y for every input vector x, according to a
conditional distribution function P(y|x), also xed but unknown;
3. a learning machine capable of implementing a set of functions f(x, ), , where
is the set of parameters that dene the family of parametric learning machines.
Thus the problem is to nd a function f that best ts the supervisors response. The selection
is based on a training set of n random independent identically distributed (i.d.d.) observations
(x
1
, y
1
), ..., (x
n
, y
n
) drawn according to P(x, y) = P(x)P(y|x) .
A loss function L(y, f(x, )) measures the discrepancy between the supervisors and the
learning machines responses given an observation x. The goal is then to minimize the
expected value of the loss, given by the risk functional
R() =
_
L(y, f(x, ))dP(x, y). (2.1)
In order to minimize this funcional when P(x, y) is unknown, an induction principle based
on available training data is used to replace R() by the empirical risk functional
R
emp
() =
1
n
n

i=1
L(y
i
, f(x
i
, )). (2.2)
The input of an object x is typically represented as a set of d features (variables) believed
to carry discriminating and characterizing information about the observation, often called
feature vectors. The d-dimensional space in which these feature vectors lie, is referred to as
the feature space. Depending on the nature of the supervisors output y of a given object,
three main learning tasks can be dened:
1. the regression task, where the output takes continuous values, i.e., y R, and the
supervisor is available,
2. the classication task, where the output takes discrete values, or labels, and the
supervisor is also available, and
3. the clustering task, where the output takes discrete values but the supervisor is not
always available.
In general, according to the availability of the supervisors output, three main learning
frameworks are dened: supervised learning, where the supervisor is always available, un-
supervised learning, where the supervisor is not available, and semi-supervised learning, a
middle ground framework where the supervisor is not always available. In this context, re-
gression and classication can be categorized as supervised learning tasks, and clustering
as an unsupervised task. Figure 2.1 illustrates the dierence between supervised (A) and
unsupervised (B) learning models.
2.2 Classication Tasks 7
Generator Supervisor
Learning Machine
Y

X
Generator Learning Machine
X
(A)
(B)
Figure 2.1: Dierence between supervised (A) and unsupervised (B) learning models.
2.2 Classication Tasks
This work focuses on classication tasks, particularly on binary classication, in which an
observations discrete output can take one of two possible values. In the following we refer
to an observations feature vector as an instance, and the corresponding supervisors output
as its label. We also refer to the tuple instance-label as a pattern. Therefore, we focus on the
problem of pattern recognition.
Let the supervisors two possible output values be zero and one, i.e., y = {0, 1}, and let
f(x, ), be a set of functions whose possible output are as well zero or one. Consider
the zero-one loss function
L(y, f(x, )) =
_
0 if y = f(x, )
1 if y = f(x, )
(2.3)
in which case the functional (2.1) represents the learning machines probability of misclassify-
ing an instance, i.e., when the responses of the supervisor and the learning machine dier. As
a consequence, the fundamental objective of a classication learning machine is to minimize
its probability of committing classication errors.
Although we have framed classication as a task of discrete outputs, classication al-
gorithms not necessarily output discrete values. Many classiers, such as neural networks,
naturally yield a probability or score, a continuous value that represents the degree to which
an instance is member of a class. In general, for classication tasks with c classes, the output
of a classier can be denoted by y(x) = [y
0
(x), ..., y
c1
(x)], where the components y
i
(x) are
(estimates of) the posterior probabilities for classes 0, ..., c 1 given x, i.e., the probability
that x is member of class i. Thus, three types of classier can be dened [BKKP99]:
1. Crisp classier: y
i
(x) {0, 1},

i
y
i
(x) = 1, x R
d
2. Probabilistic classier: y
i
(x) [0, 1],

i
y
i
(x) = 1, x R
d
3. Possibilistic classier: y
i
(x) [0, 1],

i
y
i
(x) > 0, x R
d
.
In general, a classier is a function
D : R
d
[0, 1]
c

(2.4)
8 Machine Learning Overview
where is the set of possible labels. In this work we use = {0, 1} for convenience. A
function needs to be dened in order to translate outputs, whether crisp, probabilistic or
possibilistic into a discrete label. Note that the general case of output is the possibilistic type,
as it can be normalized to probabilistic, and in turn, hardened to crisp. In the following we
refer to patterns labeled with 0 as negative examples, and to those labeled with 1 as positive
examples.
2.3 Classier Performance
Assessing the performance of a classier is application-related. In medical diagnosis, for
example, tests are often required not to miss ill patients, because doing so could have fatal
consequences. In high frequency data processing, in the other hand, maybe it would be
necessary to minimize false alarms that could yield to system overloads. These examples are
particular cases of a more general framework for measuring the performance of a classier
given a certain task.
Given a classier D and a set of instances S, a confusion matrix (also called contingency
table) C
D
S
:= (c
i,j
)
22
is used to display information about the four possible outcomes of D
given S:
1. c
1,1
: positive examples correctly classied, or true positives (TP),
2. c
2,1
: positive examples incorrectly classied as negatives, or false negatives (FN),
3. c
1,2
: negative examples correctly classied, or true negatives (TN), and
4. c
2,2
: negative examples incorrectly classied as positives, or false positives (FP).
In matrix form:
C =
_
TP FN
FP TN
_
. (2.5)
Note that S could be equivalent to the set of instances used for training the learning machine,
also known as training set, in which case C contains information about how well the model
ts known data, or training performance. But also S could be a dierent, unknown set of
instances, known as testing set, in which case C will contain more interesting information, as
testing performance measures the generalization ability of the classier.
According to this setting, a classier can make two types of error: misclassifying a negative
example, known as type I error or false positive, and misclassifying a positive example, known
as type II error or false negative.
In terms of (2.5), the overall classication error introduced in Section (2.2) can be ex-
pressed as
Error =
FP + FN
FP + TN + FP + FN
. (2.6)
2.3 Classier Performance 9
FN FN TP FP TN TN

Figure 2.2: A depiction of the given example and its relation to the confusion matrix values.
Ticked boxes represent relevant documents, whereas crossed boxes represent irrelevant
documents. Enclosed documents represent the documents retrieved by a given system
which has 3/4 precision and 3/7 recall.
Thus, the overall classication accuracy can be expressed as
Accuracy =
TP + TN
TP + TN + FP + FN
= 1 Error. (2.7)
Further performance measures that take into account the dierent types of error can be
derived from the confusion matrix, each one with a specic use among an application domain.
In the following we explain the most commonly used ones, along with their interpretations
within their respective elds.
2.3.1 Precision and Recall
Precision and recall are two measures commonly used for assessing the performance of in-
formation retrieval systems, such as web search engines. Precision measures the exactness
of the system as it reports the relevance of retrieved instances, whilst recall measures its
completeness as it indicates the fraction of all relevant instances that were retrieved.
Consider, for example, a search engine that looks up through 10 documents, 7 of which
are known to be relevant to a given query. If the system retrieves 4 documents in total, 3
of which turn out to be relevant, then its precision is 3/4 and its recall is 3/7. In this case
the search engine is arguably exact as 3 out of 4 returned documents were relevant, but lacks
completeness as it only returned 3 of the 7 relevant documents. Figure (2.2) depicts this
example and its relationship with the values of the confusion matrix.
In terms of (2.5) precision and recall can be expressed as
Precision =
TP
TP + FP
(2.8)
and
Recall =
TP
TP + FN
. (2.9)
2.3.2 Sensitivity and Specicity
Sensitivity and specicity are two measures generally used in elds where one type of error
is costlier than the other, so performance is analyzed separately in terms of the two types of
10 Machine Learning Overview
error. The sensitivity of a test measures its ability of detecting positive instances, while its
specicity measures its ability of detecting negative examples. In simpler terms, sensitivity
and specicity are measures of the accuracy of the positive and the negative class, respectively.
In terms of (2.5) sensitivity and specicity are given by
Sensitivity =
TP
TP + FN
(2.10)
Specicity =
TN
TN + FP
(2.11)
For a given classier, theres often a trade-o between these measures: high sensitivity
usually means low specicity and vice versa. In medicine, for example, diagnosing tests
should be highly sensitive, so they are unlikely to miss a disease, however, this often comes at
the cost of ring up too many false alarms, or having low specicity. Other applications may
need the complete opposite, for example, brain-computer interface systems receive hundreds
of signals per second, hence it may be required a low rate of false positives when searching
for a pattern in order to avoid system overloads.
It should be noted that neither of these measures may be used independently for measuring
the performance of a classier, they only make sense when used together. Consider, for
example, a dummy medical test that unconditionally issues positive diagnoses for any given
observation; it will achieve perfect sensitivity, but it is useless. It also should be noted that
expressions (2.9) and (2.10) are equivalent but have slightly dierent interpretations according
to their respective application domains.
2.4 Traditional Classiers Under Class Imbalances
In the previous chapter we dened the problem of imbalanced class distributions as a di-
culty that concerns traditional pattern recognition algorithms when learning from datasets
of this nature. By Occams razor, traditional learning algorithms try to output the simplest
hypothesis that best explains a set of training data. Under extremely imbalanced data the
simplest hypothesis is that all samples are negative.
Many traditional pattern recognition algorithms follow this principle, such as the C4.5
algorithm, the Articial Neural Networks (ANN) and the Support Vector Machines (SVM).
The authors [DBFS91] observed that a feed-forward neural network trained on an imbalanced
dataset may not learn to discriminate enough between classes.
2.5 Summary of the Chapter
In this chapter we have covered a considerable number of machine learning and pattern
recognition concepts needed to understand the following contents of this thesis. We dened
the dierent tasks of machine learning, presented the most common measures to assess the
performance of a learning machine, and discussed the reason why traditional algorithms fail
to provide appropriate solutions for imbalanced problems.
Chapter 3
State of the Art
The literature on class imbalanced learning exhibits a considerable number of methods and
algorithms tailored to overcome the class imbalance problem. Most of these algorithms are
modied versions of traditional error-minimization models enabled to handle imbalances in
class distributions by means of a variety of mechanisms. Depending on at which level they
operate, two major research trends can be identied: external (or data level) approaches,
and internal (or algorithmic level) approaches. In this chapter we survey the most relevant
state-of-the-art pattern recognition algorithms proposed in the literature for class imbalanced
problems at both levels.
3.1 External Approaches
At data level the proposed solutions are mainly dierent forms of data resampling, such as
random resampling, focused sampling, sampling with synthetic data and other combinations
of the above. This edition of the dataset aims to balance its class distribution in order to
be able to perform classication using traditional algorithms with an appropriate balanced
training set of examples, hence, avoiding the problem instead of solving it.
3.1.1 Resampling
Resampling is the most simple and intuitive technique to face imbalanced classication prob-
lems. In simple words, it works by modifying the training set in such a way that the class
distribution becomes balanced. This can be achieved in several ways: by randomly deleting
majority class samples, by randomly replicating minority class samples, or by applying a
combination of both approaches. In the following we describe the two most popular and
simple resampling strategies.
12 State of the Art
3.1.1.1 Random Over-sampling
Random over-sampling is the action of replicating randomly-chosen objects of a given sample.
In our particular case the technique is used to balance the class distribution of an imbalanced
dataset, i.e., to match the number of samples of both classes, by randomly replicating the
ones of the minority until the dataset is balanced. The articially-balanced dataset is then
used to train a traditional pattern recognition algorithm and perform standard classication.
This procedure is described in Algorithm 1.
Algorithm 1 Random Over-sampling
1: Let D be a set of training set of n
1
majority class samples and n
2
minority class samples
2: Choose at random n
1
n
2
minority class samples with replacement and append them to
D
3: Train a traditional error-minimization classier
One of the possible drawbacks of random over-sampling is that as it increases the size of
the training set, it could increase the amount of computational resources required for training
the classier as well. Additionally, this method increases the risk of the base classier of over-
tting the replicated objects.
3.1.1.2 Random Under-sampling
Random under-sampling is the action of deleting randomly-chosen objects of a given sample.
Similar to the random over-sampling technique, this procedure is used to balance a dataset by
randomly removing samples of the majority class until the dataset becomes balanced. Then,
the balanced dataset is used to train a traditional pattern recognition algorithm to perform
standard classication. The procedure is described in Algorithm 2.
Algorithm 2 Random Under-sampling
1: Let D be a set of training set of n
1
majority class samples and n
2
minority class samples
2: Choose at random n
1
n
2
majority class samples and remove them from D
3: Train a traditional error-minimization classier
One of the disadvantages of random under-sampling is that it is prone to discard training
observations that could be critical for decision making, risking a decrease in generalization
performance.
A mixture of random over-sampling and under-sampling is proposed in [LLL98], however
the approach does not provide a statistically signicant improvement over both methods
separately according to the authors themselves.
3.1.2 Complex Data Edition
Extensions of data edition methods have also been proposed to improve existing plain resam-
pling techniques. Mainly in the form of intelligent criteria employed to select or synthesize
objects for training. For example, in [KM97] the authors propose an one-sided sampling
3.1 External Approaches 13
strategy that under-samples the majority class by removing from the training set the less
reliable objects for classication. The selection is performed according to three criteria: ob-
jects with high class-label noise, borderline objects (objects close to the boundary between
classes in feature space), and redundant objects. Borderline samples are commonly detected
using Tomek Links [Tom76].
3.1.2.1 SMOTE
SMOTE is the acronym for Synthetic Minority Over-sampling Technique, a data edition
method proposed in 2002 by Chawla et. al, that synthesizes new minority class samples by
interpolating existing neighbor objects in feature space. Similar to the resampling methods
discussed in previous sections, the SMOTE algorithm is used to synthesize a number of
minority class samples so that the training dataset becomes balanced to be fed to a traditional
error-minimization pattern recognition algorithm. The described procedure is detailed in
Algorithm 3.
Algorithm 3 SMOTE
1: Input parameters: k: number of nearest neighbors
2: Let D be a set of training set of n
1
majority class samples and n
2
minority class samples,
d the dimensionality of the data, x
i
j the value of attribute j of sample x
i
, and s the
synthesized samples
3: for i from 1 to (n
1
n
2
) do
4: Randomly choose one of the k nearest neighbors of a given minority class sample x
l
in
D, x
n
5: for j from 1 to d do
6: d x
nj
x
lj
7: g random number between 0 and 1
8: s
ij
x
lj
+ dg
9: end for
10: end for
11: Train a traditional error-minimization classier
The main advantage of SMOTE over plain resampling is that it overcomes the risk of over-
tting replicated instances as it synthesizes new ones for training. In the other hand, the
algorithm is prone to perform interpolation using misleading neighbors, as it only considers
minority class samples to choose from. Moreover, the algorithm needs to be fed with an
additional parameter that is problem-dependent and should be tuned for achieving proper
performance.
Two variations of SMOTE are proposed in [HWM05]. They basically consist in a straight-
forward application of the SMOTE algorithm to previously-detected borderline samples of
a given dataset. The main dierence between the two variations is that one considers both
positive and negative examples for neighborhood interpolation when synthesizing, whilst the
other one only considers positive examples, similar to the original setting. The authors claim
that both techniques achieve better classication accuracy than the original version.
14 State of the Art
Another variation is proposed in [BSL09]. It synthesizes minority class samples taking
into account the presence of nearby majority class instances dening a safe level, which is
ignored by the previous techniques. The authors claim to outperform all aforementioned
versions of SMOTE.
It should be noted that the objective of these advanced resampling strategies is to select a
set of training objects to be edited (deleted, replicated, synthesized) in order to maximize the
performance of the classier when trained with the edited observations. From the point of
view of optimization, evolutive algorithms have also contributed with solutions to imbalanced
classication. An example is the evolutionary prototype selection technique [GFH06], which
uses a genetic algorithm to perform the optimal edition to the training set for classication.
3.2 Internal Approaches
We have seen that external approaches are a variety of methods that alter imbalanced train-
ing datasets so that traditional error-minimization algorithms will perform reasonably well.
An alternative way to address imbalanced classication is to design imbalance-insensitive
algorithms able to counter the eect of the majority class without the need to modify the
training set. Methods that fall in this category are regarded as internal or algorithmic level
approaches.
Three machine learning sub-disciplines have made their contributions within this category:
ensemble learning, cost-sensitive learning and one-class learning. In the following we cover
the most relevant works within each category.
3.2.1 Ensemble Learning
Ensembles of learning machines were originally proposed in [Nil65], and are based in the
intuitive idea that many opinions are often more useful than only one in decision making.
In this sense, it is assumed that the aggregated decision of a committee of classiers will
outperform its average member. Formally, ensemble algorithms consists in the aggregation of
a set of local decisions {d
1
, ..., d
L
} to generate a function D by means of a linear combination
(not necessarily convex) of the local contributions.
D =
n

i=1

i
d
i
(3.1)
It is expected that provided an appropriate design of the aggregation function, the pre-
vious assumption will be true, i.e., D will outperform the average of the local predictors d
i
,
under the assumption that they are good enough explainers of some domain of the problem.
This property, commonly known as diversity is critical for an ensemble of machines to work
properly.
The literature exhibits a considerable variety of ensemble designs, however, two of them
are widely-used in several elds: Bagging [Bre96] and Boosting [FS95]. Bagging consists
in training several classiers with dierent bootstrap samples of the available training data.
3.2 Internal Approaches 15
In the other hand, boosting adaptively adjusts the sampling weights of training examples
according to the performance of previous iterations, hence focusing on objects hard to classify.
Several authors have also adapted the concept of ensemble learning with existing tech-
niques for solving class imbalances. A considerable number of new hybrid methods have ben-
eted from aggregating multiple decisions with encouraging results [Li07; CLHB03; FSZC99].
3.2.1.1 Balanced Bagging
In [Li07] the authors aim to address imbalanced classication by partitioning the original
imbalanced problem into several smaller balanced sub-problems using a variation of Bagging.
The method consists in building a number of base classiers fed with all available minority
class instances and a random sample without replacement of majority class samples so that
the training subsets are balanced. By doing so, the members of the ensemble will be trained
using all available data. In operation, the nal decision for a new observation will be the
majority vote of the members of the ensemble. Algorithm 4 gives a detailed explanation of
the procedure.
Algorithm 4 Balanced Bagging
1: Let D be a set of training set of n
1
majority class samples and n
2
minority class samples
2: Split the majority class training set in N = n
1
/n
2
disjunctive subsets of n
2
samples
3: for i from 1 to N do
4: Train a traditional error-minimization classier using all available minority class sam-
ples and subset i of split majority class samples
5: end for
There are two clear advantages of Balanced Bagging over resampling methods. One is
that no majority class samples are discarded in the training set, so there is no risk of loosing
important information. Also, the training subsets of each classication unit have no replicated
elements, hence there is no risk of over-tting as in random over-sampling.
A major drawback of such an ensemble method is that the diversity is introduced at data
level, so there is no guarantee that the base classiers will be diverse given a certain problem.
Other strategies for partitioning the majority class data besides random sampling could be
explored in order to assert diversity.
There is a considerable number of other ensemble-based methods for imbalanced clas-
sication but only a few are worth mentioning. SMOTEBoost [CLHB03] is a combination
of the SMOTE algorithm previously described in Section 3.1.2.1 and Adaboost. Similar to
SMOTE, in [CHG10] a ranked minority class over-sampling technique is introduced and used
adaptively with a modication of Adaboost, to produce the RAMOBoost algorithm. Fi-
nally, the SHRINK algorithm [KHM98] uses an aggregation strategy based on a set of nested
tests, which are evaluated and weighted by an appropriate performance measure for decision
making.
16 State of the Art
3.2.2 Cost-Sensitive Learning
Class imbalances are very common in applications where misclassication costs are asym-
metric. Consider for example computer-assisted diagnosis of diseases, where the cost of false
negatives is much greater than that of false positives, as we previously covered in Section
2.3. Cost-sensitive learning integrates asymmetric misclassication costs in order to compen-
sate class imbalances during training. It has been reported that mildly imbalanced problems
can be accurately solved simply by integrating asymmetric misclassication costs in the loss
function at training time.
Because of this natural simultaneous occurrence of both phenomena (class imbalance and
cost asymmetry), it is said that cost-sensitive learning methods can solve the problem of
imbalanced classication. In [ZL06] the authors show empirically that mildly imbalanced
problems can be accurately solved simply by integrating asymmetric misclassication costs
in a learning algorithm.
Similar to ensemble learning, a considerable number of cost-sensitive algorithms for im-
balanced classication have been proposed.
3.2.2.1 AdaCost
Adacost [FSZC99] is a variation of the Adaboost algorithm that includes an adjustment term
in the updating rule of the sampling distribution based in the asymmetry of misclassication
costs. The authors claim that with adequate parameter selection the algorithm is able to
improve the overall classication performance. The training procedure of Adacost is described
with detail in Algorithm 5.
Algorithm 5 AdaCost
1: Input parameters: T: number of iterations, c
i
: misclassication cost of instance i, C:
cost adjustment constant
2: Initialize D
1
(i) such that D
1
(i) = c
i
/

m
i
c
j
i
3: for t from 1 to T do
4: Train weak learner using distribution D
t
5: Compute weak hypothesis h
t
: X R
6:
t

1
2
ln
1+r
1r
where r =

i
D(i)y
i
h(x
i
)(i)
and (i) = sign(y
i
h
t
(x
i
)) c
i
+ C, i
7: Update D
t+1
(i)
D(i)e
(
t
y
i
h
t
(x
i
(i)))
Zt
Where Z
t
is a normalization factor so that D
t+1
remains a distribution
8: end for
9: Output the nal hypothesis:
H(x) = sign(f(x)), where f(x) =
_

T
t=1

t
h
t
(x)
_
However, misclassication costs are not always available and are often dened arbitrarily.
We know that false negatives are costlier than false positives in a medical scenario, but the
3.2 Internal Approaches 17
cost ratio is not always quantied. Since misclassication costs are at the very heart of cost-
sensitive learning, we believe that their scarce availability given a certain problem is a major
drawback for their application.
3.2.3 One-Class Learning
All aforementioned classication methods are based in discrimination, i.e., their learning
algorithms seek to t an hyperplane between classes in feature space that minimizes the
classication error extracting information from the observations of both classes. On the
other hand, one-class learning (also called recognition-based learning) performs classication
by tracing an hypersphere around one class of data (namely the target class) and labeling
everything that falls outside as outliers. The approach has been recently used for solving class
imbalance problems with success under extremely imbalanced datasets [FSZC99]. Figure 3.1
depicts the fundamental dierence between both paradigms.
Contrary to discrimination-based classication, recognition-based methods assume the
existence of only one class for training. However, in class imbalanced learning, although
scarce, minority class samples are available for training. In this sense, the information of the
class not being modeled (wether the majority or minority) could be included in the learning
procedure of one-class approaches to further rene the decision boundary obtained with the
modeled class. The algorithm proposed in [TD04] takes advantage of this idea. We cover
this algorithm with more detail later in this chapter.
Now, in a two-class scenario, which of these classes should be regarded as the target class
for training remains an open question. According to [RG97] when nothing about the minority
class distribution can be assumed (or if an insucient number of examples is available, as in
our particular case), only a description of the boundary of the majority class can be accurately
estimated. We address this issue later on in Chapter 4.
There are mainly two approaches for building one-class classiers: density estimation
methods and boundary methods. Density estimation methods consist in estimating the un-
derlying density function of the available data and assign a rejection threshold for outlying
objects. A common choice of the probability model is the gaussian function [Bis96], however
it has been shown that a single normal distribution does not provide a exible enough model
a. b.
Figure 3.1: Depiction of the dierence between discrimination-based (a) and recognition-based (b)
classication.
18 State of the Art
for achieving good generalization. Hence, mixtures of gaussians have been introduced to im-
prove the performance in operation [DH73]. Moreover, kernels have been used with density
estimation methods to achieve further exibility [Par62].
A drawback of density estimation methods is that in order to t these models with accept-
able likelihood, a considerable number of training samples should be available for training,
which is not always the case. Boundary methods were proposed as an alternative to overcome
this situation given that they only require borderline data to describe a class. In this thesis
we focus in the latter approach for one-class classication, and in the following we review a
considerable number of related works.
3.2.3.1 K-centers
A very simple boundary method is the k-centers algorithm [YYD98]. Similar to the k-means
clustering algorithm [Bis96], this method builds a boundary around data by means of k small
hyperspheres centered at training observations, also called support objects. The area encircled
by a support object is regarded as its receptive eld.
In simple terms, the method consists in choosing a support set J of k objects from the
training set and the minimum radius r such that all training objects belong to any receptive
eld. The search is performed with several trials of randomly chosen centers and individually
improved by swapping between the support objects and the best choice for it within their
respective receptive elds. The best remaining subset J over all trials is reported as the
solution. The size of the support set k is optimized by means of a successive approximation
scheme in which support objects are increased from 1 to (at most) the number of training
samples, adding in each step the farther observation from the current support objects.
This algorithm has the risk of over-tting when k approaches the size of the training set,
therefore the convergence criterion should be chosen carefully. Moreover, the algorithm does
not consider the presence of outlying observations, which could also lead to over-tting by
design.
a. b.
Figure 3.2: Depiction of two SVDD models, one trained using target data only, or standard SVDD
(a), and another using additionally known outlier objects or tightened SVDD (b).
3.2 Internal Approaches 19
3.2.3.2 Support Vector Domain Description
The Support Vector Domain Description (SVDD), proposed in 1999 by David Tax and Robert
Duin [TD99], aims to describe an optimal hypersphere around a given set of target objects
based on the structural risk minimization principle of Vapniks Support Vector Machine
(SVM) [Vap99]. Unlike most one-class methods, such as the previously discussed algorithm,
the SVDD can be trained including known outlier objects to further tighten its decision
boundary and improve its performance, as can be seen in Figure 3.2. The hypersphere is
characterized by a center a and a radius R
min (R, a) = R
2
+ C

i
s.t. ||x
i
a||
2
R
2
+
i

i
0 i .
This problem can be solved by maximizing the function L of Lagrangian multipliers with
respect to using a standard quadratic program solver
max L =

i
(x
i
x
i
)

i,j

j
(x
i
x
j
)
s.t.

i
= 1
a =

i
x
i

i
= C
i
0
i
C, i .
(3.2)
The inner product (x
i
x
i
) can be generalized by a kernel function k(x, y) = (x) (y), where
is a mapping of the data to a higher dimensional space in which the t of the hypersphere
may be improved. With such mapping the problem (3.2) becomes
max L =

i
k(x
i
x
i
)

i,j

j
k(x
i
x
j
) (3.3)
s.t.

i
= 1 (3.4)
a =

i
(x
i
) (3.5)

i
= C
i
0
i
C, i . (3.6)
Given the optimal values for , the center of the hypersphere a and the errors
i
can be
calculated using restrictions (3.5) and (3.6). The radius R is dened as the distance from the
center to the support vectors on the boundary.
Thus, a test object z will be accepted if the distance ||z a||
2
is smaller than or equal to
the radius R
f(z) = I
_
||z a||
2
R
2
_
= I
_
_
k(z z) 2

i
k(z x
i
) +

i,j

j
k(x
i
x
j
) R
2
_
_
20 State of the Art
where I is a indicator function dened as
I(A) =
_
target if A is true
outlier otherwise
(3.7)
In [GCT09] an improved boundary for the SVDD is proposed. It allows more objects
to be t into the description and be accepted as targets. The method widens the boundary
obtained by the SVDD in terms of how close target objects are to it, avoiding over-tting. The
closest to the boundary, the wider the new boundary will be, and the more new objets will
be accepted. Thus, an object will be considered as an outlier if the following two conditions
are violated:
1. It is enclosed by the SVDD boundary.
2. The ratio of its distance to its nearest boundary point to the average distance of all
enclosing objects to their boundary point is not greater than a given decision threshold.
Thus, objects which are accepted by the SVDD boundary are also accepted by the im-
proved boundary, whereas objects which are rejected by the SVDD boundary will not nec-
essarily be rejected by the proposed decision boundary. Algorithm 6 shows the improved
rule-based boundary.
Algorithm 6 Support Vector Data Description with Improved Boundary (ISVDD)
1: Let M be a trained SVDD model over the target class, D the average distance between
the enclosed target training objects and the boundary of M, T an user-dened threshold,
z a testing object and d(z) the distance of the testing object to the boundary
2: if M accepts z or
d(z)
D
T then
3: Y (z) target
4: else
5: Y (z) outlier
6: end if
This improved boundary, however, does not consider the contrary case, in which outlier
objects are incorrectly accepted by the description. In Chapter 4 we expand this idea as it
represents the core goal of our proposal.
The literature reports that recognition-based classication models generally outperform
discriminant approaches when working with highly dimensional and extremely imbalanced
data [RK04].
3.3 Appropriate Performance Measures
In Section 2.3 we covered a considerable number of standard metrics used to measure the
performance of pattern recognition algorithms with their respective interpretations within
an application context. For example, the trade-o between sensitivity and specicity in
3.3 Appropriate Performance Measures 21
0
0.5
1
0
0.5
1
0
0.5
1
Figure 3.3: G-mean (surface) as a function of sensitivity and specicity.
a medical diagnosis scenario, and the relation between precision and recall in information
retrieval systems.
In this section we review additional performance measures proposed in the literature that
aim to provide a better representation of a classiers performance under class imbalances
than the traditional aforementioned set of metrics. Therefore these metrics will be regarded
as appropriate measures.
In general, any metric that uses values from both rows of the confusion matrix (2.5) simul-
taneously, will be inherently sensitive to class imbalances [Faw06], as the class distribution of
the dataset is the proportion of row-wise sums. Measures such as those discussed in Section
2.3 are based on values from both rows of the confusion matrix and in the following will be
regarded as inappropriate measures for imbalanced classication.
3.3.1 Geometric Mean
A single standard performance measure is unable to give a proper representation of the perfor-
mance of a pattern recognition algorithm in the task of imbalanced classication. Sensitivity
and specicity report the accuracies of the positive and negative classes, respectively. Hence,
for a thoroughly performance assessment, both measures should be analyzed together. Un-
fortunately, such analyzes are complex from a computational point of view as both measures
need to be simultaneously optimized.
The geometric mean of the sensitivity and the specicity (or simply G-mean) is a perfor-
mance measure that aims to represent a balance between both sensitivity and specicity. It
is dened as the square root of the product between both measures
G =
_
Sensitivity Specicity (3.8)
The G-mean is an arguably good performance measure for imbalanced classication as it is
independent from class distribution [KM97]. Moreover, optimizing this measure means max-
imizing the overall accuracy while maintaining a balance between sensitivity and specicity.
Figure 3.3 shows the surface of this measure as a function of sensitivity and specicity.
22 State of the Art
0
0.5
1
0
0.5
1
0
0.5
1
(a) = 0.5
0
0.5
1
0
0.5
1
0
0.5
1
(b) = 1
0
0.5
1
0
0.5
1
0
0.5
1
(c) = 2
Figure 3.4: F-measure (surface) as a function of precision and recall for three values of .
3.3.2 F-measure
Another of such representations of a pair of complementary measures in the context of infor-
mation retrieval is the F-measure or F-score. It aims to provide a weighted average between
the precision and recall by means of a parameter . It is dened as the harmonic mean
between precision and recall:
F

=
(1 +
2
) Precision Recall
(
2
Precision) + Recall
(3.9)
In practice, this measure is generally used with = 1 and called F
1
. Thus, it averages
both metrics with equal weight. For larger values of the relative importance of recall in
relation to precision increases, and the contrary for smaller values, as can be seen in Figure
3.4.
A simile between information retrieval and one-class classication is that relevant doc-
uments can be considered as target objects, and not relevant ones as outliers. With such
mapping the F-score can be adequately interpreted in the context of novelty or outlier de-
tection, and used for performance analysis.
Note in gures 3.4 and 3.3 that the G-mean and the F1-score share similar surfaces. Thus,
a similar weighting parameter could be used for the sensitivity and specicity in the G-mean
to add exibility, in case one measure should be optimized with more priority than the other,
as occurs in cost-sensitive learning.
3.3.3 ROC Analysis
Receiver operating characteristics (ROC) graphs are a technique for visualizing the perfor-
mance of a classier and depict the trade-o between their true positive and false positive
rates [Faw06]. They have been extensively used for evaluating medical decision making sys-
tems over the past three decades, and gradually adopted by the machine learning community
with similar purposes ever since.
Two measures extracted from the confusion matrix (2.5) need to be introduced: the true
positive rate (tp
rate
) and the false positive rate (fp
rate
):
tp
rate
=
TP
TP + FN
(3.10)
3.3 Appropriate Performance Measures 23
(C)
(A) (B)
0.2
0.4
0.6
0.8
1.0
0.2 0.4 0.6 0.8 1.0
t
p

r
a
t
e

=

s
e
n
s
i
t
i
v
i
t
y
fp rate = 1 - sensitivity
Figure 3.5: An example ROC plot featuring interest zones.
fp
rate
=
FN
TN + FP
. (3.11)
Note that expressions (3.10), (2.9) and (2.10) are equivalent, and that (3.11) is equivalent to
the complement of (2.11), i.e., fp
rate
= 1 specicity.
A ROC graph displays points for each combination of tp
rate
and fp
rate
in a two-dimensional
unitary space, where each point represents the performance of a discrete classier in terms of
these two measures. A series of points that resemble a curve in ROC space are often regarded
as ROC curves, and depict the performance of a continuous classier by means of stepped
score thresholding. Instances are sorted by score, and passed through a score threshold that
varies from to + in appropriate steps, thus, a single continuous classier turns into as
much discrete classiers as the number of samples in the testing set. Then, true positive and
false positive rates are computed for each threshold (classier) and plotted in ROC space.
Several zones in ROC space are important to note. Classiers near the northwest corner
(zone A in Figure 3.5) are ideal, as they achieve almost perfect classication. Classiers near
the northeast corner (zone B in Figure 3.5) are liberal, as they issue positive responses with
weak evidence, which often results in high false positive rates. Classiers appearing at the
southwest corner (zone C in Figure 3.5) are thought of as conservative, as they issue positive
responses only with strong evidence, which results in a low false positive rate but at the
price of a low true positive rate as well. Classiers near the northwest corner are the most
desirable classiers, they have high true positive rates and low false positive rates. Classiers
below the positive diagonal perform worse than chance. Figure 3.5 summarizes the previous
description.
Another useful measure obtained from the ROC plot is the area under the curve (AUC).
It gives a representation of the overall accuracy of the classier considering the true positive-
false positive tradeo.
The measures used in ROC analysis do not consider values from both rows of the confusion
24 State of the Art
0
0.5
1
0
0.5
1
1
0
1
(a) 0.8 : 0.2
0
0.5
1
0
0.5
1
1
0
1
(b) 0.5 : 0.5
0
0.5
1
0
0.5
1
1
0
1
(c) 0.2 : 0.8
Figure 3.6: Optimized precision (surface) as a function of sensitivity and specicity for three levels
of class imbalance in the form n
+
: n

.
matrix simultaneously, therefore, they are insensitive to class imbalances.
3.3.4 Optimized Precision
The authors of [RP06] propose an improvement to the representation of the accuracy for
imbalanced scenarios, which they (wrongly
1
) term Optimized Precision (OP).
The measure is based in two terms: the classic accuracy and a relationship index that
seeks to represent the level of imbalance between sensitivity an specicity.
OP = Accuracy
|Specicity Sensitivity|
Specicity + Sensitivity
(3.12)
Figure 3.6 illustrates OP as a function of sensitivity and specicity for three levels of
imbalance. We can see that the eect of sensitivity and specicity on the measure of overall
performance increases as the level of imbalance grows towards the positive and negative
classes, respectively.
Although OP has an index of balanced class accuracies to measure the level of imbalance
between sensitivity and specicity, it includes a term of overall accuracy, which could bias the
measure, causing an improper representation of the performance given the level of imbalance.
3.3.5 Optimized Accuracy with Recall-Precision
Another measure proposed in [HSMM] to properly assess the performance of an algorithm in
the task of imbalanced classication is the Optimized Accuracy with Recall-Precision (OARP).
It basically seeks to exploit the benets of accuracy, precision and recall altogether in order
to achieve stability and robustness against imbalances in class distributions.
Based in the work of [LB07], the authors propose to use a similar form of the relationship
index covered in Section 3.3.4 for each class-specic precision and recall metrics, in order to
1
The authors refer to the term precision as the percentage of overall correctly classied samples, whilst in
this thesis we refer to it as the exactness or relevance of retrieved documents in the context of information
systems. The concept the authors try to refer to in their paper we term accuracy in this thesis.
3.3 Appropriate Performance Measures 25
0.2
0.4
0.6
0.8
1.0
-1.0 -0.8 -0.6 -0.4 -0.2 0 0.2 0.4 0.6 0.8 1.0
G


=

s
e
n
s
i
t
i
v
i
t
y


s
p
e
c
i
f
i
c
t
y
Dominance = sensitivity - specificity
(A)
(B) (C)
(D) (D)
(F)
(E)
Figure 3.7: A sample balanced accuracy graph displaying a classier of Sensitivity = 0.6 and
Specicity = 0.8. The gray line is the boundary for feasible classiers.
attain a better representation of the performance from the point of view of each class. For
binary classication tasks the OARP is given by:
OARP = Accuracy
1
2 10
c
_
Precision
+
Recall

Precision
+
+ Recall

+
Precision

Recall
+
Precision

+ Recall
+
_
, (3.13)
where Precision
+
and Recall
+
are the respective measures for the positive class, and Precision

and Recall

are the respective measures for the negative class. Moreover, since OARP may
dier by a signicant amount from accuracy as a consequence of the second term, the authors
recommend reducing it to a fraction of an arbitrary power of ten by means of the parameter
c.
The authors claim that an OARP-maximization-based training algorithm will generally
outperform accuracy-based ones in the task of imbalanced classication. This is due to the
fact that the measure favors minority class samples when the data is heavily imbalanced, and
it is robust against class distribution changes.
3.3.6 Index of Balanced Accuracy
In general, the aforementioned performance measures do not consider how dominant is the
accuracy of a single class over the other, hence, their individual contributions are not reported
in the nal measure. The authors of [GMS09] seek to overcome this representation aw by
means of a new measure that quanties the trade-o between an unbiased metric of overall
accuracy (such as G or F
1
) and an index of class accuracy imbalance. This term should
not be confused with class imbalances, as it refers to an imbalance in the individual class
accuracies (sensitivity-specicity trade-o). In this sense, the class with higher accuracy rate
is regarded as the dominant class.
26 State of the Art
0
0.5
1
0
0.5
1
0
0.5
1
(a) = 0
0
0.5
1
0
0.5
1
0
0.5
1
(b) = 0.5
0
0.5
1
0
0.5
1
0
0.5
1
(c) = 1
Figure 3.8: Index of balanced accuracy (surface) as a function of sensitivity and specicity for
three values of .
The proposed measure named index of balanced accuracy (IBA) is calculated as the area
of a rectangular region in the balanced accuracy graph, described by a dominance measure
and G-mean square. The dominance measure is dened as the signed dierence between the
sensitivity and the specicity, and is expected to report which is the prevalent class and how
signicant is this relationship regarding class accuracies.
Similar to ROC analysis, the outcomes of a classier correspond to a single point in the
balanced accuracy graph. Thus, the IBA is computed as the area of the rectangle dened by
the origin (1, 0) and the outcomes of the classier (pair dominance-G
2
). Also, the authors
introduce parameter to weight the relevance of dominance:
IBA

= 1 + (Sensitivity Specicity) Sensitivity Specicity (3.14)


Figure 3.7 illustrates a balanced accuracy graph with several interesting zones: perfect
classication (A), unconditional negative classication (B), unconditional positive classi-
cation (C), and unfeasible points delimited by a grey line (D). An example classier with
Sensitivity = 0.6 and Specicity = 0.8 (E) is also displayed, along with its corresponding
rectangular area dened by the axes (F).
Greater dominance means higher sensitivity over specicity, therefore increasing the value
of results in overweighting the accuracy of the positive class in the index of balanced
accuracy, as can be seen in Figure 3.8.
The index of imbalanced accuracy for the classier depicted in Figure 3.7 is IBA
1
= 0.384.
If the class accuracies of this classier were inverted, i.e., Sensitivity = 0.8 and Specicity =
0.6, then the index would be IBA
1
= 0.576, thus, it is clear that this measure favors higher
sensitivity over specicity. Moreover, it is trivial to note that if = 0, the indices are identical
in both cases, as there is no distinction between class accuracies.
3.4 Summary of the Chapter
In this chapter we have covered and reviewed a considerable number of pattern recognition
algorithms tailored for the task of imbalanced classication along with several performance
assessment techniques developed to attain a better representation of a classiers capabilities
in this context.
3.4 Summary of the Chapter 27
In the next chapter we address the design, evaluation and comparison of our novel solution
with respect to a subset of the previously reviewed methods. Thus, given the wide variety
of proposed techniques, we dene a set of criteria to perform a proper selection of methods,
including both the related state-of-the-art algorithms that we compare our proposal to, and
also the performance measures that we employ for that purpose.
28 State of the Art
Chapter 4
Methodology
In this chapter we expand the methodology followed to design a novel approach for imbalanced
classication and conduct an experimental study to compare its performance to that of related
state-of-the-art methods, as we stated in the objectives of this thesis. We begin by describing
in detail the proposed method, seeking to overcome the research gaps found in the literature
review in Chapter 3. Then we dene criteria to perform a proper selection of the methods to
be used in the comparative study. Finally, we describe the performance assessment techniques
employed in this work to report our results in the next chapter.
4.1 Proposed Method
In this thesis we propose a new method to address the class imbalance problem based in two
previously reviewed literature approaches: one-class learning and ensemble learning, which
we have termed Dual Support Vector Domain Description (DSVDD). Our solution basically
works by aggregating local decisions of one-class domain descriptors tted to each individual
class to further tighten the decision boundary and improve the performance of the minority
class without hindering the performance of the majority class and the overall accuracy.
Our intention is to exploit the advantages of including known outlier information in the
modeling of one-class classiers, which in their very nature are designed to model a single
target class. This is due to the fact that although a few related works on boundary tight-
ening have been proposed in the literature, apparently it still remains an open research gap.
Actually, our method works in the same sense that a single SVDD trained with both classes
(tightened SVDD) seeks to improve the performance of a standard SVDD trained with target
class samples only, i.e., by correcting the decision of wrongly accepted outlier samples (see
Figure 3.2). Even though the SVDD has an inner mechanism to perform boundary tighten-
ing, there are cases in which it yields worse classication than the standard approach due to
class overlapping and further data concept complexity issues.
Thus, we propose an extension of the SVDD with improved boundary proposed in [GCT09]
30 Methodology
and previously discussed in Section 3.2.3.2 that seeks to overcome an unaddressed aw. Our
strategy consists in training two domain descriptors, one using the samples of the positive
class as targets and the other using the samples of the negative class, along with a rule-based
combination scheme designed to improve the accuracy of the minority class, hence improving
the overall performance.
4.1.1 Dual Domain Descriptions
We have seen in the previous chapter that the improved boundary of the SVDD aims to
accept wrongly rejected objects that actually belong to the target class. However, it does not
consider the contrary case, i.e., rejecting wrongly accepted objects that actually belong to
the outlier class. With our method we intend to handle this case by means of an additional
description tted to the outlier class, as described above. Thus, two tightened SVDD models
will be trained using positive class samples and negative class samples, respectively. By
doing so, we attempt to improve the performance of the SVDD in the cases where its inner
tightening mechanism fails to be an improvement on itself.
Note that both domain descriptors used in our method are tightened, wether by using
outlier samples or target samples, however, for simplicity we will refer to them as target-class
SVDD, for the SVDD trained over the target class and tightened using outlier data, and
outlier-class SVDD, for the SVDD trained over the outlier class and tightened using target
data (see Figure 4.1 for a depiction of this scheme).
In operation, with this dual-descriptor setting, four possible outcomes are possible: two
in which the descriptors agree and two others in which the descriptors disagree. A simple
aggregation rule for the individual decisions could be to accept or reject an object in agree-
ment and reject it in disagreement. However, this rule, among other simple rules that were
considered for experimentation, in practice yield unsatisfactory results. The intersection, for
example, implies less target objects being accepted, which means low sensitivity if positive
samples are being treated as targets. The union, as another example, implies less outlier
objects being rejected, which means low specicity in the same scenario. Low class-specic
accuracies yield low performance overall if the performance measure being used is appropriate
a. b.
Figure 4.1: A depiction of the tightened target-class SVDD (a) and tightened outlier-class SVDD
(b.). Note that in this example the minority class is considered as the target class,
but this may not always the case.
4.1 Proposed Method 31
for imbalanced classication, as we saw in Chapter 3, therefore, a more suitable aggregation
scheme is required.
4.1.2 Nested Aggregation Rule
To combine the decisions of the two descriptions we designed a rule that seeks to improve
the performance of the outlier class while compiling with the restrictions of the extended
receptive eld of the SVDD with improved boundary. For this purpose we propose a nested
aggregation rule as follows: if the testing object is rejected by the target-class SVDD and
the ratio of its distance to its nearest boundary point, to the average distance of all enclosing
objects to their nearest boundary points D, is smaller than a given threshold, then the object
is classied as a target, or accepted. Up to this point the decision rule is equivalent to that
of the improved boundary discussed in Section 3.2.3.2. However, in our approach if a given
test object is accepted by the target-class SVDD and rejected by the outlier-class SVDD,
then it is accepted. Otherwise, if it is accepted by the target-class SVDD and accepted by
the outlier-class SVDD, then it is rejected.
Algorithm 7 shows the proposed decision rule with the nested descriptor. Lines 1 through
5 are equivalent to the improved boundary proposed by [GCT09] and discussed in Section
3.2.3.2. The proposed rule with the nested descriptor is introduced in lines 6 to 12.
Algorithm 7 Dual Support Vector Domain Description (DSVDD)
1: Let M
+
be a trained SVDD model over the target class, M

a trained SVDD model


over the outlier class as its target, D the average distance between the enclosed target
training objects and the boundary of M
+
, T an user-dened threshold, z a testing object
and d(z) the distance of the testing object to the boundary
2: if M
+
rejects z then
3: if
d(z)
D
< T then
4: Y (z) target
5: end if
6: else
7: if M

rejects z then
8: Y (z) target
9: else
10: Y (z) outlier
11: end if
12: end if
Figure 4.2 illustrates an example of the decision boundaries of the three related methods:
the standard SVDD, the SVDD with expanded boundary, and the proposed method. Figure
4.2 (d) illustrates the aim of the proposed rule, which is to improve the performance of the
outlier class by rejecting objects that were wrongly accepted by the boundary of the target
class. The nal decision regarding candidate targets can also be seen as asking for a second
opinion to the outlier-class SVDD about its certainty.
32 Methodology
a. b. c. d.
Figure 4.2: Examples of decision boundaries in a set of two-dimensional testing objects. Black dots
are targets and white dots outliers. Figure (a) displays three decision boundaries: a
standard target-class SVDD (thin line), a target-class SVDD with improved boundary
(segmented line) and a standard outlier-class SVDD (bold line). In (b), (c) and (d)
the receptive elds of the SVDD, the ISVDD, and the proposed DSVDD, respectively
are highlighted for comparison. Although the boundaries should be tightened, they
are depicted as circumferences for simplicity.
a. b. c.
Figure 4.3: Illustration of the extended decision boundary of the SVDD given a xed threshold T
in three possible scenarios: where objects are clustered near the boundary (a), where
objects are distributed somewhat uniformly inside the description (b), and where
objects are clustered within the description far from the boundary (c).
Similar to the SVDD with improved boundary, the parameter T is user-dened and should
be properly chosen according to each instance of the problem, where greater values of T
mean a wider expansion of the boundary and more nearby objects being captured into the
description. Additionally, the two SVDD models need to be fed with their respective hyper-
parameters. In this thesis we used soft margin SVDDs with gaussian kernels, as they provided
proper exibility for running simulations. Thus, two additional parameters need to be set for
each classier: the soft margin parameter C and the kernel width . We give further details
on this matter later on in Chapter 5.
In general, it is not recommendable to use a xed value of T for several instances, as
the extended receptive eld of the description can lead to improper classication, given the
features of the dataset, such as complexity, size and data distribution. Figure 4.3 shows a
depiction of the extended decision boundary of the SVDD given a xed threshold T in three
possible scenarios: where objects are clustered near the boundary (a), where objects are
4.1 Proposed Method 33
T T
Figure 4.4: Illustration of the extended decision boundary of the SVDD for two thresholds: T
1
and T
2
, where T
1
< T
2
.
distributed somewhat uniformly inside the description (b), and where objects are clustered
within the description far from the boundary (c). Objects clustered near the boundary yield
a small value of D, hence, large values of T should be considered to overcome the small eect
of the extended receptive eld. In the other hand, objects clustered within the description
far from the boundary yield a large value of D, hence, small values of T should be considered
to avoid losing specicity.
Figure 4.4 shows a depiction of the extended decision boundary of the SVDD for two
thresholds. We can see that a greater threshold for the same data yields a wider receptive eld
of the expanded decision boundary. Thus, it is possible to accept previously rejected target
objects, and in turn, increase the sensibility. However, at the same time, previously rejected
outlier objects could be accepted into the description, which decreases the specicity (if
positive samples are being treated as targets, which, again, it may not be the case). Therefore,
this parameter should be chosen properly, as it has a direct impact in the sensitivity-specicity
trade-o.
Regarding the complexity of the proposed method, we can expect that it will be gener-
ally higher than that of SVDD and ISVDD, as two descriptions need to be trained instead
of one. The complexities of SVDD and ISVDD should not dier by a signicant amount
because ISVDD only needs minor additional calculations: the average distance of enclosed
target objects to the boundary at training time, and the decision rule in operation. We can
not estimate the dierences between the complexities of our proposal and other related al-
gorithms, as they involve dierent data structures and training algorithms not detailed here
from an implementation point of view.
Lastly, we expect our method to outperform other related state-of-the-art methods in
particular instances of the class imbalance problem.
34 Methodology
4.2 Related Methods
In Chapter 5 we compare DSVDD to other related state-of-the-art algorithms in terms of
classication performance and computational complexity. The selection of these algorithms
is based on three main criteria: a control sample of traditional algorithms, a sample of related
external approaches, and a sample of related internal approaches. In the following we explain
and identify the selected methods to be included in the experimental study.
The control sample of traditional algorithms is intended to provide a comprehensive assess-
ment of the eect of class imbalances in the classication performance of error-minimization
approaches, i.e., to inspect if class imbalances really pose additional diculties to standard
learning machines. This is due to the fact that some authors claim that class imbalances are
not the only factor that hinder traditional pattern recognition algorithms [BPM04]. For this
category of methods we selected the widely-known C4.5 algorithm for building decision trees
(TREE), and the C-type support vector classier with gaussian kernel (CSVM). We believe
that these classiers represent both simplicity with acceptable performance, and complexity
with strong performance, respectively, among traditional pattern recognition algorithms.
The samples of external and internal approaches are the actual state-of-the-art algorithms
properly designed for imbalanced classication to which we will compare our solution. For
external approaches we also followed a simple-complex criterion similar to the one used for
the the control sample of error-minimization algorithms. In one hand, we selected random
resampling as the simple technique in its two forms for internal comparison: random over-
sampling (RNDO) and random under-sampling (RNDU), and in the other, we selected the
SMOTE algorithm (SMOT), as the complex technique. Keep in mind that the three algo-
rithms are data pre-processing techniques designed to balance a dataset in order to perform
classication with a traditional algorithm. In this work we use soft-margin SVM classiers
in all three cases in order to maximize their chances of success.
For internal approaches we basically chose one algorithm per each one of the three sub-
categories discussed in Section 3.2. We chose the Balanced Bagging algorithm (BBAG) de-
scribed in Section 3.2.1.1 to represent ensemble learning, as it is the most intuitive and
natural approach for imbalanced classication, yet it has competitive performance. For cost-
sensitive learning, although hybrid (it is a cost-sensitive ensemble), we chose the AdaCost
algorithm (ACOS) thoroughly described in Section 3.2.2.1, as it integrates most of the prop-
erties of proposed cost-sensitive algorithms for imbalanced classication. Both BBAG and
ACOS were implemented using decision trees as base classiers, as they are needed to output
relatively week hypotheses. Finally, for one-class learning we chose the ISVDD algorithm
(SVDD) described in-depth in Section 3.2.3.2, as our proposal is based on it, and it is the
latest improvement of the original SVDD algorithm to our knowledge.
In Table 4.1 we summarize the selected algorithms along with their respective identiers.
This tokenization is also used in Chapter 5 to report the results of the comparative study.
Most of these algorithms, as well as the proposed method, need to be setup with a series
of user-dened hyper-parameters that have a direct impact on their performances. However,
optimal performance is not guaranteed for every instance of the problem given a xed set of
parameters. Therefore, all algorithms should be tuned for proper operation in each case. In
the case of CSVM, RNDO and RNDU, the parameters to be tuned are the constant C of the
4.3 Performance Assessment 35
TREE C4.5 Decision Trees
CSVM Soft-margin SVM with gaussian kernel
RNDO Soft-margin SVM with gaussian kernel fed with randomly over-sampled data
RNDU Soft-margin SVM with gaussian kernel fed with randomly under-sampled data
SMOT Soft-margin SVM with gaussian kernel fed with SMOTEd data
SVDD ISVDD
BBAG Balanced Bagging algorithm with decision trees as base classiers
ACOS AdaCost algorithm with decision trees as base classiers
Table 4.1: Summary of selected algorithms for the comparative study along with their respective
identiers.
soft-margin SVM and the width of the gaussian kernel . In addition to these parameters,
SMOT requires the number of nearest neighbors k to be interpolated. For its part, ACOS
needs to be tuned with the number of boosting iterations T, the misclassication costs of
negative and positive objects, and the cost function constant c which we xed at 0.5 for all
instances according to the original setting proposed by the authors. For SVDD and DSVDD
(PROP) we need to dene which class should be set as the target class (which class should
be modeled), the kernel width and the threshold of the improved boundary. Note that
although PROP computes two descriptions, we use the same set of hyper-parameters for both
models. Neither TREE nor BBAG need tuning as they do not use user-dened parameters.
The values used for these parameters in each particular instance of the problem are reported
in the next chapter in sections 5.2.2 and 5.3.2.
4.3 Performance Assessment
In Section 3.3 we described a considerable number of approaches and techniques proposed
in the literature to properly assess the performance of a classier under class imbalances.
Similar to the previous section, in the following we state criteria to select a subset of all the
performance measures described in this work (including the traditional approaches covered
in Section 2.3) to be used in the experimental study to evaluate the performance of the
methods selected in the previous section, along with the proposed solution. We also describe
a framework to validate our results with statistical signicance.
4.3.1 Measures
Similar to the selection of related methods, we dened four criteria to select a set of perfor-
mance measures for the experimental study. In rst place we selected the G-mean as a simple
yet eective appropriate measure for assessing classication performance in this scenario, as
it summarizes the trade-o between sensitivity and specicity. Moreover, it is widely-used
in the literature on class imbalanced learning and lately has become a standard to evaluate
classication in comparative studies of this nature.
We additionally selected two measures to evaluate the individual performances on each
class; the sensitivity and specicity. Thus, a more comprehensive interpretation of the G-
36 Methodology
mean can be performed, given the variety of elds from which the instances used to conduct
our simulations come.
Lastly, a traditional measure of overall classication performance is needed in order to
evaluate its level of adequacy to represent a classiers performance in relation to the afore-
mentioned appropriate measures. According to this criteria we selected the error rate.
We also included the CPU time consumption as a measure to assess the computational
complexity of the algorithms in order to enhance the analysis of the results in terms of the
complexity-performance trade-o.
4.3.2 Validation
Pattern recognition algorithms are usually validated on testing data to assess their generaliza-
tion ability. A common scheme to do so is k-fold cross-validation, in which data is partitioned
into k data subsets in order to perform k training and testing routines, each time choosing
k 1 subsets for training and the remaining set for testing.
However, under class imbalances this partition should not be arbitrary, as only a few
positive examples are available, and if chosen at random, it is likely that some folds will
not include minority class samples at all. A solution to this problem is a variation of the
technique called stratied cross-validation [Oh11], which partitions the dataset in such a way
that the original class distribution is preserved over each subset. Note that, according to this
setting, k should be at most the number of samples of the minority class, otherwise some
folds will also miss positive examples.
Finally, our hypotheses regarding the performances of the algorithms evaluated in the
next chapter are validated using Students t-test. In Section 5.1 we give more detail about
quantity regarding computer simulations and available evidence for validation.
4.4 Summary of the Chapter
In this chapter we have described a new approach for imbalanced classication which we
have termed DSVDD. The method is expected to outperform other related state-of-the-art
methods in particular instances of the problem with statistical signicance.
We described the criteria followed to select a set of related imbalanced classication
algorithms and a validation scheme to perform a comparative study between the selected
methods and our solution. The experimental setup, along with the results obtained are
reported and discussed with detail in the next chapter.
Chapter 5
Results
In this chapter we begin by describing the experimental setup and materials used to conduct
the comparative study from system platform to benchmark instances employed. We present
the results of computer simulations of the proposed method and the algorithms selected
in Section 4.2 over a series of benchmark instances, in terms of the performance measures
described in Section 4.3. Then we perform a comparative analysis of the results obtained,
and discuss their implications. Finally we test the validity of relevant hypotheses regarding
the classication performance of the studied algorithms.
5.1 Experimental Setup
For a proper evaluation of the studied algorithms we performed 20 trials of stratied 10-fold
cross-validation
1
. Since for each fold we report the measures discussed in Section 4.3.1, then
we have 200 measurements of each metric for a given instance. Moreover, for each metric
we compute their training and testing performances, measured with the corresponding folds.
Therefore, we report the mean and standard deviation of the G-mean, sensitivity, specicity
and error rate for both testing and training sets, with a population size of 200. We report
both testing and training performances in order to represent the generalization ability and
the data tting properties of the evaluated methods, respectively.
Two types of data were considered in this work: in one hand, benchmark instances col-
lected from real-world domains, such as medicine, nances, multimedia, and other miscella-
neous applications, and in the other, synthetic data systematically generated to analyze the
eect of three factors in classication performance. A detailed description of both sets of
instances can be found in sections 5.2.1 and 5.3.1, respectively. We tuned the user-dened
parameters of algorithms such that they maximized their performance in G-mean through
an exhaustive search for each instance, so each pair algorithm-instance has its own set of
1
Although arbitrary, we believe that 20 simulations is a big enough value to gather a statistically signicant
amount of evidence for proper validation.
38 Results
Data Description n d
SONA Sonar signals bounced from metal cylinders and rock 208 60
WDBC Medical information for breast cancer occurrence diagnosis 569 30
IONO Radar signals for detection of electrons in the ionosphere 351 34
PIMA Diabetes diagnosis in Pima Indian Women 768 8
WINE Chemical properties of three types of wine 178 13
GERM People description for good or bad credit risks 1000 24
VEHI Vehicle silhouette features for manufacturer identication 846 18
WPBC Medical information for breast cancer recurrence prognostic 194 33
SEGM Image segmentation features for pattern recognition 2310 19
ZERN Zernike features of hand written numerals for digit recognition 2000 47
VOWE Audio features for steady state vowels of British English 990 10
ABAL Prediction of the number of rings in abalones 731 10
GLAS Type of glass recognition dened by oxide content 214 10
YEAS Prediction of the cellular localization sites of proteins 483 8
Table 5.1: Description of the datasets used as benchmark instances of the problem. Column n is
the size of the dataset and d the dimensionality of the feature vector.
parameters (where applicable, since not all the algorithms need parameters to be tuned, such
as decision trees).
To evaluate the CPU time consumption of the evaluated algorithms we used MATLABs
cputime function, which measures the elapsed CPU time of executing code in seconds. In
our particular case we measured the elapsed time of one round of training and classication.
Similar to classication performance, we report the mean and standard deviation correspond-
ing to the 200 runs for each instance. All computer simulations were coded using MATLAB
R2011a and performed in a a 2.93 GHz Intel Core i7 with 4 GB of memory running Mac OS
X, version 10.7.
5.2 Experiments with Real Data
In this section we report and analyze the results of simulations performed using real-world
data. We begin by introducing the instances of the problem used for experimentation, then
we display the optimal parameters found for each instance, and nally we report the results
and discuss their implications.
5.2.1 Benchmark Instances
The performance of the proposed method along with that of the algorithms discussed in
Chapter 3 is evaluated using 14 real-world datasets from the UCI Machine learning repository
[AA07] and the ELENA project [GD
+
95]. The datasets in particular were chosen in order to
vary in size and imbalance level for a more comprehensive performance assessment given the
scope of this thesis. Table 5.1 summarizes the main features of the real datasets used in this
work.
5.2 Experiments with Real Data 39
Data Minority Class # Majority Class # %
SONA Rock 97 Metal cylinder 111 53
WDBC Positive diagnosis 212 Negative diagnosis 357 62
IONO Bad radar 126 Good radar 225 64
PIMA Positive diagnosis 268 Negative diagnosis 500 65
WINE Label 1 59 Labels 2 and 3 119 67
GERM Bad credit 300 Good credit 700 70
VEHI Class Van 199 All other classes 647 76
WPBC Positive diagnosis 46 Negative diagnosis 148 76
SEGM Class brickface 330 All other classes 1980 86
ZERN Digit 9 200 Digits 0 to 8 1800 90
VOWE Label 1 90 Labels 2 to 11 900 91
ABAL Label 18 42 Label 9 689 94
GLAS Class tableware 9 All other classes 205 96
YEAS Class POX 20 Class CYT 463 96
Table 5.2: Modications applied to the labels of the original datasets to transform them into two-
class imbalanced problems. The columns with # are the sizes of the minority and
majority classes and % is the imbalance level represented as the percentage of samples
that belong to the majority class
Instances describing multi-class classication tasks were transformed into binary classi-
cation datasets by selecting one of the classes as the minority class, and merging the remaining
classes into one to become the majority class. This selection was performed following the sug-
gestions in [CHG10]. Table 5.2 shows details of the transformation applied to each dataset.
The level column represents the imbalance level as the percentage of samples that belong to
the majority class.
We dened three categories for the instances according to their imbalance levels in order to
generalize the behavior of classication performance to the severity of imbalance. Instances
with imbalance levels between 53% and 64% (SONA, WDBC, IONO) fall in the category of
low imbalance, instances with imbalance levels between 65% and 76% (PIMA, WINE, GERM,
VEHI, WPBC) fall in the category of mild imbalance, and instances with imbalance levels
between 86% and 96% (SEGM, ZERN, VOWE, ABAL, GLAS, YEAS) fall in the category of high
imbalance.
5.2.2 Algorithm Tuning
As stated earlier, each algorithm was tuned to achieve its maximum performance in each
particular instance of the problem. We report the values obtained for each parameter by
means of a 10-fold stratied cross validation-driven search in Table 5.3. Note that, in general,
the misclassication cost of positive class samples in ACOS (column +) increases with the level
of imbalance, which was expected according to the design of the algorithm. All remaining
parameters have no direct interpretation within the context of the problem, however, we
report their values for your consideration.
40 Results
CSVM CSVM TREE RNDO RNDO RNDU RNDU SMOT SMOT SVDD SVDD BBAG ACOS ACOS ACOS PROP PROP
C C C k C c T T - + c T
53% SONA 1E+06 1E+00 - 1E+01 1E-01 1E+06 1E+00 10 1E+01 1E-01 0 2E+00 0.5 - 30 0.0 0.0 0 2E+00 0.7
62% WDBC 1E+03 1E-05 - 1E+03 1E-06 1E+04 1E-06 20 1E+02 1E-04 0 2E+02 1.2 - 30 0.0 0.0 0 2E+02 1.5
64% IONO 1E+06 1E+00 - 1E+00 1E-01 1E+01 1E-01 10 1E+01 1E-01 0 3E+00 0.8 - 15 0.0 0.0 0 3E+00 0.7
65% PIMA 1E+06 1E-06 - 1E+03 1E-06 1E+06 1E-06 15 1E+01 1E-04 0 2E+02 0.5 - 10 0.0 0.0 0 2E+02 0.5
67% WINE 1E+06 1E-06 - 1E+05 1E-06 1E+05 1E-06 5 1E+03 1E-04 0 1E+03 1.0 - 15 0.0 0.0 1 1E+03 0.9
70% GERM 1E+04 1E-04 - 1E+04 1E-06 1E+04 1E-05 5 1E+03 1E-04 0 3E+01 0.5 - 25 0.0 0.0 0 3E+01 0.5
76% VEHI 1E+02 1E-04 - 1E+02 1E-04 1E+04 1E-04 20 1E+02 1E-04 0 6E+01 0.5 - 10 0.0 0.2 1 6E+01 1.4
76% WPBC 1E+06 1E-06 - 1E+03 1E-06 1E+03 1E-06 20 1E+02 1E-04 0 4E+02 0.5 - 25 0.0 0.2 0 4E+02 0.5
86% SEGM 1E+03 1E-04 - 1E+04 1E-05 1E+04 1E-05 10 1E+02 1E-04 1 5E+01 0.9 - 25 0.0 0.4 1 5E+01 0.6
90% ZERN 1E+03 1E-06 - 1E-01 1E-05 1E+02 1E-06 15 1E+00 1E-04 1 8E+02 1.5 - 25 0.0 0.6 0 8E+02 0.5
91% VOWE 1E+06 1E-01 - 1E+06 1E-01 1E+00 1E+00 20 1E+04 1E+00 1 3E+00 1.3 - 30 0.0 0.8 1 3E+00 1.3
94% ABAL 1E+06 1E-04 - 1E-06 1E-05 1E+05 1E-04 15 1E+04 1E-04 0 1E+00 1.5 - 15 0.0 1.0 0 1E+00 1.5
96% GLAS 1E+06 1E-02 - 1E+06 1E-02 1E+06 1E-02 20 1E+04 1E-02 1 1E+03 1.5 - 30 0.2 1.0 0 1E+03 1.5
96% YEAS 1E+06 1E-03 - 1E+05 1E-05 1E+05 1E-04 15 1E-01 1E+02 0 3E+02 1.5 - 30 0.8 1.0 0 3E+02 1.4
Table 5.3: Optimal parameters of each method obtained for each real dataset.
5.2.3 Performance
In tables 5.4 and 5.5 we respectively report the training and testing classication performances
of the studied algorithms as a result of the computer simulations performed with real data
according to the setting described in Section 5.1. The discussion of results presented in Section
5.2.4 is based in the testing performance of the algorithms, as it is a representation of their
generalization ability. Training performance is nonetheless reported for your consideration.
In Table 5.6 we report the CPU performance of the studied algorithms represented as the
time elapsed in one round of training and classication.
5.2.4 Discussion
In the following we discuss the results of the computer simulations performed with real
data. We analyze both the generalization ability of the algorithms expressed as their testing
performance (see Table 5.5), and their complexity represented by the CPU time consumption
of the training procedure and classication of all available objects (see Table 5.6). The
training performance is also considered for reference, but not analyzed in depth.
We can see at a glance that the training performance is considerably better than that of
the testing sample. This due to the fact that testing objects are unknown to the classier
and were not considered when the model was built. Given that the classier is designed to
issue the answer that best ts the known data to the model, it is expected that it does not
perform as good with unknown observations, as it performs with known data. This behavior,
however, is quite common among supervised pattern recognition algorithms.
Regarding the performance of methods in the three categories of imbalance level, we can
see that, in general, all algorithms perform reasonably well in instances with low imbalance.
However, in instances with mild and high imbalances the overall performance of methods can
not be generalized, as there are instances with similar imbalance level where most approaches
achieve completely dierent performances. This seems to support previous observations of au-
thors regarding the nature of decreased performance due to class imbalances, which consider
5.2 Experiments with Real Data 41
additional factors such as concept complexity and data size.
For a better understanding of this phenomenon we include line charts that depict the
behavior of the G-mean as a function of the imbalance level of real instances for all studied
algorithms
2
. Figure 5.1 (a) illustrates the case of the control sample of traditional error-
minimization algorithms. We can see that there is no signicant dierence between CSVM
and TREE regarding performance with the exception of a few particular instances. It is
clear that in some cases the imbalance level is not a factor of decrease at all for these two
algorithms, for example in instances WINE, VEHI, SEGM, VOWE and GLAS. However, we can
also see that the performance in the remaining instances tends to decrease as the imbalance
increases, which indicates that there are cases where imbalance level hinders classication
and that the bigger the imbalance, the bigger the performance drop.
Figure 5.1 (b) illustrates the case of external approaches. We can see that RNDO and
RNDU have nearly the exact same performance pattern, which is consistent with reports in
the literature regarding nearly inexistent dierences between these two methods. In Table
5.5 we can see with more detail that the variation of performance (if any) is very slight. In
the other hand, the performance of SMOT varies signicantly in instances WPBC, ZERN and
ABAL with emphasis in the second one, which suggests that the plain resampling methods
are more appropriate in this instance. In any case, it seems that the performance of external
approaches as a function of class imbalances behaves similarly as with the control sample
of algorithms, i.e., it tends to drop as the level of imbalance increases in aected instances,
however less dramatically and almost stable for highly imbalanced scenarios.
Regarding internal approaches, a depiction of their performance is presented in Figure
5.1 (c). The performance pattern in this case is somewhat similar to the ones presented
previously, however, we can see dramatic variations from one method to other, specially in
highly imbalanced instances of the problem. The performance pattern of related internal
approaches is very similar to that of external approaches in Figure 5.1 (b), however, our
solution is remarkably stable at high levels of imbalance. Note that it performs worse than
other approaches in instances where algorithms are in general not aected by class imbalances,
such as VOWE and GLAS, but it is clearly better in instances where algorithms are aected
by imbalances, such as ABAL and YEAS. As a result, the performance pattern of our method
stabilizes towards high levels of imbalance, which indicates a certain level of robustness against
this factor. The overall trend of the performance in this case is similar to that of the sample
of external algorithms.
The sawtooth shapes observed in the cases of mild to high imbalance levels suggest that
there are instances in which most of the algorithms are not aected by class imbalances, and
others in which the algorithms are prone to the phenomenon, which supports the hypoth-
esis that this factor is not a sucient condition to observe a decrease in the performance.
In the following we refer to WINE, VEHI, SEGM, VOWE and GLAS as easy instances, since
although they are characterized by considerable levels of imbalance, they do not seem to be
a major diculty for the methods employed, even traditional error-minimization algorithms.
Instances PIMA, GERM, WPBC, ZERN, ABAL and YEAS, on the other hand, will be consid-
2
Note that in gures 5.1 and 5.2 we plot the performances of almost all the algorithms with the same line
style, as there are no major dierences to note and our aim is only to show the trend in general. One of the
methods in each sub-gure is highlighted with a dierent style for particular reasons in each case.
42 Results
CSVM TREE RNDO RNDU SMOT SVDD BBAG ACOS PROP
53% SONA 0.000 0.000 0.025 0.012 0.059 0.011 0.008 0.006 0.054 0.009 0.155 0.013 0.068 0.018 0.403 0.311 0.182 0.027
62% WDBC 0.020 0.003 0.011 0.004 0.037 0.004 0.032 0.004 0.013 0.003 0.068 0.005 0.034 0.008 0.006 0.002 0.068 0.007
64% IONO 0.000 0.000 0.019 0.007 0.030 0.005 0.014 0.005 0.007 0.002 0.061 0.005 0.065 0.013 0.018 0.007 0.088 0.041
65% PIMA 0.192 0.006 0.074 0.008 0.241 0.007 0.211 0.010 0.209 0.007 0.269 0.006 0.146 0.012 0.057 0.008 0.290 0.008
67% WINE 0.000 0.000 0.009 0.005 0.000 0.001 0.013 0.011 0.000 0.000 0.096 0.078 0.017 0.009 0.009 0.006 0.087 0.012
70% GERM 0.139 0.005 0.088 0.007 0.264 0.011 0.248 0.010 0.201 0.007 0.268 0.006 0.166 0.013 0.040 0.005 0.290 0.014
76% VEHI 0.003 0.001 0.012 0.003 0.002 0.001 0.020 0.007 0.004 0.001 0.087 0.005 0.046 0.007 0.008 0.003 0.093 0.011
76% WPBC 0.019 0.006 0.050 0.015 0.179 0.022 0.256 0.031 0.000 0.001 0.215 0.011 0.209 0.041 0.027 0.011 0.227 0.022
86% SEGM 0.001 0.000 0.001 0.000 0.001 0.001 0.008 0.004 0.001 0.001 0.038 0.004 0.011 0.003 0.001 0.001 0.026 0.003
90% ZERN 0.065 0.002 0.027 0.003 0.112 0.004 0.133 0.009 0.055 0.002 0.134 0.010 0.133 0.007 0.019 0.003 0.208 0.063
91% VOWE 0.000 0.000 0.005 0.002 0.000 0.000 0.026 0.007 0.000 0.000 0.028 0.003 0.071 0.009 0.002 0.002 0.058 0.004
94% ABAL 0.023 0.002 0.024 0.004 0.303 0.053 0.221 0.071 0.088 0.011 0.023 0.001 0.228 0.021 0.014 0.003 0.029 0.006
96% GLAS 0.000 0.000 0.000 0.002 0.000 0.000 0.036 0.017 0.000 0.000 0.001 0.002 0.082 0.032 0.002 0.003 0.002 0.002
96% YEAS 0.021 0.002 0.014 0.002 0.047 0.030 0.067 0.039 0.128 0.076 0.053 0.027 0.148 0.048 0.018 0.003 0.065 0.009
Error
53% SONA 1.000 0.000 0.972 0.023 0.926 0.019 1.000 0.000 0.939 0.015 0.718 0.026 0.980 0.018 0.586 0.306 0.811 0.053
62% WDBC 0.964 0.006 0.980 0.011 0.954 0.008 0.966 0.008 0.986 0.007 0.903 0.008 0.986 0.008 0.987 0.006 0.923 0.011
64% IONO 1.000 0.000 0.984 0.013 0.953 0.012 0.989 0.005 0.990 0.004 0.887 0.015 0.978 0.015 0.972 0.014 0.912 0.019
65% PIMA 0.616 0.016 0.880 0.022 0.690 0.022 0.795 0.024 0.814 0.016 0.535 0.011 0.928 0.017 0.900 0.019 0.744 0.014
67% WINE 1.000 0.000 0.974 0.014 1.000 0.003 1.000 0.003 1.000 0.000 0.839 0.087 0.978 0.012 0.980 0.016 0.944 0.020
70% GERM 0.654 0.016 0.826 0.025 0.754 0.015 0.768 0.016 0.779 0.012 0.447 0.016 0.946 0.014 0.891 0.015 0.670 0.029
76% VEHI 0.990 0.003 0.979 0.011 0.997 0.004 1.000 0.000 0.997 0.003 0.727 0.022 0.991 0.006 0.986 0.009 0.905 0.020
76% WPBC 0.956 0.019 0.871 0.060 0.847 0.046 0.879 0.045 1.000 0.000 0.346 0.037 0.974 0.027 0.894 0.045 0.587 0.068
86% SEGM 0.997 0.002 0.996 0.003 0.999 0.002 0.999 0.002 0.998 0.002 0.987 0.005 1.000 0.001 0.993 0.005 0.948 0.010
90% ZERN 0.681 0.037 0.849 0.035 0.956 0.010 0.984 0.008 0.988 0.006 0.699 0.091 0.988 0.005 0.915 0.022 0.442 0.076
91% VOWE 1.000 0.000 0.970 0.023 1.000 0.000 1.000 0.000 1.000 0.000 0.984 0.010 1.000 0.000 0.986 0.014 0.608 0.029
94% ABAL 0.632 0.035 0.676 0.082 0.581 0.062 0.848 0.047 0.786 0.031 0.595 0.025 0.978 0.025 0.796 0.053 0.689 0.054
96% GLAS 1.000 0.000 0.997 0.017 1.000 0.000 1.000 0.000 1.000 0.000 1.000 0.000 1.000 0.000 0.992 0.032 0.967 0.051
96% YEAS 0.550 0.039 0.731 0.071 0.656 0.073 0.672 0.058 0.769 0.038 0.637 0.041 0.978 0.038 0.618 0.057 0.657 0.041
Sensitivity
53% SONA 1.000 0.000 0.978 0.016 0.954 0.012 0.984 0.011 0.952 0.011 0.957 0.015 0.890 0.031 0.606 0.319 0.824 0.037
62% WDBC 0.989 0.002 0.994 0.005 0.968 0.005 0.969 0.007 0.989 0.003 0.950 0.007 0.954 0.012 0.998 0.002 0.937 0.010
64% IONO 1.000 0.000 0.979 0.010 0.980 0.004 0.984 0.008 0.994 0.003 0.967 0.004 0.910 0.020 0.988 0.008 0.912 0.068
65% PIMA 0.911 0.007 0.951 0.012 0.797 0.016 0.786 0.019 0.779 0.010 0.836 0.008 0.814 0.020 0.966 0.008 0.691 0.011
67% WINE 1.000 0.000 1.000 0.002 1.000 0.001 0.981 0.017 1.000 0.000 0.937 0.110 0.985 0.012 0.997 0.005 0.897 0.019
70% GERM 0.950 0.005 0.949 0.010 0.728 0.018 0.745 0.016 0.808 0.010 0.854 0.007 0.785 0.021 0.990 0.004 0.726 0.023
76% VEHI 0.999 0.001 0.991 0.003 0.998 0.002 0.974 0.009 0.996 0.002 0.970 0.003 0.942 0.009 0.994 0.003 0.908 0.013
76% WPBC 0.989 0.006 0.974 0.016 0.813 0.034 0.702 0.045 1.000 0.002 0.922 0.010 0.735 0.055 0.998 0.004 0.831 0.031
86% SEGM 1.000 0.000 0.999 0.000 0.999 0.001 0.990 0.005 0.999 0.001 0.957 0.005 0.987 0.004 1.000 0.000 0.978 0.003
90% ZERN 0.963 0.004 0.987 0.003 0.880 0.004 0.854 0.010 0.941 0.002 0.885 0.019 0.853 0.008 0.988 0.003 0.830 0.075
91% VOWE 1.000 0.000 0.997 0.002 1.000 0.000 0.972 0.008 1.000 0.000 0.971 0.003 0.922 0.010 0.999 0.001 0.976 0.005
94% ABAL 0.998 0.001 0.994 0.004 0.704 0.059 0.775 0.077 0.920 0.012 1.000 0.000 0.760 0.023 0.998 0.002 0.988 0.006
96% GLAS 1.000 0.000 1.000 0.001 1.000 0.000 0.962 0.018 1.000 0.000 0.999 0.002 0.914 0.034 0.999 0.002 1.000 0.000
96% YEAS 0.998 0.001 0.997 0.002 0.966 0.033 0.944 0.041 0.876 0.080 0.960 0.028 0.847 0.051 0.998 0.001 0.947 0.009
Specificity
53% SONA 1.000 0.000 0.975 0.013 0.940 0.011 0.992 0.006 0.945 0.009 0.829 0.015 0.934 0.018 0.595 0.312 0.816 0.028
62% WDBC 0.977 0.003 0.987 0.005 0.961 0.004 0.967 0.004 0.987 0.004 0.926 0.005 0.970 0.007 0.992 0.003 0.930 0.007
64% IONO 1.000 0.000 0.982 0.007 0.966 0.006 0.986 0.004 0.992 0.002 0.926 0.007 0.943 0.011 0.980 0.008 0.910 0.059
65% PIMA 0.749 0.010 0.915 0.010 0.741 0.008 0.790 0.010 0.796 0.008 0.669 0.007 0.869 0.011 0.933 0.010 0.717 0.009
67% WINE 1.000 0.000 0.987 0.007 1.000 0.001 0.990 0.009 1.000 0.000 0.879 0.112 0.982 0.008 0.988 0.008 0.920 0.012
70% GERM 0.788 0.009 0.885 0.011 0.741 0.008 0.757 0.008 0.793 0.007 0.618 0.011 0.862 0.011 0.939 0.008 0.697 0.013
76% VEHI 0.995 0.001 0.985 0.006 0.997 0.002 0.987 0.004 0.996 0.002 0.839 0.012 0.966 0.005 0.990 0.005 0.907 0.012
76% WPBC 0.973 0.010 0.921 0.030 0.829 0.020 0.785 0.024 1.000 0.001 0.564 0.030 0.845 0.030 0.944 0.024 0.696 0.036
86% SEGM 0.998 0.001 0.997 0.002 0.999 0.001 0.995 0.002 0.998 0.001 0.972 0.004 0.993 0.002 0.996 0.002 0.963 0.005
90% ZERN 0.809 0.021 0.915 0.018 0.917 0.005 0.917 0.005 0.964 0.003 0.783 0.053 0.918 0.004 0.951 0.011 0.601 0.051
91% VOWE 1.000 0.000 0.983 0.012 1.000 0.000 0.986 0.004 1.000 0.000 0.977 0.005 0.960 0.005 0.993 0.007 0.770 0.018
94% ABAL 0.794 0.022 0.818 0.049 0.637 0.028 0.809 0.040 0.850 0.016 0.771 0.016 0.862 0.014 0.891 0.030 0.825 0.032
96% GLAS 1.000 0.000 0.998 0.009 1.000 0.000 0.981 0.009 1.000 0.000 0.999 0.001 0.956 0.018 0.995 0.017 0.983 0.026
96% YEAS 0.740 0.026 0.853 0.040 0.794 0.040 0.795 0.034 0.819 0.036 0.781 0.025 0.909 0.024 0.784 0.036 0.788 0.025
Geometric Mean Geometric Mean Geometric Mean
Table 5.4: Training performance over real datasets
5.2 Experiments with Real Data 43
CSVM TREE RNDO RNDU SMOT SVDD BBAG ACOS PROP
53% SONA 0.128 0.071 0.277 0.085 0.151 0.074 0.130 0.076 0.153 0.081 0.267 0.088 0.298 0.103 0.440 0.193 0.292 0.097
62% WDBC 0.040 0.024 0.078 0.035 0.045 0.027 0.045 0.025 0.058 0.032 0.078 0.035 0.085 0.038 0.049 0.027 0.088 0.040
64% IONO 0.072 0.044 0.121 0.058 0.056 0.038 0.052 0.038 0.051 0.038 0.086 0.044 0.139 0.057 0.088 0.049 0.136 0.067
65% PIMA 0.235 0.044 0.297 0.048 0.259 0.046 0.264 0.046 0.261 0.044 0.279 0.048 0.306 0.048 0.252 0.052 0.303 0.052
67% WINE 0.026 0.039 0.053 0.048 0.022 0.038 0.036 0.047 0.054 0.055 0.103 0.104 0.040 0.050 0.035 0.043 0.092 0.074
70% GERM 0.215 0.039 0.294 0.047 0.280 0.045 0.273 0.044 0.256 0.043 0.305 0.042 0.337 0.048 0.240 0.039 0.368 0.048
76% VEHI 0.015 0.014 0.063 0.026 0.018 0.014 0.032 0.018 0.015 0.012 0.131 0.035 0.066 0.029 0.052 0.025 0.099 0.030
76% WPBC 0.314 0.099 0.306 0.094 0.295 0.095 0.349 0.113 0.378 0.104 0.333 0.098 0.400 0.104 0.235 0.070 0.380 0.111
86% SEGM 0.004 0.004 0.008 0.006 0.005 0.005 0.010 0.008 0.004 0.004 0.043 0.012 0.014 0.008 0.007 0.006 0.031 0.012
90% ZERN 0.137 0.019 0.154 0.021 0.118 0.022 0.142 0.023 0.142 0.020 0.137 0.025 0.154 0.023 0.146 0.019 0.219 0.074
91% VOWE 0.001 0.004 0.028 0.017 0.001 0.003 0.028 0.018 0.000 0.002 0.032 0.018 0.084 0.028 0.025 0.015 0.059 0.019
94% ABAL 0.075 0.022 0.068 0.026 0.305 0.074 0.245 0.079 0.182 0.045 0.299 0.052 0.258 0.057 0.063 0.024 0.312 0.053
96% GLAS 0.001 0.010 0.007 0.023 0.002 0.013 0.035 0.039 0.001 0.008 0.005 0.015 0.102 0.070 0.009 0.022 0.009 0.019
96% YEAS 0.021 0.016 0.032 0.022 0.050 0.043 0.070 0.047 0.211 0.055 0.056 0.104 0.176 0.050 0.023 0.043 0.077 0.074
Error
53% SONA 0.851 0.115 0.691 0.134 0.787 0.133 0.868 0.109 0.786 0.133 0.720 0.133 0.737 0.146 0.544 0.206 0.846 0.110
62% WDBC 0.934 0.048 0.901 0.067 0.941 0.048 0.944 0.049 0.912 0.071 0.904 0.063 0.923 0.064 0.930 0.053 0.919 0.061
64% IONO 0.975 0.047 0.838 0.102 0.883 0.091 0.906 0.083 0.908 0.083 0.885 0.087 0.865 0.088 0.868 0.096 0.900 0.086
65% PIMA 0.549 0.095 0.568 0.094 0.657 0.095 0.712 0.096 0.729 0.079 0.535 0.095 0.690 0.093 0.600 0.105 0.744 0.090
67% WINE 0.973 0.080 0.929 0.111 0.970 0.083 0.977 0.058 0.918 0.110 0.842 0.160 0.950 0.086 0.940 0.094 0.926 0.121
70% GERM 0.533 0.095 0.492 0.096 0.728 0.073 0.727 0.077 0.705 0.081 0.449 0.090 0.686 0.090 0.512 0.095 0.652 0.082
76% VEHI 0.975 0.038 0.871 0.074 0.976 0.036 0.983 0.029 0.984 0.027 0.731 0.093 0.957 0.051 0.897 0.071 0.860 0.077
76% WPBC 0.384 0.224 0.340 0.238 0.601 0.223 0.684 0.222 0.475 0.220 0.339 0.209 0.630 0.227 0.289 0.211 0.532 0.223
86% SEGM 0.987 0.021 0.972 0.030 0.993 0.016 0.996 0.012 0.993 0.014 0.956 0.033 0.988 0.019 0.972 0.029 0.910 0.051
90% ZERN 0.338 0.113 0.215 0.082 0.927 0.058 0.944 0.052 0.324 0.099 0.667 0.134 0.885 0.071 0.226 0.095 0.416 0.129
91% VOWE 0.996 0.025 0.816 0.138 0.998 0.016 0.999 0.011 0.999 0.008 0.932 0.083 0.977 0.055 0.827 0.134 0.556 0.157
94% ABAL 0.180 0.189 0.307 0.227 0.561 0.240 0.628 0.257 0.592 0.269 0.596 0.233 0.741 0.206 0.286 0.213 0.702 0.234
96% GLAS 1.000 0.000 0.975 0.157 1.000 0.000 1.000 0.000 1.000 0.000 0.900 0.301 1.000 0.000 0.930 0.256 0.800 0.401
96% YEAS 0.550 0.347 0.525 0.357 0.610 0.333 0.603 0.058 0.675 0.110 0.643 0.160 0.635 0.086 0.518 0.094 0.653 0.121
Sensitivity
53% SONA 0.891 0.091 0.752 0.116 0.903 0.086 0.872 0.095 0.900 0.092 0.743 0.134 0.672 0.143 0.573 0.235 0.588 0.156
62% WDBC 0.975 0.026 0.934 0.040 0.963 0.033 0.962 0.031 0.959 0.034 0.932 0.045 0.910 0.050 0.963 0.030 0.908 0.053
64% IONO 0.902 0.062 0.901 0.069 0.979 0.031 0.972 0.036 0.972 0.035 0.931 0.049 0.859 0.077 0.936 0.056 0.844 0.095
65% PIMA 0.881 0.044 0.776 0.057 0.787 0.058 0.749 0.060 0.745 0.058 0.821 0.056 0.696 0.062 0.828 0.058 0.672 0.066
67% WINE 0.974 0.047 0.957 0.058 0.982 0.039 0.957 0.063 0.960 0.061 0.924 0.136 0.965 0.060 0.977 0.045 0.898 0.089
70% GERM 0.893 0.039 0.797 0.053 0.716 0.059 0.727 0.055 0.761 0.051 0.801 0.052 0.654 0.064 0.866 0.041 0.624 0.059
76% VEHI 0.988 0.014 0.957 0.025 0.984 0.016 0.963 0.024 0.985 0.014 0.912 0.035 0.926 0.037 0.964 0.025 0.914 0.034
76% WPBC 0.782 0.117 0.804 0.108 0.737 0.112 0.641 0.130 0.668 0.120 0.769 0.117 0.591 0.127 0.912 0.079 0.648 0.126
86% SEGM 0.997 0.004 0.996 0.004 0.996 0.005 0.989 0.009 0.997 0.004 0.957 0.014 0.985 0.010 0.997 0.004 0.979 0.011
90% ZERN 0.922 0.021 0.916 0.022 0.877 0.024 0.849 0.026 0.917 0.020 0.885 0.031 0.841 0.026 0.924 0.020 0.822 0.083
91% VOWE 0.999 0.004 0.987 0.015 0.999 0.003 0.970 0.020 1.000 0.002 0.971 0.018 0.910 0.030 0.990 0.011 0.979 0.016
94% ABAL 0.970 0.021 0.970 0.023 0.703 0.080 0.763 0.088 0.832 0.047 0.707 0.052 0.742 0.062 0.976 0.021 0.687 0.053
96% GLAS 0.999 0.010 0.993 0.019 0.998 0.014 0.964 0.041 0.999 0.009 1.000 0.005 0.893 0.073 0.994 0.017 1.000 0.000
96% YEAS 0.998 0.007 0.987 0.016 0.965 0.046 0.944 0.063 0.794 0.061 0.957 0.136 0.832 0.060 0.996 0.045 0.934 0.089
Specificity
53% SONA 0.867 0.075 0.714 0.087 0.838 0.081 0.867 0.078 0.837 0.087 0.724 0.092 0.696 0.105 0.544 0.202 0.697 0.107
62% WDBC 0.954 0.027 0.917 0.040 0.952 0.029 0.952 0.028 0.934 0.039 0.917 0.038 0.915 0.039 0.946 0.031 0.913 0.041
64% IONO 0.937 0.041 0.867 0.064 0.928 0.051 0.937 0.048 0.938 0.047 0.906 0.051 0.860 0.058 0.899 0.057 0.867 0.084
65% PIMA 0.692 0.064 0.661 0.059 0.716 0.056 0.727 0.052 0.735 0.047 0.660 0.062 0.690 0.053 0.702 0.067 0.704 0.054
67% WINE 0.972 0.047 0.940 0.063 0.975 0.050 0.966 0.045 0.937 0.067 0.870 0.142 0.956 0.056 0.957 0.055 0.909 0.081
70% GERM 0.687 0.062 0.623 0.065 0.721 0.046 0.726 0.047 0.731 0.049 0.596 0.063 0.667 0.051 0.663 0.063 0.636 0.051
76% VEHI 0.981 0.020 0.912 0.042 0.980 0.019 0.973 0.017 0.984 0.015 0.815 0.055 0.941 0.030 0.929 0.039 0.885 0.041
76% WPBC 0.505 0.205 0.455 0.251 0.646 0.151 0.648 0.139 0.537 0.168 0.463 0.211 0.591 0.140 0.439 0.261 0.565 0.160
86% SEGM 0.992 0.011 0.984 0.016 0.994 0.009 0.992 0.007 0.995 0.007 0.956 0.017 0.987 0.010 0.984 0.015 0.943 0.027
90% ZERN 0.549 0.094 0.434 0.088 0.901 0.030 0.895 0.027 0.538 0.087 0.763 0.077 0.862 0.035 0.446 0.098 0.575 0.094
91% VOWE 0.997 0.013 0.894 0.079 0.998 0.008 0.984 0.011 1.000 0.004 0.950 0.045 0.942 0.032 0.901 0.077 0.729 0.108
94% ABAL 0.312 0.277 0.470 0.277 0.603 0.167 0.665 0.168 0.670 0.204 0.633 0.147 0.731 0.110 0.455 0.271 0.682 0.139
96% GLAS 0.999 0.005 0.973 0.156 0.999 0.007 0.981 0.021 0.999 0.004 0.900 0.301 0.944 0.039 0.928 0.255 0.800 0.401
96% YEAS 0.655 0.347 0.623 0.363 0.701 0.307 0.682 0.045 0.678 0.067 0.722 0.142 0.663 0.056 0.635 0.055 0.745 0.081
Geometric Mean Geometric Mean Geometric Mean
Table 5.5: Testing performance over real datasets
44 Results
0.2
0.4
0.6
0.8
1.0
53% 62% 64% 65% 67% 70% 76% 76% 86% 90% 91% 94% 96% 96%
CSVM
TREE
(a) Control Algorithms
0.2
0.4
0.6
0.8
1.0
53% 62% 64% 65% 67% 70% 76% 76% 86% 90% 91% 94% 96% 96%
RNDO
RNDU
SMOT
(b) External Algorithms
0.2
0.4
0.6
0.8
1.0
53% 62% 64% 65% 67% 70% 76% 76% 86% 90% 91% 94% 96% 96%
RNDO
RNDU
SMOT
(c) Internal Algorithms
Figure 5.1: Testing G-mean of control (a), external (b), and internal (c) algorithms, as a function
of the imbalance level of real instances.
5.2 Experiments with Real Data 45
CSVM TREE RNDO RNDU SMOT SVDD BBAG ACOS PROP
53% SONA 00.01 00.00 00.08 00.01 00.01 00.00 00.01 00.00 00.02 00.01 01.65 00.22 00.15 00.01 02.61 00.22 03.11 00.25
62% WDBC 00.02 00.01 00.44 00.01 00.03 00.01 00.03 00.01 00.27 00.05 03.93 00.30 00.14 00.01 03.29 00.25 08.99 00.46
64% IONO 00.02 00.00 00.07 00.01 00.02 00.01 00.01 00.00 00.11 00.02 02.26 00.24 00.13 00.02 01.26 00.11 04.61 00.35
65% PIMA 09.83 04.02 00.12 00.02 00.10 00.01 05.95 02.71 00.47 00.08 03.78 00.24 00.20 00.01 01.35 00.07 10.07 00.45
67% WINE 00.04 00.02 00.02 00.01 00.04 00.02 00.01 00.01 00.04 00.01 00.74 00.08 00.06 00.01 00.45 00.05 01.22 00.14
70% GERM 01.22 00.17 00.27 00.03 00.30 00.03 00.15 00.02 01.22 00.14 13.22 00.66 00.60 00.03 07.11 00.34 40.07 06.26
76% VEHI 00.03 00.01 00.09 00.01 00.06 00.01 00.03 00.01 00.64 00.08 06.56 00.41 00.30 00.02 01.18 00.13 10.81 00.55
76% WPBC 03.41 01.00 00.06 00.01 00.04 00.01 00.01 00.00 00.08 00.02 01.30 00.16 00.16 00.01 01.81 00.24 02.19 00.31
86% SEGM 00.05 00.01 00.16 00.02 00.33 00.03 00.15 00.02 04.90 00.68 16.20 01.02 00.81 00.02 07.80 00.46 96.56 05.87
90% ZERN 00.41 00.05 00.44 00.04 01.45 00.18 00.24 00.03 06.48 00.63 45.32 12.57 02.05 00.14 14.23 01.65 86.59 12.43
91% VOWE 00.02 00.01 00.07 00.01 00.08 00.01 00.04 00.01 00.87 00.09 04.61 00.41 00.73 00.08 03.66 00.20 11.42 00.48
94% ABAL 00.45 00.22 00.07 00.01 00.23 00.02 00.02 00.01 03.34 00.97 06.66 00.44 00.95 00.11 01.42 00.03 09.31 00.53
96% GLAS 00.00 00.00 00.02 00.01 00.01 00.01 00.00 00.01 00.04 00.01 00.74 00.03 00.40 00.05 00.84 00.04 01.47 00.02
96% YEAS 00.04 00.01 00.04 00.01 00.04 00.01 00.01 00.01 00.17 00.03 01.90 00.07 00.86 00.10 02.41 00.30 02.96 00.03
Table 5.6: CPU performance over real datasets
ered hard instances, as they form valleys (points of low G-mean) for almost every evaluated
method. Instances with low imbalance, SONA, WDBC and IONO, are not to be included in
further analyses since their results were expected and are not interesting to the goals of this
work.
Traditional algorithms perform reasonably well in hard instances of lower levels of imbal-
ance, however, for instances of higher levels their performance drops progressively. The same
occurs to external approaches but less drastically. Internal approaches are robust against
class imbalances to some extent, however, at the cost of not being able to perform perfectly
in some easy instances. The analysis above is also valid from the perspective of the error
rate, however, less notorious. This occurs because the G-mean penalizes the cases in which
a classier stands in one of the edges of the sensitivity-specicity trade-o, meanwhile the
error rate only reports overall misclassication among the two classes.
We can also see in Table 5.5 that for each performance measure there is a method that
clearly outperforms the others in most instances. In the case of the error rate CSVM is the
method that achieves the best performance in most instances, which makes sense since given
its properties it is a better option than TREE as an error-minimization algorithm in most
cases according to several reports in the literature. In sensitivity RNDU clearly outperforms
the remaining approaches, even RNDO, to which no substantial dierence in G-mean was
previously observed. This could be caused by the over-tting of the minority class samples as
a consequence of replicated objects. In specicity CSVM excels again at most instances where
it achieved the lowest error rate. Finally, the best performing method in G-mean is SMOT,
however most of the cases in which it is superior lie in the range of low and mild imbalance.
Regarding the G-mean of proper algorithms in each of the hard instances we can say that
external approaches are able to achieve good solutions for mildly imbalanced cases, whereas
internal approaches are a better choice for extremely imbalanced ones. It is interesting to
note that although the proposed method did not always provide the best solution overall,
it did nd the best solution for instance YEAS, which is the most imbalanced one used in
this work. In the remaining hard instances it has comparable performance than other proper
approaches.
We also evaluated the adequacy of the error rate in terms of its agreement to the G-
mean when selecting the best approach per instance. According to both measures the best
46 Results
algorithm for instances SONA, WDBC and GLAS is CSVM, SMOT for IONO, VEHI, SEGM and
VOWE, and RNDO for ZERN. We can see that, in general, both measures agree in easy and low-
imbalance instances, except for VOWE which is hard. If we consider that hard instances are
the actual events of the class imbalance problem, this is clear evidence that the error rate is
only appropriate in standard classication. Moreover, according to the sensitivity-specicity
trade-o the most adequate choice of the outperforming method is given by the G-mean in
hard instances.
As for CPU time consumption we can see in Table 5.6 that it holds no relation with the
level of imbalance. In general there are no further observations regarding the relation between
computational performance and the problem itself that are worth mentioning. Other than
that, the results were expected: PROP is signicantly costlier than SVDD since it needs to
train an additional descriptor to the complementary class. We can not anticipate, however,
how costlier it will be given a certain instance, since training descriptions for each class is
not equally costly, which in fact is supported by the evidence.
We have observed that the performance of the studied algorithms in hard instances de-
creases as the level of imbalance increases: notoriously in traditional error-minimization
pattern recognition algorithms, less dramatically in external approaches for imbalanced clas-
sication, and considerably less in internal methods. Although the algorithms in this last
category are weaker at lower levels of imbalance, they are somewhat robust against this fac-
tor at higher levels. Moreover, the proposed method shows stable performance at extremely
imbalanced cases, and outperforms all other methods in the most imbalanced instance of the
problem.
5.3 Experiments with Synthetic Data
In this section we report and analyze the results of simulations performed using synthetically
generated data. We begin by describing the domain generation framework used to synthesize
data for experimentation. Similarly to Section 5.2, then we report the optimal parameters
found for each generated dataset and nally we display the results.
5.3.1 Domain Generation Framework
In this work we employ Nathalie Japkowiczs domain generation framework [JS02] to syn-
thesize articial data in terms of three parameters: size, concept complexity and imbalance
level. We apply this systematic scheme for a more thorough performance assessment given
the behavior observed in the previous discussion with real data. The inputs of observations
are one-dimensional vectors drawn at random from a uniformly distributed backbone model
in the range [0, 1] which alternatively assigns a label 0 or 1 according to the concept com-
plexity parameter. Therefore, concept complexity is introduced by alternating the labels of
contiguous ranges in feature space, which could be interpreted as class overlapping.
In this work we systematically generated datasets of size 300 with concept complexity
values of 3 and 5, and imbalance levels ranging from 50% (balanced) to 90% (highly imbal-
anced) in steps of 10%. The size of the datasets was xed at 300 in order to be able to observe
5.3 Experiments with Synthetic Data 47
50% 60% 70% 80% 90%
c = 3 3C50 3C60 3C70 3C80 3C90
c = 5 5C50 5C60 5C70 5C80 5C90
Table 5.7: Synthetic datasets used as benchmark instances of the problem with their respective
concept complexity values and levels of imbalance.
performance decreases as a function of the imbalance level, since for greater data sizes no
performance issues were observed and for lower sizes the algorithms were just not competent
at concept complexity 5. In Table 5.7 we summarize the properties of the generated data.
5.3.2 Algorithm Tuning
Similar to Section 5.3.2, in Table 5.8 we report the values found for each parameter of the
studied algorithms for each of the articially generated instances of the problem. Note that
the misclassication cost of positive class samples in ACOS (column +) behaves similarly
as in the previously reported experiment in Section 5.2.2, which backs up our hypothesis
regarding how these parameters should be set according to a given level of imbalance.
5.3.3 Performance
Similar to the report of results with real data, in Tables 5.9 and 5.10 we display the classiers
training and testing performances, respectively, obtained with the synthetic datasets. The
discussion presented in Section 5.3.4 is also based in the testing performance of the algorithms,
however the training performance is reported as well. The CPU time consumption of the
algorithms in synthetic instances is also reported in Table 5.6. The aim of this analysis,
however, is to verify relevant observations already reported in the previous discussion rather
than to elaborate new hypotheses.
CSVM CSVM TREE RNDO RNDO RNDU RNDU SMOT SMOT SVDD SVDD BBAG ACOS ACOS ACOS PROP PROP
C C C k C c T T - + c T
50% 3C50 1E+01 1E+02 - 1E+01 1E+02 1E+06 1E+04 20 1E+01 1E+02 1 2E-01 0.5 - 25 0.0 0.0 0 2E-01 0.5
50% 5C50 1E+02 1E+03 - 1E+01 1E+03 1E+01 1E+03 15 1E+02 1E+03 0 6E-02 0.5 - 30 0.0 0.0 1 6E-02 0.5
60% 3C60 1E+06 1E+02 - 1E+01 1E+02 1E+06 1E+03 20 1E+03 1E+02 0 1E-01 0.6 - 25 0.0 0.0 1 1E-01 0.5
60% 5C60 1E+02 1E+03 - 1E+01 1E+03 1E+02 1E+03 20 1E+02 1E+03 1 3E-02 1.5 - 30 0.0 0.0 0 3E-02 1.4
70% 3C70 1E+01 1E+03 - 1E+01 1E+03 1E+01 1E+02 20 1E+03 1E+02 0 1E-01 0.7 - 30 0.0 0.0 0 1E-01 0.5
70% 5C70 1E+02 1E+03 - 1E+02 1E+03 1E+02 1E+03 5 1E+02 1E+03 0 5E-02 0.5 - 15 0.2 0.0 0 5E-02 0.5
80% 3C80 1E+02 1E+02 - 1E+02 1E+02 1E+01 1E+02 20 1E+03 1E+02 0 8E-02 0.7 - 15 0.0 0.2 0 8E-02 0.7
80% 5C80 1E+02 1E+03 - 1E+02 1E+03 1E+05 1E+03 20 1E+02 1E+03 0 5E-02 0.5 - 20 0.2 0.2 0 5E-02 0.5
90% 3C90 1E+01 1E+02 - 1E+01 1E+02 1E+02 1E+02 20 1E+02 1E+03 0 1E-01 0.5 - 30 0.2 0.2 0 1E-01 0.6
90% 5C90 1E+06 1E+06 - 1E+03 1E-03 1E-02 1E-05 10 1E+03 1E+04 0 4E-02 1.5 - 30 0.6 0.2 0 4E-02 1.5
Table 5.8: Optimal parameters of each method obtained for each synthetic dataset.
48 Results
CSVM TREE RNDO RNDU SMOT SVDD BBAG ACOS PROP
50% 3C50 0.010 0.006 0.000 0.000 0.011 0.007 0.001 0.003 0.010 0.007 0.043 0.019 0.000 0.000 0.494 0.473 0.026 0.011
50% 5C50 0.000 0.000 0.042 0.012 0.012 0.009 0.014 0.009 0.000 0.002 0.447 0.022 0.058 0.039 0.523 0.190 0.399 0.046
60% 3C60 0.000 0.000 0.000 0.000 0.010 0.007 0.010 0.010 0.011 0.010 0.043 0.012 0.016 0.014 0.036 0.135 0.024 0.009
60% 5C60 0.004 0.006 0.044 0.014 0.029 0.014 0.005 0.007 0.004 0.006 0.030 0.016 0.057 0.038 0.478 0.195 0.031 0.017
70% 3C70 0.010 0.003 0.000 0.000 0.009 0.005 0.029 0.018 0.019 0.013 0.024 0.008 0.017 0.010 0.003 0.006 0.022 0.008
70% 5C70 0.002 0.004 0.048 0.014 0.006 0.006 0.081 0.036 0.035 0.018 0.143 0.040 0.242 0.051 0.442 0.210 0.149 0.038
80% 3C80 0.002 0.004 0.000 0.000 0.002 0.004 0.056 0.027 0.034 0.025 0.012 0.007 0.153 0.066 0.005 0.007 0.011 0.006
80% 5C80 0.013 0.006 0.048 0.015 0.017 0.007 0.086 0.036 0.044 0.019 0.071 0.019 0.195 0.053 0.826 0.086 0.090 0.022
90% 3C90 0.020 0.005 0.000 0.000 0.043 0.018 0.127 0.070 0.090 0.036 0.032 0.011 0.226 0.106 0.008 0.009 0.049 0.012
90% 5C90 0.000 0.000 0.026 0.009 0.494 0.029 0.488 0.036 0.028 0.018 0.022 0.006 0.456 0.085 0.803 0.198 0.053 0.019
Error
50% 3C50 0.984 0.009 1.000 0.000 0.983 0.010 0.999 0.005 0.984 0.010 1.000 0.000 1.000 0.000 0.507 0.478 0.951 0.020
50% 5C50 1.000 0.000 0.961 0.037 0.996 0.009 0.993 0.012 1.000 0.000 0.161 0.061 0.952 0.042 0.465 0.211 0.877 0.122
60% 3C60 1.000 0.000 1.000 0.000 0.994 0.012 1.000 0.000 0.996 0.011 0.892 0.030 1.000 0.000 0.955 0.136 1.000 0.000
60% 5C60 0.997 0.008 0.958 0.039 0.963 0.024 0.996 0.009 0.997 0.008 1.000 0.000 0.956 0.042 0.525 0.204 0.950 0.031
70% 3C70 0.969 0.010 1.000 0.000 0.978 0.017 0.974 0.018 0.971 0.014 0.931 0.023 1.000 0.000 0.992 0.016 0.942 0.017
70% 5C70 0.995 0.012 0.857 0.041 0.996 0.011 0.997 0.009 0.995 0.012 0.573 0.108 0.881 0.065 0.412 0.122 0.611 0.117
80% 3C80 0.995 0.016 1.000 0.000 1.000 0.004 0.979 0.029 0.985 0.025 0.939 0.034 0.969 0.064 0.979 0.031 0.944 0.032
80% 5C80 0.990 0.018 0.857 0.044 0.995 0.013 1.000 0.000 0.988 0.019 0.787 0.057 0.888 0.069 0.350 0.095 0.815 0.057
90% 3C90 0.832 0.040 1.000 0.000 0.939 0.060 0.979 0.041 0.914 0.066 0.768 0.079 0.971 0.065 0.943 0.065 0.650 0.104
90% 5C90 1.000 0.000 0.841 0.057 0.515 0.056 0.495 0.059 1.000 0.000 0.865 0.038 0.896 0.085 0.387 0.137 0.872 0.034
Sensitivity
50% 3C50 0.996 0.009 1.000 0.000 0.996 0.009 1.000 0.000 0.996 0.009 0.910 0.041 1.000 0.000 0.504 0.467 1.000 0.002
50% 5C50 1.000 0.000 0.955 0.037 0.980 0.014 0.979 0.014 0.999 0.004 0.945 0.064 0.931 0.072 0.488 0.204 0.325 0.134
60% 3C60 1.000 0.000 1.000 0.000 0.987 0.012 0.983 0.017 0.984 0.015 1.000 0.000 0.974 0.023 0.971 0.137 0.961 0.016
60% 5C60 0.995 0.010 0.954 0.040 0.980 0.020 0.994 0.011 0.995 0.010 0.940 0.032 0.930 0.072 0.518 0.213 0.989 0.015
70% 3C70 1.000 0.000 1.000 0.000 0.998 0.006 0.970 0.029 0.986 0.019 0.997 0.006 0.975 0.015 0.999 0.004 0.995 0.007
70% 5C70 1.000 0.003 0.999 0.004 0.993 0.009 0.880 0.055 0.950 0.027 0.999 0.020 0.696 0.078 0.631 0.289 0.970 0.019
80% 3C80 0.999 0.004 1.000 0.000 0.997 0.005 0.936 0.035 0.961 0.030 1.000 0.000 0.817 0.090 0.999 0.004 1.000 0.000
80% 5C80 0.986 0.009 0.999 0.004 0.978 0.011 0.871 0.054 0.940 0.029 1.000 0.000 0.763 0.086 0.087 0.113 0.957 0.023
90% 3C90 1.000 0.000 1.000 0.000 0.960 0.026 0.859 0.081 0.910 0.040 0.995 0.009 0.747 0.122 0.999 0.004 0.992 0.008
90% 5C90 1.000 0.000 1.000 0.000 0.505 0.037 0.516 0.050 0.967 0.022 1.000 0.000 0.474 0.108 0.158 0.251 0.962 0.022
Specificity
50% 3C50 0.990 0.006 1.000 0.000 0.989 0.007 0.999 0.003 0.990 0.007 0.954 0.021 1.000 0.000 0.501 0.477 0.975 0.010
50% 5C50 1.000 0.000 0.958 0.012 0.988 0.009 0.986 0.009 1.000 0.002 0.383 0.049 0.941 0.040 0.467 0.195 0.516 0.079
60% 3C60 1.000 0.000 1.000 0.000 0.991 0.007 0.991 0.009 0.990 0.009 0.944 0.016 0.987 0.012 0.963 0.136 0.980 0.008
60% 5C60 0.996 0.006 0.955 0.014 0.971 0.014 0.995 0.007 0.996 0.006 0.970 0.016 0.942 0.039 0.516 0.197 0.969 0.018
70% 3C70 0.984 0.005 1.000 0.000 0.988 0.007 0.972 0.014 0.979 0.011 0.963 0.012 0.988 0.008 0.996 0.008 0.968 0.010
70% 5C70 0.997 0.006 0.925 0.022 0.994 0.006 0.936 0.029 0.972 0.015 0.753 0.073 0.781 0.047 0.493 0.171 0.766 0.072
80% 3C80 0.997 0.008 1.000 0.000 0.999 0.003 0.957 0.019 0.973 0.020 0.969 0.017 0.887 0.042 0.989 0.016 0.971 0.016
80% 5C80 0.988 0.008 0.925 0.024 0.986 0.007 0.933 0.029 0.964 0.016 0.886 0.032 0.821 0.047 0.138 0.115 0.883 0.031
90% 3C90 0.912 0.022 1.000 0.000 0.948 0.023 0.915 0.044 0.911 0.038 0.873 0.044 0.848 0.070 0.970 0.034 0.800 0.063
90% 5C90 1.000 0.000 0.917 0.032 0.508 0.027 0.501 0.043 0.983 0.011 0.930 0.021 0.645 0.069 0.150 0.144 0.916 0.020
Geometric Mean Geometric Mean Geometric Mean
Table 5.9: Training performance over synthetic datasets
5.3 Experiments with Synthetic Data 49
CSVM TREE RNDO RNDU SMOT SVDD BBAG ACOS PROP
50% 3C50 0.020 0.043 0.042 0.061 0.024 0.052 0.045 0.064 0.022 0.048 0.068 0.076 0.040 0.063 0.484 0.441 0.068 0.083
50% 5C50 0.123 0.112 0.329 0.162 0.120 0.106 0.139 0.113 0.122 0.098 0.492 0.125 0.360 0.161 0.493 0.173 0.489 0.156
60% 3C60 0.038 0.067 0.058 0.069 0.042 0.056 0.053 0.066 0.054 0.074 0.068 0.083 0.054 0.064 0.072 0.141 0.073 0.081
60% 5C60 0.160 0.127 0.323 0.179 0.190 0.134 0.151 0.120 0.155 0.119 0.292 0.119 0.314 0.175 0.492 0.168 0.268 0.131
70% 3C70 0.028 0.060 0.027 0.048 0.029 0.054 0.045 0.065 0.063 0.078 0.049 0.069 0.044 0.071 0.043 0.064 0.057 0.071
70% 5C70 0.099 0.107 0.291 0.116 0.102 0.102 0.243 0.167 0.177 0.129 0.260 0.133 0.513 0.162 0.468 0.160 0.269 0.149
80% 3C80 0.027 0.051 0.048 0.063 0.027 0.056 0.080 0.080 0.078 0.087 0.032 0.054 0.256 0.148 0.050 0.066 0.037 0.066
80% 5C80 0.101 0.099 0.300 0.101 0.098 0.100 0.230 0.146 0.191 0.124 0.205 0.126 0.489 0.174 0.583 0.154 0.211 0.153
90% 3C90 0.022 0.044 0.051 0.064 0.084 0.084 0.167 0.134 0.197 0.131 0.072 0.083 0.301 0.164 0.055 0.067 0.074 0.074
90% 5C90 0.164 0.044 0.181 0.060 0.532 0.161 0.502 0.162 0.341 0.150 0.198 0.131 0.712 0.152 0.679 0.188 0.195 0.138
Error
50% 3C50 0.979 0.063 0.974 0.072 0.976 0.075 0.976 0.078 0.981 0.057 0.954 0.095 0.980 0.059 0.510 0.464 0.951 0.092
50% 5C50 0.907 0.144 0.707 0.238 0.914 0.140 0.901 0.136 0.916 0.130 0.173 0.182 0.681 0.251 0.492 0.252 0.768 0.243
60% 3C60 0.945 0.118 0.929 0.119 0.946 0.115 0.951 0.105 0.956 0.105 0.895 0.159 0.959 0.093 0.908 0.183 0.878 0.172
60% 5C60 0.844 0.179 0.656 0.259 0.792 0.198 0.850 0.165 0.849 0.165 0.472 0.222 0.675 0.243 0.498 0.236 0.955 0.100
70% 3C70 0.944 0.152 0.963 0.103 0.953 0.124 0.958 0.118 0.931 0.135 0.930 0.152 0.956 0.129 0.916 0.164 0.940 0.129
70% 5C70 0.808 0.264 0.279 0.260 0.795 0.251 0.809 0.240 0.856 0.238 0.575 0.283 0.434 0.289 0.390 0.312 0.593 0.292
80% 3C80 0.883 0.240 0.848 0.257 0.893 0.240 0.930 0.181 0.938 0.166 0.950 0.150 0.745 0.388 0.830 0.267 0.938 0.173
80% 5C80 0.826 0.220 0.218 0.225 0.852 0.216 0.811 0.248 0.870 0.218 0.775 0.256 0.403 0.277 0.801 0.238 0.785 0.244
90% 3C90 0.828 0.360 0.678 0.442 0.810 0.374 0.775 0.397 0.813 0.366 0.773 0.388 0.740 0.428 0.628 0.459 0.663 0.454
90% 5C90 0.000 0.000 0.000 0.000 0.410 0.413 0.425 0.400 0.710 0.393 0.865 0.287 0.063 0.194 0.905 0.276 0.865 0.282
Sensitivity
50% 3C50 0.980 0.065 0.940 0.104 0.976 0.077 0.932 0.115 0.976 0.076 0.909 0.131 0.940 0.111 0.523 0.434 0.912 0.153
50% 5C50 0.850 0.174 0.637 0.238 0.844 0.166 0.818 0.183 0.838 0.170 0.843 0.191 0.598 0.254 0.523 0.248 0.255 0.230
60% 3C60 0.974 0.073 0.952 0.088 0.967 0.069 0.944 0.090 0.939 0.101 0.957 0.090 0.938 0.097 0.942 0.150 0.961 0.082
60% 5C60 0.836 0.186 0.700 0.263 0.826 0.186 0.846 0.168 0.842 0.185 0.949 0.101 0.694 0.254 0.517 0.244 0.510 0.239
70% 3C70 0.985 0.047 0.977 0.053 0.980 0.053 0.954 0.080 0.940 0.100 0.961 0.072 0.957 0.083 0.978 0.053 0.944 0.087
70% 5C70 0.948 0.090 0.924 0.125 0.949 0.087 0.730 0.208 0.806 0.156 0.822 0.162 0.511 0.215 0.610 0.279 0.801 0.180
80% 3C80 0.996 0.021 0.978 0.051 0.994 0.027 0.918 0.099 0.919 0.102 0.973 0.056 0.744 0.189 0.981 0.047 0.970 0.062
80% 5C80 0.935 0.096 0.941 0.114 0.926 0.101 0.749 0.182 0.776 0.166 0.804 0.152 0.564 0.216 0.229 0.210 0.791 0.186
90% 3C90 0.999 0.016 0.988 0.040 0.930 0.089 0.841 0.149 0.802 0.143 0.950 0.078 0.695 0.194 0.988 0.039 0.963 0.069
90% 5C90 1.000 0.000 0.980 0.052 0.482 0.177 0.512 0.186 0.648 0.174 0.790 0.144 0.331 0.182 0.216 0.249 0.793 0.155
Specificity
50% 3C50 0.978 0.047 0.954 0.065 0.974 0.059 0.950 0.071 0.977 0.051 0.927 0.083 0.957 0.068 0.483 0.468 0.925 0.097
50% 5C50 0.869 0.122 0.643 0.182 0.870 0.114 0.849 0.124 0.866 0.121 0.283 0.249 0.598 0.196 0.457 0.216 0.350 0.258
60% 3C60 0.957 0.077 0.937 0.076 0.953 0.067 0.945 0.071 0.945 0.078 0.920 0.100 0.945 0.066 0.919 0.151 0.911 0.106
60% 5C60 0.828 0.139 0.636 0.225 0.795 0.144 0.839 0.127 0.834 0.132 0.638 0.197 0.656 0.195 0.464 0.203 0.663 0.215
70% 3C70 0.959 0.107 0.968 0.061 0.964 0.076 0.952 0.076 0.931 0.087 0.941 0.094 0.953 0.086 0.941 0.098 0.938 0.082
70% 5C70 0.857 0.179 0.393 0.318 0.849 0.185 0.749 0.188 0.812 0.174 0.643 0.229 0.397 0.239 0.355 0.264 0.646 0.235
80% 3C80 0.919 0.186 0.890 0.191 0.922 0.195 0.914 0.120 0.922 0.106 0.957 0.091 0.652 0.328 0.879 0.203 0.947 0.121
80% 5C80 0.868 0.143 0.333 0.303 0.877 0.145 0.759 0.176 0.803 0.164 0.767 0.179 0.411 0.247 0.332 0.256 0.769 0.183
90% 3C90 0.838 0.353 0.693 0.436 0.790 0.354 0.716 0.363 0.735 0.328 0.773 0.370 0.600 0.361 0.644 0.456 0.661 0.441
90% 5C90 0.000 0.000 0.000 0.000 0.316 0.305 0.343 0.307 0.597 0.311 0.792 0.241 0.035 0.116 0.275 0.269 0.833 0.137
Geometric Mean Geometric Mean Geometric Mean
Table 5.10: Testing performance over synthetic datasets
50 Results
CSVM TREE RNDO RNDU SMOT SVDD BBAG ACOS PROP
50% 3C50 0.001 0.002 0.013 0.005 0.001 0.003 0.002 0.004 0.001 0.003 0.491 0.022 0.015 0.006 0.452 0.008 0.789 0.011
50% 5C50 0.001 0.003 0.023 0.004 0.002 0.004 0.002 0.004 0.002 0.004 1.247 0.106 0.026 0.009 0.722 0.018 2.240 0.116
60% 3C60 0.001 0.002 0.013 0.005 0.001 0.003 0.001 0.003 0.009 0.005 0.476 0.025 0.027 0.006 0.460 0.011 0.727 0.011
60% 5C60 0.001 0.003 0.022 0.005 0.002 0.004 0.002 0.004 0.002 0.004 0.421 0.017 0.026 0.009 0.714 0.011 0.711 0.015
70% 3C70 0.001 0.003 0.013 0.005 0.002 0.004 0.001 0.003 0.012 0.005 0.497 0.016 0.040 0.006 0.546 0.011 0.721 0.011
70% 5C70 0.001 0.003 0.022 0.006 0.003 0.005 0.002 0.004 0.010 0.004 0.623 0.026 0.041 0.009 0.355 0.009 1.019 0.030
80% 3C80 0.000 0.002 0.013 0.005 0.002 0.004 0.002 0.004 0.020 0.007 0.452 0.008 0.054 0.006 0.276 0.010 0.694 0.008
80% 5C80 0.001 0.003 0.022 0.005 0.003 0.005 0.002 0.004 0.010 0.004 0.618 0.018 0.042 0.009 0.475 0.014 1.010 0.028
90% 3C90 0.001 0.002 0.013 0.005 0.003 0.004 0.002 0.004 0.015 0.006 0.478 0.007 0.097 0.007 0.548 0.013 0.710 0.009
90% 5C90 0.002 0.004 0.023 0.005 0.003 0.005 0.002 0.004 0.016 0.006 0.466 0.010 0.075 0.009 0.671 0.013 0.726 0.014
Table 5.11: CPU performance over synthetic datasets
5.3.4 Discussion
In the following we discuss the results of the computer simulations performed with the
datasets synthesized following the domain generation framework described in Section 5.3.1.
Similar to the previous discussion of the performance with real datasets, we analyze both the
generalization ability of the algorithms expressed as their testing performances and their com-
putational complexity represented by the CPU time consumption of the training algorithm
and the classication of all available objects.
In gures 5.2 (a), 5.2 (b), and 5.2 (c) we can see similar sawtooth-shaped performance
patterns as those observed previously in gures 5.1 (a), 5.1 (b), and 5.1 (c), respectively. We
can clearly see that instances where c = 5 form valleys in the graph (points of low G-mean),
whilst instances where c = 3 represent the peaks of the graph (points of high G-mean).
Similar to the analysis of real data we can relate instances of concept complexities 3 and 5
to easy and hard instances, respectively. We can see that CSVM as well as all the studied
external approaches are somewhat robust against class imbalances in this range of concept
complexity, however, they have serious diculties with the most complex and imbalanced
instance, where their performance drops signicantly, except for the case of SMOT. In the case
of internal approaches we can see a considerable performance deterioration in hard instances,
even at low levels of imbalances. This could be due to the fact that these algorithms are
designed to operate with highly-dimensional data, which is not the case of the generated
data according to Japkowiczs framework.
Compared to the case of real data, we can se that, in general, valleys are less deep and
peaks are lower in the performance pattern of synthetic data. Therefore, according to this
three-axes model (concept complexity, sample size and level of imbalance), hard real-world
instances from the previous analysis are characterized by concept complexities greater than
5 or sample sizes less than 300, whilst easy real-world instances have concept complexities
lesser than 5 or sample sizes less than 300.
For a better understanding of this model we systematically generated instances varying
concept complexity in the range [1, 5], sample size in the range [100, 1000] and level of im-
balance in the range [50%, 95%], and tested the generalization ability in G-mean of TREE in
each of these instances. In Figure 5.3 we can see a graphical depiction of the performance
obtained as a function of the three parameters. Each slice in the illustration represents a
level of concept complexity. At low complexity levels we can see nearly perfect classication
(empty slices), however, as we move to higher levels we begin to notice performance decreases.
5.3 Experiments with Synthetic Data 51
0.2
0.4
0.6
0.8
1.0
50% 50% 60% 60% 70% 70% 80% 80% 90% 90%
CSVM
TREE
(a) Control Algorithms
0.2
0.4
0.6
0.8
1.0
50% 50% 60% 60% 70% 70% 80% 80% 90% 90%
RNDO
RNDU
SMOT
(b) External Algorithms
0.2
0.4
0.6
0.8
1.0
50% 50% 60% 60% 70% 70% 80% 80% 90% 90%
RNDO
RNDU
SMOT
(c) Internal Algorithms
Figure 5.2: Testing G-mean of control (a), external (b), and internal (c) algorithms, as a function
of the imbalance level in synthetic instances.
52 Results
200
400
600
800
1000
1
2
3
4
5
0.5
0.55
0.6
0.65
0.7
0.75
0.8
0.85
0.9
0.95
Figure 5.3: Slice representation of the geometric mean of a Decision Tree for ve complexity levels.
The axes represent the size of the dataset ranging from 100 to 1000, the percentage of
the majority class ranging from 50% to 95% and the complexity of the model ranging
from 1 to 5. The performance is represented in gray scale where white is perfect
classication and black is total misclassication.
Although from the presented tridimensional point of view the slices are not completely visible,
the reader should be able to get a sense of how they look like given the observable pattern.
We can clearly see that the performance signicantly drops at higher levels of imbalance and
smaller sample sizes, which supports previous observations in the analyses with both real and
synthetic data.
Regarding the best performing algorithms in G-mean per instance, we can see that our
algorithm achieves in average comparable results with related algorithms, and outperforms
them in instance 5C90, which is the most imbalanced case of synthetic data used in this work.
These results agree with those obtained with real data, where our method also outperformed
the remaining algorithms in the most imbalanced instance.
The error rate here is also not a good performance measure. As an example consider again
instance 5C90, for which the best performing algorithm is CSVM according to the error rate,
and PROP according to the G-mean. However, according to the sensitivity and specicity
values, CSVM unconditionally issues negative responses, which makes it useless. On the other
hand PROP has a reasonable sensitivity-specicity trade-o, which is better represented by
the G-mean.
Lastly, in Table 5.11 we present the CPU time consumption of the algorithms. We can
see that, in general, time consumption is not related to the level of imbalance. However, in
some cases computational complexity seems to be related to concept complexity. Consider for
5.4 Validation 53
example algorithmTREE, which has a signicant increase of time consumption when stepping
from complexity level 3 to 5, but none when going from 50% to 90% of imbalance. The only
method whose computational performance seems to be aected by class imbalances is BBAG,
however this result was expected given the design of the algorithm, which generates additional
classiers as the level of imbalance increases, which in turn, increases the computational
burden of training.
We can see that the results of synthetic data are overall similar to those of real data. The
G-mean of a simple error-minimization algorithm measured over a systematic instantiation
of Japkowiczs domain generation framework clearly shows the behavior of performance as
a function of the three parameters of the model, which veries previous conjectures and
provides an explanation to the case of easy real instances where the imbalance level does not
seem to hinder classication.
5.4 Validation
All results presented and discussed, in this section we intend to verify their validity with
both real and synthetic data. Given the great number of variables evaluated according to
the experimental setup, i.e., algorithms, benchmark instances and performance measures,
we begin by summarizing the most relevant results, and then we formulate and test the
hypotheses of this work in terms of the summarized evidence.
5.4.1 Relevant Results
For each factor we narrow the levels to those which are relevant to our interests. Regarding
algorithms, we only report the results of external and internal approaches, or in other words,
appropriate methods for imbalanced classication. Although the results obtained with the
sample of traditional algorithms are interesting, we can only compare our proposal to related
methods. For benchmark instances we only report the results of what we have been naming
hard instances, in both real and synthetic data, as they are the cases where the class imbalance
problem really takes place. Finally, for performance measures we only report the G-mean,
because as we have discussed throughout this work it is a widely-used metric in the literature
regarding imbalanced classication. According to this selection, in Table 5.12 we report the
summary of results extracted from tables 5.5 and 5.10.
We can see that for levels of imbalance below 90%, external methods are sucient to ad-
dress imbalanced classication, contrary to internal methods that achieve low performances
overall. In the three most imbalanced instances 5C90, ABAL and YEAS, however, internal ap-
proaches considerably outperform external approaches, therefore we can say that data edition
techniques are limited to a certain level of class imbalance, whereas internal approaches are
very eective in extremely imbalanced data. It is interesting to note that our method outper-
forms all other approaches in the most extremely imbalanced instances, which is remarkable
given the complexity of such solutions, specially when measured in terms of G-mean.
54 Results
RNDO RNDU SMOT SVDD BBAG ACOS PROP
60% 5C60 0.795 0.144 0.839 0.127 0.834 0.132 0.638 0.197 0.656 0.195 0.464 0.203 0.663 0.215
65% PIMA 0.716 0.056 0.727 0.052 0.735 0.047 0.660 0.062 0.690 0.053 0.702 0.067 0.704 0.054
70% GERM 0.721 0.046 0.726 0.047 0.731 0.049 0.596 0.063 0.667 0.051 0.663 0.063 0.636 0.051
70% 5C70 0.849 0.185 0.749 0.188 0.812 0.174 0.643 0.229 0.397 0.239 0.355 0.264 0.646 0.235
76% WPBC 0.646 0.151 0.648 0.139 0.537 0.168 0.463 0.211 0.591 0.140 0.439 0.261 0.565 0.160
80% 5C80 0.877 0.145 0.759 0.176 0.803 0.164 0.767 0.179 0.411 0.247 0.332 0.256 0.769 0.183
90% ZERN 0.901 0.030 0.895 0.027 0.538 0.087 0.763 0.077 0.862 0.035 0.446 0.098 0.575 0.094
90% 5C90 0.316 0.305 0.343 0.307 0.597 0.311 0.792 0.241 0.035 0.116 0.275 0.269 0.833 0.137
94% ABAL 0.603 0.167 0.665 0.168 0.670 0.204 0.633 0.147 0.731 0.110 0.455 0.271 0.682 0.139
96% YEAS 0.701 0.307 0.682 0.045 0.678 0.067 0.722 0.142 0.663 0.056 0.635 0.055 0.745 0.081
Table 5.12: Summary of relevant results with real and synthetic data measured in G-mean.
5.4.2 Hypothesis Testing
Although our method did not excel in every instance, we still nd remarkably interesting
that it did perform specially well in extremely imbalanced cases. We tested and successfully
validated the following hypothesis using Students t-test:
PROP outperforms every related method in instance YEAS
PROP outperforms every related method in instance 5C90
The proposed algorithm is a very well performing method in cases of extremely imbalanced
(real and synthetic) data, however, although our ndings meet the scope of this work, we
strongly believe that our method can be further improved to provide better solutions in lower
ranges of imbalance, improving its exibility.
5.5 Summary of the Chapter
In this chapter we have expanded the framework used for experimentation, describing the
computer platform in which the simulations were performed, and the benchmark instances of
the problem used to evaluate the algorithms. Then we presented the results of the experiments
and discussed their implications. Finally we tested two hypothesis regarding the classication
performance of our method, which were successfully validated using the available evidence.
Chapter 6
Conclusions
In this last chapter we present an overview of all the topics covered in this thesis along with
our nal remarks. We begin by briey summarizing the highlights of this work, then we
present our conclusions regarding the results obtained along with our nal remarks and new
possible research directions for future work, and nally we humbly state our achievements
regarding the products elaborated in this research project.
6.1 Summary
In the rst chapter of this thesis we introduced the problem of imbalanced classication and
stated a formal denition for it. We also characterized traditional algorithms that by design
fail to appropriately solve imbalanced problems in practice. In the following chapter we cov-
ered a considerable number of basic machine learning and pattern recognition concepts for
consideration of the reader. We dened the dierent tasks of machine learning, presented
the most common measures to assess the performance of a learning machine, and discussed
the reason why traditional algorithms fail to provide appropriate solutions for imbalanced
classication tasks. In the next chapter we reviewed a considerable number of proper pattern
recognition algorithms for imbalanced classication, along with several performance assess-
ment techniques also developed to attain a better representation of a classiers capabilities
in that context. Given the solutions in the literature, in the following chapter we introduced
our solution, designed to outperform other related methods in particular instances. Then
we selected a set of related imbalanced classication algorithms and a validation scheme to
perform a comparative study. Finally, in the next chapter we presented the results of the ex-
periments, discussed their implications, and tested our hypotheses regarding the classication
performance of the proposed method, which were successfully validated using the available
evidence.
56 Conclusions
6.2 Results
In the previous chapter we observed that in hard instances, i.e., instances where algorithms
seem to be prone to class imbalances, the performance decreases as the level of imbalance
increases. The results of the experiments with synthetic data supported this observation.
The systematic instantiation of Japkowiczs domain generation framework clearly showed
the behavior of performance as a function of the three parameters of the model, which also
provided an explanation for the case of easy real instances where the imbalance level does
not seem to hinder classication. According to these observations we can say that in pattern
recognition tasks, the imbalance level is a necessary condition to be able to see performance
drops, however it is not sucient.
Regarding the performance of the proposed method, we found that it performs reasonably
well in extremely imbalanced data. Moreover, it shows a stable performance in the range of
high imbalance levels and achieves the best G-mean in the most imbalanced real and synthetic
instances. This indicates that our method could be particularly useful in applications where
extremely infrequent events need to be detected.
6.3 Future Work
Given that our method is arguably good at extreme levels of imbalance, future work may
consider exploring its capabilities in that kind of data with more detail. Future work may
also consider a generalization of the proposed method to n descriptions instead of 2. A proper
decision rule should be designed, however, in order to aggregate the responses of n classiers,
which is not trivial. Another problem to be considered in future research is that of class
distributions shifting over time, for example in a medical scenario when sudden epidemics
increase the occurrence of a disease in a given location and time. In such an event, the models
should alter their behavior to meet the new concept being established more accurately.
With regard to the research trend of addressing new application-related problems not
previously considered by the theory, we believe that we should start focusing in develop-
ing methods for real-world applications with their inherent diculties, rather than falling
in the popular practice of exploiting ready-to-use data sets. We strongly believe that the
class imbalance problem is a key topic in the link between academy and industry regarding
real-world pattern recognition applications, and will continue to receive attention from the
research community as many issues remain unsolved.
6.4 Achievements
Partial results of this research were published in the 22nd International Conference on Arti-
cial Neural Networks (ICANN), in Lausanne, Switzerland [RA12]. Also, in the context of
the same project we contributed to the publication of a related work in the 17th Iberoameri-
can Congress on Pattern Recognition (CIARP), in Buenos Aires, Argentina [ORV
+
12]. The
authors are currently working in a third contribution expected to be published in an indexed
journal of the speciality.
Bibliography
[AA07] D.J. Newman A. Asuncion. UCI machine learning repository, 2007.
[bio06] Wiley Encyclopedia of Biomedical Engineering, 6-Volume Set. Wiley-
Interscience, 2006.
[Bis96] Christopher M. Bishop. Neural Networks for Pattern Recognition. Oxford Uni-
versity Press, USA, 1 edition, January 1996.
[BKKP99] James C. Bezdek, James Keller, Raghu Krisnapuram, and Nikhil Pal. Fuzzy
Models and Algorithms for Pattern Recognition and Image Processing (The
Handbooks of Fuzzy Sets). Springer, 1999.
[BPM04] Gustavo E. A. P. A. Batista, Ronaldo C. Prati, and Maria Carolina Monard. A
study of the behavior of several methods for balancing machine learning training
data. Sigkdd Explorations, 6:2004, 2004.
[Bre96] Leo Breiman. Bagging predictors. Mach. Learn., 24:123140, August 1996.
[BS05] Camilla Brekke and Anne H.S. Solberg. Oil spill detection by satellite remote
sensing. Remote Sensing of Environment, 95(1):1 13, 2005.
[BSL09] Chumphol Bunkhumpornpat, Krung Sinapiromsaran, and Chidchanok Lursin-
sap. Safe-level-smote: Safe-level-synthetic minority over-sampling technique for
handling the class imbalanced problem. In Proceedings of the 13th Pacic-Asia
Conference on Advances in Knowledge Discovery and Data Mining, PAKDD
09, pages 475482, Berlin, Heidelberg, 2009. Springer-Verlag.
[CHG10] Sheng Chen, Haibo He, and E.A. Garcia. Ramoboost: Ranked minority over-
sampling in boosting. Neural Networks, IEEE Transactions on, 21(10):1624
1642, oct. 2010.
[CLHB03] Nitesh V. Chawla, Ar Lazarevic, Lawrence O. Hall, and Kevin W. Bowyer.
Smoteboost: improving prediction of the minority class in boosting. In Proceed-
ings of the Principles of Knowledge Discovery in Databases, PKDD-2003, pages
107119, 2003.
58 BIBLIOGRAPHY
[DBFS91] E. DeRouin, J. Brown, L. Fausett, and M. Schneider. Neural network train-
ing on unequally represented classes. Intellligent Engineering Systems Through
Articial Neural Networks, pages 135141, 1991.
[DH73] Richard O. Duda and Peter E. Hart. Pattern Classication and Scene Analysis.
John Wiley & Sons Inc, 1973.
[Faw06] Tom Fawcett. An introduction to roc analysis. Pattern Recognition Letters,
27(8):861 874, 2006. ROC Analysis in Pattern Recognition.
[FS95] Yoav Freund and Robert E. Schapire. A decision-theoretic generalization of
on-line learning and an application to boosting. In Proceedings of the Second
European Conference on Computational Learning Theory, EuroCOLT 95, pages
2337, London, UK, UK, 1995. Springer-Verlag.
[FSZC99] Wei Fan, Salvatore J. Stolfo, Junxin Zhang, and Philip K. Chan. Adacost:
Misclassication cost-sensitive boosting. In In Proc. 16th International Conf.
on Machine Learning, pages 97105. Morgan Kaufmann, 1999.
[GCT09] S. M. Guo, L. C. Chen, and J. S. H. Tsai. A boundary method for outlier
detection based on support vector domain description. Pattern Recogn., 42:77
83, January 2009.
[GD
+
95] A. Guerin-Dugue et al. Deliverable R3-B1-P - Task B1: Databases. Technical
report, Elena-NervesII Enhanced Learning for Evolutive Neural Architecture,
ESPRIT-Basic Research Project Number 6891, June 1995. Anonymous FTP:
/pub/neural-nets/ELENA/Databases.ps.Z on ftp.dice.ucl.ac.be.
[GFH06] Salvador Garc, Alberto Fern, and Francisco Herrera. A proposal of evolutionary
prototype selection for class imbalance problems. Lecture Notes in Computer
Science, 4224:14151423, 2006.
[GMS09] V. Garca, R. A. Mollineda, and J. S. Sanchez. Index of balanced accuracy: A
performance measure for skewed class distributions. In Proceedings of the 4th
Iberian Conference on Pattern Recognition and Image Analysis, IbPRIA 09,
pages 441448, Berlin, Heidelberg, 2009. Springer-Verlag.
[HSMM] Mohammad Hossin, Md N. Sulaiman, Aida Mustapha, and Norwati Mustapha.
A novel performance metric for building an optimized classier. Journal of
Computer Science, 7(4):582 590.
[HWM05] Hui Han, Wen-Yuan Wang, and Bing-Huan Mao. Borderline-SMOTE: A New
Over-Sampling Method in Imbalanced Data Sets Learning. pages 878887. 2005.
[JS02] Nathalie Japkowicz and Shaju Stephen. The class imbalance problem: A sys-
tematic study. Intell. Data Anal., 6:429449, October 2002.
BIBLIOGRAPHY 59
[KHM98] Miroslav Kubat, Robert Holte, and Stan Matwin. Machine learning for the
detection of oil spills in satellite radar images. In Machine Learning, pages
195215, 1998.
[KM97] Miroslav Kubat and Stan Matwin. Addressing the curse of imbalanced train-
ing sets: One-sided selection. In Proceedings of the Fourteenth International
Conference on Machine Learning, pages 179186. Morgan Kaufmann, 1997.
[LB07] Pawan Lingras and Cory J. Butz. Precision and recall in rough support vec-
tor machines. In Proceedings of the 2007 IEEE International Conference on
Granular Computing, GRC 07, pages 654, Washington, DC, USA, 2007. IEEE
Computer Society.
[Li07] Cen Li. Classifying imbalanced data using a bagging ensemble variation (bev).
In Proceedings of the 45th annual southeast regional conference, ACM-SE 45,
pages 203208, New York, NY, USA, 2007. ACM.
[LLH10] Der-Chiang Li, Chiao-Wen Liu, and Susan C. Hu. A learning method for the
class imbalance problem with medical data sets. Computers in Biology and
Medicine, 40(5):509 518, 2010.
[LLL98] Charles Ling, , Charles X. Ling, and Chenghui Li. Data mining for direct
marketing: Problems and solutions. In Proceedings of the Fourth International
Conference on Knowledge Discovery and Data Mining (KDD-98, pages 7379.
AAAI Press, 1998.
[Mit97] Thomas M. Mitchell. Mach. Learn. McGraw-Hill, Inc., New York, NY, USA, 1
edition, 1997.
[Nil65] N. J. Nilsson. Learning machines: Foundations of Trainable Pattern Classifying
Systems. McGraw-Hill, 1965.
[Oh11] Sang-Hoon Oh. Error back-propagation algorithm for classication of imbal-
anced data. Neurocomputing, 74(6):1058 1061, 2011.
[ORV
+
12] Pablo Orme no, Felipe Ramrez, Carlos Valle, Hector Allende-Cid, and Hec-
tor Allende. Robust asymmetric adaboost. In Luis Alvarez, Marta Mejail,
Luis Gomez, and Julio Jacobo, editors, Progress in Pattern Recognition, Image
Analysis, Computer Vision, and Applications, volume 7441 of Lecture Notes in
Computer Science, pages 519526. Springer Berlin Heidelberg, 2012.
[Par62] Emanuel Parzen. On Estimation of a Probability Density Function and Mode.
The Annals of Mathematical Statistics, 33(3):10651076, 1962.
[RA12] Felipe Ramrez and Hector Allende. Dual support vector domain description for
imbalanced classication. In Alessandro Villa, Wlodzislaw Duch, Peter

Erdi,
Francesco Masulli, and G unther Palm, editors, Articial Neural Networks and
Machine Learning ICANN 2012, volume 7552 of Lecture Notes in Computer
60 BIBLIOGRAPHY
Science, pages 710717. Springer Berlin / Heidelberg, 2012. 10.1007/978-3-642-
33269-2 89.
[RACVA10] Felipe Ramrez, Hector Allende-Cid, Alejandro Veloz, and Hector Allende.
Neuro-fuzzy-based arrhythmia classication using heart rate variability features.
In Sergio F. Ochoa, Federico Meza, Domingo Mery, and Claudio Cubillos, edi-
tors, SCCC, pages 205211. IEEE Computer Society, 2010.
[RG97] Gunter Ritter and Mara Teresa Gallegos. Outliers in statistical pattern recogni-
tion and an application to automatic chromosome classication. Pattern Recog-
nition Letters, 18(6):525 539, 1997.
[RK04] Bhavani Raskutti and Adam Kowalczyk. Extreme re-balancing for svms: a case
study. SIGKDD Explor. Newsl., 6:6069, June 2004.
[RN02] Stuart Russell and Peter Norvig. Articial Intelligence: A Modern Approach
(2nd Edition). Prentice Hall series in articial intelligence. Prentice Hall, 2
edition, December 2002.
[RP06] R. Ranawana and V. Palade. Optimized precision - a new measure for classier
performance evaluation. In Evolutionary Computation, 2006. CEC 2006. IEEE
Congress on, pages 2254 2261, 0-0 2006.
[SN98] Lorenza Saitta and Filippo Neri. Learning in the real world. Mach. Learn.,
30:133163, February 1998.
[TD99] David M.J Tax and Robert P.W Duin. Support vector domain description.
Pattern Recognition Letters, 20:1191 1199, 1999.
[TD04] David M. J. Tax and Robert P. W. Duin. Support vector data description.
Mach. Learn., 54:4566, January 2004.
[Tom76] Ivan Tomek. Two modications of cnn. Systems, Man and Cybernetics, IEEE
Transactions on, 6(11):769 772, nov. 1976.
[Vap99] V.N. Vapnik. An overview of statistical learning theory. Neural Networks, IEEE
Transactions on, 10(5):988 999, sep 1999.
[YYD98] Alexander Ypma, Er Ypma, and Robert P.W. Duin. Support objects for domain
approximation. In In ICANN98, Skovde Sweden, pages 24. Springer, 1998.
[ZL06] Zhi-Hua Zhou and Xu-Ying Liu. Training cost-sensitive neural networks with
methods addressing the class imbalance problem. Knowledge and Data Engi-
neering, IEEE Transactions on, 18(1):63 77, jan. 2006.

You might also like