You are on page 1of 9

Research Bo Tang and Haibo He

Frontier Department of Electrical, Computer, and


Biomedical Engineering, University of
Rhode Island, Kingston, RI, USA

Enn: Extended Nearest Neighbor Method for Pattern Recognition

Abstract based on their similarities (clustering). In competitive performance, and a nonpara-

T his article introduces a new su-


pervised classification method
the extended nearest neigh-
bor (ENN) that predicts input pat-
terns according to the maximum gain of
the setting of a classification, there are
two kinds of classifiers which scientists
and engineers may use: parametric classifi-
ers in which the
underlying joint dis-
metric computational basis which is inde-
pendent of the underlying data
distribution.The first modern study of the
nearest neighbor ap-
proach can be traced
intra-class coherence. Unlike the classic k- tributions/models of back to 1951 by Fix
nearest neighbor (KNN) method, in the data are assumed and Hodges [5]. In
which only the nearest neighbors of a to be known but cer- their formalization of
test sample are used to estimate a group tain parameters need nonparametric dis-
membership, the ENN method makes to be estimated, and crimination, the con-
a prediction in a two-way communi- nonparametric classifiers sistency of KNN was
cation style: it considers not only who in which the classifi- established using a
are the nearest neighbors of the test sample, cation rules do not probability density es-
but also who consider the test sample as depend explicitly on timation. They proved
their nearest neighbors. By exploiting the the datas underlying that if k " 3 and
Digital vision
generalized class-wise statistics from all distributions [1]. With k/n " 0, where k is
training data by iteratively assuming the coming of the Big Data era [2] the number of selected nearest neighbors
all the possible class memberships of a [3] [4], nonparametric classifiers have and n is the sample size of the whole data
test sample, the ENN is able to learn received particular attention because the set, the classification error of the nearest
from the global distribution, therefore data distributions/models of many classi- neighbor method ^R nh can asymptotical-
improving pattern recognition perfor- fication problems are either unknown or ly converge to the Bayesian rule
mance and providing a powerful tech- very difficult to obtain in practice. Previ- (R )): R n " R ) . In another representative
nique for a wide range of data analysis ous nonparametric classification meth- work [6], Cover and Hart proved that for
applications. ods, such as k-nearest neighbor (KNN), N -class classification problems, when
assume that data which are close togeth- k = 1, the classification error of the near-
I. Introduction er based upon some metrics, such as est neighbor method will be bounded
With the continuous expansion of data Euclidean distance, more likely belong to by R ) # R n # R ) (2 - (N/(N - 1)) R ))
availability in many areas of engineering the same category. Therefore, given an when there are infinite samples. This basi-
and science, such as biology, communica- unknown sample to be classified, its cally means in the large sample case, the
tions, social networks, global climate, and nearest neighbors are first ranked and nearest neighbor method has a probabili-
remote sensing, it becomes critical to counted, and then a class membership ty of error which is less than twice the
identify patterns from vast amounts of assignment is made. Bayes probability of error [6]. These early
data, to identify members of a predefined Nearest neighbor-based pattern classic works laid down a strong theoreti-
class (classification), or to group patterns recognition methods have several key ad- cal foundation for nearest neighbor based
vantages, such as easy implementation, methods, which have been witnessed in
Digital Object Identifier 10.1109/MCI.2015.2437512 considerable applications in many differ-
Corresponding author: Haibo He (Email: he@ele.
Date of publication: 16 July 2015 uri.edu). ent disciplines, such as biological and

52 IEEE Computational intelligence magazine | August 2015 1556-603x/152015ieee


nobis distance metric by semidefinite II. Problem Formulation:
Online Resource for ENN: programming [16] [17]. Hastie and Tib- Limitations of KNN
http://www.ele.uri.edu/faculty/he/ shirani used a local linear discriminant The reason of the two types of errors
ENN/ENN.html analysis to estimate distance metrics for [29] is that KNN method is sensitive to
Supplementary materials and computing neighbors and building local the scale or variance of the distributions
source code implementation of the decision boundaries for classification of the predefined classes [31]. The near-
ENN classifier have been provided. [21]. The third issue in the KNN est neighbors of an unknown observa-
The source code includes all three method is its high computational com- tion will tend to be dominated by the
versions of the ENN classifier, ENN, plexity for large data sets. To address this class with the highest density. For exam-
ENN.V1 and ENN.V2, in Matlab issue, Gowda et al. proposed a con- ple, in a two-class classification problem,
M-file format. densed nearest neighbor approach using if we assume that class 1 has a smaller
a concept of mutual nearest neighbors variance than class 2, then the class 1
to reduce the size of training data for samples dominate their near neighbor-
chemical data analysis [7], disease classifi- decision making [22]. Bagui et al. hood with higher density (i.e., more
cation and clinical outcome prediction employed a concept of ranking to concentrated distribution), whereas the
[8], among others. In addition to pattern reduce the computational complexity class 2 samples are distributed in regions
recognition, KNN based methods are also [23]. To speed up the searching of near- with lower density (i.e., more spread out
widely used for clustering [9], regression est neighbors and overcome the mem- distribution). In this case, for class 2 sam-
[10], density estimation [11], and outlier ory limitation, many data structures, ples, the number of nearest neighbors
detection [12], to name a few. In fact, named fast KNN, have been proposed from class 1 may exceed that from class
KNN has been identified as one of the to implement the KNN method, such as 2. For instance, for those class 2 samples
top 10 algorithms in data mining by the k-d tree [24], nearest feature line [25], which are close to the region of class 1,
IEEE International Conference on Data orthogonal search tree [26], ball-tree there may be a large number of class 1
Mining (ICDM) presented in Hong [27], and principal axis search tree [28], neighbors because of the higher density
Kong in 2006 [13]. to name a few. of class 1. Only for those class 2 samples
There are several important issues in In general, two types of errors may which are far away from the region of
the KNN method that have been of occur in KNN based methods: for the higher density of class 1, their nearest
great interest since it was originally pro- samples in the areas of higher density, neighbors from class 2 will be dominant.
posed. The first one is how to choose the k nearest neighbors could lie in the Fig. 1 shows one example (a two-class
the user-defined parameter k , the num- areas of slightly lower density, or vice classification scenario) of decision making
ber of nearest neighbors. To determine versa [29]. These types of errors may in the classic KNN method for two
the best k , a straightforward method is result in misclassification, especially Gaussian distributions with different
through cross validation by trying vari- when nearest neighbors are dominated means and variances, in which we assume
ous k and choose the one with the best by the samples from other classes. Error that their prior distribution p (~) are the
performance [14]. The second one is rates are increased when sample sizes same. In this figure, the x-axis represents
how to determine an appropriate dis- are small or data are imbalanced (see a the data value (one dimensional data in
similarity measurement. The Euclidean comprehensive survey on imbalanced this case), and the y-axis represents the
distance is commonly used for this pur- learning [30]). In this paper, we intro- class-conditional probability p (x|~) . For
pose, but it is easily influenced by irrele- duce extended nearest neighbor the eight data points illustrated here, we
vant features therefore suffering from (ENN), a new method based on gener- assume that data points x 1, x 2, x 3 and x 4
noisy samples, especially with high- alized class-wise statistics which can are sampled from the class 1 distribution,
dimensional data (i.e., the curse of represent intra-class coherence. Unlike and data points x 5, x 6, x 7 and x 8 are
dimensionality problem). Feature nor- the classic KNN method where only sampled from the class 2 distribution. On
malization and dimensionality reduction the nearest neighbors are considered the side of this figure, we also listed their
methods are able to remedy this issue in for classification, our proposed ENN corresponding k 1 /k and k 2 /k ratio (here
certain ways. More recently, distance method takes advantage of all available k = 5 is the parameter of KNN, and k 1
metric learning from training data is of training data to make a classification and k 2 are the number of nearest neigh-
great interest to improve the prediction decision: it assigns a class membership bors belong to class 1 and class 2, respec-
performance of KNN based methods to an unknown sample to maximize tively). In this case, the Bayesian method
[15][21]. For example, Goldberger et al. the intra-class coherence. By exploiting can correctly classify all these eight data
proposed a stochastic nearest neighbor the information from all available data points by calculating their corresponding
approach for distance metric learning to maximize the intra-class coherence, posterior probabilities according to the
with the optimum leave-one-out per- ENN is able to learn from the global Bayes theorem. However, under the
formance on training data [15]. Wein- distribution, therefore, improving pat- KNN method, all of these eight data
berger et al. proposed to learn a Mahala- tern recognition performance. points are predicted as class 1. This means

August 2015 | IEEE Computational intelligence magazine 53


I r ^x, S h equals 1; otherwise, it equals 0.
p(x | ~) : Class 1 Data In this way, the generalized class-wise
k1/k k2/k
: Class 2 Data statistic Ti in Eq. (1) is the ratio of the
Data x1 0.6 0.4 number of nearest neighbors belonging
0.4 Data x2 0.6 0.4 to the same class for class i with respect
Data x3 0.6 0.4 to the product of the sample size of class
Class 1: d12 = 1
Data x4 0.6 0.4 i (i.e., n i ) and the number of nearest
Data x5 0.6 0.4 neighbors under consideration ^ k h . A
Data x6 0.6 0.4 large Ti indicates the samples in S i are
0.2 Data x7 0.8 0.2 much closer together and their nearest
Data x8 0.8 0.2 neighbors are dominated by the same
Class 2: d22 = 4
class samples, whereas a small Ti indi-
cates that samples in S i have an excess
A
of nearest neighbors from the other
0
x5 x6 x7 x1 x2 x3 x4 x8 x class. Note that the generalized class-
wise statistic has 0 # Ti # 1 with Ti = 1
Figure 1 One example of the KNN rule in comparison with the Bayesian rule for a two-class when all the nearest neighbors of class i
classification problem. Assume that data points x 1, x 2, x 3 and x 4 are sampled from the class 1 data are also from the same class i , and
distribution, and data points x 5, x 6, x 7 and x 8 are sampled from the class 2 distribution. In this Ti = 0 when all the nearest neighbors
case, the Bayesian rule will correctly classify all these eight data points. However, the KNN rule
(with k = 5 ) will misclassify the four data points from class 2: x 5, x 6, x 7 and x 8 . This figure are from other classes. Based on this dis-
demonstrates the well-known limitation of the classic KNN method. cussion, we can use Ti to represent the
data distribution across multiple classes.
that the four data points from class 2, ly spread [32], [33]. In our approach, Therefore, we introduce the concept of
x 5, x 6, x 7 and x 8 , are misclassified unlike the generalized statistic over all intra-class coherence, defined as follows:
because their k 1 /k ratio values are larger samples as discussed in [33], we first build 2
than their corresponding k 2 /k ratio val- our generalized class-wise statistic regard- H = / Ti (3)
i =1
ues. The reason is that the nearest neigh- ing the pooled samples S 1 and S 2 for
bors of these four data points are domi- each class along with its nearest neigh- Given an unknown sample Z to be
nated by the class 1 data because of the bors. We define the generalized class-wise classified, we iteratively assign it to
distribution. This has been a long-stand- statistic Ti for class i as the following: class1 and class 2, respectively, to obtain
ing limitation of the classic KNN two new generalized class-wise statistics
k
T ij , where j = 1, 2.
Ti = 1 / / I r ^x, S = S 1 , S 2h
method in literature [29] [31].
n i k x ! Si r = 1
T ij =
III. Proposed Approach: Enn for i = 1, 2 (1)
/ I r ^x, S ' = S 1 , S 2 , {Z}h
k
Classification 1 /
Here we describe the Extended Nearest where S 1 and S 2 denote the samples in nli k x ! S' r =1
i, j

Neighbor (ENN) method to solve this class 1 and class 2, respectively, x denotes i, j = 1, 2 (4)

deficiency in the classic KNN method.The one single sample in S = S 1 , S 2, n i is the
advantages of the original KNN approach number of samples in S i, and k is the user- where n 'i is the size of S li, j and S li, j is
are retained in our ENN classifier, such as defined parameter of the number of the defined as
easy implementation and competitive classi- nearest neighbors. The indicator function
S li, j = '
S i , {Z}, when j = i
fication performance. However, unlike the I r (x, S) indicates whether both the sample (5)
S i, when j ! i
classic KNN approach, the new ENN classi- x and its r-th nearest neighbor belong to
fier makes a prediction by not only considering the same class, defined as follows: For this two-class classification prob-
who are the nearest neighbors of the test sample, lem, we have four generalized class-wise
but also who consider the test sample as their I r ^ x, S h = statistics: T 11, T 12, T 21 and T 22 . Unlike the
'
nearest neighbors. 1, if x ! S i and NN r (x, S) ! S i previous KNN classifiers which predict
(2)
0, otherwise that the unknown sample Z belongs to
A. ENN Classifier a class based only on its nearest neigh-
We describe our ENN method starting where NN r (x, S) denotes the r-th near- bors, the ENN classifier predicts its class
with the two-class classification problem. est neighbor of x in S . This equation membership according to the following
The generalized statistic based on nearest means for either class, if both the sample target function:
neighbors was first proposed in the two- x and its r-th nearest neighbor in the 2

sample test problem to evaluate whether pool of S belong to the same class, then fENN = arg max / T ij = arg max H j
j ! 1, 2 i =1 j ! 1, 2
two distributions are mixed well or wide- the outcome of the indicator function  (6)

54 IEEE Computational intelligence magazine | August 2015


where on who are the nearest neighbors of Z , tistic T ij for class i . According to Eq. (4),
2 but also on who consider Z as one of we have,
H j = / T ij (7) their nearest neighbors. when i = j
i =1
For the new observation Z , lets as-
Ti = 1
j
/
This means that our ENN method sume that there are k 1 nearest neighbors nli k X ! Si , {Z}
makes the prediction based on which from class 1 and k 2 nearest neighbors k

decision resulting the largest intra-class from class 2, where k 1 + k 2 = k is the / I r (X, Sl=S 1 , S 2 , {Z})
r =1
coherence when we iteratively assume the total number of nearest neighbors inves-
1 /
(n i + 1) k ;X ! Si
test sample Z to be each possible class. tigated. First, lets assume the observation =
We now present the detailed calcula- Z to be classified as class 1; then, we k
tion steps for a two-class classification count the number of class 1 data who / I r (X, S 1 , S 2 , {Z})
problem using our ENN method. First, have an increased number of class 1 sam- r =1

+ / I r (Z, S 1 , S 2 , {Z})G
k
lets assume that the observation Z should ples in its k nearest neighbors (denote
be classified as class 1.Then, for each train- this number as Dn 11 ), and we also count r =1

ing sample, we re-calculate its k nearest the number of class 2 data who have a 1 /
(n i + 1) k ;X ! Si
=
neighbors and obtain the new generalized decreased number of class 2 samples in its
class-wise statistic according to Eq. (4), k nearest neighbors (denote this number k

denoted as T 11 for class 1 and T 12 for class as Dn 12 ). Then, in a similar way, we fur-
/ I r (X, S 1 , S 2) + Dn ij
r =1
2. Second, we assume that the observation ther assume the observation Z to be a
+ / I r (Z, S 1 , S 2)G
k
Z should be classified as class 2; then, for member of class2, and count the num-
r =1
each training sample, we also re-calculate ber of class 1 data who have a decreased
its k nearest neighbors and obtain the new number of class 1 samples in its k nearest  = 1 ^ n kT + Dn ij + k i h (10)
(n i + 1 ) k i i
generalized class-wise statistic according to neighbors (denote this number as Dn 21 ),
and when i ! j
Eq. (4), denoted as T 21 for class 1 and T 22 and we also count the number of class 2
k
for class 2. Then a classification decision is data who have an increased number of T ij = 1 / / I r (X, S l =S 1 , S 2 ,{Z})
made according to Eq. (6). class 2 samples in its k nearest neighbors nli k X ! Si r = 1
(denote this number as Dn 22 ). In this way,
= 1 = / / I r (X, S 1 , S 2) - Dn i G
k
The aforementioned two-class ENN j
decision rule can be easily extended to Dn ij represents how many of the k near- n i k X ! Si r = 1
multi-class classification problems. Specifi- est neighbors from each class will change
 = 1 ^n i kTi - Dn ij h
cally, for an N-class classification problem, because of the introduction of the new ni k
we have an ENN Classifier algorithm sample Z , when it is iteratively assumed Dn ij
(see the box below). to be each possible class. To clearly dem- = Ti - (11)
ni k
onstrate this concept, we present a de- Therefore, from Eq. (8), we have:
B. ENN.V1 Classifier: An Equivalent tailed calculation example in our N
Version of ENN Supplementary Material Section 1 for a fENN = arg max / T ij
j ! 1, 2, g, N i = 1
The underlying idea of the ENN rule two-class classification problem.
= arg max / ^T ij - Ti h
N
shown in Eq. (8) is that we classify the With these discussions, we present an
test sample Z based on which decision equivalent version of the ENN classifier j ! 1, 2, g, N i = 1

= arg max )^T ij - Ti hi = j


results in the greatest intra-class coher- in the box ENN.V1 (see below).
ence among all possible classes. We can Proof of ENN.V1. We iteratively j ! 1, 2, g, N
further demonstrate that the classification assign the unknown sample Z to class j
+ / ^T ij - Ti h3
N

of Z in our ENN rule depends not only to obtain new generalized class-wise sta- i!j

ENN.V1: Given an unknown sample Z to be classified, we iteratively assign it to each


possible class j, j = 1, 2, g, N, and predict the class membership according to:
ENN Classifier: Given an unknown

fENN.V1 = arg max )e


Dn i + k i - kTi o
j j
3
sample Z to be classified, we itera- N
Dn i
-/ (9)
tively assign it to each possible class j ! 1, 2, g, N (n i + 1) k i = j i ! j n i k
j, j = 1, 2, g, N, and compute the
j where k is the user-defined parameter of the number of the nearest neighbors, n i is
generalized class-wise statistic T i for
the number of training data for class i, k i is the number of the nearest neighbors of
each class i, i = 1, 2, g, N. Then, the j
the test sample Z from class i, Dn i represents the change of the k nearest neigh-
sample Z is classified according to:
bors for class i when the test sample Z is assumed to be class j, and Ti represents
N
fENN = arg max / T i 
j
(8) the generalized class-wise statistic of original class i (i.e., without the introduction of
j ! 1, 2, g, N i = 1
the test sample Z ).

August 2015 | IEEE Computational intelligence magazine 55


= arg max )c
Dn ij + k i - kTi m
ENN.V2: Given an unknown sample Z to be classified, under the following two
j ! 1, 2, g, N (n i + 1) k i =j
conditions:
Dn ij
3 
N
-/ (1) All classes have the same number of data samples n, i.e., a balanced classification
i!j
ni k problem;
= fENN.V1 (12) j
(2) For all i ! j, (Dn i / ((n + 1) nk)) " 0;
 The ENN decision rule can be approximated as follows:
This proves our ENN.V1 is equivalent
to the original ENN method. We now fENN.V2 = arg max " Dn j + k j - kT j , (13)
j ! 1, 2, g, N
present an example in Fig. 2 for the
detailed implementation of our two ENN where k is the user-defined parameter of the number of the nearest neighbors, Dn j
versions, i.e., fENN in Eq. (8) and fENN.V1 denotes the number of samples in class j who consider Z as one of their k nearest
in Eq. (9). Note that this equivalent target neighbors, k j is the number of the nearest neighbors of the test sample Z from class
function fENN.V1 can provide the same j , and T j represents the generalized class-wise statistic of original class j (i.e., without
classification result, without the recalcula- the introduction of the test sample Z ).
tion of T ij as in fENN . Both target func-
tions provide a general formula for our
1 2 2
ENN method for multi-class classification & Dn 1 + Dn 1 + k 1 - kT1 + Dn 1 posed ENN method can address the
problems. In practical applications, ( n + 1) k (n + 1) nk scale-sensitive problem in the classic
2 1 1
whether using fENN or fENN.V1 is a users n 2 + Dn 2 + k 2 - kT2 KNN rule. Whereas the KNN rule only
< D + Dn 2
preference. We would like to note that ( n + 1) k (n + 1) nk considers k j to make a decision, the
fENN is more straightforward to calculate  (14) proposed ENN method considers three
and implement, but from a computational factors to make a prediction decision:
point of view, we recommend using When Dn 21 / ((n + 1) nk) and Dn 12 / two positive terms k j and Dn j, and
fENN.V1 as this target function does not ((n + 1) nk) approximate to zero, we have one negative term kT j . The two pos-
require recalculating the generalized class- itive terms demonstrate that our ENN
Dn 11 + Dn 21 + k 1 - kT1
wise statistic T ij for every test sample.  approach considers not only who are
< Dn 22 + Dn 12 + k 2 - kT2 (15) the nearest neighbors of the test sample,
C. ENN.V2 Classifier: An Approximate We now define but also who consider the test sample as
Version of ENN their nearest neighbors. The negative
Dn 1 = Dn 11 + Dn 21 (16)
The proposed ENN classifier can resolve term means the class that has a greater
Dn 2 = Dn 12 + Dn 22 (17)
the scale-sensitive problem in the classic generalized class-wise statistic would be
KNN decision rule as we discussed pre- In this way, Dn 1 represents the total given a larger penalty value when we
viously in this paper. From Eq. (8), the number of samples in class 1 who consider estimate the class membership of an
ENN method classifies the test sample Z Z as one of their k nearest neighbors, and unknown sample. The combination of
based on which decision results in the Dn 2 represents the total number of sam- these three factors provides the unique
greatest intra-class coherence among all ples in class 2 who consider Z as one of advantage of our ENN method to
predefined classes. Moreover, from Eq. their k nearest neighbors. Therefore, Eq. improve classification performance. We
(9), the classification decision rule of our (15) can be expressed as follows: would also like to note that in many
ENN method is based on both the sam- practical applications if the two condi-
ples who are the nearest neighbors of Z Dn 1 + k 1 - kT1 < Dn 2 + k 2 - kT2 tions as described in ENN.V2 algo-
 (18)
and who regard Z as one of their nearest rithm are satisfied, using Eq. (13) as a
neighbors. We provide an approximate Therefore, for a two-class classifica- simple approximation of our ENN
but straightforward version of the ENN tion problem, the classification rule is method can in general provide compet-
method under certain assumptions in the itive classification performance.
box ENN.V2. fENN.V2 = arg max " Dn j + k j - kT j ,
j ! 1, 2

Proof of ENN.V2. Considering two- (19) D. Computational Analysis


class classification scenario first, without  As a simple and reliable technique, the
loss of generality, if the ENN method nearest neighbor based methods have
classifies Z as class 2 according to Eq.(9), The above derivation can be easily been widely used in both research and
we have extended to N -class classification problems, industry for pattern recognition, regres-
Dn 11 + k 1 - kT1 - Dn 12 and we can obtain the approximate version sion, feature reduction, clustering,
(n + 1) k nk of our ENN method as shown in Eq. (13). among others. However, one major con-
2 2 This approximate version of ENN cern of this kind of method is its com-
< Dn 2 + k 2 - kT2 - Dn 1
(n + 1) k nk in Eq. (13) also explains why the pro- putational complexity. For the classic

56 IEEE Computational intelligence magazine | August 2015


Assume
ENN: Assume Test Sample Z to Be Class 1 Z Class 1

ENN: Consider 3-Nearest Neighbors (k = 3) NN1,2,3 (X4) = [X2, X1, Y4] [X2, X1, Z]
NN1,2,3 (Y3) = [Y2, Y4, Y1] [Y2, Z, Y4]
f2 Y2
: Class 1 Class 1: T11 = 0.800
Y1 Class 2: T21 = 0.500
: Class 2 Y3
: Test Sample Z X1 X3
Y4 Class 1: Dn11 = 1, k1 = 2
X4
Class 2: Dn21 = 1, k2 = 1
f2 X2
Y2
f1
Y1 (B)
Y3
Z X1 X3
Y4
X4
X2 ENN: Assume Test Sample Z to Be Class 2 Assume
Z Class 2
f1 NN1,2,3 (Y1) = [X1, Y2, X3] [X1, Y2, Z]
(A) NN1,2,3 (Y2) = [Y1, Y3, X1] [Y1, Y3, Z]
NN1,2,3 (Y4) = [X4, Y3, X2] [Z, X4, Y3]
3-Nearest Neighbors Without Z: f2
Y2
NN1,2,3 (X1) = [X2, X3, Y1] NN1,2,3 (Y1) = [X1, Y2, X3] Class 1: T12 = 0.750
Y3 Y1
NN1,2,3 (X2) = [X1, X4, X3] NN1,2,3 (Y2) = [Y1, Y3, X1]
Class 2: T22 = 0.733
NN1,2,3 (X3) = [X1, X2, Y1] NN1,2,3 (Y3) = [Y2, Y4, Y1] Z
X1 X3
NN1,2,3 (X4) = [X2, X1, Y4] NN1,2,3 (Y4) = [X4, Y3, X2]
X4 Class 1: Dn12 = 0, k1 = 2
Y4
Class 1: T1 = 0.750 Class 2: T2 = 0.583 X2 Class 2: Dn22 = 3, k2 = 1

f1
(C)

According to Eq: (8): fENN = arg max T11 + T21 = 1.300, T12 + T22 = 1.483 Assign
j e 1, 2 Class 2
j=1 j=2
to Z
Dn11 + k1 - kT1 Dn21 Dn22 + k2 - kT2 Dn12
According to Eq: (9): fENN.V1 = arg max - =- 0.03, - = 0.15
(n1 + 1)k n2k (n2 + 1)k n1k
j e 1, 2
j=1 j=2

August 2015 | IEEE Computational intelligence magazine


Figure 2 (A), (B) and (C) demonstrate the detailed procedure of two versions of ENN classifier. In (A), training samples are stored and their k nearest neighbors and distances are calculated to
compute the generalized class-wise statistic T1 and T2 . (B) and (C) illustrate the changes of the generalized class-wise statistic for each class, when we iteratively assume Z to be class 1 and

57
j
class 2. Both of them provide the same classification result, but the recalculation of generalized class-wise statistic T i is avoided in ENN.V1.
Model 1: v12 = 5, v22 = 5, v32 = 5 4
Model 2: v12 = 5, v22 = 20, v32 = 5

Overall Classification
Error (in Percentage)
3.8 KNN
KNN ENN
Overall Classification
Error (in Percentage)

Overall Classification
Error (in Percentage)
34 KNN 3.6
ENN
34 ENN 3.4
32 3.2
32 3
30 2.8
30 2.6
28
28 2.4
5 10 15 20
5 10 15 20 5 10 15 20 k
k k
Figure 4 Overall classification error rate (in
Model 3: v12 = 5, v22 = 5, v32 = 20 Model 4: v12 = 5, v22 = 20, v32 = 20 percentage) of ENN and KNN for handwrit-
36
KNN 44 ten digits classification.
KNN
Overall Classification
Error (in Percentage)

Overall Classification
34 ENN Error (in Percentage) 42 ENN
32 IV. Experiments and
40
30 Results Analysis
38 To evaluate the performance of our
28
36 ENN classifier, we conduct several
26
34 experiments with Gaussian data classifi-
5 10 15 20 5 10 15 20 cation, hand-written digits classification
k k
[34], and twenty real-world datasets
from UCI Machine Learning Reposito-
Figure 3 Overall classification error rate (in percentage) of ENN and KNN for four Gaussian
data models. ry [35]. In all these experiments, we
explore every odd number of k from 1
KNN method, there is no training stage obtain k i for each class at a computa- to 21, i.e., k = 1, 3, 5, 7, ..., 21.
(i.e., the so-called instance-based learn- tional complexity of O (M log M) . We first test our proposed ENN clas-
ing or lazy learning), with a computa- After that, we compare this distance sifier for a 3-dimensional Gaussian data
tional complexity of O (M log M) in a within the weighted KNN graphs to with 3 classes, in comparison with the
testing stage for every test sample, where obtain Dn ij with only computational classic KNN rule and the Maximum a
M is the number of training data. To complexity of O ^M h . Therefore, the Posterior (MAP) rule. In the MAP rule,
improve the efficiency, it makes more total computational complexity of our we estimate the parameters of Gaussian
sense to take a preprocessing stage proposed ENN method is distribution for each class with the given
where data structures are built to con- O (M log M) + O (M) , which is on the training data. In our current experi-
struct the relationship among training same scale as O (M log M) in the KNN ments, we use 250 training samples and
data, such as k-d tree [24], ball tree [27], method. Notice that we have computa- 250 test samples for each class with the
and nearest feature line [25], or to tional complexity of O (M 2 log M) in a following parameters:
reduce the size of training data by elimi- preprocessing stage of building the
nating the data which may not provide weighted KNN graphs, if we perform n1 = [8 5 5] T C 1 = v 21 I
T 2
useful information for decision making, it in a direct manner (i.e., for each n 2 = [ 5 8 5] C 2 = v 2 I
such as the condensednearest neighbor training sample, we calculate and order T
n 3 = [5 5 8] C 3 = v 3 I
2

rule by using a concept of mutual near- the mutual distances to all the other
est neighborhood to only select samples training data to search its nearest neigh- To demonstrate the effectiveness of the
close to the decision boundary [22]. bors). Fortunately, the existing tech- proposed ENN classifier to solve the defi-
In our ENN method, a preprocess- niques proposed in literature to ciency of the KNN rule, we examine the
ing stage can be developed to calculate improve the efficiency of KNN based classification error rate of each class with
the generalized class-wise statistic Ti methods can be easily integrated into different variances. Fig. 3 shows the overall
for each class to build weighted KNN our ENN method to speed up the classification error rates (in percentage)
graphs, in which training samples are searching of nearest neighbors, and averaged over 100 random runs. The
the vertices and the distances of the hence reduce the complexity of the detailed classification performance for each
sample to its nearest neighbors are the ENN method. For example, using the class among these three comparative
edges. The weighted KNN graphs can structure of k-d trees [24] can reduce methods, i.e., ENN, KNN, and MAP, is
then be used to calculate Dn ij efficient- the computational complexity of the provided in Supplementary Material Sec-
ly for a given test sample. To do so, we preprocessing stage and the testing tion 2. Our results show that the proposed
calculate distances between the test stage to O (M log M) and O (log M) , ENN method performs much better than
sample and every training data to respectively. the classic KNN method.

58 IEEE Computational intelligence magazine | August 2015


Table 1 The average testing error rate and standard derivation of ENN ^ k = 3 h compared with KNN ^ k = 3 h , naive Bayes, LDA, and
neural network. All results are shown in percentage. Each result is averaged over 100 random runs: in each run, we randomly select
half of the data as the training data and the remaining half as the test data. For each dataset, we highlight the best result with Bold
value among all these five methods. To specifically compare the results of ENN and KNN, we also underline the value if ENN
performs significantly better than KNN under one-tailed t-test ^ p = 0.01 h .

Datasets ENN KNN Naive Bayes LDA Neural Network


Ionosphere 17.35 ! 2.69 18.55 ! 2.94 19.83 ! 2.86 20.68 ! 3.00 18.48 ! 2.90
Vowel 8.50 ! 1.92 11.73 ! 1.80 43.90 ! 2.98 40.94 ! 1.97 45.17 ! 3.15
Sonar 22.67 ! 3.97 24.49 ! 4.06 29.22 ! 4.16 33.75 ! 5.11 27.24 ! 4.37
Wine 4.49 ! 2.16 7.08 ! 2.20 5.07 ! 1.71 2.58 ! 2.13 7.21 ! 2.88
Breast-cancer 4.04 ! 0.87 4.44 ! 1.07 5.76 ! 1.04 5.88 ! 1.18 4.57 ! 1.17
Haberman 31.32 ! 6.53 32.13 ! 5.79 36.35 ! 10.85 34.63 ! 9.92 37.40 ! 10.58
Breast tissue 36.71 ! 6.37 42.40 ! 6.19 44.02 ! 6.18 41.24 ! 6.60 67.62 ! 5.22
Movement libras 26.33 ! 2.88 32.16 ! 2.97 45.41 ! 3.39 39.90 ! 3.31 40.87 ! 4.34
Mammographic masses 21.16 ! 1.43 22.27 ! 1.55 18.96 ! 1.57 19.17 ! 1.72 49.40 ! 0.29
Segmentation 24.71 ! 3.07 27.85 ! 3.04 12.64 ! 2.93 12.79 ! 2.88 23.06 ! 5.95
ILPD 40.04 ! 3.58 40.91 ! 3.68 26.87 ! 2.69 29.64 ! 3.39 32.09 ! 3.53
Pima Indians diabetes 31.22 ! 2.15 33.08 ! 2.37 29.44 ! 2.19 28.29 ! 2.01 25.38 ! 2.77
Knowledge 23.93 ! 4.69 27.11 ! 4.45 12.66 ! 2.45 6.97 ! 2.53 14.42 ! 3.86
Vertebral 35.13 ! 4.83 37.64 ! 5.06 47.93 ! 3.41 36.88 ! 4.83 45.11 ! 3.12
Bank note 0.09 ! 0.18 0.12 ! 0.23 15.27 ! 1.25 2.60 ! 0.60 0.18 ! 0.37
Magic 20.10 ! 0.33 20.42 ! 0.36 25.69 ! 0.61 23.30 ! 0.34 29.62 ! 0.38
Pen digits 0.74 ! 0.15 0.94 ! 0.17 15.38 ! 0.41 11.22 ! 0.52 11.65 ! 0.70
Faults 0.91 ! 0.52 1.65 ! 0.86 0.00 ! 0.00 0.00 ! 0.00 0.00 ! 0.00
Letter 5.60 ! 0.25 7.44 ! 0.25 40.09 ! 0.47 29.80 ! 0.37 28.33 ! 0.52
Spam 10.08 ! 0.59 11.52 ! 0.63 10.31 ! 0.78 9.64 ! 0.61 15.32 ! 1.02

We also evaluate the classification p = 0.01h . In the neural network please refer to the Supplementary Mate-
performance of our proposed ENN implementation, we use the classic mul- rial Section 3 for further information.
classifier on the entire MNIST hand- tilayer perceptron (MLP) structure with We would like to note that for all
written digits dataset [34], which is a 10 hidden neurons, and with 800 back- these experiments, there appears to be
widely used benchmark in the commu- propagation iterations for training at the no significant difference between ENN
nity. In this experiment, we use 60,000 learning rate of 0.01. The results are and KNN when k = 1. The reason for
images as training data and 10,000 averaged over 100 random runs, and in this might be that under such a small
images as test data. Fig. 4 shows the every run, we randomly select half of value of k = 1, the classification perfor-
comparison of classification error rates the data as the training data and the mance will be determined by the single
(in percentage) between classic KNN remaining half as the test data. Lets con- closest neighbor. Therefore, when k = 1,
rule and our ENN rule, where the min- sider the Spam database as an example. both methods do not consider the data
imum error of 2.61% is obtained at This database includes 4601 e-mail mes- distribution anyway. That might explain
k = 7 for our ENN method. The overall sages, in which 2788 are legitimate mes- why under k = 1, both methods show a
classification error rates using the ENN sages and 1813 are spam messages. Each very close performance.
rule are notably less than those with the message is represented by 57 attributes,
KNN rule (Fig. 4). of which 48 are the frequency of a par-
We further apply our ENN classifier ticular word (FW), 6 are based on the
to twenty real world datasets from UCI frequency of a particular character (FC), 13.5
Overall Classification
Error (in Percentage)

13 KNN
Machine Learning Repository [35]. and 3 are continuous attributes that 12.5 ENN
Table I presents the classification error reflect the use of capital letters (SCL) in 12
11.5
rates (in percentage) for these twenty the e-mails. Fig. 5 shows detailed classifi- 11
UCI datasets in comparison to the clas- cation error rates (in percentage) with 10.5
sic KNN, naive Bayes, linear discrimi- different parameters of k , which clearly 10
9.5
nant analysis (LDA), and neural network. demonstrates that ENN method can 5 10 15 20
It shows that ENN always performs bet- achieve consistently lower error rates k
ter than KNN, and in 17 out of these 20 than those of KNN rule. For a detailed
Figure 5 Overall classification error rate (in
datasets, the performance improvement performance comparison between ENN percentage) of ENN and KNN for spam
is significant (one-tailed t -test, and KNN for all these twenty datasets, e-mail classification.

August 2015 | IEEE Computational intelligence magazine 59


V. Conclusions and Future Work performance, we would expect the new [19] J. Wang, P. Neskovic, and L. N. Cooper, Improv-
ing nearest neighbor rule with a simple adaptive distance
In this paper, we proposed an innovative ENN method and its future variations measure, Pattern Recognit. Lett., vol. 28, no. 2, pp. 207
ENN classification methodology based could have widespread use in many areas 213, 2007.
[20] A. Bellet, A. Habrard, and M. Sebban, A survey on
on the maximum gain of intra-class of data and information processing. metric learning for feature vectors and structured data,
coherence. By analyzing the generalized Comput. Res. Repository, 2013, to be published.
[21] T. Hastie and R. Tibshirani, Discriminant adaptive
class-wise statistics, ENN is able to learn VI. Acknowledgment nearest neighbor classification, IEEE Trans. Pattern Anal.
from the global distribution to improve This research was partially supported by Machine Intell., vol. 18, no. 6, pp. 607616, 1996.
[22] K. C. Gowda and G. Krishna, The condensed near-
pattern recognition performance. Unlike National Science Foundation (NSF) est neighbor rule using the concept of mutual nearest
the classic KNN rule which only con- under grant ECCS 1053717 and CCF neighborhood, IEEE Trans. Inform. Theory, vol. 25, no.
4, pp. 488490, 1979.
siders the nearest neighbors of a test 1439011, and the Army Research Office [23] S. C. Bagui, S. Bagui, K. Pal, and N. R. Pal, Breast
sample to make a classification decision, under grant W911NF-12-1-0378. cancer detection using rank nearest neighbor classifica-
tion rules, Pattern Recognit., vol. 36, no. 1, pp. 2534,
ENN method considers not only who 2003.
are the nearest neighbors of the test References [24] J. H. Friedman, J. L. Bentley, and R. A. Finkel,
[1] C. M. Bishop, Pattern Recognition and Machine Learning. An algorithm for finding best matches in logarithmic
sample, but also who consider the test New York: Springer, 2006. expected time, ACM Trans. Math. Softw., vol. 3, no. 3,
sample as their nearest neighbors. We [2] Nature. (2008). Big data. [Online]. Available: http:// pp. 209226, 1977.
www.nature.com/news/specials/bigdata/index.html [25] S. Z. Li, K. L. Chan, and C. Wang, Performance
have developed three versions of the [3] Z. Zhou, N. Chawla, Y. Jin, and G. Williams, Big evaluation of the nearest feature line method in image
ENN classifier, ENN, ENN.V1 and data opportunities and challenges: Discussions from data classification and retrieval, IEEE Trans. Pattern Anal.
analytics perspectives, IEEE Comput. Intell. Mag., vol. 9, Machine Intell., vol. 22, no. 11, pp. 13351349, 2000.
ENN.V2, and analyzed their founda- no. 4, pp. 6274, 2014. [26] J. McNames, A fast nearest-neighbor algorithm
tions and relationships to each other. [4] Y. Zhai, Y. Ong, and I. Tsang, The emerging big based on a principal axis search tree, IEEE Trans. Pat-
dimensionality, IEEE Comput. Intell. Mag., vol. 9, no. 3, tern Anal. Machine Intell., vol. 23, no. 9, pp. 964976,
The experimental results on numerous pp. 1426, 2014. 2001.
benchmarks demonstrated the effective- [5] E. Fix and J. L. Hodges Jr, Discriminatory analysis- [27] T. Liu, A. W. Moore, and A. Gray, New algorithms
nonparametric discrimination: Consistency properties, for efficient high-dimensional nonparametric classifica-
ness of our ENN method. U.S. Air Force Sch. Aviation Medicine, Randolph Field, tion, J. Machine Learn. Res., vol. 7, pp. 11351158, 2006.
As a new classification method, we are Texas, Project 21-49-004, Tech. Rep. 4, 1951. [28] V. Garcia, E. Debreuve, and M. Barlaud, Fast k
[6] T. Cover and P. Hart, Nearest neighbor pattern clas- nearest neighbor search using GPU, in Proc. IEEE Com-
currently exploring the power of the sification, IEEE Trans. Inform. Theory, vol. 13, no. 1, pp. puter Society Conf. Computer Vision Pattern Recognition
ENN method and developing its variants 2127, 1967. Workshops, 2008, pp. 16.
[7] P. Horton and K. Nakai, Better prediction of protein [29] P. Indyk and R. Motwani, Approximate nearest
for numerous machine learning and data cellular localization sites with the k nearest neighbors neighbors: Towards removing the curse of dimensional-
mining problems, such as imbalanced classifier, in Proc. 5th Int. Conf. Intelligent Systems Molecu- ity, in Proc. 30th Annu. ACM Symp. Theory Computing,
lar Biology, 1997, vol. 5, pp. 147152. 1998, pp. 604613.
learning, class conditional density estima- [8] R. Parry, W. Jones, T. Stokes, J. Phan, R. Moffitt, H. [30] H. He and E. A. Garcia, Learning from imbalanced
tion, clustering, regression, among others. Fang, L. Shi, A. Oberthuer, M. Fischer, W. Tong, and M. data, IEEE Trans. Know. Data Eng., vol. 21, no. 9, pp.
Wang, k-Nearest neighbor models for microarray gene 12631284, 2009.
For example, for imbalanced learning expression analysis and clinical outcome prediction, [31] J. H. Friedman, S. Steppel, and J. Tukey, A Nonpara-
problems [30] [36] [37], notice that the Pharmacogenomics J., vol. 10, no. 4, pp. 292312, 2010. metric Procedure for Comparing Multivariate Point Sets. no.
[9] J. MacQueen, Some methods for classification and 153, Stanford Linear Accelerator Center Computation
distribution scale sensitive issue addressed analysis of multivariate observations, in Proc. 5th Berkeley Research Group Technical Memo, 1973.
by our ENN method can also be consid- Symp. Mathematical Statistics Probability, California, 1967, [32] M. F. Schilling, Multivariate two-sample tests
vol. 1, pp. 281297. based on nearest neighbors, J. Amer. Stat. Assoc., vol. 81,
ered as an unequal distribution learning [10] N. S. Altman, An introduction to kernel and near- no. 395, pp. 799806, 1986.
problem, therefore, we expect that our est-neighbor nonparametric regression, Amer. Stat., vol. [33] M. Schilling, Mutual and shared neighbor prob-
46, no. 3, pp. 175185, 1992. abilities: Finite-and infinite-dimensional results, Adv.
ENN method could be applied easily to [11] G. R. Terrell and D. W. Scott, Variable kernel density Appl. Probab., vol. 18, no. 2, pp. 388405, 1986.
the learning from imbalanced data. estimation, Ann. Stat., vol. 20, no. 3, pp. 12361265, 1992. [34] Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner,
[12] M. M. Breunig, H.-P. Kriegel, R. T. Ng, and J. Gradient-based learning applied to document recogni-
Meanwhile, the idea of the proposed Sander, LOF: Identifying density-based local outliers, tion, Proc. IEEE, vol. 86, no. 11, pp. 22782324, 1998.
ENN method may also benefit the class ACM SIGMOD Rec., vol. 29, no. 2, pp. 93104, 2000. [35] M. Lichman. (2013). UCI Machine Learning Re-
[13] X. Wu, V. Kumar, J. R. Quinlan, J. Ghosh, Q. Yang, pository. School of Information and Computer Science.
conditional density estimation [38] [39], H. Motoda, G. J. McLachlan, A. Ng, B. Liu, P. S. Yu, Irvine, CA: Univ. California. [Online]. Available: http://
if one considers a different size of neigh- Z.-H. Zhou, M. Steinbach, D. J. Hand, and D. Steinberg, archive.ics.uci.edu/ml/.
Top 10 algorithms in data mining, Knowl. Inform. Syst., [36] H. He, Y. Bai, E. A. Garcia, and S. Li, ADASYN:
borhood for each class according to our vol. 14, no. 1, pp. 137, 2007. Adaptive synthetic sampling approach for imbalanced
defined generalized class-wise statistic Ti . [14] R. Kohavi, A study of cross-validation and boot- learning, in Proc. IEEE Int. Joint Conf. Neural Networks,
strap for accuracy estimation and model selection, 2008, pp. 13221328.
One similar idea is adaptive or variable- in Proc. Int. Joint Conf. Artificial Intelligence, vol. 14, pp. [37] S. Chen, H. He, and E. A. Garcia, RAMOBoost:
bandwidth kernel density estimation 11371145, 1995. Ranked minority over-sampling in boosting, IEEE
[15] J. Goldberger, S. Roweis, G. Hinton, and R. Sal- Trans. Neural Networks, vol. 21, no. 10, pp. 16241642,
where the width of kernels is varied for akhutdinov, Neighbourhood components analysis, in 2010.
different samples [11] [40]. Furthermore, Proc. Advances Neural Information Processing Systems, 2005, [38] G. Biau, F. Chazal, D. Cohen-Steiner, L. Devroye,
pp. 513520. and C. Rodriguez, A weighted k-nearest neighbor den-
similar to different variations of the [16] K. Q. Weinberger, J. Blitzer, and L. K. Saul, Dis- sity estimate for geometric inference, Electron. J. Stat.,
KNN method, other forms of ENN clas- tance metric learning for large margin nearest neighbor vol. 5, pp. 204237, 2011.
classification, in Proc. Advances Neural Information Process- [39] U. von Luxburg and M. Alamgir, Density esti-
sifiers could be developed, such as dis- ing Systems, 2005, pp. 14731480. mation from unweighted k-nearest neighbor graphs: A
tance-weighted ENN. As nearest neigh- [17] K. Q. Weinberger and L. K. Saul, Distance metric roadmap, in Proc. Advances Neural Information Processing
learning for large margin nearest neighbor classification, Systems, 2013, pp. 225233.
bor-based classification methods are used J. Machine Learn. Res., vol. 10, pp. 207244, Feb. 2009. [40] D. W. Scott, Multivariate Density Estimation: Theory,
in many scientific applications because of [18] S. Chopra, R. Hadsell, and Y. LeCun, Learning Practice, and Visualization. Hoboken, NJ: Wiley, 2009.
a similarity metric discriminatively, with application to
their easy implementation, non-paramet- face verification, in Proc. IEEE Computer Society Conf.
ric nature, and competitive classification Vision Pattern Recognition, 2005, pp. 539546. 

60 IEEE Computational intelligence magazine | August 2015

You might also like