Professional Documents
Culture Documents
Abstract—For classification applications, the role of hidden form templates of the input. An alternative to input clustering is
layer neurons of a radial basis function (RBF) neural network can input-output clustering [6], [7]. The input-output clustering dif-
be interpreted as a function which maps input patterns from a
fers from the input clustering in that it determines center loca-
nonlinear separable space to a linear separable space. In the new
space, the responses of the hidden layer neurons form new feature tions based on not only the input, but also the output deviations.
vectors. The discriminative power is then determined by RBF Besides the clustering methods, the orthogonal forward selec-
centers. In the present study, we propose to choose RBF centers tion algorithm [8]–[10] is another frequently used method for
based on Fisher ratio class separability measure with the objective RBF centers selection. The basic idea of this method is to intro-
of achieving maximum discriminative power. We implement this
idea using a multistep procedure that combines Fisher ratio, an duce an orthogonal transform to facilitate the center selection
orthogonal transform, and a forward selection search method. procedure. RBF centers can also be determined using the re-
Our motivation of employing the orthogonal transform is to cently developed support vector machine (SVM) method [11],
decouple the correlations among the responses of the hidden layer [12]. The basic idea of SVM is to determine the structure of the
neurons so that the class separability provided by individual RBF
neurons can be evaluated independently. The strengths of our
classifier by minimizing the bounds of training error and gen-
method are double fold. First, our method selects a parsimonious eralization error. Usually, the centers selected using SVM are
network architecture. Second, this method selects centers that close to the boundary of the decision surface. In contrast, the
provide large class separation. centers selected by clustering are templates or stereotypical pat-
Index Terms—Center selection, Fisher’s class separability mea- terns of the training samples.
sure, pattern classification, radial basis function (RBF) neural net- The hidden layer of the RBF neural network classifier can be
works. viewed as a function that maps the input patterns from a non-
linear separable space to a linear separable space. In the new
I. INTRODUCTION space, the responses of the hidden layer neurons form new fea-
ture vectors for pattern representation. Then the discriminative
II. RBF CLASSIFIER CENTER SELECTION IN THE FRAMEWORK Substituting (4) into (1), yields
OF NONLINEAR APPROXIMATION
A. Fisher Ratio for Class Separability Measure that provide large class separation using the forward selection
Two measures related to class separation include the intra- procedure. However, there is a problem when implementing this
class difference and the interclass spread. Let the mean and vari- idea. If , where is a small positive number,
ance of samples belonging to class and class in the direc- and would be severely correlated. As a result, redundant cen-
tion of the th feature be denoted by and , ters may be selected if the forward selection algorithm is used
respectively. Fisher ratio is defined as the ratio of the interclass without any modification. This may result in a large network
difference to the intraclass spread [13] architecture where the generalization capacity of the classifier
may deteriorate. In our study, we alleviate this problem by in-
(7) troducing the orthogonal decomposition into the procedure of
class separability evaluation. We define the hidden layer neuron
response matrix as
where denotes the class separation between classes and
in the direction of the th feature. Fisher ratio provides a good
class separability measure because it is maximized with the in- (11)
.. .. .. ..
terclass difference being maximized and the intraclass spread . . . .
being minimized.
The definition of Fisher ratio in (7) can be extended to multi-
where is the response of the th hidden layer neuron with
class case. In our study, class separability is evaluated using the
respect to the th input pattern , the components of
average value which is defined as
the th column are the responses of the th candidate neuron
corresponding with respect to all the training patterns
(8) . The components of the th row are the
responses of all candidate neurons to the th input pattern .
where is the average class separability measure in the direc-
Performing the orthogonal decomposition to the matrix , we
tion of the th feature, is the total number of classes.
obtain
Fisher ratio was originally proposed for feature selection in
linear separable problems. In the present study, Fisher ratio can (12)
be applied to nonlinear separable problems for RBF centers se-
lection. This is because the RBF mapping given in (6) maps where is an orthogonal matrix. Columns
patterns from the nonlinear separable space into the linear sep- define orthogonal directions, and components in each column
arable space. define distribution of samples along that direction.
Based on different criteria, the decomposition in (12) can
B. RBF Center Selection Using Fisher Ratio, Orthogonal yield different orthogonal transforms. Principal component
Transform, and Forward Selection Procedure analysis (PCA) provides orthogonal basis that leads to min-
We shall investigate into the role of the RBF centers in de- imum loss of information resulted from dimension reduction.
termining the class separability measure. In the new linear sep- But the problem with the PCA method is that the resultant
arable space, the mean value and the variance of the responses vector is a linear combination of columns of . Conse-
of the hidden layer neurons corresponding to the quently, cannot be linked back to a single column of .
new features are given by Rank revealing QR factorization and the orthogonal forward
regression algorithm [8], [9] are orthogonal transforms whose
individual column of can be linked back to individual column
of , and hence can be linked back to a hidden layer neuron.
The rank revealing QR factorization is based on the size of the
(9) norm of the orthogonal vectors . Columns of are sorted in
such an order that
(13)
where denotes the norm of .
(10) In the framework of approximation, the orthogonal forward
regression algorithm [8], [9] decomposes starting from the
following regression equation:
where denotes the number of training samples that fall into
class . (14)
Equations (9)–(10) reveal that center vectors play a crit- where
ical role in determining the mean and the variance and hence Substituting (12) into (14) yields
the interclass difference and the intraclass spread. Suppose the
training samples are representative, we can select a subset of the (15)
training samples as centers. The selection procedure consists of where
evaluating the class separability measure resulted from each of
the candidate centers using Fisher ratio, and selecting centers
1214 IEEE TRANSACTIONS ON NEURAL NETWORKS, VOL. 13, NO. 5, SEPTEMBER 2002
The orthogonal transform is formed by sorting largest class separability. Strictly speaking, exhaustive search is
, in the following order [8], [9]: the only method that guarantees to find the optimal subset. How-
ever, since the exhaustive search method has to evaluate all pos-
(16) sible combinations of all candidate centers available, it requires
an enormous amount of computation when the number of given
In the present study, we propose an orthogonal transform that
training samples is very large. Therefore the forward selection
sorts in the following order:
procedure, though suboptimal, is more frequently used than the
(17) exhaustive search. Also, despite of the suboptimality, the for-
ward selection procedure is capable of selecting a good combi-
where denotes class separability measure provided by . nation of a small number of centers because the center selected
is obtained by applying operations (7)–(8) to . The motivation at each step is the one that provides the largest class separability
of introducing such an orthogonal transform is to select centers measure. The forward selection procedure is then employed and
that provide maximum class separation. The RBF center selec- combined with the orthogonal transform in the present study.
tion procedure that combines the forward selection procedure, The orthogonal forward search procedure developed in the
the orthogonal transform and Fisher ratio class separability mea- present study has similarity to the orthogonal forward regres-
sure is summarized as follows. sion algorithms developed in [8] and [9]. Both algorithms em-
1) Initially, take all the training examples as candidate cen- ploy the orthogonal transform and the forward selection proce-
ters. Compute the responses of all candidate hidden layer dure. However, the two algorithms employ different evaluation
neurons with respect to all training input patterns. Form criteria in the center selection process. The orthogonal forward
matrix as in (11) using the responses of all hidden layer regression [8], [9] is developed in the context of nonlinear ap-
neurons. proximation, it evaluates candidate centers based on the approx-
2) In the second step, estimate sample mean and variance of imation error reduction. Our orthogonal forward selection algo-
each class along directions determined by each column of rithm is developed in the context of classification, it evaluates
and compute the class separability measure provided candidate centers based on Fisher ratio class separability mea-
by each column of using (7)–(8). The column that pro- sure. One advantage of our center selection algorithm is that it
vides maximum class separability is selected as the first can deal with heavily unbalanced classes because of the em-
column of matrix , and the corresponded neuron is se- ployment of Fisher ratio, it is therefore more suitable for clas-
lected as the first neuron to add to the network structure. sification tasks. As for the computational cost, our algorithm is
3) In the third step, orthogonalize all remaining columns similar to the orthogonal forward regression algorithm [8], [9].
of with all the columns of . Estimate sample mean
and variance of each class along directions determined by IV. EXPERIMENTS
each column of , compute the class separability mea-
sure provided by each of the orthogonalized columns. The One synthetic example and two real life problems from
one that yields maximum class separability is selected as UCI Repository of Machine Learning [20] and Knowledge
the second column of , and the corresponded neuron is Discovery in Databases Archive [21] were used to test our
selected to add to the network structure. algorithms.
4) The procedure is continued until the class separability
provided by the next selected neuron is smaller than a pre- A. Experiment 1
defined threshold. In the first experiment, a synthetic dataset was used to test
Once the hidden layer neurons are selected, the nonlinear sep- our algorithm. The dataset has 64 samples, where 40 examples
arable training patterns can be mapped into a linear separable fall into class 1 and the remaining 24 examples fall into class
space. The weights that connect the hidden and output layers 2. In our study, half of the examples in each class were used
can be determined using linear classifier design methods such for training and the remaining half were used for testing. As
as the linear least squares algorithm. shown in Fig. 1, the samples are nonlinear separable in the input
An important parameter in the center selection procedure is space. An RBF neural network classifier was used to deal with
the setting of the threshold for step 4). The value of the threshold the problem.
will determine the number of centers to be selected and thus To construct the RBF classifier, we need to select RBF centers
affect the network size and the performance of the classifier. in the first place. Initially, all the 32 training samples were taken
A small threshold will lead to a large network structure which as candidate centers. Following the center selection procedure
tends to overfitting and exhibits deteriorated generalization ca- summarized in Section III-B, we performed orthogonal decom-
pacity. On the other hand, a large threshold will result in a small position and evaluated the class separability provided by each
network structure which has the danger of underfitting. In our candidate neurons. The class separability measures provided by
implementation, the selection procedure is terminated when the individual candidate neurons are shown in Fig. 2. Inclusion of
class separability improvement from adding the next selected the third neuron would provide just 3.5% improvement to the
center is smaller than 5% of the sum of class separability of all class separability. Therefore, we selected two hidden layer neu-
previously selected neurons. rons in this experiment. The role of the two RBF neurons is to
Center selection is a subset search problem with the objec- map patterns from the nonlinear separable input space into the
tive of finding an optimal combination of centers that provide new linear separable feature space. The training patterns in the
IEEE TRANSACTIONS ON NEURAL NETWORKS, VOL. 13, NO. 5, SEPTEMBER 2002 1215
Fig. 1. Sample distribution in the input space for Experiment 1. Fig. 4. Samples distribution in the new feature space (the algorithm in [8], [9])
for Experiment 1.
B. Experiment 2
Cleveland data on cardiology patient [20] was used in the
second experiment. The dataset has 303 samples, where each
example is represented by 13 input attributes and one output
attribute. Since the information provided is incomplete, the es-
timated class error rate is 20% [22].
Among the 303 samples, only 297 samples are complete and
the rest have one or more missing values. In our study, missing
values were replaced with the mean of the corresponding at-
tributes. The 303 samples were divided into a training dataset
and a test dataset consisting of 152 and 151 samples, respec-
Fig. 3. Sample distribution in the new feature space for Experiment 1.
tively. Only attributes 8, 9, 12, and 13 were selected as features
for pattern classification using the sequential forward feature se-
new feature space are shown in Fig. 3, where and denote lection method [23].
training data, and denote test data. Obviously patterns in Initially all the 152 training samples were considered as can-
the new space are linear separable. As a matter of fact, our clas- didate center vectors. Class separability measure provided by
sifier with just one hidden layer neuron is able to achieve 100% individual candidate hidden layer neurons was evaluated and
correct classification over both training and test datasets. part of the evaluations are shown in Fig. 5. Notice that only
For comparison, we also used the orthogonal forward regres- three neurons provide significant class separation. Classifica-
sion algorithm [8], [9] to select RBF centers. As shown in Fig. 4, tions using different number of hidden layer neurons were per-
samples in the new feature space are still inseparable if just two formed. As shown in Figs. 6 and 7, the classification error rate
neurons were used. To achieve 100% correct classification over reduces when more neurons are employed in the hidden layer.
both training and test datasets, three hidden layer neurons were However, the reduction of classification error is trivial when the
needed if the algorithm in [8], [9] was used to select RBF cen- number of neurons exceeds three. This number coincides with
ters. In this experiment, our algorithm selected more efficient that from the analysis based on class separability measure.
hidden layer neurons. This is achieved due to the employment The RBF neural network classifier is with a structure of
of the Fisher ratio class separability which can deal with the un- 4 3 1. Classification error rates over the training and test
balanced classes problem in this example. data are 13.2% and 19.7%, respectively. This result matches
1216 IEEE TRANSACTIONS ON NEURAL NETWORKS, VOL. 13, NO. 5, SEPTEMBER 2002
Fig. 6. Classification error rate versus number of hidden layer neurons for Fig. 8. Class separability measure provided by individual neurons for
training data of Experiment 2. Experiment 3.
Fig. 7. Classification error rate versus number of hidden layer neurons for test Fig. 9. Number of identified owners versus number of hidden layer neurons
data of Experiment 2. for Experiment 3.