Professional Documents
Culture Documents
- Peshawar]
On: 20 June 2014, At: 00:39
Publisher: Taylor & Francis
Informa Ltd Registered in England and Wales Registered Number: 1072954
Registered office: Mortimer House, 37-41 Mortimer Street, London W1T 3JH,
UK
To cite this article: J. A. Benediktsson & J. R. Sveinsson (1997) Feature extraction for
multisource data classification with artificial neural networks, International Journal of
Remote Sensing, 18:4, 727-740, DOI: 10.1080/014311697218728
Taylor & Francis makes every effort to ensure the accuracy of all the
information (the “Content”) contained in the publications on our platform.
However, Taylor & Francis, our agents, and our licensors make no
representations or warranties whatsoever as to the accuracy, completeness, or
suitability for any purpose of the Content. Any opinions and views expressed
in this publication are the opinions and views of the authors, and are not the
views of or endorsed by Taylor & Francis. The accuracy of the Content should
not be relied upon and should be independently verified with primary sources
of information. Taylor and Francis shall not be liable for any losses, actions,
claims, proceedings, demands, costs, expenses, damages, and other liabilities
whatsoever or howsoever caused arising directly or indirectly in connection
with, in relation to or arising out of the use of the Content.
This article may be used for research, teaching, and private study purposes.
Any substantial or systematic reproduction, redistribution, reselling, loan, sub-
licensing, systematic supply, or distribution in any form to anyone is expressly
forbidden. Terms & Conditions of access and use can be found at http://
www.tandfonline.com/page/terms-and-conditions
Downloaded by [NWFP University of Engineering & Technology - Peshawar] at 00:39 20 June 2014
int. j. remote sensing, 1997 , vol. 18 , no. 4 , 727 ± 740
Feature extraction for multisource data classi® cation with arti® cial
neural networks
1. Introduction
Representation of input data for neural networks is important and can a ect
signi® cantly the classi® cation performance of neural networks. The selection of input
representation is related to the general pattern recognition process of selecting input
classi® cation variables which a ect classi® er design strongly. This means if the input
variables show signi® cant di erences from one class to another, the classi® er can be
designed more easily with better performance. Therefore, the selection of variables
is a key problem in pattern recognition, and is termed feature selection or feature
extraction ( Fukunaga 1990). Feature extraction can, thus, be used to transform the
input data and in some way ® nd the best input representation for neural networks.
Feature selection is generally considered a process of mapping the original meas-
urements into more e ective features. If the mapping is linear, the mapping function
is well de® ned and the task is simply to ® nd the coe cients of the linear function
to maximize or minimize a criterion. Therefore, if a proper criterion exists for
evaluating the e ectiveness of features, well-developed techniques of linear algebra
can be used for simple criteria. On the other hand, if the criterion is too complicated
for an analytical solution, iterative techniques can be applied. In many applications
of pattern recognition, there are important features which are non-linear rather than
linear functions of the original measurements. In that case, the basic problem is to
® nd a proper non-linear mapping for the given data. Since a general algorithm to
generate systematically non-linear mapping is not available, the selection of features
becomes very problem oriented ( Fukunaga 1990 ).
Neural networks have been applied successfully in various ® elds. A characteristic
of neural networks is that they may need a long training time but are relatively fast
data classi® ers. However, the principal reason for using neural network methods for
classi® cation of multisource remote sensing and geographic data is that these methods
are distribution-free. Since multisource data are in general of multiple types, the data
from the various sources can have di erent statistical distributions. The neural
0143 ± 1161/97 $12.0 0 Ñ 1997 Taylo r & Francis Ltd
728 J. A. Benediktsson and J. R. Sveinsson
network approach does not require explicit modelling of the data from each source.
In addition, neural network methods have been shown to approximate class-condi-
tional probabilities for their entire training set in the mean-squared sense ( Ruck
et al . 1990 ). Consequently, there is no need to treat the data sources independently
as in many statistical methods ( Benediktsson et al. 1990, Benediktsson and Swain
1992 ). The neural network approach also avoids the problem in statistical multi-
source analysis of specifying how much in¯ uence each data source should have in
the classi® cation (Benediktsson and Swain 1992).
Downloaded by [NWFP University of Engineering & Technology - Peshawar] at 00:39 20 June 2014
For high-dimensional data, large neural networks (with many inputs and a large
number of hidden neurons) are often used. The training time of a large neural
network can be very long. Also, the training methods for neural networks are based
on estimating the weights and biases for the networks. If the neural networks are
large, then many parameters need to be estimated based on a ® nite number of
training samples. In that case, over® tting can possibly be observed, that is, the neural
networks may not generalize well, although high classi® cation accuracy can be
achieved for training data. Also, for high-dimensional data the curse of dimensionality
or the Hughes phenomenon ( Fukunaga 1990 ) may occur. Hence, it is necessary to
reduce the input dimensionality for the neural network in order to obtain a smaller
network which performs well both in terms of training and test classi® cation accu-
racies. This leads to the importance of feature extraction for neural networks, that
is, to ® nd the best representation of input data in lower dimensional space where
the representation does not lead to a signi® cant decrease in overall classi® cation
accuracy as compared to the one obtained in the original feature space. However,
few feature extraction algorithms are available for neural networks. Though some
of the conventional feature extraction methods, such as principal component analysis
( PCA) and discriminant analysis ( DA), might be used, such methods do not take
full advantage of the way neural networks de® ne complex decision boundaries.
Recently, it was shown ( Lee and Landgrebe 1993 a,b,c) that all the features
necessary to achieve the same classi® cation accuracy, as in the original feature space
for a given classi® er, can be obtained from the decision boundary de® ned by the
classi® er. The method is called decision boundary feature extraction ( DBFE). The
DBFE method takes full advantage of the characteristics of a classi® er by selecting
features directly from its decision boundary. Therefore, this method should be consid-
ered as a candidate for both feature extraction and input data representation for
neural network classi® ers.
This paper applies several feature extraction methods for neural networks to
reduce multisource remote sensing and geographic data to relatively few features.
The goal is to do this without much loss in overall classi® cation accuracy.
2. Feature extraction
Feature extraction can be viewed as ® nding a set of vectors that represent an
observation while reducing the dimensionality. In pattern recognition, it is desirable
to extract features that are focused on discriminating between classes. Although a
reduction in dimensionality is desirable, the error increment due to the reduction in
dimension has to be without sacri® cing the discriminative power of classi® ers. The
development of feature extraction methods has been one of the most important
problems in the ® eld of pattern analysis and has been studied extensively. Feature
extraction methods can be both unsupervised and supervised, and also linear and
Neural networks in remote sensing 729
classi® cation accuracy. Below three di erent linear feature extraction methods are
discussed. For all these methods a feature matrix is de® ned and the eigenvalues of
the feature matrix are ordered in decreasing order along with their corresponding
eigenvectors. The number of input dimensions corresponds to the number of eigenvec-
tors selected ( Fukunaga 1990 ). The transformed data are determined by
T
Y =W X ( 1)
where W is the transformation matrix composed of the eigenvectors of the feature
matrix, X is the data in the original feature space, and Y is the transformed data in
the new feature space.
M0= P ( v i )M i ( 5)
i
where M i is the mean vector for the i th class, S i is the covariance matrix for the i th
class, and P ( v i ) is the prior probability of the i th class. The criterion of optimization
may be de® ned as
Õ SB )
J = tr( SW
1
( 6)
Downloaded by [NWFP University of Engineering & Technology - Peshawar] at 00:39 20 June 2014
where tr( ) denotes the trace of a matrix. New feature vectors are selected to maximize
the criterion.
The necessary transformation from X to Y in equation ( 1) is found by taking
the eigenvalue± eigenvector decomposition of the matrix SWÕ 1S B and then taking the
transformation matrix as the normalized eigenvectors corresponding to the eigen-
values in decreasing order. However, this method does have some shortcomings. For
example, since discriminant analysis mainly utilizes class mean di erences, the feature
vectors selected by discriminant analysis are not reliable if mean vectors are near to
one another. Since the lumped covariance matrix is used in the criterion, discriminant
analysis may lose information contained in class covariance di erences. Also, for M
classes the maximum rank of S B is M Õ 1 since S B is dependent on M 0 . Usually S W
is of full rank and, therefore, the maximum rank of SWÕ 1S B is M Õ 1. This indicates
that at maximum M Õ 1 features can be extracted by this approach. Another problem
is that the criterion function in equation ( 6 ) generally does not have direct relation-
ship to the error probability.
area of R i
W (R i ) = . ( 8)
total area of decision boundary
It can be shown that the rank of the DBFM is the smallest dimension where the
same classi® cation accuracy can be obtained as in the original feature space. Also,
the eigenvectors of the DBFM, corresponding to non-zero eigenvalues, are the
Neural networks in remote sensing 731
necessary feature vectors to achieve the same classi® cation accuracy as in the original
feature space (Lee and Landgrebe 1993 a,c).
Lee and Landgrebe ( 1993 b) use a non-parametric procedure to ® nd the decision
boundary numerically. From the decision boundary, normal vectors, N (X ) , are
estimated using a gradient approximation, N i . Then the e ective decision boundary
feature matrix is estimated using the normal vectors as
T
S EDBFM = N i Ni . ( 9)
Downloaded by [NWFP University of Engineering & Technology - Peshawar] at 00:39 20 June 2014
where p is a pattern number, n is the sample size, t pj is the desired output of the j th
output neuron, o pj is the actual output of the j th output neuron, and m is the number
of output neurons. The most commonly used minimization approach is gradient
descent optimization of the cumulative squared error at the output of the network.
Gradient descent has been shown to be computationally wasteful, and in this paper
we apply the less wasteful conjugate-gradient optimization for multilayer neural
networks with one hidden layer ( Barnard 1992 ).
Several authors have proposed the use of neural networks for feature extraction
( Lampinen and Oja 1995, Mao and Jain 1995, Oja 1995). All these authors concen-
trate on proposing neural networks which do feature extraction. In their works, the
neural networks can be non-linear and either supervised or unsupervised feature
extractors. However, they did not focus on data representation and feature extraction
for neural networks. In contrast, Fakhr et al . ( 1994 ) have proposed a neural network
for nearest neighbour classi® cation and linear feature extraction, but their feature
extractor is not speci® ed and they do no analysis of di erent feature extraction
methods.
Of interest here is to investigate what kind of linear feature extraction is desirable
for neural networks. Linear feature extraction of input data for neural networks
should be bene® cial since simpler classi® ers can be trained more easily with low
dimensional data than high dimensional data.
4. Ex periments
The data used in the experiment, the Anderson River data set, are a multisource
remote sensing and geographic data set made available by the Canada Centre for
732 J. A. Benediktsson and J. R. Sveinsson
Table 1. Training and test samples for information classes in the experiment on the Anderson
River data.
Table 2. Average pairwise JM-distances for three of the data sources (maximum JM-distance
is 1´414 ).
AMSS 1´19877
SAR shallow 0´46305
SAR steep 0´43109
Neural networks in remote sensing 733
and was trained six times with di erent initializations. Then the overall average
accuracies were computed for each version. The average results for the CGBP
networks with 30 and 35 hidden neurons are shown in ® gure 1 as a function of the
number of training iterations. With 30 hidden neurons 74´83 per cent overall accuracy
was reached for training data and 72´18 per cent for test data. In comparison, the
network with 35 hidden neurons gave an overall accuracy of 74´48 per cent for
training data and 72´18 per cent for test data.
Principal component analysis ( PCA) was performed on the data. The eigenvalues
of the global covariance matrix are shown in table 3. From the table it can be seen
that approximately 99 per cent of the variance in the data was preserved in 14
features, about 95 per cent of the variance in eight features, and 85 per cent in four
features. The CGBP with one hidden layer was trained on the PCA transformed
data with a di erent number of input features. In each case, the number of hidden
neurons was twice the number of input features, except for the 22 feature case where
30 hidden neurons were used. The classi® cation results for the PCA are shown in
table 4. From table 4 it can be seen that there was only about 1 per cent decrease in
overall training and test accuracies when 14 input features were used instead of 22.
When less than 14 features were used, the classi® cation accuracies decreased more
signi® cantly.
Figure 1. Anderson River data: average results for the CGBP. The upper curves represent
training results and the lower curves (marked with *) represent test results.
734 J. A. Benediktsson and J. R. Sveinsson
1 34´27 34´61
2 47´84 47´66
3 56´06 55´29
4 58´04 58´28
5 64´05 62´86
6 64´14 62´76
10 71´19 68´82
14 73´11 70´50
22 74´28 71´50
Discriminant analysis ( DA) was then performed on the data. Since there were
six information classes in the data, it was know that discriminant analysis would at
maximum give only ® ve input features using the criterion tr(SWÕ 1S B ). The eigenvalues
of the matrix SWÕ 1S B are shown in table 5. From the table it can be seen that
Neural networks in remote sensing 735
approximately 97 per cent of the variance according to the tr(SW Õ 1S B ) criterion was
preserved in four features, and about 91 per cent in three features. The CGBP with
one hidden layer was trained on the DA transformed data with a di erent number
of input features. In each case, the number of hidden neurons was twice the number
of input features. The classi® cation results for the DA are shown in table 6. From
table 6 it can be seen that the use of DA for feature extraction resulted in signi® cantly
less accurate classi® cation by the neural network classi® ers. The results are only
comparable to the PCA results in table 4 for ® ve or fewer features. For those few
Downloaded by [NWFP University of Engineering & Technology - Peshawar] at 00:39 20 June 2014
features the DA results are slightly higher in terms of classi® cation accuracies.
However, the DA de® nitely su ered from only being able to give ® ve total features.
Finally, decision boundary feature extraction ( DBFE) was performed on the data
( Lee and Landgrebe 1992 ). The eigenvalues of the decision boundary feature matrix
are shown in table 7. From the table it can be seen that approximately 99 per cent
of the variance in the data was preserved in 10 features, and about 94 per cent in
six features. The CGBP with one hidden layer was trained on the DBFE transformed
data with a di erent number of input features. In each case, the number of hidden
1 47´15 45´84
2 52´48 51´35
3 59´40 57´75
4 61´51 60´37
5 63´22 61´92
neurons was twice the number of input features, except for the 22 feature case where
30 hidden neurons were used. The classi® cation results for the DBFE are shown in
table 8. From table 8 it can be seen that there was only about 1 per cent decrease in
overall training and test accuracies when 10 input features were used instead of 22.
Then, the performance decreased about 2 per cent in terms of training and test
accuracies when six features were used instead of 10. These results indicate that
smaller neural network classi® ers can be used to obtain similar results to the ones
in the original feature space. Comparing the DBFE results in table 8 to the corres-
Downloaded by [NWFP University of Engineering & Technology - Peshawar] at 00:39 20 June 2014
ponding PCA results in table 4 it is clear that the DBFE always outperformed the
PCA in terms of classi® cation accuracies (see also ® gures 2 and 3). For example, the
accuracies for the DBFE with six features were similar to the PCA results with
10 features.
The overall classi® cation accuracies based on the original 22 features are summar-
ized in table 9 (the CGBP results are given at convergence) and ® gures 2 and 3.
There it can be seen that the classi® cation accuracies with the DBFE were comparable
to the ones obtained for the original data when 30 hidden neurons were used.
Table 8. Classi® cation accuracies for decision boundary feature extraction (DBFE).
1 46´90 45´72
2 56´40 55´18
3 58´83 57´04
4 68´28 66´13
5 71´10 68´27
6 71´50 69´29
10 73´95 71´49
22 74´75 72´48
Figure 2. Anderson River data: average training results for the di erent feature extraction
methods as a function of the number of features used.
Neural networks in remote sensing 737
Downloaded by [NWFP University of Engineering & Technology - Peshawar] at 00:39 20 June 2014
Figure 3. Anderson River data: average test results for the di erent feature extraction methods
as a function of the number of features used.
Table 9. Overall training and test accuracies for the CGBP applied to the Anderson River
data set.
Training Test
Method accuracy accuracy
However, the DBFE outperformed both the PCA and DA in terms of overall
classi® cation accuracies of training and test data. The standard deviations of the
classi® cations for the di erent feature extraction methods are shown in ® gures 4
and 5, for training and test data, respectively. From these ® gures it can be seen that
the DBFE does have lower classi® cation variance than the PCA, especially for the
lower dimensional data. All these results clearly demonstrate that the PCA is not
an optimal input representation method for neural network classi® ers. On the other
hand, excellent classi® cation results were achieved by using the DBFE.
5. Conclusion
Several linear feature extraction methods were considered for neural networks.
The methods included principal component analysis, discriminant analysis, and the
recently proposed decision boundary feature extraction. Although principal compon-
ent analysis can be shown to be optimal for data representation it can be improved
738 J. A. Benediktsson and J. R. Sveinsson
Downloaded by [NWFP University of Engineering & Technology - Peshawar] at 00:39 20 June 2014
Figure 4. Anderson River data: standard deviations for classi® cations of training data for
the di erent feature extraction methods as a function of the number of features used.
Figure 5. Anderson River data: standard deviations for classi® cations of test data for the
di erent feature extraction methods as a function of the number of features used.
upon in terms of classi® cation accuracies. Feature extraction methods are important
for neural network classi® ers and can be used to ® nd the best representation of input
data for the networks since the performance of neural network classi® ers is dependent
on the input representation.
In experiments, neural networks were used to classify multisource remote sensing
and geographic data. Few feature extraction methods have been proposed for multi-
Neural networks in remote sensing 739
source data but such data usually cannot be modelled by a simple multivariate
statistical model. The decision boundary feature extraction method not only showed
the best performance of the feature extraction methods in terms of classi® cation
accuracies when the dimensionality was reduced, but also gave the best performance
for test data when the full 22 dimensional feature set was used. Since the decision
boundary feature reduction algorithm does not assume any underlying probability
density functions for the data, it takes full advantage of the distribution free nature
of neural networks, and how neural network models de® ne complex decision bound-
Downloaded by [NWFP University of Engineering & Technology - Peshawar] at 00:39 20 June 2014
aries. With a reduced feature set, it is possible to obtain simpler classi® ers but in the
experiments these simpler classi® ers gave similar accuracies to classi® ers applied to
the original data. The decision boundary feature extraction method also gives a
feature matrix which has full rank, that is, as many features can be used as are in
the original feature space. In contrast, discriminant analysis generally does not give
a full rank matrix in the original feature space. Therefore, discriminant analysis
should be considered a less attractive method than decision boundary feature extrac-
tion for input representation for neural network classi® ers.
With its outstanding performance on the di cult data set, decision boundary
feature extraction should both be considered an excellent feature extraction method
and a desirable method for input data representation for neural networks.
Acknowledgm ents
The authors are very grateful to Dr Chulhee Lee of Yonsei University and
Professor David A. Landgrebe of Purdue University for their invaluable assistance
in preparing this paper. The authors also thank Dr Joseph Ho beck of AT&T for
providing his Matlab data analysis software to us. The Anderson River SAR/MSS
data set was acquired, preprocessed, and loaned by the Canada Centre for Remote
Sensing, Department of Energy Mines, and Resources, of the Government of Canada.
This work was funded in part by the Icelandic Science Council and the Research
Fund of the University of Iceland.
References
B arnard, E ., 1992 , Optimization for training neural nets. I.E.E.E. Transactions on Neural
Networks, 3, 232± 240.
B enediktsson, J . A ., and S wain, P . H ., 1992 , Consensus theoretic classi® cation methods.
I.E.E.E. T ransactions on Systems, Man, and Cybernetics, 22, 688± 704.
B enediktsson, J . A ., S wain, P . H ., and E rsoy, O . K ., 1990, Neural network approaches
versus statistical methods in classi® cation of multisource remote sensing data. I.E.E.E.
T ransactions on Geoscience and Rem ote Sensing, 28, 540± 552.
B enediktsson, J . A ., S wain, P . H ., and E rsoy, O . K ., 1993 , Conjugate-gradient neural
networks in classi® cation of multisource and very-high-dimensional data. International
Journal of Rem ote Sensing, 14, 2883± 2903.
B londa, P ., la F orgia, V ., P asquaeriello, G ., and S atalino, G ., 1994 , Multispectral
classi® cation by a modular neural network architecture. Proceedings of the 1994
International Geoscience and Remote Sensing Symposium ( New York: I.E.E.E. Press),
pp. 1873± 1875.
F akhr, W ., K amel, M ., and E lmasry, M . I ., 1994 , The adaptive feature extraction nearest
neighbour classi® er. Proceedings of the 1994 World Congress on Neural Networks, 3
( Hillsdale, New Jersey: Lawrence Erlbaum), pp. 123± 128.
F ukunaga, K ., 1990 , Introduction to Statistical Pattern Recognition, 2nd edn (New York:
Academic Press).
G oodenough, D . G ., G oldberg, M ., P lunkett, G ., and Z elek, J ., 1987, The CCRS
740 Neural networks in remote sensing
SAR/MSS Anderson River data set. I.E.E.E. T ransactions on Geoscience and Rem ote
Sensing, 25, 360± 367.
L ampinen, J ., and O ja, E ., 1995 , Distortion tolerant pattern recognition based on self-
organizing feature extraction. I.E.E.E. Transactions on Neural Networks, 6, 539± 547.
L ee, C ., and L andgrebe, D . A ., 1992 , Decision boundary feature selection for neural networks.
Proceedings of the I.E.E.E. International Conference on Systems, Man and Cybernetics
( New York: I.E.E.E. Press), pp. 1053± 1057.
L ee, C ., and L andgrebe, D . A ., 1993 a, Feature extraction and classi® cation algorithms for
high dimensional data. Technical Report TR-EE 93-1, School of Electrical Engineering,
Downloaded by [NWFP University of Engineering & Technology - Peshawar] at 00:39 20 June 2014