You are on page 1of 16

This article was downloaded by: [NWFP University of Engineering & Technology

- Peshawar]
On: 20 June 2014, At: 00:39
Publisher: Taylor & Francis
Informa Ltd Registered in England and Wales Registered Number: 1072954
Registered office: Mortimer House, 37-41 Mortimer Street, London W1T 3JH,
UK

International Journal of Remote


Sensing
Publication details, including instructions for authors
and subscription information:
http://www.tandfonline.com/loi/tres20

Feature extraction for


multisource data classification
with artificial neural networks
J. A. Benediktsson & J. R. Sveinsson
Published online: 25 Nov 2010.

To cite this article: J. A. Benediktsson & J. R. Sveinsson (1997) Feature extraction for
multisource data classification with artificial neural networks, International Journal of
Remote Sensing, 18:4, 727-740, DOI: 10.1080/014311697218728

To link to this article: http://dx.doi.org/10.1080/014311697218728

PLEASE SCROLL DOWN FOR ARTICLE

Taylor & Francis makes every effort to ensure the accuracy of all the
information (the “Content”) contained in the publications on our platform.
However, Taylor & Francis, our agents, and our licensors make no
representations or warranties whatsoever as to the accuracy, completeness, or
suitability for any purpose of the Content. Any opinions and views expressed
in this publication are the opinions and views of the authors, and are not the
views of or endorsed by Taylor & Francis. The accuracy of the Content should
not be relied upon and should be independently verified with primary sources
of information. Taylor and Francis shall not be liable for any losses, actions,
claims, proceedings, demands, costs, expenses, damages, and other liabilities
whatsoever or howsoever caused arising directly or indirectly in connection
with, in relation to or arising out of the use of the Content.

This article may be used for research, teaching, and private study purposes.
Any substantial or systematic reproduction, redistribution, reselling, loan, sub-
licensing, systematic supply, or distribution in any form to anyone is expressly
forbidden. Terms & Conditions of access and use can be found at http://
www.tandfonline.com/page/terms-and-conditions
Downloaded by [NWFP University of Engineering & Technology - Peshawar] at 00:39 20 June 2014
int. j. remote sensing, 1997 , vol. 18 , no. 4 , 727 ± 740

Feature extraction for multisource data classi® cation with arti® cial
neural networks

J. A. BENEDIKTSSON and J. R. SVEINSSON


Downloaded by [NWFP University of Engineering & Technology - Peshawar] at 00:39 20 June 2014

Engineering Research Institute, University of Iceland, Hjardarhaga 2± 6,


107 Reykjavik, Iceland

( Received 31 December 1995; in ® nal form 20 June 1996 )

Abstract. Classi® cation of multisource remote sensing and geographic data by


neural networks is discussed with respect to feature extraction. Several feature
extraction methods are reviewed, including principal component analysis, discrim-
inant analysis, and the recently proposed decision boundary feature extraction
method. The feature extraction methods are then applied in experiments in
conjunction with classi® cation by multilayer neural networks. The decision bound-
ary feature extraction method shows excellent performance in the experiments.

1. Introduction
Representation of input data for neural networks is important and can a€ ect
signi® cantly the classi® cation performance of neural networks. The selection of input
representation is related to the general pattern recognition process of selecting input
classi® cation variables which a€ ect classi® er design strongly. This means if the input
variables show signi® cant di€ erences from one class to another, the classi® er can be
designed more easily with better performance. Therefore, the selection of variables
is a key problem in pattern recognition, and is termed feature selection or feature
extraction ( Fukunaga 1990). Feature extraction can, thus, be used to transform the
input data and in some way ® nd the best input representation for neural networks.
Feature selection is generally considered a process of mapping the original meas-
urements into more e€ ective features. If the mapping is linear, the mapping function
is well de® ned and the task is simply to ® nd the coe cients of the linear function
to maximize or minimize a criterion. Therefore, if a proper criterion exists for
evaluating the e€ ectiveness of features, well-developed techniques of linear algebra
can be used for simple criteria. On the other hand, if the criterion is too complicated
for an analytical solution, iterative techniques can be applied. In many applications
of pattern recognition, there are important features which are non-linear rather than
linear functions of the original measurements. In that case, the basic problem is to
® nd a proper non-linear mapping for the given data. Since a general algorithm to
generate systematically non-linear mapping is not available, the selection of features
becomes very problem oriented ( Fukunaga 1990 ).
Neural networks have been applied successfully in various ® elds. A characteristic
of neural networks is that they may need a long training time but are relatively fast
data classi® ers. However, the principal reason for using neural network methods for
classi® cation of multisource remote sensing and geographic data is that these methods
are distribution-free. Since multisource data are in general of multiple types, the data
from the various sources can have di€ erent statistical distributions. The neural
0143 ± 1161/97 $12.0 0 Ñ 1997 Taylo r & Francis Ltd
728 J. A. Benediktsson and J. R. Sveinsson

network approach does not require explicit modelling of the data from each source.
In addition, neural network methods have been shown to approximate class-condi-
tional probabilities for their entire training set in the mean-squared sense ( Ruck
et al . 1990 ). Consequently, there is no need to treat the data sources independently
as in many statistical methods ( Benediktsson et al. 1990, Benediktsson and Swain
1992 ). The neural network approach also avoids the problem in statistical multi-
source analysis of specifying how much in¯ uence each data source should have in
the classi® cation (Benediktsson and Swain 1992).
Downloaded by [NWFP University of Engineering & Technology - Peshawar] at 00:39 20 June 2014

For high-dimensional data, large neural networks (with many inputs and a large
number of hidden neurons) are often used. The training time of a large neural
network can be very long. Also, the training methods for neural networks are based
on estimating the weights and biases for the networks. If the neural networks are
large, then many parameters need to be estimated based on a ® nite number of
training samples. In that case, over® tting can possibly be observed, that is, the neural
networks may not generalize well, although high classi® cation accuracy can be
achieved for training data. Also, for high-dimensional data the curse of dimensionality
or the Hughes phenomenon ( Fukunaga 1990 ) may occur. Hence, it is necessary to
reduce the input dimensionality for the neural network in order to obtain a smaller
network which performs well both in terms of training and test classi® cation accu-
racies. This leads to the importance of feature extraction for neural networks, that
is, to ® nd the best representation of input data in lower dimensional space where
the representation does not lead to a signi® cant decrease in overall classi® cation
accuracy as compared to the one obtained in the original feature space. However,
few feature extraction algorithms are available for neural networks. Though some
of the conventional feature extraction methods, such as principal component analysis
( PCA) and discriminant analysis ( DA), might be used, such methods do not take
full advantage of the way neural networks de® ne complex decision boundaries.
Recently, it was shown ( Lee and Landgrebe 1993 a,b,c) that all the features
necessary to achieve the same classi® cation accuracy, as in the original feature space
for a given classi® er, can be obtained from the decision boundary de® ned by the
classi® er. The method is called decision boundary feature extraction ( DBFE). The
DBFE method takes full advantage of the characteristics of a classi® er by selecting
features directly from its decision boundary. Therefore, this method should be consid-
ered as a candidate for both feature extraction and input data representation for
neural network classi® ers.
This paper applies several feature extraction methods for neural networks to
reduce multisource remote sensing and geographic data to relatively few features.
The goal is to do this without much loss in overall classi® cation accuracy.

2. Feature extraction
Feature extraction can be viewed as ® nding a set of vectors that represent an
observation while reducing the dimensionality. In pattern recognition, it is desirable
to extract features that are focused on discriminating between classes. Although a
reduction in dimensionality is desirable, the error increment due to the reduction in
dimension has to be without sacri® cing the discriminative power of classi® ers. The
development of feature extraction methods has been one of the most important
problems in the ® eld of pattern analysis and has been studied extensively. Feature
extraction methods can be both unsupervised and supervised, and also linear and
Neural networks in remote sensing 729

non-linear. Here we concentrate on linear feature extraction methods for neural


networks and then leave the neural networks with the classi® cation task.
The question of how input data should be represented for a neural network is
an important one and strongly a€ ects the classi® cation performance of the neural
network.
Some authors ( Blonda et al. 1994, Lee et al . 1994) have suggested principal
component analysis ( PCA) as a feature extraction method for neural networks, but
here it will be shown empirically that the method is not optimal in terms of
Downloaded by [NWFP University of Engineering & Technology - Peshawar] at 00:39 20 June 2014

classi® cation accuracy. Below three di€ erent linear feature extraction methods are
discussed. For all these methods a feature matrix is de® ned and the eigenvalues of
the feature matrix are ordered in decreasing order along with their corresponding
eigenvectors. The number of input dimensions corresponds to the number of eigenvec-
tors selected ( Fukunaga 1990 ). The transformed data are determined by
T
Y =W X ( 1)
where W is the transformation matrix composed of the eigenvectors of the feature
matrix, X is the data in the original feature space, and Y is the transformed data in
the new feature space.

2.1. Principal component analysis


One of the most widely used transforms for single representation and data
compression is the principal component ( Karhunen± Loeve) transformation.
To ® nd the necessary transformation of X to Y in equation ( 1), the global
covariance matrix for the original data set S X is estimated. Then the eigenvalue±
eigenvector decomposition for the covariance matrix S X is determined, that is,
T
S X = WLW ( 2)
where L is a diagonal matrix with the eigenvalues of S X in decreasing order and W T
is a normalized matrix with corresponding eigenvectors of S X . With this choice of
the transformation matrix in equation ( 1), it is easily seen that the covariance matrix
for the transformed data is S Y = L .
Although, the principal component transformation is optimal for signal represen-
tation in the sense that it provides the smallest mean squared error for a given
number of features, quite often the features de® ned by this transformation are not
optimal with regard to class separability ( Fukunaga 1990 ). In feature extraction for
classi® cation, it is not the mean squared error but the classi® cation accuracy that
must be considered as the primary criterion for feature extraction.

2.2. Discriminant analysis


The principal component transformation is based upon the global covariance
matrix. Therefore, it is explicitly not sensitive to inter-class structure. It often works
as a feature reduction tool because classes are frequently distributed in the direction
of the maximum data scatter. Discriminant analysis is a method which is intended
to enhance separability. A within-class scatter matrix, S W, and a between-class scatter
matrix, S B , are de® ned:
S W= ž P (v i) Si ( 3)
i
T
SB = ž P( v i ) ( M i Õ M0) ( M iÕ M0) ( 4)
i
730 J. A. Benediktsson and J. R. Sveinsson

M0= ž P ( v i )M i ( 5)
i

where M i is the mean vector for the i th class, S i is the covariance matrix for the i th
class, and P ( v i ) is the prior probability of the i th class. The criterion of optimization
may be de® ned as

Õ SB )
J = tr( SW
1
( 6)
Downloaded by [NWFP University of Engineering & Technology - Peshawar] at 00:39 20 June 2014

where tr( ) denotes the trace of a matrix. New feature vectors are selected to maximize
the criterion.
The necessary transformation from X to Y in equation ( 1) is found by taking
the eigenvalue± eigenvector decomposition of the matrix SWÕ 1S B and then taking the
transformation matrix as the normalized eigenvectors corresponding to the eigen-
values in decreasing order. However, this method does have some shortcomings. For
example, since discriminant analysis mainly utilizes class mean di€ erences, the feature
vectors selected by discriminant analysis are not reliable if mean vectors are near to
one another. Since the lumped covariance matrix is used in the criterion, discriminant
analysis may lose information contained in class covariance di€ erences. Also, for M
classes the maximum rank of S B is M Õ 1 since S B is dependent on M 0 . Usually S W
is of full rank and, therefore, the maximum rank of SWÕ 1S B is M Õ 1. This indicates
that at maximum M Õ 1 features can be extracted by this approach. Another problem
is that the criterion function in equation ( 6 ) generally does not have direct relation-
ship to the error probability.

2.3. Decision boundar y feature extraction


Lee and Landgrebe ( 1993 a,b,c) showed that discriminantly informative features
and discriminantly redundant features can be extracted from the decision boundary
itself. They also showed that discriminantly informative feature vectors have a
component which is normal to the decision boundary at least at one point on the
decision boundary.
Further, discriminantly redundant feature vectors are orthogonal to a vector
normal to the decision boundary at every point on the decision boundary. In ( Lee
and Landgrebe 1993 a,b) a decision boundary feature matrix ( DBFM) was de® ned
to extract discriminantly informative features and discriminantly redundant features
from the decision boundary:
De® nition: T he decision boundar y feature matrix
Let N i be the normal vector to the decision boundary at a point on the decision
boundary for a given pattern classi® cation problem. Let R i be a collection of points
on the decision boundary which have the same normal vector N i . Then the decision
boundary feature matrix is de® ned as
T
S DBFM = ž N i Ni W (R i ) ( 7)
i

area of R i
W (R i ) = . ( 8)
total area of decision boundary
It can be shown that the rank of the DBFM is the smallest dimension where the
same classi® cation accuracy can be obtained as in the original feature space. Also,
the eigenvectors of the DBFM, corresponding to non-zero eigenvalues, are the
Neural networks in remote sensing 731

necessary feature vectors to achieve the same classi® cation accuracy as in the original
feature space (Lee and Landgrebe 1993 a,c).
Lee and Landgrebe ( 1993 b) use a non-parametric procedure to ® nd the decision
boundary numerically. From the decision boundary, normal vectors, N (X ) , are
estimated using a gradient approximation, N i . Then the e€ ective decision boundary
feature matrix is estimated using the normal vectors as
T
S EDBFM = ž N i Ni . ( 9)
Downloaded by [NWFP University of Engineering & Technology - Peshawar] at 00:39 20 June 2014

Next, the eigenvalue± eigenvector decomposition of the e€ ective decision bound-


ary feature matrix, S EDBFM , is calculated and the normalized eigenvectors correspond-
ing to non-zero eigenvalues are used as the transformation matrix from X to Y in
equation ( 1 ). Theoretically, the eigenvectors corresponding to non-zero eigenvalues
will give the same classi® cation accuracy as in the original feature space. However,
in practice, few eigenvalues are actually zero. Thus, a threshold is used.

3. Classi® cation and feature extraction for neural networks


Neural network classi® ers have been demonstrated to be attractive alternatives
to conventional classi® ers for classi® cation of remote sensing and geographic data
( Benediktsson et al. 1990, 1993) which is important, especially when a convenient
statistical model does not exist.
Most neural network methods are based on the optimization (minimization) of
a cost functional:
n n m
1 2
E= ž ep = ž ž ( t pj Õ o pj ) ( 10)
p 1
= 2
= =
p 1j 1

where p is a pattern number, n is the sample size, t pj is the desired output of the j th
output neuron, o pj is the actual output of the j th output neuron, and m is the number
of output neurons. The most commonly used minimization approach is gradient
descent optimization of the cumulative squared error at the output of the network.
Gradient descent has been shown to be computationally wasteful, and in this paper
we apply the less wasteful conjugate-gradient optimization for multilayer neural
networks with one hidden layer ( Barnard 1992 ).
Several authors have proposed the use of neural networks for feature extraction
( Lampinen and Oja 1995, Mao and Jain 1995, Oja 1995). All these authors concen-
trate on proposing neural networks which do feature extraction. In their works, the
neural networks can be non-linear and either supervised or unsupervised feature
extractors. However, they did not focus on data representation and feature extraction
for neural networks. In contrast, Fakhr et al . ( 1994 ) have proposed a neural network
for nearest neighbour classi® cation and linear feature extraction, but their feature
extractor is not speci® ed and they do no analysis of di€ erent feature extraction
methods.
Of interest here is to investigate what kind of linear feature extraction is desirable
for neural networks. Linear feature extraction of input data for neural networks
should be bene® cial since simpler classi® ers can be trained more easily with low
dimensional data than high dimensional data.

4. Ex periments
The data used in the experiment, the Anderson River data set, are a multisource
remote sensing and geographic data set made available by the Canada Centre for
732 J. A. Benediktsson and J. R. Sveinsson

Remote Sensing (CCRS) (Goodenough et al . 1987). The imagery involves a


2´8 kmÖ 2´8 km forestry site in the Anderson River area of British Columbia, Canada.
The area is characterized by rugged topography, with terrain elevations ranging
from 300 m to 1100 m above sea level. The forest cover is primarily coniferous, with
Douglas ® r predominating up to approximately 1050 m elevation, and cedar, hem-
lock, and spruce types predominating at higher elevations (Goodenough et al . 1987).
Six data sources were used:
Downloaded by [NWFP University of Engineering & Technology - Peshawar] at 00:39 20 June 2014

(i ) Airborne Multispectral Scanner System (AMSS) with 11 spectral data chan-


nels (10 channels from 380 to 1100 nm and 1 channel from 8 to 14 m m);
( ii) Steep Mode Synthetic Aperture Radar (SAR) with four data channels ( X-HH,
X-HV, L-HH, L-HV );
(iii ) Shallow Mode SAR with four data channels ( X-HH, X-HV, L-HH, L-HV );
(iv) elevation data (one data channel, where elevation in meters= 61´996 + 7´2266
Ö pixel value);
(v) slope data (one data channel, where slope in degrees= pixel value);
(vi) aspect data (one data channel, where aspect in degrees= 2Ö pixel value).
The AMSS and SAR data were detected during the week of 25 July to 31 July
1978. Each channel comprises an image of 256 lines and 256 columns. All of the
images are spatially co-registered with a spatial resolution of 12´5 m.
There are 19 information classes in the ground reference map provided by CCRS.
In the experiments, only the six largest ones were used, as listed in table 1. Here,
training samples were selected uniformly, giving 10 per cent of the total sample size.
Test samples were then selected randomly from the rest of the labelled data.
To estimate the separabilities between the information classes for the AMSS and
SAR data sources, Je€ ries± Matusita ( JM ) distances (Swain 1978 ) were computed.
The average pairwise JM-distance separabilities are shown in table 2 for the AMSS
and SAR data sources. The values in table 2 indicate that the Anderson River data
are very di cult to classify. The AMSS source is apparently the most separable of

Table 1. Training and test samples for information classes in the experiment on the Anderson
River data.

Class number Information class Training size Test size

1 Douglas ® r (31± 40 m) 971 1250


2 Douglas ® r (21± 30 m) 551 817
3 Douglas ® r+ other species ( 31± 40 m) 548 701
4 Douglas ® r+ lodgepole pine (21± 30 m) 542 705
5 Hemlock + cedar (31± 40 m) 317 405
6 Forest clearings 1260 1625
Total 4189 5503

Table 2. Average pairwise JM-distances for three of the data sources (maximum JM-distance
is 1´414 ).

Data source Average JM-distance

AMSS 1´19877
SAR shallow 0´46305
SAR steep 0´43109
Neural networks in remote sensing 733

the multidimensional data sources. Although it only has an average separability of


1´199, it is much more separable than the SAR data sources. Since these three
multidimensional data sources are not very separable for this forest area, the topo-
graphic data may be expected to help in classifying the data more accurately than
can be achieved using the remote sensing data alone.
The conjugate-gradient backpropagation (CGBP) algorithm with one hidden
layer ( Barnard 1992) was trained on the original data with 15, 20, 25, 30, and 35
hidden neurons. Each version of the CGBP network had 22 inputs and six outputs,
Downloaded by [NWFP University of Engineering & Technology - Peshawar] at 00:39 20 June 2014

and was trained six times with di€ erent initializations. Then the overall average
accuracies were computed for each version. The average results for the CGBP
networks with 30 and 35 hidden neurons are shown in ® gure 1 as a function of the
number of training iterations. With 30 hidden neurons 74´83 per cent overall accuracy
was reached for training data and 72´18 per cent for test data. In comparison, the
network with 35 hidden neurons gave an overall accuracy of 74´48 per cent for
training data and 72´18 per cent for test data.
Principal component analysis ( PCA) was performed on the data. The eigenvalues
of the global covariance matrix are shown in table 3. From the table it can be seen
that approximately 99 per cent of the variance in the data was preserved in 14
features, about 95 per cent of the variance in eight features, and 85 per cent in four
features. The CGBP with one hidden layer was trained on the PCA transformed
data with a di€ erent number of input features. In each case, the number of hidden
neurons was twice the number of input features, except for the 22 feature case where
30 hidden neurons were used. The classi® cation results for the PCA are shown in
table 4. From table 4 it can be seen that there was only about 1 per cent decrease in
overall training and test accuracies when 14 input features were used instead of 22.
When less than 14 features were used, the classi® cation accuracies decreased more
signi® cantly.

Figure 1. Anderson River data: average results for the CGBP. The upper curves represent
training results and the lower curves (marked with *) represent test results.
734 J. A. Benediktsson and J. R. Sveinsson

Table 3. Eigenvalues for principal component analysis ( PCA).

Number of features Eigenvalue Proportion Accumulation

1 0´787 53´97 53´97


2 0´279 19´12 73´09
3 0´131 9´01 82´09
4 0´054 3´70 85´80
5 0´051 3´49 89´29
Downloaded by [NWFP University of Engineering & Technology - Peshawar] at 00:39 20 June 2014

6 0´032 2´23 91´52


7 0´026 1´81 93´32
8 0´192 1´32 94´65
9 0´161 1´10 95´75
10 0´140 0´96 96´71
11 0´095 0´65 97´36
12 0´084 0´58 97´94
13 0´076 0´52 98´45
14 0´066 0´46 98´91
15 0´057 0´39 99´30
16 0´037 0´25 99´55
17 0´025 0´17 99´73
18 0´023 0´16 99´88
19 0´011 0´08 99´96
20 0´005 0´03 99´99
21 0´001 0´01 100´00
22 0´000 0´00 100´00

Table 4. Classi® cation accuracies for principal component analysis (PCA).

Number of features Overall training accuracy Overall test accuracy

1 34´27 34´61
2 47´84 47´66
3 56´06 55´29
4 58´04 58´28
5 64´05 62´86
6 64´14 62´76
10 71´19 68´82
14 73´11 70´50
22 74´28 71´50

Table 5. Eigenvalues for discriminant analysis (DA).

Number of features Eigenvalue Proportion Accumulation

1 1´50 56´29 56´29


2 0´64 23´92 80´21
3 0´29 10´92 91´12
4 0´17 6´27 97´39
5 0´07 2´61 100´00

Discriminant analysis ( DA) was then performed on the data. Since there were
six information classes in the data, it was know that discriminant analysis would at
maximum give only ® ve input features using the criterion tr(SWÕ 1S B ). The eigenvalues
of the matrix SWÕ 1S B are shown in table 5. From the table it can be seen that
Neural networks in remote sensing 735

approximately 97 per cent of the variance according to the tr(SW Õ 1S B ) criterion was
preserved in four features, and about 91 per cent in three features. The CGBP with
one hidden layer was trained on the DA transformed data with a di€ erent number
of input features. In each case, the number of hidden neurons was twice the number
of input features. The classi® cation results for the DA are shown in table 6. From
table 6 it can be seen that the use of DA for feature extraction resulted in signi® cantly
less accurate classi® cation by the neural network classi® ers. The results are only
comparable to the PCA results in table 4 for ® ve or fewer features. For those few
Downloaded by [NWFP University of Engineering & Technology - Peshawar] at 00:39 20 June 2014

features the DA results are slightly higher in terms of classi® cation accuracies.
However, the DA de® nitely su€ ered from only being able to give ® ve total features.
Finally, decision boundary feature extraction ( DBFE) was performed on the data
( Lee and Landgrebe 1992 ). The eigenvalues of the decision boundary feature matrix
are shown in table 7. From the table it can be seen that approximately 99 per cent
of the variance in the data was preserved in 10 features, and about 94 per cent in
six features. The CGBP with one hidden layer was trained on the DBFE transformed
data with a di€ erent number of input features. In each case, the number of hidden

Table 6. Classi® cation accuracies for discriminant analysis ( DA).

Number of features Overall training accuracy Overall test accuracy

1 47´15 45´84
2 52´48 51´35
3 59´40 57´75
4 61´51 60´37
5 63´22 61´92

Table 7. Eigenvalues of decision boundary feature matrix (DBFM ).

Number of features Eigenvalue Proportion Accumulation

1 1´200 37´75 37´75


2 0´810 25´29 63´04
3 0´360 11´31 74´35
4 0´270 8´35 82´71
5 0´250 7´84 90´55
6 0´110 3´50 94´05
7 0´086 2´69 96´74
8 0´039 1´22 97´96
9 0´019 0´58 98´54
10 0´016 0´51 99´05
11 0´011 0´33 99´38
12 0´008 0´24 99´62
13 0´004 0´14 99´77
14 0´004 0´11 99´88
15 0´002 0´07 99´94
16 0´001 0´02 99´97
17 0´001 0´02 99´98
18 0´000 0´01 99´99
19 0´000 0´00 100´00
20 0´000 0´00 100´00
21 0´000 0´00 100´00
22 0´000 0´00 100´00
736 J. A. Benediktsson and J. R. Sveinsson

neurons was twice the number of input features, except for the 22 feature case where
30 hidden neurons were used. The classi® cation results for the DBFE are shown in
table 8. From table 8 it can be seen that there was only about 1 per cent decrease in
overall training and test accuracies when 10 input features were used instead of 22.
Then, the performance decreased about 2 per cent in terms of training and test
accuracies when six features were used instead of 10. These results indicate that
smaller neural network classi® ers can be used to obtain similar results to the ones
in the original feature space. Comparing the DBFE results in table 8 to the corres-
Downloaded by [NWFP University of Engineering & Technology - Peshawar] at 00:39 20 June 2014

ponding PCA results in table 4 it is clear that the DBFE always outperformed the
PCA in terms of classi® cation accuracies (see also ® gures 2 and 3). For example, the
accuracies for the DBFE with six features were similar to the PCA results with
10 features.
The overall classi® cation accuracies based on the original 22 features are summar-
ized in table 9 (the CGBP results are given at convergence) and ® gures 2 and 3.
There it can be seen that the classi® cation accuracies with the DBFE were comparable
to the ones obtained for the original data when 30 hidden neurons were used.

Table 8. Classi® cation accuracies for decision boundary feature extraction (DBFE).

Number of features Overall training accuracy Overall test accuracy

1 46´90 45´72
2 56´40 55´18
3 58´83 57´04
4 68´28 66´13
5 71´10 68´27
6 71´50 69´29
10 73´95 71´49
22 74´75 72´48

Figure 2. Anderson River data: average training results for the di€ erent feature extraction
methods as a function of the number of features used.
Neural networks in remote sensing 737
Downloaded by [NWFP University of Engineering & Technology - Peshawar] at 00:39 20 June 2014

Figure 3. Anderson River data: average test results for the di€ erent feature extraction methods
as a function of the number of features used.

Table 9. Overall training and test accuracies for the CGBP applied to the Anderson River
data set.

Training Test
Method accuracy accuracy

Original data (15 hidden neurons) 71´44% 69´28%


Original data (20 hidden neurons) 73´56% 70´94%
Original data (25 hidden neurons) 73´40% 70´93%
Original data (30 hidden neurons) 74´83% 72´18%
Original data (35 hidden neurons) 74´48% 71´43%
Discriminant analysis (10 hidden neurons) 63´22% 61´92%
Principal component analysis (30 hidden neurons) 74´43% 71´50%
Decision boundary feature extraction ( 30 hidden neurons) 74´75% 72´48%

However, the DBFE outperformed both the PCA and DA in terms of overall
classi® cation accuracies of training and test data. The standard deviations of the
classi® cations for the di€ erent feature extraction methods are shown in ® gures 4
and 5, for training and test data, respectively. From these ® gures it can be seen that
the DBFE does have lower classi® cation variance than the PCA, especially for the
lower dimensional data. All these results clearly demonstrate that the PCA is not
an optimal input representation method for neural network classi® ers. On the other
hand, excellent classi® cation results were achieved by using the DBFE.

5. Conclusion
Several linear feature extraction methods were considered for neural networks.
The methods included principal component analysis, discriminant analysis, and the
recently proposed decision boundary feature extraction. Although principal compon-
ent analysis can be shown to be optimal for data representation it can be improved
738 J. A. Benediktsson and J. R. Sveinsson
Downloaded by [NWFP University of Engineering & Technology - Peshawar] at 00:39 20 June 2014

Figure 4. Anderson River data: standard deviations for classi® cations of training data for
the di€ erent feature extraction methods as a function of the number of features used.

Figure 5. Anderson River data: standard deviations for classi® cations of test data for the
di€ erent feature extraction methods as a function of the number of features used.

upon in terms of classi® cation accuracies. Feature extraction methods are important
for neural network classi® ers and can be used to ® nd the best representation of input
data for the networks since the performance of neural network classi® ers is dependent
on the input representation.
In experiments, neural networks were used to classify multisource remote sensing
and geographic data. Few feature extraction methods have been proposed for multi-
Neural networks in remote sensing 739

source data but such data usually cannot be modelled by a simple multivariate
statistical model. The decision boundary feature extraction method not only showed
the best performance of the feature extraction methods in terms of classi® cation
accuracies when the dimensionality was reduced, but also gave the best performance
for test data when the full 22 dimensional feature set was used. Since the decision
boundary feature reduction algorithm does not assume any underlying probability
density functions for the data, it takes full advantage of the distribution free nature
of neural networks, and how neural network models de® ne complex decision bound-
Downloaded by [NWFP University of Engineering & Technology - Peshawar] at 00:39 20 June 2014

aries. With a reduced feature set, it is possible to obtain simpler classi® ers but in the
experiments these simpler classi® ers gave similar accuracies to classi® ers applied to
the original data. The decision boundary feature extraction method also gives a
feature matrix which has full rank, that is, as many features can be used as are in
the original feature space. In contrast, discriminant analysis generally does not give
a full rank matrix in the original feature space. Therefore, discriminant analysis
should be considered a less attractive method than decision boundary feature extrac-
tion for input representation for neural network classi® ers.
With its outstanding performance on the di cult data set, decision boundary
feature extraction should both be considered an excellent feature extraction method
and a desirable method for input data representation for neural networks.

Acknowledgm ents
The authors are very grateful to Dr Chulhee Lee of Yonsei University and
Professor David A. Landgrebe of Purdue University for their invaluable assistance
in preparing this paper. The authors also thank Dr Joseph Ho€ beck of AT&T for
providing his Matlab data analysis software to us. The Anderson River SAR/MSS
data set was acquired, preprocessed, and loaned by the Canada Centre for Remote
Sensing, Department of Energy Mines, and Resources, of the Government of Canada.
This work was funded in part by the Icelandic Science Council and the Research
Fund of the University of Iceland.

References
B arnard, E ., 1992 , Optimization for training neural nets. I.E.E.E. Transactions on Neural
Networks, 3, 232± 240.
B enediktsson, J . A ., and S wain, P . H ., 1992 , Consensus theoretic classi® cation methods.
I.E.E.E. T ransactions on Systems, Man, and Cybernetics, 22, 688± 704.
B enediktsson, J . A ., S wain, P . H ., and E rsoy, O . K ., 1990, Neural network approaches
versus statistical methods in classi® cation of multisource remote sensing data. I.E.E.E.
T ransactions on Geoscience and Rem ote Sensing, 28, 540± 552.
B enediktsson, J . A ., S wain, P . H ., and E rsoy, O . K ., 1993 , Conjugate-gradient neural
networks in classi® cation of multisource and very-high-dimensional data. International
Journal of Rem ote Sensing, 14, 2883± 2903.
B londa, P ., la F orgia, V ., P asquaeriello, G ., and S atalino, G ., 1994 , Multispectral
classi® cation by a modular neural network architecture. Proceedings of the 1994
International Geoscience and Remote Sensing Symposium ( New York: I.E.E.E. Press),
pp. 1873± 1875.
F akhr, W ., K amel, M ., and E lmasry, M . I ., 1994 , The adaptive feature extraction nearest
neighbour classi® er. Proceedings of the 1994 World Congress on Neural Networks, 3
( Hillsdale, New Jersey: Lawrence Erlbaum), pp. 123± 128.
F ukunaga, K ., 1990 , Introduction to Statistical Pattern Recognition, 2nd edn (New York:
Academic Press).
G oodenough, D . G ., G oldberg, M ., P lunkett, G ., and Z elek, J ., 1987, The CCRS
740 Neural networks in remote sensing

SAR/MSS Anderson River data set. I.E.E.E. T ransactions on Geoscience and Rem ote
Sensing, 25, 360± 367.
L ampinen, J ., and O ja, E ., 1995 , Distortion tolerant pattern recognition based on self-
organizing feature extraction. I.E.E.E. Transactions on Neural Networks, 6, 539± 547.
L ee, C ., and L andgrebe, D . A ., 1992 , Decision boundary feature selection for neural networks.
Proceedings of the I.E.E.E. International Conference on Systems, Man and Cybernetics
( New York: I.E.E.E. Press), pp. 1053± 1057.
L ee, C ., and L andgrebe, D . A ., 1993 a, Feature extraction and classi® cation algorithms for
high dimensional data. Technical Report TR-EE 93-1, School of Electrical Engineering,
Downloaded by [NWFP University of Engineering & Technology - Peshawar] at 00:39 20 June 2014

Purdue University, West Lafayette, Indiana, U.S.A.


L ee, C ., and L andgrebe, D . A . 1993 b, Decision boundary feature extraction for non-paramet-
ric classi® ers. I.E.E.E. T ransactions on Systems, Man and Cybernetics, 23, 433± 444.
L ee, C ., and L andgrebe, D . A ., 1993 c, Feature extraction based on decision boundaries.
I.E.E.E. T ransactions on Pattern Analysis and Machine Intelligence, 15, 388± 400.
L ee, B ., K im, D ., C ho, Y ., L ee, H ., and H wang, H ., 1994 , Reduction of input nodes for shift
invariant second order neural networks using principal component analysis ( PCA).
Proceedings of the 1994 World Congress on Neural Networks, 3 (Hillsdale, New Jersey:
Lawrence Erlbaum), pp. 144± 149.
M ao, J ., and J ain, A . K ., 1995 , Arti® cial neural networks for feature extraction and multi-
variate data projection. I.E.E.E. Transactions on Neural Networks, 6, 296± 317.
O ja, E ., 1995 , PCA, ICA, and non-linear Hebbian learning. Proceedings of the International
Conference on Arti® cial Neural Networks (ICANN ’95), held in Paris, France, on 9± 13
October 1995 (Paris: EC2 & Cie), pp. 89± 94.
R uck, D . W ., R ogers, S . K ., K abrisky, M ., O xley, M . E ., and S uter, B . W ., 1990 , The
multi-layer perceptron as an approximation to a Bayes optimal discrimination func-
tion. I.E.E.E. T ransactions on Neural Networks, 1, 296± 298.
S wain, P . H ., 1978 , Fundamentals of pattern recognition in remote sensing. In Remote Sensing:
T he Quantitative Approach , edited by P. H. Swain and S. Davis ( New York: McGraw-
Hill ) pp. 136± 187.

You might also like