You are on page 1of 5

IEEE GEOSCIENCE AND REMOTE SENSING LETTERS, VOL. 14, NO.

12, DECEMBER 2017 2245

Zero-Shot Learning of SAR Target Feature Space


With Deep Generative Neural Networks
Qian Song, Student Member, IEEE, and Feng Xu , Senior Member, IEEE

Abstract— Zero-shot learning (ZSL) is of critical importance a convolutional neural network (CNN) to SAR ATR and
for practical synthetic aperture radar (SAR) automatic target achieved a state-of-the-art accuracy of 99% on the ten-types
recognition (ATR) as training samples are not always available for of supervised classification tasks of the Moving and Station-
all targets and all observation configurations. We propose a novel
generative-based deep neural network framework for ZSL of ary Target Acquisition and Recognition (MSTAR) benchmark
SAR ATR. The key component of the framework is a generative data sets.
deconvolutional neural network referred to as generator. It learns Although being powerful in classification, CNN as a super-
a faithful hierarchical representation of known targets while vised discriminative network is highly sensitive to selection
automatically constructing a continuous SAR target feature space and size of training samples and thus susceptible to overfitting
spanned by orientation-invariant features and orientation angle.
It is then used as a reference to design and initialize an interpreter problems. It is particularly well known for its weakness
convolutional neural network, which is inversely symmetric to the in generalization. Its performance would rapidly degrade if
generator network. The interpreter network is then trained to applied to different observation configurations or tasked to
map any input SAR image, including those of unseen targets, discriminate variant target types. It would become completely
into the target feature space. In a preliminary experiment with useless for new target types which have never been seen in the
the Moving and Stationary Target Acquisition and Recognition
data set, seven targets are used in the training of generator and training samples, which is a practical need in many applica-
interpreter networks. Then, the eighth target is used to test the tions. This becomes a major hindrance in practical application
interpreter, where it is correctly mapped to the reasonable spot of CNN to SAR ATR.
spanned by the previous seven targets and its orientation can In deep learning regime, this is an important research
also be estimated. topic which has motivated the so-called zero-shot learn-
Index Terms— Deep generative neural network, orientation- ing (ZSL) [5], [6] and one-shot learning (OSL) [7].
invariant feature space, synthetic aperture radar (SAR).
ZSL requires no training samples while OSL requires only few
I. I NTRODUCTION training samples or just one training sample. ZSL/OSL is often
achieved by semantic representation, which maps the low-
S YNTHETIC aperture radar (SAR) images are very dif-
ferent from optical images and are difficult to interpret
because of the microwave wavelength employed and the
level image features into mid-to-high-level space, where the
distance calculated between two samples reveals their relation-
phase-coherent nature of SAR imaging. Nevertheless, a SAR ships. Usually, semantic representation can be categorized into
image contains rich information about the target and scene two types: attribute-based binary vectors [8] and word-based
under observation, e.g., geometry, material, and structure. continuous vectors [9]. The former requires manually defining
Human interpretation of SAR imagery, which needs experi- the attributes of targets and then assigning the corresponding
enced experts to find small targets in massive SAR images, is labels for each training sample, which is apparently too costly
challenging, time consuming, and impractical in the big data for annotation. The later works only for common label words
era of remote sensing [1]. After being trained on enough sam- which have simple meanings, e.g., car, airplane, and dog.
ples, deep learning can imitate the mechanism of human brain, This is not the case for a specific class of SAR targets
learn the latent features of targets, and help machine to inter- such as the eight vehicles in the MSTAR data set, e.g.,
pret SAR images automatically. It has revolutionized first the T72 and T62.
computer vision area and then many other machine learning Generative neural network [10], [11] as opposed to discrim-
areas including the SAR image interpretation, e.g., automatic inative neural network is often used for unsupervised learning
target recognition (ATR) [2], terrain surface classification [3], such as representation learning. This letter addresses the
and parameter inversion [4]. Chen et al. [2] first applied ZSL problem of SAR ATR using deep generative neural
networks. It learns to mimic SAR images by training a
Manuscript received July 21, 2017; revised September 5, 2017 and generative deep neural network (DNN). It automatically learns
September 27, 2017; accepted September 28, 2017. Date of publication
November 10, 2017; date of current version December 4, 2017. This work was a hierarchical representation of SAR target features from given
supported in part by the National Key R&D Program of China under SAR images. In the meanwhile, it constructs a continuous SAR
Grant 2017YFB0502700 and in part by NSFC under Grant 61571134. target feature space where target orientation and orientation-
(Corresponding author: Feng Xu.)
The authors are with the Key Laboratory for Information Science of invariant intrinsic features are disentangled. Such feature space
Electromagnetic Waves (MoE), Fudan University, Shanghai 200433, China is spanned by known target samples and will be used as the
(e-mail: fengxu@fudan.edu.cn). reference to interpret targets which have never been seen in
Color versions of one or more of the figures in this letter are available
online at http://ieeexplore.ieee.org. the training samples. Thereafter, a deep CNN is trained to
Digital Object Identifier 10.1109/LGRS.2017.2758900 continuously map any input SAR image to the learned target
1545-598X © 2017 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.
See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
2246 IEEE GEOSCIENCE AND REMOTE SENSING LETTERS, VOL. 14, NO. 12, DECEMBER 2017

as an interpreter deep neural network in this letter, and trained


on real SAR images of known targets.
The goal of the whole network is to construct the inter-
pretable feature space F automatically. In previous works,
F was often defined manually or semimanually or learned
automatically from available textual documents, while in our
framework, it is learned automatically from SAR images using
the constructor–generator framework. It derives the distribu-
tion of F given target labels c and the interpreter DNN to
estimate the distribution of F given real SAR with the goal of
min L( p(F|c)|| p(F|X x∼D )). (1)
To obtain p(F|c), the generator DNN tries to minimize the loss
function defined as L 2 norm between the generated images
and actual images with respect to parameters θc and θg of
constructor and generator DNNs
Fig. 1. Proposed framework of the generative DNN-based SAR target feature
space construction and interpretation.
min L(G(C(c;θc ), v; θg )||X). (2)
θc ,θg
feature space. Thus, it can faithfully interpret unseen target
Then learned θg acts as prior knowledge for the inter-
images.
preter DNN. Note that here in order to ensure that the feature
The remainder of this letter is organized as follows. The
space is physically sound, we pretrained the constructor NN on
overall architecture is explained in Section II, while Section III
training samples by manually defining the prototype of each
presents the results of generative neural networks of the
class F0c . Thus, θc is initialized by
MSTAR data set. Section IV gives the results of interpreter
 
CNN. Section V concludes this letter. min L C(c;θc )||F0c . (3)
θc0

II. OVERALL A RCHITECTURE After obtaining p(F|c), the interpreter aims at minimizing
the loss defined between p(F|c) and p(F|X x∼D ). The whole
Suppose we have cs known targets with n s samples
network is trained only by limited types of targets, i.e.,
S = {cs , vs , Fs , X s }, and ct unseen targets with n t samples
cs classes of S; however, the trained interpreter DNN can also
T = {vt , Ft , X t }. X denotes the SAR images, and c and v
be utilized to interpret other types of targets such as ct types
are the label and orientation information of targets. The key
of T , which is the principle of ZSL.
idea of this letter is to construct a feature space and learn the
Compared to conventional generative neural networks,
mapping from the SAR images to orientations and orientation-
the proposed framework includes a supervised constructor NN
invariant features Fs using cs , vs , and X s of S. The learned
which tends to map known labels to a continuous intrinsic
mapping can be used to infer vt and Ft of unseen targets T
feature space. The constructor NN is simultaneously trained
given X t .
with the deconvolutional generator network. The key is to take
Fig. 1 illustrates the overall framework of DNNs for the
advantage of hierarchical representation learning to construct
proposed ZSL SAR ATR. In the forward generative direction,
the desired feature space.
the input is discrete labels and orientation information of
targets. Labels are first fed into a fully connected construc-
tor NN, which constructs a continuous target space spanned III. C ONSTRUCTOR –G ENERATOR DNN OF SAR I MAGES
by orientation-invariant features, which are then concatenated The proposed framework is then implemented on the famous
with orientation information to form the complete feature SAR ATR benchmark data set MSTAR data set [12], which is
vector. Subsequently, this complete feature vector is fed widely used for SAR target recognition and classification task.
into a deep DNN, i.e., the generator DNN, which generates In this letter, eight types of vehicle targets as observed by the
pseudo-SAR images of the corresponding target at a particular X-band one-foot-resolution SAR from 360° orientations have
orientation. The generative networks, both the constructor and been used. For each type, there are ∼300 images. The orien-
the generator DNNs, are trained using SAR images of known tation (i.e., aspect angle) of each SAR image can be obtained
targets. They learn from the data to construct a continuous fea- in the header of its binary file. To test the ZSL capability, only
ture space where all known targets can find their corresponding seven types (i.e., cs = 7) of targets are used in training. The
position. eighth type is used for testing. This is to mimic the fact that
Once the orientation-invariant feature vector is constructed, not all targets are available as training samples, which is the
it is then used as the goal to train the inverse direction, i.e., key challenge for ZSL.
from an input SAR image to its feature space. The inverse The forward-pass constructor–generator network archi-
direction is a simple CNN which maps an SAR image to tecture is illustrated in Fig. 2. The inputs of the network are
the orientation-invariant feature space and at the same time one-hot target label c and orientation vector v of the target.
extracts its orientation information. Such a CNN is referred to In this case, the constructor is a two-layer fully connected
SONG AND XU: ZSL OF SAR TARGET FEATURE SPACE 2247

Fig. 2. Constructor–generator network for MSTAR data.

Fig. 4. Convergence of MSE during training.

Fig. 3. Visualization of generated images during training.

network with the first layer consisting of 20 neurons and the


output layer consisting of two neurons corresponding to a
2-D feature space. Since the number of target types available is
limited, the feature space is designed to be a 2-D space, which Fig. 5. Comparison between real SAR images and generated images.
is also easier for visualization. Note that the feature space can
be of higher dimension, if more sophisticated feature partition
is needed. Label c is mapped to a continuous orientation-
invariant feature space F via this constructor NN C(c;θc ).
On the other side, orientation information is represented
as a 2-D vector v =[cos φ, sin φ]T , where φ ∈ [0, 360°]
denotes the orientation angle. Note that sin/cos representation
is employed to reflect the 360° periodicity [10].
The orientation-invariant feature F and orientation-only
feature v are concatenated and fed into the generator DNN,
which worked as an image generator. The generator consists
of a three-layer fully connected network and a deep DNN.
For each deconvolutional layer, the up-pooling operation takes
place by setting the stride to be 2. The final output image is
matched with the real SAR image with an L2 loss function.
The constructor–generator network is then trained via reg-
ular stochastic gradient descent using all training samples of Fig. 6. Autoconstructed target feature space as spanned by the seven targets
the known seven types of target. All the weights are randomly included in training samples.
initialized except θc , which is initialized by (3).
Fig. 3 shows the generated SAR images during the training and brightness. Apparently, the generator had learned to cap-
process. The network converges after 200 epochs. It is interest- ture the intrinsic characteristics of the data set. This capability
ing to find that the network first learns the location of targets could be potentially explored for speckle reduction application.
(e.g., at the second epoch), then learns the general characteris- Now let us look at the constructed orientation-invariant
tics of targets (e.g., at the 72nd epoch, target shape and shadow feature space spanned by the 2-D F. The seven types of
area are observable), and finally learns the textures of the targets as mapped to the feature space are shown in Fig. 6,
target image (e.g., at the 177th epoch). Fig. 4 shows the mean along with the optical images of each target. The constructed
squared error (MSE) between real and generated SAR images feature space is not directly linked to any physical parameters.
at different training epochs. The network is trained with one However, it faithfully reflects the distances or similarities
TITAN X GPU. Training takes about 2 s per epoch. among different targets. For example, BRDM2, BTR60, and
Fig. 5 compares nine samples of the real SAR images with 2S1 are located nearby as all three are armored vehicles.
the corresponding generated SAR images. It is interesting to D7, the backhoe, and ZIL131, the truck, are located farthest
note that the generated SAR image appears to be smoother away from the rest of the targets as they differ greatly in either
and speckle free. However, it preserves the distinct features shape or scale. Note that the learned distribution might be
of the target itself, e.g., target geometric profile, shadow, sensitive to the network initialization, but the basic topology
2248 IEEE GEOSCIENCE AND REMOTE SENSING LETTERS, VOL. 14, NO. 12, DECEMBER 2017

Fig. 9. Distribution of test samples of known targets as interpreted in the


feature space.

Fig. 7. Generated arbitrary target full-aspect SAR images.

Fig. 10. Orientation angle estimation of test samples of known targets.


Fig. 8. Inverse interpreter DNN for MSTAR data. (Left) Scatterplot. (Right) Error histogram.

reflecting the mutual distances among different targets should The key difference between the interpreter DNN and con-
be stable. ventional supervised classifier DNN is that the goal output is
To demonstrate the efficacy of the generator DNN, we sam- a continuous target feature space instead of discrete labels.
ple arbitrary from the feature space and then examine the Another key factor of interpreter DNN is that it has to
generated fake SAR images. For example, we can sample from be the inverse network of a successfully trained generator
the orientation-invariant feature space a point between any two network. It guarantees that the interpreter will correctly map
targets. Then, full-aspect SAR images of such an imaginary the input SAR image to the continuous feature space. Again,
target can be generated. Fig. 7 shows examples of generating the generative DNN plays a critical role in constructing and
0°–360° SAR images of an imaginary target between D7 and training the interpreter DNN.
T62, T62 and 2S1, and between 2S1 and ZIL123, respectively. The interpreter DNN is trained using 80% of the known
We believe that the generative DNN is crucial in construct- seven types of real SAR images, and the rest (20%) is used
ing the continuous feature space which faithfully reflects the to validate the fitness of the network. The GPU takes 36 s
feature of each target and the distances among them. to train the network and 0.2 s to test one SAR image.
Fig. 9 shows the distribution of the 20% validation samples
IV. I NTERPRETER DNN FOR U NSEEN SAR TARGET as interpreted in the orientation-invariant feature space. It can
Once we constructed the target feature space via training be seen that the known seven types are well separated and
the constructor–generator DNN, the interpreter DNN can be clustered around respective centers. An overall classification
easily designed and trained. The interpreter DNN is symmetric accuracy of 96.8% can be achieved if we simply apply nearest
to the generator network, but in backward direction (Fig. 8). neighbor clustering. Fig. 10 shows the estimated orientation
It is a CNN as opposed to the generative DNN. Its architecture angle versus true orientation angle. The standard deviation
is the same as the generator DNN except that the input and error for orientation estimation is 16°.
output are swapped at each layer. In addition, the weights of Finally, let us examine the ZSL capability of the interpreter
interpreter DNN are initialized as the same weights as the network. We pick the T72 target from the rest target types
generator DNN. It serves as a weak regularization to bind the which are not included in the training samples. It means that
inverse interpreter to the generator whose SAR representative the interpreter has never seen any SAR images of a T72.
power had been tested. In an ideal case, we want the interpreter network to be able to
SONG AND XU: ZSL OF SAR TARGET FEATURE SPACE 2249

clusters of unseen targets are more dispersive which means


that the interpreter is more uncertain with unseen targets.
On the other hand, the interpreter can also extract the
orientation angle of an unseen target. Fig. 11(c) shows the
error histograms of T72-A04 and T72-A05. The interpreter
performs reasonably well in this task with a minor degradation
in terms of standard deviation error (∼35°).
V. C ONCLUSION
ZSL or OSL is of particular importance for SAR ATR
applications due to lack of training samples of many targets
and many observation configurations. This letter proposed
a generative DNN-based framework for ZSL of SAR ATR,
which was demonstrated on the MSTAR data set, where sam-
ples of only seven targets were used in training and the eighth
target type was used for testing. The generator DNN employs
an L2-norm loss function and successfully learns hierarchical
representation of the selected seven types of SAR target
images. From the generated images of training targets and
sampled images of imaginary targets, we see that the generator
DNN helped to construct a continuous feature space spanned
by the selected seven types. The generator DNN plays a critical
role not only in constructing a physically sound feature space
but also in initializing and regularizing the interpreter DNN,
which is later trained to interpret unseen targets in the context
of known targets. The interpreter DNN is inversely symmetric
to the generator DNN, and it is trained to map an input
SAR image to the feature space. The interpretation results of
unseen T72 targets demonstrated the efficacy of the proposed
ZSL framework. It correctly maps T72 to the feature space
and estimates its orientation angle.
R EFERENCES
[1] F. Xu, Y.-Q. Jin, and A. Moreira, “A preliminary study on SAR advanced
information retrieval and scene reconstruction,” IEEE Geosci. Remote
Sens. Lett., vol. 13, no. 10, pp. 1443–1447, Oct. 2016.
[2] S. Chen, H. Wang, F. Xu, and Y.-Q. Jin, “Target classification using
the deep convolutional networks for SAR images,” IEEE Trans. Geosci.
Remote Sens., vol. 54, no. 8, pp. 4806–4817, Aug. 2016.
[3] Y. Zhou, H. Wang, F. Xu, and Y.-Q. Jin, “Polarimetric SAR image
Fig. 11. T72 SAR images as interpreted by the zero-shot learned interpreter classification using deep convolutional neural networks,” IEEE Geosci.
DNN. (a) Distribution of “new target” T72 on the feature space. (b) Similarity Remote Sens. Lett., vol. 13, no. 12, pp. 1935–1939, Dec. 2016.
of T72 subtypes A04 and A05 to T62 and 2S1. (c) Orientation estimation error [4] L. Wang, K. A. Scott, L. Xu, and D. A. Clausi, “Sea ice concentration
of T72-A04 and T72-A05. estimation during melt from dual-pol SAR scenes using deep convo-
lutional neural networks: A case study,” IEEE Trans. Geosci. Remote
Sens., vol. 54, no. 8, pp. 4524–4533, Aug. 2016.
interpret T72 SAR images in the context of the seven known [5] C. J. C. Burges, L. Bottou, M. Welling, Z. Ghahramani, and
targets. K. Q. Weinberger, “Zero-shot learning through cross-modal transfer,”
Feeding 0°–360° SAR images of T72 (subtypes A04 in Proc. Adv. Neural Inf. Process. Syst., 2013, pp. 935–943.
[6] Y. Fu, T. M. Hospedales, T. Xiang, and S. Gong, “Transductive multi-
and A05) into the interpreter, it generates the corresponding view zero-shot learning,” IEEE Trans. Pattern Anal. Mach. Intell.,
scatter points in the feature space as well as the corresponding vol. 37, no. 11, pp. 2332–2345, Nov. 2015.
orientation angle. Fig. 11(a) shows the clusters of [7] L. Fei-Fei, R. Fergus, and P. Perona, “One-shot learning of object
categories,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 28, no. 4,
T72-A04 and T72-A05 in the target feature space. It shows pp. 594–611, Apr. 2006.
that both T72 targets are distributed among T62, 2S1, BTR60, [8] Y. Fu, T. M. Hospedales, T. Xiang, and S. Gong, “Learning multimodal
and BRDM2. It reflects the fact that T72 looks similar to these latent attributes,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 36, no. 2,
pp. 303–316, Feb. 2014.
armored vehicles. The T72-A05 is shifted slightly toward T62, [9] A. Frome et al., “DeViSE: A deep visual-semantic embedding model,”
while T72-A04 is closer to 2S1. In Fig. 11(b), we compare in Proc. Adv. Neural Inf. Process. Syst., 2013, pp. 2121–2129.
the optical and SAR images of these four targets, from which [10] A. Dosovitskiy, J. T. Springenberg, M. Tatarchenko, and T. Brox.
(2015). “Learning to generate chairs, tables and cars with convolutional
we believe that the interpreter DNN is able to accurately networks.” [Online]. Available: https://arxiv.org/abs/1411.5928
reflect the similarity/dissimilarity of a new unseen target to [11] I. Goodfellow et al., “Generative adversarial nets,” in Proc. Adv. Neural
known targets. Thus, one could roughly say that the unseen Inf. Process. Syst., 2014, pp. 2672–2680.
[12] The Air Force Moving and Stationary Target Recognition Database.
target T72 is an armored vehicle similar to T62 and 2S1 but Accessed: Sep. 2013. [Online]. Available: https://www.sdms.afrl.af.mil/-
not a truck like ZIL131 or a backhoe like D7. Note that the datasets/mstar/

You might also like