Next Article in Journal
Remote Sensing of Urban Poverty and Gentrification
Next Article in Special Issue
Mapping and Characterizing Displacements of Landslides with InSAR and Airborne LiDAR Technologies: A Case Study of Danba County, Southwest China
Previous Article in Journal
Improved Single-Frequency Kinematic Orbit Determination Strategy of Small LEO Satellite with the Sun-Pointing Attitude Mode
Previous Article in Special Issue
Drone SAR Image Compression Based on Block Adaptive Compressive Sensing
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Two-Stream Deep Fusion Network Based on VAE and CNN for Synthetic Aperture Radar Target Recognition

National Laboratory of Radar Signal Processing, Xidian University, Xi’an 710071, China
*
Author to whom correspondence should be addressed.
Remote Sens. 2021, 13(20), 4021; https://doi.org/10.3390/rs13204021
Submission received: 17 August 2021 / Revised: 22 September 2021 / Accepted: 28 September 2021 / Published: 9 October 2021
(This article belongs to the Special Issue Advances in SAR Image Processing and Applications)

Abstract

:
Usually radar target recognition methods only use a single type of high-resolution radar signal, e.g., high-resolution range profile (HRRP) or synthetic aperture radar (SAR) images. In fact, in the SAR imaging procedure, we can simultaneously obtain both the HRRP data and the corresponding SAR image, as the information contained within them is not exactly the same. Although the information contained in the HRRP data and the SAR image are not exactly the same, both are important for radar target recognition. Therefore, in this paper, we propose a novel end-to-end two stream fusion network to make full use of the different characteristics obtained from modeling HRRP data and SAR images, respectively, for SAR target recognition. The proposed fusion network contains two separated streams in the feature extraction stage, one of which takes advantage of a variational auto-encoder (VAE) network to acquire the latent probabilistic distribution characteristic from the HRRP data, and the other uses a lightweight convolutional neural network, LightNet, to extract the 2D visual structure characteristics based on SAR images. Following the feature extraction stage, a fusion module is utilized to integrate the latent probabilistic distribution characteristic and the structure characteristic for the reflecting target information more comprehensively and sufficiently. The main contribution of the proposed method consists of two parts: (1) different characteristics from the HRRP data and the SAR image can be used effectively for SAR target recognition, and (2) an attention weight vector is used in the fusion module to adaptively integrate the different characteristics from the two sub-networks. The experimental results of our method on the HRRP data and SAR images of the MSTAR and civilian vehicle datasets obtained improvements of at least 0.96 and 2.16%, respectively, on recognition rates, compared with current SAR target recognition methods.

Graphical Abstract

1. Introduction

Synthetic aperture radar (SAR) target recognition is a development of radar automatic target recognition (RATR) technology. Because of the all-weather, all-day and long-distance perception capabilities of SAR, SAR target recognition plays an important role in both military and civil fields [1,2,3,4]. SAR target recognition is urgently required, given the overwhelming amount of SAR data available, and the SAR target recognition has been a wide concern at home and abroad.
As a type of data widely used in RATR [5,6,7,8,9,10], high-resolution range profile (HRRP) data can be simultaneously obtained with the corresponding SAR image in the procedure of SAR imaging. HRRP data obtained from SAR echoes have been widely used for target recognition [11,12]. Figure 1 shows the relationship between HRRP data and the SAR image based on the classical range–Doppler algorithm (RDA) [13]. HRRP data is a 1D distribution of the radar cross section and can be obtained by the modulo operation after the range compression of the received SAR echoes. HRRP target recognition receives widespread attention in the RATR community due to its relatively few complexities in signal acquisition [1,2,3,4]. A SAR image is the 2D image of the target derived by coherently processing high-range resolution radar echoes and conducting translational motion compensation by means of range cell migration correction (RCMC). The SAR images are easier and more intuitive to understand, being interpretable for human visual perception, as each pixel value reflects the surface microwave reflection intensity.
Feature extraction is an important part of target recognition. The quality of the extracted features directly affects the performance of the target recognition. The development of HRRP data and SAR images in RATR has both gone through a process from the extraction of manual features to the extraction of depth features [1,2,3,4,14,15], which also leads to better RATR recognition performance. However, most of the existing RATR methods based on HRRP data and SAR images only use a single type of data. According to Figure 1, due to the different generation mechanisms, the information contained in the HRRP data and the SAR images is not exactly the same. Since the HRRP data and the SAR images can only represent the original SAR echoes from one aspect, using the above two data sources together can lead us to obtain a more complete information representation of the original SAR echoes. Modeling a complete interpretation using only unimodal data is theoretically insufficient. Therefore, to reveal more complete information, we underwent to formulate a novel framework to fuse the characteristics obtained from modeling HRRP data and the SAR image for radar target recognition. To the best of our knowledge, this is the first time that HRRP data and SAR images have been comprehensively utilized for radar target recognition.
In this paper, we propose an end-to-end two-stream fusion network. The first stream takes the HRRP data as the input, and draws support from the VAE, a deep probabilistic model, to effectively extract the latent probabilistic distribution features. The other stream takes the SAR image as its input. In this stream, a light weight CNN, LightNet, is utilized to extract 2D visual structure features. A fusion module with an attention mechanism is exploited to integrate the different characteristics extracted from two different signal types into a global space, to obtain a single, compact, comprehensive representation for radar target recognition that reflects target information more comprehensively and sufficiently. In the fusion module, the attention weight vector learned automatically is used to adaptively integrate the different characteristics, controlling the contribution of each feature to the overall output feature on a per-dimension basis, remarkably improving recognition performance. Finally, the fused feature is fed into a softmax layer to predict the classification results. More specifically, the main contributions of the proposed two-stream deep fusion network for target recognition are as follows:
  • Considering that both the SAR image and the corresponding HRRP data, in which the information contained are not exactly the same, can be simultaneously obtained in the procedure of SAR imaging, we apply two different sub-networks, VAE and LightNet, in the proposed deep fusion network to mine the different characteristics from the average profiles of the HRRP data and the SAR image, respectively. Through joint utilization of these two types of characteristics, the target representation is more comprehensive and sufficient, which is beneficial for the target recognition task. Moreover, the proposed network is a unified framework which can be end-to-end joint optimized.
  • For the integration process on the latent feature of VAE and the structure feature of LightNet, a novel fusion module is developed in the proposed fusion network. The proposed fusion module takes advantage of the latent feature and the structure feature to automatically learn an attention weight. Then, the learned attention weight is used to adaptively integrate the latent feature and the structure feature. Compared with original concatenation operator, the proposed fusion module can achieve better recognition performance.
The rest of this paper is arranged as follows. Section 2 gives the related works of RATR based on HRRP data and SAR images. Section 3 introduces the novel two-stream fusion network. In Section 4, experiments based on the measured radar dataset and their corresponding analysis are presented to verify the target recognition performance of the proposed two-stream fusion method. Finally, the conclusions are presented in Section 5.

2. Related Work

2.1. Radar Target Recognition

Traditional radar target recognition methods are mostly based on manual feature extraction. These hand-crafted features are inappropriate if there is not sufficient prior knowledge on their application. Meanwhile, these features are mainly lower-level representations, e.g., textural features and local physical structural features, which cannot represent higher-level, abstract information.
Recently, deep learning has made progress by leaps and bounds in computer vision tasks due to its powerful representation capacity.
In HRRP target recognition, owing to the successful application of deep neural networks in various tasks, there are several deep neural networks have been developed on HRRP data. Part of the works focus on selecting suitable networks for HRRP recognition, such as stacked auto-encoder (SAE) [14], denoising auto-encoder (DAE) [5] and recurrent neural network (RNN) [16,17]. There are also some works which focus on how to use HRRP data reasonably, such as using the average profile of HRRP data and sequential HRRP data [18]. Nevertheless, those above-mentioned neural networks for HRRP recognition only gain the point estimations of latent features, which lack descriptions of the underlying probabilistic distribution. Considering the HRRP data do have the statistical distribution characteristics as [19,20,21,22,23,24] described, probabilistic statistical models are exploited to reveal the description of underlying probabilistic distribution, which can use prior information according to a solid theoretical basis, and an appropriate prior will enhance model performance. Meanwhile, probabilistic statistical models possess robustness and flexibility in modeling [25]. At present, several probabilistic statistical models have been developed to describe HRRP [23,26,27,28]. Nevertheless, traditional probabilistic models need to preset the distribution patterns of data, such as Gaussian distribution and Gamma distribution, which are relatively simple and have some limitations in their data fitting ability (the ability of fitting the original data distribution) [29]. In addition, since traditional probabilistic models are based on shallow architectures with simple linear mapping structures, they are only good at learning linear features. However, different from traditional probabilistic models, the VAE [6,30,31] introduces the neural network into probabilistic modelling. As we all know, neural networks stack nonlinear layers to form a deep structure. This nonlinear capability in VAE makes the data fitting more accurate, which can reduce the performance degradation caused by inaccurate data fitting. The deep structure of VAE can mine deep latent features of data with stronger feature separability. Because there is an explicit latent feature to represent the distribution characteristics of data in VAE, the latent variable is often directly used as the representational information of the sample for classification tasks, including HRRP target recognition [7,32,33], and has achieved good performance. At present, VAE is the prevailing generative model. Meanwhile, the generative adversarial network (GAN) is also well known as a popular generative model. Although the VAE and the GAN both belong to generative models and they are usually mentioned at the same time, they are different in many aspects. In VAE, there is an explicit latent feature to represent the distribution characteristics of the data. Therefore, in the practical application of VAE, in addition to the common sample generation, the latent variable of VAE is often directly used as representational information of the sample for the classification and recognition tasks. However, restricted by the inherent mechanism of GAN, there is no explicit feature which can represent the distribution characteristics of data. The application of GAN focuses on the related fields of sample generation and transfer learning.
In the target recognition of the SAR images, auto-encoder (AE) [1,3] and the RBM [2], two widely used unsupervised deep neural network structures, are also employed and have better performance. Among deep neural networks, CNN has become the dominant deep learning approach, as in the VGG network [34], or ResNet. CNN architectures are usually comprised of multiple convolutional layers (followed with activation layers), pooling layers, and one or more fully connected layers. In CNNs, the local connection and weight share in the convolution operation, and the pool operation can effectively reduce the parameters and complexity, resulting in the invariance to translation and distortion which makes the learned features more robust [4]. Another advantage of the CNNs is that they can utilize convolution kernels to extract 2D visual structure information from the apparent to the abstract through layer-by-layer learning. This visual structure information plays a vital important role in image recognition [35,36,37].
In this paper, VAE and CNN are used as sub-networks for the HRRP data and the SAR image, respectively.

2.2. Information Fusion

In recent years, with the development of sensor technology, the diversity of information forms, the huge quantity of information, the complexity of information relations, and demand of timeliness, accuracy and reliability in information processing are unprecedented. Therefore, information fusion technology has been developed rapidly. Information fusion denotes the process of combining data from different sensors or information sources to obtain new or precise knowledge on physical quantities, events or situations [38].
According to the abstract level of information, information fusion methods can be divided into three categories: data-level fusion [39], feature-level fusion [40] and decision-level fusion [41]. Data-level and decision-level fusion are the two most easily implemented information fusion methods, but their performance improvements are also limited. Recently, it has also an important research topic to comprehensively and effectively use a variety of information of radar data, such as multi-temporal [42] and multi-view [43] data, to achieve better model performance. An inverse synthetic aperture radar (ISAR) target recognition method based on both range profile (RP) data and ISAR images was proposed, based on decision-level fusion of the classification results of RP data and ISAR images [44]. Feature-level fusion is the most effective method of information fusion, and it is often used as an effective means to improve performance in deep learning research. Several works focusing on image segmentation also use feature level fusion to fuse multi-level features [45,46,47]. However, these works fuse the features of the same data at different scales, while this paper fuses the features extracted from different data through their respective feature extraction networks.

3. Two-Stream Deep Fusion Network Based on VAE and CNN

The framework of the proposed two-stream deep fusion network for target recognition is depicted in Figure 2. As shown in Figure 2, the framework is briefly introduced as follows.
  • Data acquisition: as can be seen from Figure 1, the complex-valued high-range resolution radar echoes can be obtained after range compression of the receiving SAR echoes. Then, the HRRP data are obtained through the modulo operation. At the same time, based on the complex-valued high-range resolution radar echoes, the complex-valued SAR image is obtained through azimuth focusing processing. Then, the commonly used real-valued SAR image for target recognition can be obtained by modulating the complex-valued SAR image.
  • VAE branch: based on the HRRP data, the average profile of the HRRP is obtained by preprocessing. Then, the average profile is fed into the VAE branch to acquire the latent probabilistic distribution as a representation of the target information.
  • LightNet branch: the other branch takes the SAR image as input and draws support from a lightweight convolutional architecture, LightNet, to extract the 2D visual structure information as another essential representation of the target information.
  • Fusion module: the fusion module is employed to integrate the distribution representation and the visual structure representation to reflect more comprehensive and sufficient information for target recognition. The fusion module merges the VAE branch and the LightNet branch into a unified framework which can be trained in an end-to-end manner.
  • Softmax classifier: finally, the integrated feature is fed into a usual softmax classifier to predict the category of target.
In Section 3.1, Section 3.2, Section 3.3, Section 3.4, Section 3.5, Section 3.6 some important components, including the acquisition data of the HRRP data and the real-valued SAR image from high-range resolution echoes, the VAE branch, the LightNet branch, the fusion module, the loss function and the training procedure, are introduced concretely.

3.1. Acquisition of the HRRP Data and the Real-Valued SAR Image from High-Range Resolution Echoes

Figure 1 in the Introduction gives the data acquisition procedure of the HRRP data and real-valued SAR image from received the SAR echoes based on RDA. The received SAR echoes are obtained from the radar-received signals through the dechirping and matched filters. The RDA SAR imaging algorithm can be divided into two steps: range focusing processing and azimuth focusing processing. The range focusing processing includes, in turn range fast Fourier transformation (FFT), range compression and range IFFT. Then, the high-range resolution radar echoes can be obtained. The azimuth focusing processing includes, in turn, the azimuth FFT, RCMC, azimuth compressing and azimuth IFFT.
Based on the high-range resolution radar echoes, the HRRP data are obtained through the modulo operation. At the same time, based on the complex-valued high-range resolution radar echoes, the complex-valued SAR image is obtained through azimuth focusing processing. The azimuth focusing processing includes, in turn, the azimuth fast Fourier transformation (FFT), range cell migration correction (RCMC), azimuth compression and azimuth IFFT. Then, the commonly used real-valued SAR image for target recognition can be obtained by modulating the complex-valued SAR image. According to the introduction of the SAR imaging procedure, we can see that the complex SAR image is obtained using the high-range resolution radar echoes. Furthermore, given the complexity of the SAR image, the corresponding high-range resolution radar echoes and HRRP data also can be acquired [48,49].
Considering the mechanism inherent in the modulo operation, the modulo operation for generating HRRP data and the operation of the module for generating real-valued SAR images have different information loss characteristics. Therefore, although the HRRP data and the real-valued SAR images used in the proposed method keep a one-to-one correspondence, they cannot convert to each other anymore due to the operation of the module. In other words, the information contained in the HRRP data and the real-valued SAR images used in the proposed method are not exactly the same. The HRRP data and the SAR images can only represent the original high-range resolution radar echoes from one aspect each. Therefore, the features extracted from the HRRP data cannot be derived from the SAR images with certainty.

3.2. The VAE Branch

Before radar HRRP statistical modeling, there are some issues should be considered in practical application. The first is the time-shift sensitivity of HRRP. Centroid alignment [50] is commonly used as the time-shift compensation technique. We can eliminate amplitude-scale sensitivity through amplitude-scale normalization, such as L 2 normalization. Considering the target-aspect sensitivity [15,32], it has been demonstrated that the average profile has a smoother and more concise signal form than the single HRRP, and can better reflect the scattering property of the target in a given aspect-frame. From the perspective of signal processing, the average profile represents target’s stable physical structure information in a frame [8,9,51]. One important characteristic of the average profile is that it can depress the speckle effect of HRRPs. Furthermore, the average profile also suppresses the impact of the noise spikes and the amplitude fluctuation property.
According to the literature [8,10,51], the definition of the average profile is
x A P = 1 M i = 1 M x P i 1 , 1 M i = 1 M x P i 2 , , 1 M i = 1 M x P i r T = 1 M i = 1 M x P i
where x P i i = 1 M is an HRRP frame, with the i th HRRP sample x P i = x P i 1 , x P i 2 , , x P i r T , and r is the dimension of HRRP samples.
The VAE holds that the sample space can be generated by the latent variable space, that is, sampling latent variables from a simpler latent variable space can generate the real samples within the sample space. The latent variable in VAE can describe the distribution characteristics of the data. The framework of VAE is illustrated in Figure 3. Given the observations x A P n n = 1 N with N samples, the VAE exploits an encoder model with input x A P and outputs the mean, μ , and the standard deviation, σ , of the latent variable, z . Assuming the encoder model can be represented as f V A E _ E with parameter φ , which is also known as an inference model, q φ z x A P , the encoder of VAE can be formulated as follows:
μ , σ = f V A E _ E x A P ; φ .
Here, the reparametrization trick is adopted to sample from the posterior z q φ z x A P using the following:
z = μ + σ ε
where ε N 0 , I , and represents an element-wise product.
Then, with the latent variable z as the input, the decoder model f V A E _ D with parameter θ outputs the reconstruction sample, x ^ A P , which can be formulated as follows:
x ^ A P = f V A E _ E z ; θ .
The decoder model is also known as a generative process with a probabilistic distribution: p θ x A P z .
The goal of the VAE model is to use the arbitrary distribution q φ z x A P to approximate the true posterior distribution p θ z x A P . Formally, as shown in Equation (5), the KL divergence is used to measure the similarity between q φ z x A P and p θ z x A P p θ z x A P , as follows:
K L q φ z x A P p θ z x A P = log p θ x A P L B θ , φ ; x A P
where
L B θ , φ ; x A P = E q φ z x A P log p θ x A P z K L q φ z x A P p θ z
is the variational evidence lower bound (ELBO) [52,53].
For the given observations, p θ x A P is a constant. Thus, minimizing the K L q φ z x A P p θ z x A P is equivalent to the ELBO maximization. Therefore, the loss of VAE on the data x A P can be written as follows:
L V A E θ , φ ; x A P = L B θ , φ ; x A P = E q φ z x A P log p θ x A P z K L q φ z x A P p θ z .
In Equation (7), the first term can be regarded as reconstruction loss, which also can be written as follows:
E q φ z x A P log p θ x A P z = x A P x ^ A P 2 2
This teaches the decoder to reconstruct the data and suffers a cost if the output of decoder cannot reconstruct the data accurately. Usually, we can use a l 2 -norm between the original data x A P and the reconstructed data x ^ A P as the reconstruction loss. The second term is the KL divergence between the encoder’s distribution q φ z x A P and the prior p θ z . Typically, if we let the prior over the latent variables be the centered isotropic multivariate Gaussian p θ z = N z ; 0 , I , the KL divergence in Equation (7) can be computed as follows:
K L q φ z x A P p θ z = 1 2 j = 1 J 1 + log σ j 2 μ j 2 σ j 2
where μ j and σ j represent the j th element in the μ and σ , respectively, and J denotes the dimensionality.
Then, Equation (7) can be rewritten as follows:
L V A E θ , φ ; x A P = x A P x ^ A P 2 2 + 1 2 j = 1 J 1 + log σ j 2 μ j 2 σ j 2
where 2 denotes the l 2 -norm.
In practice, the encoder model is implemented with a three-layer, fully connected neural network. The units in the encoder model are 512, 256, and 128 respectively. Moreover, the decoder model is also implemented with a three-layer, fully connected neural network. The units in the decoder model are 128, 256 and 512 respectively. The dimensions of the latent variable z , the mean μ and the standard deviation σ are set to 50.

3.3. The LightNet Branch

Among deep neural networks, CNNs have made remarkable progress due to their characteristics of local connection and weight sharing. CNNs take advantage of convolution kernels to extract 2D visual structure information through layer-by-layer processing. Many excellent convolutional network architectures, such as VGG and ResNet, have come to dominate many fields. Nevertheless, considering the limited data volume, these above-mentioned networks still have a larger number of parameters for the task of SAR target recognition. Therefore, we applied a lightweight CNN, called LightNet, which has very few parameters and can achieve approximate performance.
The LightNet architecture is mainly comprised of convolution layers and pooling layers. Following each convolution layer, there is a rectified linear unit (ReLU) as an activation function and a batch normalization layer which allows the network to use much higher learning rates and be less careful about initialization [54]. The architecture of the LightNet is shown in Table 1. In the LightNet, there are only 5 convolutional layers. The kernel size in the first convolutional layer is 11 × 11, which is a larger kernel size, for gaining a larger receptive field. In the following three convolutional layers, the kernel sizes are 5 × 5. Considering that the fully connected layer in the original LightNet, which is usually used to transform feature maps to a feature vector at the final position in the network, has many parameters, we use a convolutional layer with a 3 × 3 kernel and no padding to replace a common fully connected layer to generate a feature vector from the feature maps. The convolutional layer has fewer parameters than the fully connected layer. Compared with global pooling, the convolutional layer can not only learn more abstract features but also adjust the dimensions of the feature vector.
The LightNet network considers the SAR image, x I , as an input to extract the 2D visual structure information, m , as another essential representation of the target information. Assuming f L N e t represents the LightNet with parameter ψ L N e t , then the LightNet branch can be formulated as follows:
m = f L N e t x I ; ψ L N e t

3.4. Fusion Module

In the feature extraction stage, a VAE model is employed on the HRRP data to extract the latent probabilistic distribution information as a feature, and a lightweight LightNet is used on the SAR image to extract the structure features. In the neural network framework, the most common feature fusion approaches are the concatenation operation and element-wise addition. The concatenation operation combines multiple original features according to the feature dimensions to generate a fused feature, and the dimension of the fused features is equal to the sum of the original feature dimensions. Although it is simple to realize, the dimension of the fused feature is relatively high, which brings a greater pressure on the subsequent classifiers, including the increase in the number of parameters and the cost of optimizing the parameters. The element-wise addition is also a common feature fusion method. Based on the element-wise addition, the fusion feature is obtained by adding the original features, element by element, which keeps the dimension consistent with the original features and requires smaller parameters on the subsequent classifiers than the concatenation operation. In essence, element-wise addition assumes that the importance of different features is the same.
To reflect the target information more comprehensively and sufficiently, a novel fusion module was exploited to integrate the latent feature obtained from VAE and the structure feature obtained from LightNet, which can also merge the two streams into a unified framework with end-to-end joint optimization. The proposed fusion module is a further extension on the element-wise addition inspired by the gated recurrent unit (GRU) [55]. On the one hand, we use an attention weight vector, not a single value, to integrate the different features. More clearly, in the feature fusion, we no longer think that each dimension in a feature vector shares the same weight, but each dimension of the feature vector has its own weight coefficient. The influence of features that contribute more to the target task on the fusion features is increased by considering the differences in the importance of each feature more carefully. Likewise, the influence of features that contribute less to the target task on fusion features is weakened. On the other hand, compared with traditional, empirically set weight values, the attention weight vector is learned automatically according the target task, which can perform an adaptive adjustment of feature weights with the samples and categories.
Figure 4 shows the flowchart of the fusion module. At first, the latent feature, z , and the structure feature, m , are fed into fully connected layers, respectively, to generate the features Z ˜ d × 1 and M ˜ d × 1 :
Z ˜ = ReLU W Z z M ˜ = ReLU W M m
where features Z ˜ and M ˜ have the consistent dimension d which was set to 50 for the experiments. ReLU denotes the ReLU activation operation, and W Z and W M are the parameters in the fully connected layers, respectively. Here, the fully connected layers are applied not only to further map two features into a global space, but also to make the dimension and order contain consistent correspondence for subsequent element-wise addition, i.e., the relationship between the i th feature in Z ˜ and the i th feature in M ˜ is a one-to-one correspondence relationship.
Then, the latent feature z and the structure feature m are concatenated into a long feature vector, and a fully connected layer is used on the long feature vector to learn the attention vector α d × 1 :
α = s i g m o i d W α z , m
where s i g m o i d denotes the sigmoid activation operation, and W α denotes the parameter. Drawing support from the sigmoid activation, the value in the attention vector is in the range 0 , 1 . Here, the attention mechanism is derived from the selective attention behavior of the human brain when processing information. The human brain scans the total information quickly to get the focus area, and then invests more attention resources in this area to obtain more detailed information of the target task, while suppressing other useless information. This method has greatly improved the means of screening high-value information from a large quantity of information. Similar to the selective attention mechanism of human beings, the core goal of the attention mechanism we used was to select the information that as most critical to the current task from a large quantity of information. Therefore, a fully connected layer with activation was used to simulate the neurons in the human brain. The input of the fully connected layer was all of the sample information, i.e., all the features of the sample. By using the fully connected layer to sense all the information, we could determine the focus area/features, and then invest more attention on the focus features while suppressing other useless information. That is to say, we can know where to focus and the degree to which to focus from the output of the fully connected layer. Therefore, the output of the fully connected layer is called attention weight vector.
Finally, the attention vector α was used as a weight to sum the Z ˜ and M ˜ . Due to the value of α being in range 0 , 1 , the value in 1 α is also in range 0 , 1 . The attention vector can be regarded as a weight vector which controls the contribution of the feature Z ˜ to the overall output of the unit. In contrast, considering the weight normalization, the weight of feature M ˜ can be directly obtained by the operation 1 α without an extra learning process. More concretely, the attention vector α is element-wise multiplied with the feature Z ˜ , and the vector 1 α is element-wise multiplied with the feature M ˜ , and then the element-wise sum operation is used to integrate these two features:
F = α Z ˜ + 1 α M ˜
where represents the element-wise multiply operation, 1 is a vector whose elements are all valued one, and F represents the fusion feature.
Assuming f f u s i o n represents the overall fusion module with the parameter ψ f u s i o n , then the fusion module can be summarized as follows:
F = f f u s i o n z , m ; ψ f u s i o n

3.5. Loss Function

Following the feature extraction stage and the fusion module, the fusion feature F was fed into a softmax layer to predict the classification results y ^ n n = 1 N , which can be formulated as follows:
y ^ = f c F ; ψ c
where f c represents a usual softmax classifier with parameter ψ c .
The supervised constraint ensures that the prediction label y ^ n is closed to the true label y n via the cross-entropy loss function, as follows:
L l a b e l y n , y ^ n = k = 1 K y n k log y ^ n k
where K represents the number of classes.
Therefore, the total loss function of the proposed deep fusion network for target recognition is a combination of L l a b e l and L V A E (described in formula (10)) as follows:
L t o t a l = L l a b e l + L V A E

3.6. Training Procedure

Based on the total loss function, L t o t a l , the backpropagation algorithm where we used stochastic gradient descent (SGD), was used for the proposed network for end-to-end joint optimization. The total training procedure of the proposed network is outlined in Algorithm 1.
Algorithm 1. Training Procedure of the Proposed Network
1. Set the architecture of the proposed network, including the number of fully connected layers, the units in each fully connected layer, the size of convolutional kernels, the strides and the number of channels, and so on.
2. Initialize the network parameters φ , θ , ψ L N e t , ψ f u s i o n and ψ c .
3. while not converged do
4. Randomly sample a mini-batch X b b = 1 B and its corresponding label y b b = 1 B from the whole dataset.
5. Based on each data X b in the mini-batch, generate the average profile x b A P and the SAR image x b I .
6. Sample random noise ε n n = 1 N 1 from standard Gaussian distribution for re-parameterization.
7. With x b A P as input, generate the latent distribution representation z using Equations (2) and (3), and then generate the reconstruction x ^ A P based on Equation (4).
8. With x b I as input, generate the structure information m using Equation (11).
9. Based on Equation (15), generate integrated feature F .
10. Based on integrated feature F , obtain prediction y ^ with Equation (16).
11. Compute the total loss L t o t a l .
12. Update network parameters φ , θ , ψ L N e t , ψ f u s i o n and ψ c via SGD on the total loss L t o t a l .
13. end while

4. Results

4.1. Experimental Data Description

Experiments were carried out based on the HRRP data and SAR images of the moving and stationary target acquisition and recognition (MSTAR) dataset, which was collected through a HH polarization mode SAR sensor working in the X-band with 0.3 × 0.3 m resolution in spotlight mode [56]. The MSTAR dataset, a measured benchmark dataset, is widely used for evaluating SAR target recognition performance. The MSTAR dataset includes ten different ground military targets, i.e., BMP2 (tank), BTR70 (armored vehicle), T72 (tank), BTR60 (armored vehicle), 2S1 (cannon), BRDM (truck), D7 (bulldozer), T62 (tank), ZIL131 (truck) and ZSU234 (cannon). Among them, BMP2 and T72 have variants in the test stage. The depression angles of the samples for each target category are 15° and 17°, and the aspect angles cover a range from 0° to 360°. Referring to the existing literature, this paper focuses on two experimental scenarios: three-target data SAR target recognition and ten-target data SAR target recognition. The specific details of the experimental data setting on the above-mentioned two experimental scenarios are listed in Table 2 and Table 3, respectively. Optical image examples of the ten different targets are shown in Figure 5, and the corresponding SAR image examples are listed in Figure 6.
For the MSTAR data, we use the complex-valued SAR images provided by the U.S. Defense Advanced Research Projects Agency and the U.S. Air Force Research Laboratory to get the high-range resolution radar echoes in reverse without information loss, in accordance with the reference [11]. And then, based on the high-range resolution radar echoes, the HRRP data can be generated, as shown in Figure 1. The average profile examples of the generated HRRP data of the ten different targets are listed in Figure 6. The real-valued SAR image was directly obtained by a modulo operation on the complex-valued SAR images.

4.2. Evaluation Criteria

For the quantitative analysis, we use two widely used criteria, namely, the overall accuracy and the average accuracy, as the evaluation criteria to evaluate target recognition performance.
o v e r a l l   a c c u r a c y = i = 1 N C T r i i = 1 N C Q i
a v e r a g e   a c c u r a c y = 1 N C i = 1 N C T r i Q i
where T r i represents the number of test samples recognized correctly in class i , Q i represents the total number of test samples in class i , and N C represents the number of classes.
The higher the values of the overall accuracy and the average accuracy, the better the performance of target recognition method.

4.3. Three-Target MSTAR Data Experiments

In this section, we discuss the effectiveness of the proposed method on the three-target MSTAR data. We gave the confusion matrix of the proposed deep fusion network on three-target MSTAR data in Table 4. The confusion matrix is a widely used performance evaluation method for target recognition. In a confusion matrix, each row represents the actual category, while each column is the predicted category, and the elements denote the probabilities that the targets are recognized as a certain class. In particular, the elements on the diagonal represent the recognition accuracy. From Table 4, it is easy to see that the accuracy on BTR70 was 0.9898, the accuracy on T72 was 0.9880, and the accuracy on BMP2 was 0.9642, which shows the proposed method has better recognition performance.
In order to further validate the efficiency of the proposed method, we compared the proposed method with some traditional SAR target recognition methods, i.e., directly applying the amplitude feature of the original SAR images, principal component analysis (PCA), the template matching method, dictionary learning and JDSR (DL-JDSR) [57], sparse representation in the frequency domain (SRC-FT) [58], and Riemannian manifolds [59]. Moreover, the proposed method was compared with other deep learning-based target recognition methods without data augmentation, as seen in Table 5. The compared deep learning-based target recognition methods include the original auto-encoder (AE), denoising AE (DAE), linear SVM, the Euclidean distance restricted AE [3] (Euclidean-AE), the VGG convolutional neural network (VGGNet), A-ConvNets [60], the early feature fusion of a model-based geometric hashing (MBGH) approach and a CNN approach (MBGH+CNN with EFF) [61], compact convolutional autoencoder (CCAE) [62], ResNet-18 [63], ResNet-34 [63] and DenseNet [64]. Figure 7 shows the intuitional accuracy results of the proposed method and the above-mentioned compared methods. Table 5 lists their detailed accuracies with the three-target MSTAR data and their overall and average accuracies.
From Figure 7 we can clearly see that, compared with the original image method and the PCA, template matching, DL-JDSR, SRC-FT and Riemannian manifold methods, our proposed method performs better on both overall accuracy and average accuracy. The proposed method also yields higher overall and higher average accuracy than the compared deep learning methods, i.e., AE, DAE, Euclidean-AE, VGGNet, A-ConvNet, MBGH+CNN with EFF, ResNet-18, ResNet-34 and DenseNet. As shown in Table 5, for the BMP2 and the T72 types with variants in the test stage, the accuracies of the proposed method attained 0.9642 and 0.9880, respectively, which outperformed all other compared target recognition methods. For the BTR70 type, which does not contain variants, the template matching method and VGGNet could correctly recognize all test samples, at the same time, the accuracies of the proposed method, DL-JDSR and the A-ConvNet were also 0.9898, which is very close to 1. In terms of overall accuracy and average accuracy in Table 5, two comprehensive evaluation criteria, we can see that the proposed method is at least 0.96% and 0.52% higher, respectively than other compared methods.

4.4. Ten-Target MSTAR Data Experiments

In this section, we evaluate the target recognition performance of the proposed method with the ten-target MSTAR data. Similar to the Section III-C, the confusion matrix is shown at first in Table 6. From the Table 6 we can see that the accuracy of all target types, except T72, was over 0.97. The best accuracy is shown in ZIL131, where the test samples were all correctly classified. The accuracies on BTR60, D7 and ZSU234 were close to 1, and the worst accuracy was over 0.94.
We compare the performance of the proposed method with the original image, PCA, template matching, DL-JDSR, AE, DAE, Euclidean-AE, VGG network, A-ConvNets, MBGH+CNN with EFF, ResNet-18, ResNet-34 and DenseNet methods in Figure 8 and Table 7.
As shown in Figure 8 and Table 7, our proposed method outperforms all the other compared methods. Especially, for the first nine types, i.e., BMP2, BTR70, T72, BTR60, 2S1, BRDM, D7 and T62, the proposed method yielded the highest accuracy. For the ZSU234 type, the A-ConvNet method has the highest accuracy and the proposed method followed closely, with an accuracy of 0.9964. In terms of overall accuracy, we can see that the proposed method is at least 4% higher than the other compared methods. The proposed method is about 4% higher in terms of average accuracy.

4.5. Model Analysis

4.5.1. Ablation Study

In order to gain a better understanding of the network’s behavior and prove that the fusion of HRRP data and SAR images is beneficial to SAR target recognition, an ablation study is always adopted to see how each component affects the performance where one or more certain components of the network are removed or replaced. Therefore, in this sub-section, several controlled experiments are designed. Except for certain examined components, the rest of settings remain consistent. The ablation study experiment results on the three-target MSTAR data are summarized in Table 8. In Table 8, the addition denotes the element-wise addition of the fusion operation and the concatenation denotes the concatenation fusion operation, which are usually adopted as a fusion module in multi-stream network architectures [66,67]. From rows 1 and 2 in Table 8, it can be observed that recognition accuracy using only the HRRP data through the VAE model was 0.8813 for overall accuracy and 0.8399 for average accuracy. Moreover, the recognition accuracy using only the SAR images through LightNet was 0.9487 for overall accuracy and 0.9612 for average accuracy. The VAE model and the LightNet can extract different features from different domains, and both of them gain good recognition performance. Nevertheless, when comparing rows 1 and 2 with rows 3, 4, 5 and 6, it can be observed that the fusion of the latent feature of the HRRP data obtained from the VAE and the structure features of the SAR images obtained from LightNet are beneficial for reflecting target information more comprehensively and sufficiently to achieve better recognition performance. Furthermore, as shown in rows 3, 4, 5 and 6, on the basis of fusing VAE and LightNet, the performance improvements brought by the different fusion modules were different. The decision-level fusion module had 0.9278 overall accuracy and 0.9357 average accuracy, which were lower than the accuracy only using LightNet. In fact, simple decision-level fusion can indeed get robust performance but finds it difficult to obtain the best performance. The utilization of the element-wise addition module had an 0.9568 overall accuracy and 0.9664 average accuracy; the concatenation module had an 0.9648 overall accuracy and an 0.9715 average accuracy, and the proposed fusion module produced markedly superior recognition accuracy of 0.9780 for overall accuracy and 0.9807 average accuracy. From the comparison we can see that the proposed fusion module achieved the best fusion performance.

4.5.2. Feature Analysis

The quantitative performance analysis has been evaluated through comparisons to existing methods and detailed ablation studies to reveal the effectiveness of the proposed method. In this sub-section, we adopt t-SNE [68] to visualize the fusion feature learned by the proposed method, the features learned through the VAE model and the LightNet, as well as the amplitude feature of the original SAR images in Figure 9 on the three-target data. From Figure 9, it can be observed that the features learned by the proposed fusion network show a better feature distribution, in which each class gathers more closely and the margin between them is much more distinct when compared with other features.

4.5.3. FLOPs Analysis

In Table 9, we give the floating point of operations (FLOPs) for the VAE branch, the LightNet branch, the proposed network and the VGG network for comparison.
By analyzing the calculation principle of the convolutional layer, we get that the computational complexity of one convolutional layer is C i n , c l C o u t , c l K c l 2 M o u t , c l 2 , where C i n , c l and C o u t , c l are the number of channels in the input and output feature map of the convolutional layer, K c l is the size of the convolution kernel, and M o u t , c l is the size of the output feature map. For one fully connected layer, the computational complexity is N i n . f l N o u t , f l , where N i n . f l is the number of input nodes of the fully connected layer and N o u t , f l is the number of output nodes. Therefore, according to the architecture and details of the LightNet branch shown in Table 1, we obtain the FLOPs for the LightNet branch as 2.3 × 107 by substituting the relevant parameters into the formula of computational complexity. Similarly, according to the introduction of the VAE and the detail of its architecture presented in Section II-B, we can obtain the FLOPs for the VAE branch as 3.4 × 105. In the proposed network, besides the LightNet branch and the VAE branch, there is a fusion module with 1.25 × 104 FLOPs. Therefore, the total FLOPs of the proposed network are 2.33525 × 107. By substituting the relevant parameters in the VGG network, ResNet-18, ResNet-34 and DenseNet, we can obtain that the FLOPs for VGG were 5.14 × 109, 1.9 × 109, 3.6 × 109 and 5.7 × 109, respectively.
It can be seen from Table 9 that although the VGG network, ResNet-18, ResNet-34 and DenseNet have deeper architectures and require more FLOPs, the recognition performance of these methods on all datasets was lower than that of the proposed method.

4.6. Experiments on Civilian Vehicle Dataset

The civilian vehicles dataset was provided by the U.S. Air Force Research Laboratory. The sensor collecting the civilian vehicles data is a high-resolution Circular SAR and the wave band is X-band. The civilian vehicles data includes ten different civilian vehicle targets, i.e., Toyota Camry, Honda Civic 4dr, 1993 Jeep, 1999 Jeep, Nissan Maxima, Mazda MPV, Mitsubishi, Nissan Sentra, Toyota Avalon and Toyota Tacoma. The aspect angles cover from 0° to 360°, and the depression angles of the samples for each target category is 30°. The HH channel as used for training and the VV channel was used for testing. The number of training and test samples in each category were 360. Importantly, the provided data were high-range resolution radar echoes. For the proposed method, the HRRP data and real-valued SAR images were obtained according to the procedure shown in Figure 1.
We compared the performance of the proposed method with some SAR target recognition methods, including directly applying linear SVM to the original SAR images, PCA followed by linear SVM, the template matching method, DL-JDSR, AE, DAE, the VGG network, A-ConvNet, MBGH+CNN with EFF, ResNet-18, ResNet-34 and DenseNet in Figure 10 and Table 10. Here, due to the number of test samples for each category being the same, the overall accuracy and the average accuracy are the same, too, as formulated in Equations (19) and (20). Thereby, only the total accuracy is listed in Table 10. As shown in Figure 10 and Table 10, our proposed method outperforms all the other compared methods. Especially for the 1993 Jeep, 1999 Jeep and Toyota Avalon, the proposed method yielded the highest accuracy. For the other categories, the accuracy of our method was not the highest, but it was also among the best. And in terms of total accuracy, we can see that the proposed method was at least 2.1% higher than the other compared methods.

5. Conclusions

In this paper, considering that both SAR images and the corresponding HRRP data, in which the information contained is not exactly the same, can be simultaneously obtained in the procedure of SAR imaging, we formulated a novel end-to-end two stream fusion network framework to fuse the characteristics obtained from modeling HRRP data and SAR images for radar target recognition. The proposed fusion network contains two separated streams in the feature extraction stage, one of which takes advantage of a VAE network to acquire latent probabilistic distribution from the HRRP data and the other using LightNet to extract 2D visual structure information based on the SAR images. The proposed fusion module was utilized to integrate the above-mentioned two types of different characteristics to reflect target information more comprehensively and sufficiently, and it could also merge the two streams into a unified framework with end-to-end joint training. The experimental results based on the MSTAR dataset and the civilian vehicle dataset show that the proposed two-stream fusion method has greater performance advantages than some conventional target recognition methods and other deep learning-based target recognition methods, showing the superiority of the proposed method.
Although the proposed target recognition method offers a significant improvement in performance, there is also a limit in speed. Since the proposed method contains two branches, the running time of the proposed method on one test sample is a little higher than that of a single branch. In the future, we will further explore the increase in speed draw support through parallel computing and algorithm optimization.

Author Contributions

Conceptualization, L.D.; methodology, L.D.; software, L.L. and Y.G.; validation, L.L., K.R., J.C., L.D. and Y.G.; investigation, Y.W.; resources, L.D.; writing—original draft preparation, L.L.; writing—review and editing, Y.G. and L.D.; visualization, L.L.; supervision, L.D. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the National Science Foundation of China under Grant 61771362 and in part by the 111 Project.

Data Availability Statement

Not applicable.

Conflicts of Interest

The authors declare no conflict of interest.

References

  1. Chen, S.; Wang, H. SAR target recognition based on deep learning. In Proceedings of the International Conference on Data Science and Advanced Analytics (DSAA), Paris, France, 19–21 October 2015. [Google Scholar]
  2. Cui, Z.; Cao, Z.; Yang, J.; Ren, H. Hierarchical Recognition System for Target Recognition from Sparse Representations. Math. Probl. Eng. 2015, 2015 Pt 17, 6. [Google Scholar] [CrossRef]
  3. Deng, S.; Du, L.; Li, C.; Ding, J.; Liu, H. SAR automatic target recognition based on euclidean distance restricted autoencoder. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2017, 10, 3323–3333. [Google Scholar] [CrossRef]
  4. Housseini, A.E.; Toumi, A.; Khenchaf, A. Deep Learning for Target recognition from SAR images. In Proceedings of the 2017 Seminar on Detection Systems Architectures and Technologies (DAT), Algiers, Algeria, 20–22 February 2017. [Google Scholar]
  5. Yan, H.; Zhang, Z.; Gang, X.; Yu, W. Radar HRRP recognition based on sparse denoising autoencoder and multi-layer perceptron deep model. In Proceedings of the 2016 Fourth International Conference on Ubiquitous Positioning, Indoor Navigation and Location Based Services (UPINLBS), Shanghai, China, 2–4 November 2016. [Google Scholar]
  6. Kingma, D.P.; Welling, M. Auto-Encoding Variational Bayes. arXiv 2013, arXiv:1312.6114. [Google Scholar]
  7. Du, C.; Chen, B.; Xu, B.; Guo, D.; Liu, H. Factorized Discriminative Conditional Variational Auto-encoder for Radar HRRP Target Recognition. Signal Process. 2019, 158, 176–189. [Google Scholar] [CrossRef]
  8. Du, L.; Liu, H.; Bao, Z.; Zhang, J. Radar automatic target recognition using complex high-resolution range profiles. IET Radar Sonar Navig. 2007, 1, 18–26. [Google Scholar] [CrossRef]
  9. Du, L. Noise Robust Radar HRRP Target Recognition Based on Multitask Factor Analysis With Small Training Data Size. IEEE Trans. Signal Process. 2012, 60, 3546–3559. [Google Scholar]
  10. Xing, M. Properties of high-resolution range profiles. Opt. Eng. 2002, 41, 493–504. [Google Scholar] [CrossRef]
  11. Zhang, X.Z.; Huang, P.K. Multi-aspect SAR target recognition based on features of sequential complex HRRP using CICA. Syst. Eng. Electron. 2012, 34, 263–269. [Google Scholar]
  12. Masahiko, N.; Liao, X.J.; Carin, L. Target identification from multi-aspect high range-resolution radar signatures using a hidden Markov model. IEICE Trans. Electron. 2004, 87, 1706–1714. [Google Scholar]
  13. Tan, X.; Li, J. Rang-Doppler imaging via forward- backward sparse Bayesian learning. IEEE Trans. Signal Process. 2010, 58, 2421–2425. [Google Scholar] [CrossRef]
  14. Zhao, F.; Liu, Y.; Huo, K.; Zhang, S.; Zhang, Z. Radar HRRP Target Recognition Based on Stacked Autoencoder and Extreme Learning Machine. Sensors 2018, 18, 173. [Google Scholar] [CrossRef] [PubMed] [Green Version]
  15. Feng, B.; Chen, B.; Liu, H. Radar HRRP target recognition with deep networks. Pattern Recognit. 2017, 61, 379–393. [Google Scholar] [CrossRef]
  16. Pan, M.; Liu, A.L.; Yu, Y.Z.; Wang, P.; Li, J.; Liu, Y.; Lv, S.S.; Zhu, H. Radar HRRP target recognition model based on a stacked CNN-Bi-RNN with attention mechanism. IEEE Trans. Geosci. Remote Sens. 2021, 61, 1–14, online published. [Google Scholar] [CrossRef]
  17. Chen, W.C.; Chen, B.; Peng, X.J.; Liu, J.; Yang, Y.; Zhang, H.; Liu, H. Tensor RNN with Bayesian nonparametric mixture for radar HRRP modeling and target recognition. IEEE Trans. Signal Process. 2021, 69, 1995–2009. [Google Scholar] [CrossRef]
  18. Peng, X.; Gao, X.Z.; Zhang, Y.F. An adaptive feature learning model for sequential radar high resolution range profile recognition. Sensors 2017, 17, 1675. [Google Scholar] [CrossRef] [PubMed] [Green Version]
  19. Jacobs, S.P. Automatic Target Recognition Using High-Resolution Radar Range-Profiles; ProQuest Dissertations Publishing: Morrisville, NC, USA, 1997. [Google Scholar]
  20. Webb, A.R. Gamma mixture models for target recognition. Pattern Recognit. 2000, 33, 2045–2054. [Google Scholar] [CrossRef]
  21. Copsey, K.; Webb, A. Bayesian gamma mixture model approach to radar target recognition. IEEE Trans. Aerosp. Electron. Syst. 2003, 39, 1201–1217. [Google Scholar] [CrossRef]
  22. Du, L.; Liu, H.; Zheng, B.; Zhang, J. A two-distribution compounded statistical model for Radar HRRP target recognition. IEEE Trans. Signal Process. 2006, 54, 2226–2238. [Google Scholar]
  23. Du, L.; Liu, H.; Bao, Z. Radar HRRP Statistical Recognition: Parametric Model and Model Selection. IEEE Trans. Signal Process. 2008, 56, 1931–1944. [Google Scholar] [CrossRef]
  24. Du, L.; Wang, P.; Zhang, L.; He, H.; Liu, H. Robust statistical recognition and reconstruction scheme based on hierarchical Bayesian learning of HRR radar target signal. Expert Syst. Appl. 2015, 42, 5860–5873. [Google Scholar] [CrossRef]
  25. Park, S.C.; Park, M.K.; Kang, M.G. Super-Resolution Image Reconstruction: A Technical Overview. IEEE Signal Process. Mag. 2003, 20, 21–36. [Google Scholar] [CrossRef] [Green Version]
  26. Wang, P.; Shi, L.; Lan, D.; Liu, H.; Xu, L.; Bao, Z. Radar HRRP Statistical Recognition With Local Factor Analysis by Automatic Bayesian Ying-Yang Harmony Learning. Front. Electr. Electron. Eng. China 2011, 6, 300–317. [Google Scholar] [CrossRef]
  27. Chen, J.; Du, L.; He, H.; Guo, Y. Convolutional factor analysis model with application to radar automatic target recognition. Pattern Recognit. 2019, 87, 140–156. [Google Scholar] [CrossRef]
  28. Pan, M.; Du, L.; Wang, P.; Liu, H.; Bao, Z. Noise-Robust Modification Method for Gaussian-Based Models With Application to Radar HRRP Recognition. IEEE Geosci. Remote Sens. Lett. 2013, 10, 55–62. [Google Scholar] [CrossRef]
  29. Chen, H.; Guo, Z.Y.; Duan, H.B.; Ban, D. A genetic programming-driven data fitting method. IEEE Access 2020, 8, 111448–111458. [Google Scholar] [CrossRef]
  30. Rezende, D.J.; Mohamed, S.; Wierstra, D. Stochastic Backpropagation and Approximate Inference in Deep Generative Models. In Proceedings of the International Conference on Machine Learning, Beijing, China, 21–26 June 2014. [Google Scholar]
  31. Doersch, C. Tutorial on Variational Autoencoders. arXiv 2016, arXiv:1606.05908. [Google Scholar]
  32. Ying, Z.; Bo, C.; Hao, Z.; Wang, Z. Robust Variational Auto-Encoder for Radar HRRP Target Recognition. In Proceedings of the International Conference on Intelligent Science & Big Data Engineering, Dalian, China, 22–23 September 2017. [Google Scholar]
  33. Chen, J.; Du, L.; Liao, L. Class Factorized Variational Auto-encoder for Radar HRRP Target Recognition. In Proceedings of the 2020 IEEE Radar Conference (RadarConf20), Florence, Italy, 21–25 September 2020. [Google Scholar]
  34. Simonyan, K.; Zisserman, A. Very deep convolutional networks for large-scale image recognition. arXiv 2014, arXiv:1409.1556. [Google Scholar]
  35. Min, R.; Lan, H.; Cao, Z.; Cui, Z. A Gradually Distilled CNN for SAR Target Recognition. IEEE Access 2019, 7, 42190–42200. [Google Scholar] [CrossRef]
  36. Huang, X.; Yang, Q.; Qiao, H. Lightweight two-stream convolutional neural network for SAR target recognition. IEEE Geosci. Remote Sens. Lett. 2020, 18, 667–671. [Google Scholar] [CrossRef]
  37. Cho, J.; Chan, G. Multiple feature aggregation using convolutional neural networks for SAR image-based automatic target recognition. IEEE Geosci. Remote Sens. Lett. 2018, 15, 1882–1886. [Google Scholar] [CrossRef]
  38. Ruser, H.; Leon, F.P. Information fusion—An overview. Tech. Mess. 2006, 74, 93–102. [Google Scholar] [CrossRef]
  39. Jiang, L.; Yan, L.; Xia, Y.; Guo, Q.; Fu, M.; Lu, K. Asynchronous multirate multisensor data fusion over unreliable measurements with correlated noise. IEEE Trans. Aerosp. Electron. Syst. 2017, 53, 2427–2437. [Google Scholar] [CrossRef]
  40. Rasti, B.; Ghamisi, P.; Plaza, J.; Plaza, A. Fusion of hyperspectral and LiDAR data using sparse and low-rank component analysis. IEEE Trans. Geosci. Remote Sens. 2017, 55, 6354–6365. [Google Scholar] [CrossRef] [Green Version]
  41. Bassford, M.; Painter, B. Intelligent bio-environments: Exploring fuzzy logic approaches to the honeybee crisis. In Proceedings of the 2016 12th International Conference on Intelligent Environments (IE), London, UK, 14–16 September 2016; IEEE: Piscataway, NJ, USA, 2016; pp. 202–205. [Google Scholar]
  42. Mehra, A.; Jain, N.; Srivastava, H.S. A novel approach to use semantic segmentation based deep learning networks to classify multi-temporal SAR data. Geocarto Int. 2020, 1–16. [Google Scholar] [CrossRef]
  43. Pei, J.; Huang, Y.; Huo, W.; Zhang, Y.; Yang, J.; Yeo, T.S. SAR automatic target recognition based on multiview deep learning framework. IEEE Trans. Geosci. Remote Sens. 2017, 56, 2196–2210. [Google Scholar] [CrossRef]
  44. Choi, I.O.; Jung, J.H.; Kim, S.H.; Kim, K.T.; Park, S.H. Classification of targets improved by fusion of range profile and the inverse synthetic aperture radar image. Prog. Electromagn. Res. 2014, 144, 23–31. [Google Scholar] [CrossRef] [Green Version]
  45. Wang, L.X.; Weng, L.G.; Xia, M.; Liu, J.; Lin, H. Multi-resolution supervision network with an adaptive weighted loss for desert segmentation. Remote Sens. 2021, 13, 1–18. [Google Scholar]
  46. Shang, R.H.; Zhang, J.Y.; Jiao, L.C.; Li, Y.; Marturi, N.; Stolkin, R. Multi-scale Adaptive feature fusion network for segmentation in remote sensing images. Remote Sens. 2020, 12, 872. [Google Scholar] [CrossRef] [Green Version]
  47. Chen, J.; He, F.; Zhang, Y.; Sun, G.; Deng, M. SPMF-net: Weakly supervised building segmentation by combining superpixel pooling and multi-scale feature fusion. Remote Sens. 2020, 12, 1049. [Google Scholar] [CrossRef] [Green Version]
  48. Liao, X.; Runkle, P.; Carin, L. Identification of ground targets from sequential high-range-resolution radar signatures. IEEE Trans. Aerosp. Electron. Syst. 2002, 38, 1230–1242. [Google Scholar] [CrossRef]
  49. Zhang, X.; Liu, Z.; Liu, S.; Li, G. Time-Frequency Feature Extraction of HRRP Using AGR and NMF for SAR ATR. J. Electr. Comput. Eng. 2015, 2015, 340–349. [Google Scholar] [CrossRef]
  50. Chen, B.; Liu, H.; Bao, Z. Analysis of three kinds of classification based on different absolute alignment methods. Mod. Radar 2006, 28, 58–62. [Google Scholar]
  51. Lan, D.; Liu, H.; Zheng, B.; Xing, M. Radar HRRP Target Recognition Based on Higher Order Spectra. IEEE Trans. Signal Process. 2005, 53, 2359–2368. [Google Scholar] [CrossRef]
  52. Beal, M. Variational Algorithms for Approximate Bayesian Inference. Ph.D. Thesis, University College London, London, UK, 2003. [Google Scholar]
  53. Nielsen, F.B. Variational Approach to Factor Analysis and Related Models. Master’s Thesis, Informatics and Mathematical Modelling, Technical University of Denmark, Copenhagen, Denmark, 2004. [Google Scholar]
  54. Ioffe, S.; Szegedy, C. Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift. In Proceedings of the International Conference on Machine Learning, Lille, France, 6–11 July 2015. [Google Scholar]
  55. Gulcehre, C.; Cho, K.; Pascanu, R.; Bengio, Y. Learned-Norm Pooling for Deep Feedforward and Recurrent Neural Networks. In Proceedings of the Joint European Conference on Machine Learning and Knowledge Discovery in Databases, Nancy, France, 15–19 September 2014. [Google Scholar]
  56. The Sensor Data Management System. Available online: https://www.sdms.afrl.af.mil/index.php?collection=mstar (accessed on 10 September 2015).
  57. Sun, Y.; Du, L.; Wang, Y.; Wang, Y.; Hu, J. SAR automatic target recognition based on dictionary learning and joint dynamic sparse representation. IEEE Geosci. Remote Sens. Lett. 2017, 13, 1777–1781. [Google Scholar] [CrossRef]
  58. Dong, G.; Liu, H.; Kuang, G.; Chanussot, J. Target recognition in SAR images via sparse representation in the frequency domain. Pattern Recognit. 2019, 96, 106972. [Google Scholar] [CrossRef]
  59. Dong, G.; Kuang, G. Target recognition in SAR images via classification on Riemannian manifolds. IEEE Geosci. Remote Sens. Lett. 2014, 12, 199–203. [Google Scholar] [CrossRef]
  60. Chen, S.; Wang, H.; Xu, F.; Jin, Y.Q. Target Classification Using the Deep Convolutional Networks for SAR images. IEEE Trans. Geosci. Remote Sens. 2016, 54, 1–12. [Google Scholar] [CrossRef]
  61. Theagarajan, R.; Bhanu, B.; Erpek, T.; Hue, Y.K.; Schwieterman, R.; Davaslioglu, K.; Shi, Y.; Sagduyu, Y.E. Integrating deep learning-based data driven and model-based approaches for inverse synthetic aperture radar target recognition. Opt. Eng. 2020, 59, 051407. [Google Scholar] [CrossRef]
  62. Guo, J.; Wang, L.; Zhu, D.; Hu, C. Compact convolutional autoencoder for SAR target recognition. IET Radar Sonar Navig. 2020, 14, 967–972. [Google Scholar] [CrossRef]
  63. He, K.M.; Zhang, X.Y.; Ren, S.Q.; Sun, J. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016. [Google Scholar]
  64. Huang, G.; Liu, Z.; Van Der Maaten, L.; Weinberger, K.Q. Densely connected convolutional networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 4700–4708. [Google Scholar]
  65. Yu, M.; Dong, G.; Fan, H.; Kuang, G. SAR Target Recognition via Local Sparse Representation of Multi-Manifold Regularized Low-Rank Approximation. Remote Sens. 2018, 10, 211. [Google Scholar] [CrossRef] [Green Version]
  66. Mou, L.; Schmitt, M.; Wang, Y.; Zhu, X.X. A CNN for the identification of corresponding patches in SAR and optical imagery of urban scenes. In Proceedings of the 2017 Joint Urban Remote Sensing Event (JURSE), Dubai, United Arab Emirates, 6–8 March 2017. [Google Scholar]
  67. Hu, J.; Mou, L.; Schmitt, A.; Zhu, X.X. FusioNet: A two-stream convolutional neural network for urban scene classification using PolSAR and hyperspectral data. In Proceedings of the 2017 Joint Urban Remote Sensing Event (JURSE), Dubai, United Arab Emirates, 6–8 March 2017. [Google Scholar]
  68. Laurens, V.D.M.; Hinton, G. Visualizing Data using t-SNE. J. Mach. Learn. Res. 2008, 9, 2579–2605. [Google Scholar]
Figure 1. The data acquisition of HRRP data and real-valued SAR images from received SAR echoes.
Figure 1. The data acquisition of HRRP data and real-valued SAR images from received SAR echoes.
Remotesensing 13 04021 g001
Figure 2. Framework of the proposed two-stream deep fusion network. Here the black solid lines and arrows represent the acquisition of inputs of the two sub-network branches, the blue solid lines and arrows represent the information flow in the VAE model, the green solid lines and arrows represent the information flow in the LightNet, the brown solid lines and arrows represent the information flow in the fusion module, and the red solid lines and arrows represent the final classifier. μ and σ represent the learned mean and standard deviation from the VAE encoder. The dotted line indicates the calculation of loss.
Figure 2. Framework of the proposed two-stream deep fusion network. Here the black solid lines and arrows represent the acquisition of inputs of the two sub-network branches, the blue solid lines and arrows represent the information flow in the VAE model, the green solid lines and arrows represent the information flow in the LightNet, the brown solid lines and arrows represent the information flow in the fusion module, and the red solid lines and arrows represent the final classifier. μ and σ represent the learned mean and standard deviation from the VAE encoder. The dotted line indicates the calculation of loss.
Remotesensing 13 04021 g002
Figure 3. Architecture of the VAE with Gaussian distribution assumption.
Figure 3. Architecture of the VAE with Gaussian distribution assumption.
Remotesensing 13 04021 g003
Figure 4. Flowchart of the proposed fusion module.
Figure 4. Flowchart of the proposed fusion module.
Remotesensing 13 04021 g004
Figure 5. SAR image examples of ten different targets in the MSTAR dataset.
Figure 5. SAR image examples of ten different targets in the MSTAR dataset.
Remotesensing 13 04021 g005
Figure 6. The average profile examples of the generated HRRP data of the ten different targets in the MSTAR dataset.
Figure 6. The average profile examples of the generated HRRP data of the ten different targets in the MSTAR dataset.
Remotesensing 13 04021 g006
Figure 7. Three-target accuracies obtained by different SAR target recognition methods. (a) overall accuracy; (b) average accuracy.
Figure 7. Three-target accuracies obtained by different SAR target recognition methods. (a) overall accuracy; (b) average accuracy.
Remotesensing 13 04021 g007
Figure 8. Ten-target accuracies obtained by different SAR target recognition methods. (a) Overall accuracy; (b) average accuracy.
Figure 8. Ten-target accuracies obtained by different SAR target recognition methods. (a) Overall accuracy; (b) average accuracy.
Remotesensing 13 04021 g008
Figure 9. T-SNE visualization of the learned features for (a) the original amplitude feature, (b) the VAE model, (c) LightNet, and (d) the proposed fusion network.
Figure 9. T-SNE visualization of the learned features for (a) the original amplitude feature, (b) the VAE model, (c) LightNet, and (d) the proposed fusion network.
Remotesensing 13 04021 g009
Figure 10. Ten-target accuracy on civilian vehicle data obtained by different SAR target recognition methods.
Figure 10. Ten-target accuracy on civilian vehicle data obtained by different SAR target recognition methods.
Remotesensing 13 04021 g010
Table 1. The architecture of the LightNet used in our method.
Table 1. The architecture of the LightNet used in our method.
InputOperatorKernel SizeNumber of ChannelsStrides
128 × 128 × 1Convolution11162
62 × 62 × 16Pooling2162
31 × 31 × 16Convolution5321
27 × 27 × 32Pooling2322
14 × 14 × 32 Convolution5641
10 × 10 × 64Pooling2642
5 × 5 × 64Convolution51281
3 × 3 × 128Convolution31001
Table 2. Details of Training and Test Samples for the Three-Target Dataset.
Table 2. Details of Training and Test Samples for the Three-Target Dataset.
DatasetBMP2BTR70T72
C2195669563C71132S7812
Training samples (17°)2330023323200
Test samples (15°)196196195196196191195
Table 3. Details of Training and Test Samples for the Ten-Target Dataset.
Table 3. Details of Training and Test Samples for the Ten-Target Dataset.
DatasetBMP2BTR70T72BTR602S1BRDMD7T62ZIL
131
ZSU
234
Training samples
(17°)
233
(C21)
233232
(132)
255299298299299299299
Test samples
(15°)
196
(C21)
196
(9566)
195
(9563)
196196
(132)
191
(S7)
195
(812)
195274274274273274274
Table 4. Confusion Matrix of the Proposed Method on Three-Target Data.
Table 4. Confusion Matrix of the Proposed Method on Three-Target Data.
TypeBMP2BTR70T72
BMP20.96420.00340.0324
BTR700.01020.98980
T720.01030.00170.9880
Table 5. Detailed Accuracies of Different Types on The Three-Target Data via Some SAR Recognition Methods.
Table 5. Detailed Accuracies of Different Types on The Three-Target Data via Some SAR Recognition Methods.
BMP2BTR70T72Overall AccuracyAverage Accuracy
proposed method0.96420.98980.98800.97800.9807
original image0.73250.96430.92780.84910.8748
PCA0.83300.95410.91060.88350.8992
Template matching0.914810.92440.93110.9464
DL-JDSR [65]0.93010.98980.93120.93910.9503
AE0.87560.94390.83510.86810.8848
DAE0.79220.97960.95190.88710.9079
Euclidean-AE [3]0.94210.93880.94160.94140.9408
VGGNet0.885910.94850.92890.9448
A-ConvNets0.91990.98980.93990.93850.9498
ResNet-180.964210.94850.96260.9709
ResNet-340.96760.98470.96220.96780.9715
DenseNet0.87560.95920.86250.88210.8991
SRC-FT [58]0.962510.95190.96310.9715
Riemannian manifolds [59]0.95740.98470.95700.96100.9664
CCAE [62]0.952310.97420.96840.9755
MBGH+CNN with EFF0.93870.96430.93130.93890.9448
Table 6. Confusion Matrix of the Proposed Method on Ten-Target Data.
Table 6. Confusion Matrix of the Proposed Method on Ten-Target Data.
TypeBMP2BTR70T72BTR602S1BRDMD7T62ZIL131ZSU234
BMP20.97100.00340.02390.0017000000
BTR700.00510.994900000000
T720.02750.00340.943300.0069000.00340.00690.0086
BTR60000.00510.98460.010300000
2S10.01090.00360.007300.97450000.00360
BRDM0.0109000.003600.9818000.00360
D70000000.992700.00730
T62000.018300000.97070.01100
ZIL1310000000010
ZSU234000000000.00360.9964
Table 7. Detailed Accuracies of Different Types in the Ten-Target Data via Some SAR Recognition Methods.
Table 7. Detailed Accuracies of Different Types in the Ten-Target Data via Some SAR Recognition Methods.
BMP2BTR70T72BTR602S1BRDMD7T62ZIL131ZSU234Overall AccuracyAverage
Accuracy
Proposed method0.97100.99490.94330.98460.97450.98180.99270.970710.99640.97600.9810
Original image0.68990.86730.71310.78970.44530.93070.89050.78020.91240.95620.77740.7975
PCA0.70700.85200.77150.80510.69710.79200.95980.86450.79560.94530.80300.8190
Template matching0.86370.92350.69930.91790.85770.88690.98180.96700.93070.97450.87580.9003
DL-JDSR [65]0.88760.93880.86250.88210.89050.91610.98540.96700.92340.98180.91480.9235
AE0.82450.90820.67350.87180.91610.95620.97080.94510.92340.98910.87040.8979
DAE0.71550.89800.73710.74360.56570.95990.90880.84980.93800.96720.80890.8284
Euclidean AE [3]0.87900.92860.79550.91790.93800.96720.98910.94140.94530.99640.91290.9298
VGGNet0.76830.97450.88720.92700.98180.99640.97800.98910.98540.88830.91660.9376
A-ConvNets0.89610.97450.78870.96410.91970.98180.95260.95970.989110.92190.9426
ResNet-180.92160.95410.84880.89230.95620.91610.90510.93410.88690.95620.91070.9171
ResNet-340.88420.92860.86770.77950.93430.87590.90110.92340.91610.93800.89320.8949
DenseNet0.94380.97450.91580.84100.94530.95260.91940.97450.92340.97080.93600.9361
MBGH+CNN with EFF0.85180.89290.95530.83070.87230.90150.88640.89420.93430.89420.88760.8914
Table 8. Ablation Study.
Table 8. Ablation Study.
VAE StreamLightNet StreamFusion ModuleOverall AccuracyAverage Accuracy
Decision Level FusionAdditionConcatenationProposed Fusion Module
0.88130.8399
0.94870.9612
0.92780.9357
0.95680.9664
0.96480.9715
0.97800.9807
Table 9. FLOPs.
Table 9. FLOPs.
VAE
Branch
LightNet
Branch
Proposed
Network
VGG NetworkResNet-18ResNet-34DenseNet
FLOPs3.4 × 1052.3 × 1072.33525 × 1075.14 × 1091.9 × 1093.6 × 1095.7 × 109
Table 10. Detailed Accuracies of Different Types on Civilian Vehicle Data via Some SAR Recognition Methods.
Table 10. Detailed Accuracies of Different Types on Civilian Vehicle Data via Some SAR Recognition Methods.
Toyota CameryHonda Civic 4dr1993 Jeep1999 JeepNissan MaximaMazda MPVMitsu-
Bishi
Nissan SentraToyota AvalonToyota TacomaTotal Accuracy
Proposed method0.86940.947210.93060.96390.95280.91110.988910.96670.9530
Original image0.96660.93060.98610.65280.783310.90.6111110.8830
PCA0.94440.93890.95560.69440.827810.95830.7306110.9050
Template matching0.90830.91940.93890.87220.91670.94440.86390.83330.94440.98610.9128
DL-JDSR0.88330.98060.97500.91110.96390.82220.95830.94440.8510.9289
AE0.99440.96390.93890.87220.87780.96940.96390.6333110.9213
DAE0.98890.97220.99170.850.88330.97220.92780.68330.997210.9267
VGGNet0.82780.76940.96110.70.93610.91390.84170.91940.97500.92500.8769
A-ConvNets0.86940.93060.99720.84440.99170.95280.73060.997210.99170.9305
ResNet-180.94520.93450.97640.88570.92480.99340.77560.791110.98470.9211
ResNet-340.94370.96470.97130.67930.95370.97690.89850.88650.969110.9247
DenseNet0.96080.97620.98420.91360.91570.98450.82230.7567110.9314
MBGH+CNN with EFF0.87570.98270.98010.83480.92140.91870.84670.87090.94220.93410.9017
Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Share and Cite

MDPI and ACS Style

Du, L.; Li, L.; Guo, Y.; Wang, Y.; Ren, K.; Chen, J. Two-Stream Deep Fusion Network Based on VAE and CNN for Synthetic Aperture Radar Target Recognition. Remote Sens. 2021, 13, 4021. https://doi.org/10.3390/rs13204021

AMA Style

Du L, Li L, Guo Y, Wang Y, Ren K, Chen J. Two-Stream Deep Fusion Network Based on VAE and CNN for Synthetic Aperture Radar Target Recognition. Remote Sensing. 2021; 13(20):4021. https://doi.org/10.3390/rs13204021

Chicago/Turabian Style

Du, Lan, Lu Li, Yuchen Guo, Yan Wang, Ke Ren, and Jian Chen. 2021. "Two-Stream Deep Fusion Network Based on VAE and CNN for Synthetic Aperture Radar Target Recognition" Remote Sensing 13, no. 20: 4021. https://doi.org/10.3390/rs13204021

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop