End-to-End Radar HRRP Target Recognition Based on Integrated Denoising and Recognition Network

: For high-resolution range proﬁle (HRRP) radar target recognition in a low signal-to-noise ratio (SNR) scenario, traditional methods frequently perform denoising and recognition separately. In addition, they assume equivalent contributions of the target and the noise regions during feature extraction and fail to capture the global dependency. To tackle these issues, an integrated denoising and recognition network, namely, IDR-Net, is proposed. The IDR-Net achieves denoising through the denoising module after adversarial training, and learns the global relationship of the generated HRRP sequence using the attention-augmented temporal encoder. Furthermore, a hybrid loss is proposed to integrate the denoising module and the recognition module, which enables end-to-end training, reduces the information loss during denoising, and boosts the recognition performance. The experimental results on the measured HRRPs of three types of aircraft demonstrate that IDR-Net obtains higher recognition accuracy and more robustness to noise than traditional methods.


Introduction
The high-resolution range profile (HRRP) of a target represents the 1D projection of its scattering centers along the radar line of sight (LOS), as shown in Figure 1. Compared with the 2D inverse synthetic aperture radar (ISAR) image, the HRRP is easier to acquire, store, and process. Moreover, it contains abundant structural signatures of the target such as the shape, size, and location of the main parts. Currently, automatic radar target recognition based on HRRP has received increasing attention in the radar automatic target recognition (RATR) community [1][2][3][4][5].
In real-world situations, however, the existence of strong noise will lead to a low signal-to-noise ratio (SNR) and hinder effective feature extraction. To deal with this issue, the available methods implement denoising firstly and then carry out feature extraction and recognition [19]. In terms of deep neural networks, however, such two-stage processing prohibits end-to-end training, resulting in complicated processing as well as long operational time. Furthermore, decoupling denoising from recognition ignores the potential requirements for noise suppression and signal extraction when fulfilling effective recognition. Therefore, it is natural to study the network structure integrating denoising and recognition to boost the performance and efficiency. Traditional HRRP recognition methods are mainly divided into three categories: (1) feature domain transformation [6][7][8]; (2) statistical modeling [3][4][5]9,10]; and (3) kernel methods [11,12]. The first category obtains features in the transformation domain, e.g., the bispectra domain [6], by data projection, and then designs proper classifiers for HRRP recognition. The over-dependency on the prior knowledge, however, induces degraded performance and robustness in complex scenarios where priors are improper or unavailable. The second category establishes statistical models by imposing specific distributions, e.g., Gaussian [5], on the HRRP, which may result in limited data description capability, optimization space, and generalization performance. The third category projects the HRRP to higher feature space through kernels. In order to obtain satisfying recognition and generalization performance, however, the kernels should be carefully designed, such as kernel optimization based on the localized kernel fisher criterion [12].
In recent years, deep learning [14] has received intensive attention in HRRP recognition. Unlike traditional methods that rely heavily on hand-designed features, methods based on deep learning are data-driven, i.e., they could extract features of the HRRP automatically, through typical structures such as the autoencoder (AE) [15,16], the convolutional neural network (CNN) [17][18][19][20], and the recurrent neural network (RNN) [21,22], etc. The proposed method belongs to deep learning. Constituted by the encoder and the decoder, the AE attempts to output a copy of the input data by reconstructing it in an unsupervised fashion. In particular, the encoded, i.e., compressed data in the middle, serves as the recognition feature, which is then fed into the classifier for recognition [15,16]. The traditional CNN [17] extracts hierarchical spatial features from the input by cascaded convolutional and pooling layers, whereas it fails to capture the temporal information [18][19][20]. In view of this, RNN [21] has sequential architecture to process the current input and historical information simultaneously, so that to capture the temporal information of the target. However, it assumes that both the target and noise regions contribute equally to HRRP recognition, which may result in limited performance [22].
Mimicking the human vision, the attention mechanism [23][24][25][26] captures long-term information and dependencies between input sequence elements by measuring the importance of the input to the output. Traditional attention models designed for HRRP recognition [27][28][29][30][31], such as the target-attentional convolutional neural network (TACNN) [28], the target-aware recurrent attentional network (TARAN) [29], and the stacked CNN-Bi-RNN (CNN-Bi-RNN) [30]. TACNN, which is based on CNN, fails to make full use of the temporal correlation of HRRP, whereas TARAN, which is based on RNN and its variants, has difficulties in network training, parallelization, and long-term memory representation. Furthermore, CNN-Bi-RNN fuses the advantages of CNN and RNN and uses Traditional HRRP recognition methods are mainly divided into three categories: (1) feature domain transformation [6][7][8]; (2) statistical modeling [3][4][5]9,10]; and (3) kernel methods [11,12]. The first category obtains features in the transformation domain, e.g., the bispectra domain [6], by data projection, and then designs proper classifiers for HRRP recognition. The over-dependency on the prior knowledge, however, induces degraded performance and robustness in complex scenarios where priors are improper or unavailable. The second category establishes statistical models by imposing specific distributions, e.g., Gaussian [5], on the HRRP, which may result in limited data description capability, optimization space, and generalization performance. The third category projects the HRRP to higher feature space through kernels. In order to obtain satisfying recognition and generalization performance, however, the kernels should be carefully designed, such as kernel optimization based on the localized kernel fisher criterion [12].
In recent years, deep learning [14] has received intensive attention in HRRP recognition. Unlike traditional methods that rely heavily on hand-designed features, methods based on deep learning are data-driven, i.e., they could extract features of the HRRP automatically, through typical structures such as the autoencoder (AE) [15,16], the convolutional neural network (CNN) [17][18][19][20], and the recurrent neural network (RNN) [21,22], etc. The proposed method belongs to deep learning. Constituted by the encoder and the decoder, the AE attempts to output a copy of the input data by reconstructing it in an unsupervised fashion. In particular, the encoded, i.e., compressed data in the middle, serves as the recognition feature, which is then fed into the classifier for recognition [15,16]. The traditional CNN [17] extracts hierarchical spatial features from the input by cascaded convolutional and pooling layers, whereas it fails to capture the temporal information [18][19][20]. In view of this, RNN [21] has sequential architecture to process the current input and historical information simultaneously, so that to capture the temporal information of the target. However, it assumes that both the target and noise regions contribute equally to HRRP recognition, which may result in limited performance [22].
Mimicking the human vision, the attention mechanism [23][24][25][26] captures long-term information and dependencies between input sequence elements by measuring the importance of the input to the output. Traditional attention models designed for HRRP recognition [27][28][29][30][31], such as the target-attentional convolutional neural network (TACNN) [28], the target-aware recurrent attentional network (TARAN) [29], and the stacked CNN-Bi-RNN (CNN-Bi-RNN) [30]. TACNN, which is based on CNN, fails to make full use of the temporal correlation of HRRP, whereas TARAN, which is based on RNN and its variants, has difficulties in network training, parallelization, and long-term memory representation. Furthermore, CNN-Bi-RNN fuses the advantages of CNN and RNN and uses an attention mechanism to adjust the importance of features. In recent years, self-attention [32], which relates different positions of a single sequence to compute a global representation, has achieved efficient and parallel sequence modeling and feature extraction. Specifically, it Remote Sens. 2022, 14, 5254 3 of 18 acquires the attention score by calculating the correlation between the query vector and the key vector, and then weights it to the value vector as the output. Since the self-attention mechanism explicitly models the interactions between all elements in the sequence, it is a feature extractor of global information with long-term memory. Moreover, the global random access of the self-attention mechanism facilitates the fast and parallel modeling of long sequences. For HRRP recognition, the self-attention is added before the convolutional long short-term memory (ConvLSTM) [33] in order to focus on more significant range cells. Because the main recognition structure, i.e., the LSTM, is still a variant of RNN, it fails to directly use the different importance between features for recognition. In addition, although the networks proposed by the existing methods have certain noise robustness, they fail to achieve better recognition results under the condition of low SNR.
Traditionally, HRRP denoising is implemented prior to feature extraction, and typical denoising methods include least mean square (LMS) [34,35], recursive least square (RLS) [36], and eigen subspace techniques [37], etc. Such techniques, however, rely heavily on domain expertise and fail to estimate the model-order (i.e., the number of signal components) accurately with low SNR. Recently, the generative adversarial network (GAN) has been introduced as a novel way to train a generative model, which could learn the complex distributions through the adversarial training between the generator and the discriminator [38]. Currently, GAN has been successfully applied to data generation [39,40], image conversion and classification [41,42], speech enhancement [43] and so on, which provides an effective way to blind HRRP denoising.
In a nutshell, the separated HRRP denoising and recognition processes, the inability to distinguish the contribution of the target regions and noisy regions during the feature extraction process, the incompetence in long-term/global dependency acquisition hinder effective recognition of the noisy HRRP. Specifically, the output of the classifier cannot be fed back to the denoising process, thus significant signal components may be suppressed during denoising. Meanwhile, the different intensity information of each component of the HRRP cannot be effectively utilized in the identification process. Therefore, it is natural to integrate the tasks of denoising and recognition through elaborately deigned deep architectures, under the guidance of proper loss.
Aiming at the above issues, this paper proposes the integrated denoising and recognition network, namely, IDR-Net, to achieve effective HRRP denoising and recognition. The network consists of two modules, i.e., the denoising module and the recognition module. Specifically, the generator in the denoising module maps the noisy HRRP to the denoised one after adversarial training, which is then fed into the attention-augmented recognition module to output the target label. In particular, a new hybrid loss function is used to guide the denoising of HRRP. The main contributions of this paper include the following: (a) To tackle the issue that separated HRRP denoising and recognition hinder end-to-end training and may suppress signal components that are significant for recognition, an integrated denoising and recognition model, i.e., the IDR-Net is designed, denoising the low SNR HRRP through the denoising module and outputs the category label through the recognition module. To the best of our knowledge, our method integrates denoising and recognition for the first time, realizing end-to-end training, and achieving better recognition performance. (b) To tackle the issue of long-term and global dependency acquisition of HRRP, the recognition module adopts the attention-augmented temporal encoder with parallelized and global sequential feature extraction. In particular, the attention score is generated with emphasis on the important input data to weight the feature vector and facilitate recognition. (c) Propose a new hybrid loss, and for the first time in the recognition of HRRP using such a combination of denoising loss and classification loss as loss function. By these means, the recognition module is integrated with the generator, thereby reducing the information loss during denoising, and enhancing the inter-class dissimilarity.
The remainder of this paper is organized as follows: Section 2 discusses the related work, including the modelling of HRRP and the basic principles of GAN. Section 3 provides the detailed structure of the proposed IDR-Net. Section 4 presents the data set and experimental results with detailed analysis. Finally, Section 5 concludes this paper and discusses the future work.

HRRP Modeling
The high-resolution range profile (HRRP) is a 1-D signature of an object, which could represent the time domain response of a target to a high-range resolution radar pulse [13]. The complex valued HRRP of the target of the nth pulse can be expressed as where θ(n) is the initial phase induced by translation. For the mth, m ∈ [1, M], range cell, x m (n) = ∑ P p=1 σ mp e jφ mp (n) is the amplitude, where P is the number of scattering centers; σ mi is the radar cross section of the pth scattering center; and φ mp (n) is the phase induced by the rotation of the pth scattering center. In addition, T denotes vector transpose. Then, we obtain the real-valued HRRP by taking the modulus of x C (n), i.e., Generally, the HRRP is characterized by: (1) translation sensitivity; (2) amplitude sensitivity; and (3) aspect sensitivity. Specifically, the translational motion of the target will lead to unknown shifts among HRRPs along the range/temporal dimension; and the variation of the distance between the target and radar will cause amplitude fluctuation. Moreover, each scattering center has its own amplitude and phase characteristics, and these are combined as vectors to provide a net amplitude and phase return in the associated range cell, i.e., x m (n). These interference effects between scattering centers can give rise to rapid changes of the HRRP with aspect angle. To alleviate the sensitivities discussed above, we perform HRRP alignment and normalization, and then generate the training set utilizing HRRPs with various aspect angles.

GAN
GAN is a deep learning framework for estimating the generative models via adversarial training [38], which could sidestep the difficulty in approximating many intractable probabilistic computations. In general, a GAN consists of two adversarial models: a generator G to capture the data distribution, and a discriminator D to estimate the probability that a sample comes from the training data rather than G. That is to say, G to generate samples close to the real samples, making the discriminator D cannot distinguish them; at the same time, D attempts to distinguish real samples from generated ones.
Both G and D could be non-linear mapping function, e.g., the deep neural network, and are trained following the two-player min-max game with the value function: where E[·] is the expectation; x is the sample comes from real distribution p data (x); and z is the noise comes from latent distribution p z (z). By minimizing log(1 − D(G(z))), parameters of G are adjusted to map z into a new sample which is expected to have distribution p g . Ideally, p g should be as close to p data as possible. By maximizing log D(x), parameters of D are adjusted to distinguish the generated samples from the true ones. In practice, G and D are trained alternatively until convergence.
Traditional GAN is an unconditioned generative model, that is, there is no control on modes of the data being generated. In view of this, the conditional GAN (CGAN) [44] conditions the model on addition information and directs the data generation process. Specifically, it performs the conditioning by feeding the extra information y to G and D in the training process. Then, the objective function becomes Currently, CGAN has been successfully applied to style transformation, such as image denoising [45] and image-to-image translation [46].

Network Structure
This section introduces the structure of IDR-Net, which consists of the denoising module and the recognition module. Firstly, the denoising module implements HRRP denoising through the generator. Then, the denoised HRRP is fed into the recognition module, which calculates the attention weights, extracts the features, and outputs the classification label. The framework of IDR-Net is shown in Figure 2, and the detailed structures will be introduced in Sections 3.1-3.3. maximizing ( ) log D x , parameters of D are adjusted to distinguish the generated samples from the true ones. In practice, G and D are trained alternatively until convergence.
Traditional GAN is an unconditioned generative model, that is, there is no control on modes of the data being generated. In view of this, the conditional GAN (CGAN) [44] conditions the model on addition information and directs the data generation process. Specifically, it performs the conditioning by feeding the extra information y to G and D in the training process. Then, the objective function becomes Currently, CGAN has been successfully applied to style transformation, such as image denoising [45] and image-to-image translation [46].

Network Structure
This section introduces the structure of IDR-Net, which consists of the denoising module and the recognition module. Firstly, the denoising module implements HRRP denoising through the generator. Then, the denoised HRRP is fed into the recognition module, which calculates the attention weights, extracts the features, and outputs the classification label. The framework of IDR-Net is shown in Figure 2, and the detailed structures will be introduced in Sections 3.1-3.3.

The Denoising Module
The denoising module treats HRRP denoising as a style transformation problem of converting the noisy HRRP into clean HRRP. For this purpose, the generator G and the discriminator D are designed according to the dimensionality of HRRP and trained with conditional information. Specifically, the generator G maps the noisy HRRP x noisy to denoised HRRP x denoised , and the discriminator D distinguishes x denoised from the real noise-free HRRP x clean . Below, we will discuss detailed structures of G and D.

The Generator
According to the principles of GAN, the output of the generator G, i.e., x denoised , should be resemble to the real noise-free HRRP x clean as closely as possible, so that the discriminator D cannot distinguish x denoised from x clean . In the IDR-Net, the non-linear mapping from x noisy to x denoised is achieved by an encoder and a decoder with symmetrical structures, as shown in Figure 3. For instance, "conv1D 16@15 1_2" denotes 1-D convolution with kernel size of 15 and stride size of 2, whereas "deconv1D 64@15 1_2" denotes deconvolution with kernel size of 15 and stride size of 2. In terms of HRRP denoising, the kernel size is set to 15 × 1 with 2 stride sizes for each convolutional layer. The dimension of the input is 256 × 1, and the dimension of the output feature map of each layer are 256 × 1, 128 × 16, 64 × 32, 32 × 32, 16 × 64, and 8 × 64, respectively. Then, the output c of the last layer in the encoder is fed into the decoder, where the dimensions of the output of each layer is 16 × 64, 32 × 32, 64 × 32, 128 × 16, and 256 × 1, respectively. The last layer outputs the denoised HRRP x denoised .
converting the noisy HRRP into clean HRRP. For this purpose, the generator G and the discriminator D are designed according to the dimensionality of HRRP and trained with conditional information. Specifically, the generator G maps the noisy HRRP noisy x to denoised HRRP denoised x , and the discriminator D distinguishes denoised x from the real noise-free HRRP clean x . Below, we will discuss detailed structures of G and D .

The Generator
According to the principles of GAN, the output of the generator G , i.e., denoised x , should be resemble to the real noise-free HRRP clean x as closely as possible, so that the discriminator D cannot distinguish denoised x from clean x . In the IDR-Net, the non-linear mapping from noisy x to denoised x is achieved by an encoder and a decoder with symmetrical structures, as shown in Figure 3. For instance, "conv1D 16@15 1_2" denotes 1-D convolution with kernel size of 15 and stride size of 2, whereas "deconv1D 64@15 1_2" denotes deconvolution with kernel size of 15 and stride size of 2. In terms of HRRP denoising, the kernel size is set to 15 Figure 3. Detailed structure of the generator.
As illustrated in the upper left part of the training process in Figure 2, the generator connects the output of each encoding layer and the output of the symmetrical decoding layer along the channel dimension through skip connection. By this means, it directly transfers the low-level features to the decoder without compression and facilitates gradient propagation.

The Discriminator
In the discriminator D , the noise-free HRRP clean x and denoised x are concatenated with the same noisy signal noisy x , respectively, to obtain , clean noisy As illustrated in the upper left part of the training process in Figure 2, the generator connects the output of each encoding layer and the output of the symmetrical decoding layer along the channel dimension through skip connection. By this means, it directly transfers the low-level features to the decoder without compression and facilitates gradient propagation.

The Discriminator
In the discriminator D, the noise-free HRRP x clean and x denoised are concatenated with the same noisy signal x noisy , respectively, to obtain x clean , x noisy and x denoised , x noisy . Conditioned by x noisy , these vectors are then adopted as the real and generated samples, respectively, and fed into the network, as shown by the lower part of the training process of Figure 2. By introducing the conditional information x noisy , we increase the similarity between the real samples and the generated ones, thereby facilitating the initial training stage of the network. That is, the outputs of G become closer to the real samples, and the capability of distinguishing the real samples from the generated ones is enhanced for D.
As shown in Figure 4, the discriminator is composed of a series of 1D convolutional layers and fully connected layers, which has certain robustness to feature position. The size and number of the first five convolutional kernels are the same as those of the encoder in G. Moreover, LeakyReLU [45] with non-zero derivative is added to each convolutional layer, and the dimensions of the output feature maps are 128 × 16, 64 × 32, 32 × 32, 16 × 64, and 8 × 64, respectively. Then, a convolutional layer with kernel size of 1 × 1 and stride size of 1 is utilized to flatten the 2D feature map into a 1D vector. Finally, the fully connected layer outputs a scalar to indicate whether the current sample is real or generated. .
As shown in Figure 4, the discriminator is composed of a series of 1D convolutional layers and fully connected layers, which has certain robustness to feature position. The size and number of the first five convolutional kernels are the same as those of the encoder in G . Moreover, LeakyReLU [45] with non-zero derivative is added to each convolutional layer, and the dimensions of the output feature maps are 128 16 × , 64 32 × , 32 32 × , 16 64 × , and 8 64 × , respectively. Then, a convolutional layer with kernel size of 1 1 × and stride size of 1 is utilized to flatten the 2D feature map into a 1D vector. Finally, the fully connected layer outputs a scalar to indicate whether the current sample is real or generated.

The Recognition Module
The recognition module determines the category label of the denoised sample denoised x given by G . To exploit the sequential information among range cells of a single HRRP, we slide a sampling window continuously with a fixed size to generate the HRRP sequence. As discussed in Section 1, the traditional attention mechanism is confined to the inherent order of the sequence, thereby only processing two adjacent time steps. Therefore, it is essentially a local perception model and is incompetent to capture the global relationship of the entire sequence in parallel. To deal with this issue, the recognition module captures the long-term dependence efficiently through the attention-augmented temporal encoder, as shown in Figure 5.

The Recognition Module
The recognition module determines the category label of the denoised sample x denoised given by G. To exploit the sequential information among range cells of a single HRRP, we slide a sampling window continuously with a fixed size to generate the HRRP sequence. As discussed in Section 1, the traditional attention mechanism is confined to the inherent order of the sequence, thereby only processing two adjacent time steps. Therefore, it is essentially a local perception model and is incompetent to capture the global relationship of the entire sequence in parallel. To deal with this issue, the recognition module captures the long-term dependence efficiently through the attention-augmented temporal encoder, as shown in Figure 5.
Considering an HRRP, the sequence X seq = [x 1 , · · · , x N ] T is generated by sliding the sampling window with length d w and step size d w /2, where x s ∈ R d w , s = 1, . . . , N, and N is the number of segments. Then, a weight matrix W map ∈ R d w ×d maps X seq linearly to obtain the embedding vectors E 0 = e 0 0 ; e 0 1 ; . . . ; e 0 N satisfying the following conditions where d is the hidden size.
Considering the position invariance of the attention mechanism, a learnable position encoding P ∈ R N×d is added to E 0 , so as to better capture the sequential features, i.e., After that, the L-layer attention-augmented temporal encoder calculates the attention score from Z 0 on the temporal dimension. Assuming that the input of the (l − 1)th layer (l = 1, . . . , L) of the encoder is Z l−1 ∈ R N×d , the key K l , query Q l and value V l of the lth layer are calculated as follows: where k l n , q l n , and v l n are row vectors of K l , Q l , and V l , respectively; z l−1 n is the nth row of Z l−1 ; W l k , W l q , and W l v ∈ R d×d are dimension transformation matrices; and LN(·) is the layer normalization for calculating the mean and variance on all layers of each input, i.e., where ρ and b are variable parameters; µ is the mean value and σ 2 is variance; and ξ is a small nonzero value. The attention score of the lth layer A l ∈ R N×d can be calculated by: where d is the hidden size.
Considering the position invariance of the attention mechanism, a learnable position encoding is added to 0 E , so as to better capture the sequential features, i.e.,   To accelerate convergence, the residual Z l−1 is added to A l and layer normalization is performed: Then, it is fed into a feedforward neural network (FFN), i.e., where ReLU(·) is the rectified linear unit [47,48]; W 1 ∈ R d×d f , W 2 ∈ R d f ×d are weight matrixes in FFN and d f is the corresponding dimension. Furthermore, we add H l to (12) and perform layer normalization to obtain Z l ∈ R N×d : After the L-layer encoder, Z L is vectorized into a feature vector s ∈ R 1×(N×d) . Finally, the category label y ∈ R 1×K is given by where K is the number of target categories; W 1 ∈ R (N×d)×d f and W 2 ∈ R d f ×K are the weights of fully connected layers with dimension d f .

Construction of the Hybrid Loss
Traditional methods implement HRRP denoising and recognition separately, under the guidance of different losses. Such manipulation may lose important signal components beneficial to recognition. In view of this, we introduce the recognition loss to the value function of GAN and design the hybrid loss for integrated training of the IDR-Net. By this means, the recognition module is associated with the generator in the training process, thereby boosting the HRRP denoising, feature extraction, and recognition performance.
Since the generator and discriminator are trained alternatively, this paper expresses the losses of G and D in the denoising module separately as which are composed of the CGAN loss L base ; the gradient penalty term L GP ; the regularization terms L l 1 and L l 2 ; and the recognition loss L class . Moreover, λ GP , ρ 1 , ρ 2 , and β denote the corresponding coefficients. Below, each term will be discussed in detail. To facilitate sample generation, i.e., HRRP denoising, we introduce the supervised learning strategy by adding x noisy to the loss function of GAN. Then, the CGAN loss is expressed as: where G(·) denotes the generated samples, and D(·) denotes the discriminative score. To avoid gradient explosion or vanishing and obtain a well-trained model, the gradient penalty term L GP is designed as In addition, L l 1 and L l 2 measure the similarity between the denoised sample and the clean one, i.e., In particular, the recognition loss L class is added to the loss of G, which is expressed as where t k is the kth entry of the true label t, and y k is the kth entry of the predicted label y. By omitting the irrelevant terms in (14) and (15), this paper finally obtains the losses of the generator and the discriminator of the IDR-Net, i.e., where λ is a regularization coefficient; and α ∈ (0, 1) adjusts the proportion of L l 1 and L l 2 satisfies,

Data Sets and Pre-Processing
In this section, we adopt the measured HRRPs of three types of aircraft, i.e., An-26, Cessna Citation S/II, and Yak-42, to design the experiments of network validation and performance analysis. The optical images and typical HRRPs are illustrated in Figure 6, and the size is listed in Table 1. Projections of the flight paths on the ground plane is illustrated in Figure 7, with radar located at the origin (0, 0). The radar pulse repetition frequency is 400 Hz, the bandwidth is 400 MHz, and the range resolution is 0.375 m. The echoes are divided into several segments, and the corresponding flight paths are indicated by integers ranging from 1 to 7, the number of samples for each data segment is listed in Table 2.    As a matter of routine [16,18,[27][28][29][30], the training set is constructed by sampling the 5th and the 6th HRRP segments of An-26, the 6th and the 7th HRRP segments of Cessna Citation S/II, and the 2nd and the 5th HRRP segments of Yak-42, whereas the test set is constructed by sampling the rest HRRP segments. Moreover, the sampling interval is 20, and the number of training and test samples is 7398 and 16,656, respectively. Such settings  As a matter of routine [16,18,[27][28][29][30], the training set is constructed by sampling the 5th and the 6th HRRP segments of An-26, the 6th and the 7th HRRP segments of Cessna Citation S/II, and the 2nd and the 5th HRRP segments of Yak-42, whereas the test set is constructed by sampling the rest HRRP segments. Moreover, the sampling interval is 20, and the number of training and test samples is 7398 and 16,656, respectively. Such settings could cover a wider range of aspect angles and mitigate the aspect sensitivity of HRRP. The division of the data set and number of samples is shown in Table 3. For the translation sensitivity, we align the samples by calculating the centroid of each HRRP [49], which is assumed to be constant in a short observation time. To eliminate the amplitude sensitivity, the l 2 -norm normalization is implemented to each HRRP through

Training and Testing Process
We train the generator, the recognition module, and the discriminator of the IDR-Net alternately using the losses described in (21) and (22). Specifically, we calculate the gradients through back-propagation [50] and update the network parameters through the root mean square prop (RMSprop) gradient descent [51]. Such method can adaptively adjust the learning rate and has a faster descending speed than conventional methods. The main steps include: where k is the index of iterations; θ k is the set of network parameters at the kth iteration; is the partial derivative of the loss L with respect to θ k ; ϕ is the momentum coefficient; η is the learning rate; δ is a small positive number to avoid the zero devisor; and is the dot product. During training, the network is trained using noise-added measured data. The detailed training process of the IDR-Net is shown in Algorithm 1, where θ G k , θ D k , and θ R k represent the parameter sets of the generator G, the discriminator D, and the recognition module, respectively, at the kth iteration; and x k represents the output of the generator G. Algorithm 1. Iterative alternating training process of the IDR-Net Additionally, the number of neurons in the fully connected layer in D is 8; the length of the sliding window in the recognition module is 6; the number of layers in the attentionaugmented temporal encoder is 5; and d is set to 128. The detailed description of the hyperparameters is shown in Table 4. These parameters are determined empirically, making the IDR-Net perform better. In the testing process, as shown in the lower part of Figure 2, we fix weights of G and the recognition module, feed the noisy test sample to G, and then obtain the category label from the recognition module.
The IDR-Net is implemented based on the TensorFlow software, and the training and testing phases are implemented by a NVIDIA GTX 1080Ti GPU.

Recognition Results
In terms of the original HRRPs of the three aircrafts, we treat them as noise-free samples and generate the noisy training and test samples by adding Gaussian noise. Then, it implements preprocessing following the steps introduced in Section 4.1. Given SNR of 5 dB, 10 dB, and 15 dB, the confusion matrix, overall accuracy (OA), and per-class accuracy (PA) of the IDR-Net on the noisy test sets are shown in Table 5. Each column of the confusion matrix denotes the true category label, whereas each row denotes the predicted label. The recognition accuracy is 77.97%, 85.30%, and 88.44%, respectively, for SNR of 5 dB, 10 dB, and 15 dB, respectively. Moreover, the recognition accuracy of Yak-42 is higher than that of An-26 and Cessna Citation S/II aircraft, which may be due to the similar size and trajectories of An-26 and Cessna Citation S/II.
To evaluate the denoising performance, we calculate the root mean square error (RMSE) between the denoised and the noise-free HRRPs by: The smaller RMSE indicates better denoising performance. For the test sets with different SNR, the statistical histograms of the RMSE before and after denoising are shown in Figure 8. By comparing the images in the same column, we observe that the denoised histogram shifts to the left, demonstrating the effectiveness of the generator. in Figure 8. By comparing the images in the same column, we observe that the denoised histogram shifts to the left, demonstrating the effectiveness of the generator.
To explain the feature extraction ability of the IDR-Net explicitly, Figure 9 visualizes the deep features of the noisy test samples for SNR of 5 dB, 10 dB, and 15 dB, respectively, by applying the t-distributed stochastic neighbor embedding (t-SNE) [52] to the output of the fully connected layer in the recognition module. Specifically, the first row demonstrates the separability of the original noisy samples, whereas the second row demonstrates the separability of the denoised ones. The red, green, and blue markers represent features of the An-26, Cessna Citation S/II, and Yak-42 aircraft, respectively. It is observed that the separability of the three aircrafts is boosted after denoising and attention-augmented temporal feature extraction.  To explain the feature extraction ability of the IDR-Net explicitly, Figure 9 visualizes the deep features of the noisy test samples for SNR of 5 dB, 10 dB, and 15 dB, respectively, by applying the t-distributed stochastic neighbor embedding (t-SNE) [52] to the output of the fully connected layer in the recognition module. Specifically, the first row demonstrates the separability of the original noisy samples, whereas the second row demonstrates the separability of the denoised ones. The red, green, and blue markers represent features of the An-26, Cessna Citation S/II, and Yak-42 aircraft, respectively. It is observed that the separability of the three aircrafts is boosted after denoising and attention-augmented temporal feature extraction.

Ablation Study
To demonstrate the validity of the denoising module (including the generator and the discriminator), the integrated denoising and recognition architectures, and the hybrid loss, we design two models: (1) the recognition network, i.e., the IDR-Net without the denoising module; and (2) the two-stage network, which carries out HRRP denoising and recognition separately. Similar to the IDR-Net, we feed the noisy samples into the recognition network for training and testing, and the loss function is expressed as Equation (20).
The two-stage network performs HRRP denoising through the denoising module of the IDR-Net firstly. Then, it feeds the denoised HRRPs into the recognition network to obtain the class label. It is worth noting that the denoising module is trained firstly by the noisy samples and the corresponding noise-free samples, and the loss function satisfies Then, the weights of the generator are fixed, and the denoised samples together with their labels are adopted to train the recognition network with the loss function given in Equation (20).
Detailed configurations for the two models and the IDR-Net are listed in Table 6, and the corresponding recognition accuracies are listed in Table 7. It can be found that the IDR-Net achieves the highest recognition accuracy for SNR of 5 dB, 10 dB, and 15 dB.

Ablation Study
To demonstrate the validity of the denoising module (including the generator and the discriminator), the integrated denoising and recognition architectures, and the hybrid loss, we design two models: (1) the recognition network, i.e., the IDR-Net without the denoising module; and (2) the two-stage network, which carries out HRRP denoising and recognition separately. Similar to the IDR-Net, we feed the noisy samples into the recognition network for training and testing, and the loss function is expressed as Equation (20).
The two-stage network performs HRRP denoising through the denoising module of the IDR-Net firstly. Then, it feeds the denoised HRRPs into the recognition network to obtain the class label. It is worth noting that the denoising module is trained firstly by the noisy samples and the corresponding noise-free samples, and the loss function satisfies Then, the weights of the generator are fixed, and the denoised samples together with their labels are adopted to train the recognition network with the loss function given in Equation (20).
Detailed configurations for the two models and the IDR-Net are listed in Table 6, and the corresponding recognition accuracies are listed in Table 7. It can be found that the IDR-Net achieves the highest recognition accuracy for SNR of 5 dB, 10 dB, and 15 dB.
Compared with the recognition network, the recognition accuracy of the IDR-Net is improved by about 2%, demonstrating the effectiveness of the denoising module. Because the denoising module and the recognition module are trained separately in the two-stage model, the denoised samples may lose the information beneficial to recognition. On the contrary, the IDR-Net achieves integrated denoising and recognition through the hybrid loss, so that the denoising module is guided to generate samples facilitate recognition. Therefore, the recognition accuracy of the IDR-Net is about 3% higher than that of the two-stage network.

Contrast Experiments
Although methods for HRRP recognition emerge in an endless stream in recent years, they either design two networks for denoising and recognition separately, such as the SMTR-Net [19], or directly design networks which are not robust to noise, such as DPmTRNN [1] and RFRAN [31]. Below, we will compare the performance of the IDR-Net with traditional HRRP recognition methods and recently proposed methods with certain noise robustness, i.e., the linear support vector machine (LSVM) [27], the CNN [18], the TACNN [28], the TARAN [29], the CNN-Bi-RNN [30], and the class factorized complex variational autoencoder (CFCVAE) [16]. Among them, the LSVM is a traditional kernel method, which has satisfactory recognition and generalization performance. The remaining methods are deep models. Specifically, the CNN could effectively extract the local structural information of the HRRP; the TACNN is an attention-augmented CNN, where the learned attention coefficients can better represent the importance of each local feature in the recognition task; the TARAN is an attention-augmented RNN which could capture the temporal dependence and consider the contributions of different range cells during feature extraction, the CNN-Bi-RNN fuses the advantages of CNN and RNN and uses an attention mechanism to adjust the importance of features; and the CFCVAE is a variant of AE, which improves the feature characterization ability through multiple class-decoders.
Comparisons of the recognition accuracies between the available models and the IDR-Net are listed in Table 8, where the proposed model achieves the highest recognition accuracy on the noisy test sets with different SNR. Because traditional recognition methods mainly utilize shallow models, they have limited data description capabilities. On the contrary, deep neural networks are data-driven and could extract hierarchical features conducive to HRRP recognition. As demonstrated by Table 8, the recognition accuracies of the deep models are significantly higher than traditional method. However, the CNN fails to calculate the sequential relationships, whereas the methods based on traditional attention cannot describe the global information of the HRRP sequence. To tackle these issues, the IDR-Net suppresses the impact of noise on HRRP feature extraction through the denoising module, and then designs the attention-augmented temporal encoder extract the global information in parallel, thereby effectively boosting the recognition accuracy and the robustness to noise.

Conclusions
To achieve integrated of denoising and recognition of HRRPs in low SNR scenarios, this paper proposes the IDR-Net, which converts the noisy HRRP to denoised HRRP though adversarial training, and realizes global relationship extraction through the selfattention mechanism. The hybrid loss is designed to preserve significant features beneficial to recognition during denoising and facilitate end-to-end training. The experimental results on the measured HRRP data have demonstrated that the IDR-Net has higher recognition accuracy and stronger robustness to noise than traditional methods.
In the future, we will focus on studying effective feature extraction and recognition of HRRP under complex conditions such as data deficiency and deformation, and on exploring sequential features for HRRP sequence recognition.