Attention-Based Recurrent Temporal Restricted Boltzmann Machine for Radar High Resolution Range Profile Sequence Recognition

The High Resolution Range Profile (HRRP) recognition has attracted great concern in the field of Radar Automatic Target Recognition (RATR). However, traditional HRRP recognition methods failed to model high dimensional sequential data efficiently and have a poor anti-noise ability. To deal with these problems, a novel stochastic neural network model named Attention-based Recurrent Temporal Restricted Boltzmann Machine (ARTRBM) is proposed in this paper. RTRBM is utilized to extract discriminative features and the attention mechanism is adopted to select major features. RTRBM is efficient to model high dimensional HRRP sequences because it can extract the information of temporal and spatial correlation between adjacent HRRPs. The attention mechanism is used in sequential data recognition tasks including machine translation and relation classification, which makes the model pay more attention to the major features of recognition. Therefore, the combination of RTRBM and the attention mechanism makes our model effective for extracting more internal related features and choose the important parts of the extracted features. Additionally, the model performs well with the noise corrupted HRRP data. Experimental results on the Moving and Stationary Target Acquisition and Recognition (MSTAR) dataset show that our proposed model outperforms other traditional methods, which indicates that ARTRBM extracts, selects, and utilizes the correlation information between adjacent HRRPs effectively and is suitable for high dimensional data or noise corrupted data.


Introduction
A high-resolution range profile (HRRP) is the amplitude of the coherent summations of the complex time returns from target scatters in each range cell, which represents the projection of the complex returned echoes from the target scattering centers on to the radar line-of-sight (LOS) [1]. The HRRP recognition has been studied for decades in the field of RATR because it contains important structural information such as the target size and the distribution of scattering points [1][2][3][4]. In addition, the HRRP is easy to obtain, store, and process. For the problem of HRRP recognition, a large number of scholars have conducted extensive research [1,[5][6][7]. The reported methods can be summarized as extracted features of HRRPs after dividing the full target radar aspect angles into several frames and performing the target detection to select the region of interest in an HRRP. The difference between these methods lies in feature extraction. Common feature extraction techniques include HRRP templates, HRRP stochastic modeling, time-frequency transform features, and invariant features [8,9]. These feature extraction techniques all have clear physical meaning and are conducive for promotion.

Preliminaries
In this section, we will go over the salient properties of the Restricted Boltzmann Machine (RBM) briefly and then give preliminaries about Recurrent Temporal Restricted Boltzmann Machine (RTRBM), which is a temporal extension of RBMs.

Restricted Boltzmann Machine
The RBM is an undirected graphical model that uses a layer of hidden variables h = [h 1 , h 2 , · · · h m ] to model a joint distribution over the visible variables v = [v 1 , v 2 , · · · v n ] [16]. The graphical depiction of the RBM is depicted in Figure 1. The two layers are fully connected to each other by a weight matrix W but there exists no connections between units within the same layer [28,38]. On the problem of HRRP-based RATR, visible units can be an HRRP sample and the hidden layer can be used to extract the features. SAR data of MSTAR [37]. Experimental results indicate the superior performance of the proposed model against HMM, Class RBM, and Principle Component Analysis (PCA). Additionally, the proposed model can still achieve an ideal accuracy when the intensity of noise is lower than −15, which confirms its strong robustness to noise. This paper is organized as follows. In Section 2, the RBM and RTRBM are briefly introduced as a preparation for the proposal of the method. In Section 3, the proposed model for sequential HRRP recognition is presented in detail, which is followed by the training method for the proposed model in Section 4. After that, several experiments on the MSTAR dataset have been performed to evaluate our model in Section 5. Lastly, we conclude our work in Section 6.

Preliminaries
In this section, we will go over the salient properties of the Restricted Boltzmann Machine (RBM) briefly and then give preliminaries about Recurrent Temporal Restricted Boltzmann Machine (RTRBM), which is a temporal extension of RBMs.

Restricted Boltzmann Machine
The RBM is an undirected graphical model that uses a layer of hidden variables h = [h , h , ⋯ h ] to model a joint distribution over the visible variables v = [v , v , ⋯ v ] [16]. The graphical depiction of the RBM is depicted in Figure 1. The two layers are fully connected to each other by a weight matrix W but there exists no connections between units within the same layer [28,38]. On the problem of HRRP-based RATR, visible units can be an HRRP sample and the hidden layer can be used to extract the features. The RBM defines the joint distribution over visible units v and hidden units h, which is shown in the equation below [24].
where Z = ∑ ∑ exp [−E(v, h)] is the partition function, which is given by adding all possible pairs of visible and hidden vectors. Additionally, E is an energy function defined below.
where Θ = {W, b, c} consists of the model parameters, W ∈ R × represent the weight matrix connecting visible and hidden vectors, and b ∈ R and c ∈ R are the biases of the visible and hidden layers, respectively.

Recurrent Temporal Restricted Boltzmann Machine
The Recurrent Temporal Restricted Boltzmann Machine is a generative model for modeling high-dimensional sequences, which was constructed by rolling multiple RBMs over time. In detail, the RBM at time step t is connected at t − 1 through the weight matrix W and is conditioned on it. The dependency on h ( ) is the major difference compared to the RBM. It is worth noting that this horizontal deep architecture is different from the Deep Brief Networks (DBN), which stacks RBMs vertically [39]. Therefore, more sequence information can be extracted by RTRBM and performs better in many application scenarios such as radar HRRP target recognition. The RBM defines the joint distribution over visible units v and hidden units h, which is shown in the equation below [24].
is the partition function, which is given by adding all possible pairs of visible and hidden vectors. Additionally, E is an energy function defined below.
where Θ = {W, b, c} consists of the model parameters, W ∈ R M×N represent the weight matrix connecting visible and hidden vectors, and b ∈ R N and c ∈ R M are the biases of the visible and hidden layers, respectively.

Recurrent Temporal Restricted Boltzmann Machine
The Recurrent Temporal Restricted Boltzmann Machine is a generative model for modeling high-dimensional sequences, which was constructed by rolling multiple RBMs over time. In detail,  [39]. Therefore, more sequence information can be extracted by RTRBM and performs better in many application scenarios such as radar HRRP target recognition. The graphical model for the RTRBM is illustrated in Figure 2. The graphical model for the RTRBM is illustrated in Figure 2. The model gives five parameters {W, W , h ( ) , b, c}. Here W is the weight matrix between the visible and the hidden layer of the RBM at each time frame. W stands for the directed weights, which connect the hidden layer at time t -1 and t, and h ( ) is a vector of initial mean-file values of the hidden units. The motivation for the choice of h ( ) is that, using the RBM associated with time instant t, we have that (h ( ) v ( ) ) = h ( ) ; i.e., it is the expected value of the hidden units vector. In addition, b ( ) and c ( ) are the biases of visible and hidden layers. respectively. In RTRBM, the RBM at time frame t is conditioned on itself at time step t − 1 through a set of time dependent model parameters such as the visible and hidden layer biases b ( ) and c ( ) that depend on h ( ) [40].
while h ( ) is the mean-filed value of h ( ) , which is represented in detail below.
Given hidden inputs h ( ) (t > 1), the conditional distributions are factorized and takes the form below.
P h , = 1 v, h ( ) = σ( ω , v , + b + W , h ( , ) ) Therefore, the joint probability distribution of the visible and hidden units of the RTRBM with length T takes the form below [21].
where Z ( ) denotes the normalization factors for the RBM at T = t and E(v ( ) , h ( ) , h ( ) ) is the energy function at the time step t, which is defined by the equation below.
Furthermore, given the hidden inputs h ( ) , h ( ) , ⋯ , h ( ) , all the RBMs are decoupled. Therefore, sampling can be performed using block Gibbs sampling for each RBM independently. This fact is useful in deriving the CD algorithm, which is a stochastic approximation and utilizes a few Gibbs sampling steps to estimate the gradient of parameters [18,41].

The Proposed Model
Based on the original RTRBM, the newly proposed model brings the idea of the attention mechanism, which is named Attention based RTRBM. The graphical structure of the proposed model is demonstrated in Figure 3. In the proposed model, RTRBM is utilized to extract features from the input data and store the extracted features in the hidden vector. A new hidden layer s is introduced to RTRBM by the weighted sum in all hidden layers for the reason of measuring the role of each hidden vector in recognition tasks and then the new hidden layer is used for classification.
, it is the expected value of the hidden units vector. In addition, b (t) and c (t) are the biases of visible and hidden layers. respectively. In RTRBM, the RBM at time frame t is conditioned on itself at time step t − 1 through a set of time dependent model parameters such as the visible and hidden layer biases b (t) and c (t) that depend onĥ (t−1) [40].
whileĥ (t) is the mean-filed value of h (t) , which is represented in detail below.
Given hidden inputsĥ (t−1) (t > 1), the conditional distributions are factorized and takes the form below.
Therefore, the joint probability distribution of the visible and hidden units of the RTRBM with length T takes the form below [21].
where Zĥ (t−1) denotes the normalization factors for the RBM at T = t and E(v (t) , h (t) ,ĥ (t−1) ) is the energy function at the time step t, which is defined by the equation below.
Furthermore, given the hidden inputsĥ (1) ,ĥ (2) , · · · ,ĥ (T) , all the RBMs are decoupled. Therefore, sampling can be performed using block Gibbs sampling for each RBM independently. This fact is useful in deriving the CD algorithm, which is a stochastic approximation and utilizes a few Gibbs sampling steps to estimate the gradient of parameters [18,41].

The Proposed Model
Based on the original RTRBM, the newly proposed model brings the idea of the attention mechanism, which is named Attention based RTRBM. The graphical structure of the proposed model is demonstrated in Figure 3. In the proposed model, RTRBM is utilized to extract features from the input data and store the extracted features in the hidden vector. A new hidden layer s is introduced to RTRBM by the weighted sum in all hidden layers for the reason of measuring the role of each hidden vector in recognition tasks and then the new hidden layer is used for classification.
In the context of radar HRRP recognition, the input data v = [v 1 , v 2 , · · · , v N ] is the raw HRRPs sequence and the output y is a sequence of the class label. Each feature vector is extracted from the RTRBM, which is treated as an encoder to form a sequential representation. In the context of radar HRRP recognition, the input data v = [v , v , ⋯ , v ] is the raw HRRPs sequence and the output y is a sequence of the class label. Each feature vector is extracted from the RTRBM, which is treated as an encoder to form a sequential representation. The upper half of Figure 3 represents the attention mechanism in the ARTRBM model. The fundamental principle of the attention mechanism can be expressed as the classifier paying more attention to the major part rather than all the extracted feature vectors.
As is shown in Figure 3, α stands for the weight coefficient for the hidden layer at time step t. The layer s is determined by the hidden layer of each time step and W corresponds to the weight matrix, which connects the layer s and output layer y. Additionally, y is a vector representing the class label in which all values are set to 0 except at the position corresponding to a label y, which is set to 1.
In order to detail and describe the process of our model, the flowchart about ARTRBM is shown below.
As shown in Figure 4, the basic process of the attention mechanism can be summarized in three steps. First, computing the feature energy e and weight coefficients α , which represent the contribution of extracted feature vectors for recognition. Afterward, the final hidden layer s is constructed, which is determined by the hidden layers of all time steps. Finally, the layer s is used in the final classification task. The upper half of Figure 3 represents the attention mechanism in the ARTRBM model. The fundamental principle of the attention mechanism can be expressed as the classifier paying more attention to the major part rather than all the extracted feature vectors.
As is shown in Figure 3, α t stands for the weight coefficient for the hidden layer at time step t. The layer s is determined by the hidden layer of each time step and W ys corresponds to the weight matrix, which connects the layer s and output layer y. Additionally, y is a vector representing the class label in which all values are set to 0 except at the position corresponding to a label y, which is set to 1.
In order to detail and describe the process of our model, the flowchart about ARTRBM is shown below.
As shown in Figure 4, the basic process of the attention mechanism can be summarized in three steps. First, computing the feature energy e j and weight coefficients α j, which represent the contribution of extracted feature vectors for recognition. Afterward, the final hidden layer s is constructed, which is determined by the hidden layers of all time steps. Finally, the layer s is used in the final classification task.
In order to detail and describe the process of our model, the flowchart about ARTRBM is shown below.
As shown in Figure 4, the basic process of the attention mechanism can be summarized in three steps. First, computing the feature energy e and weight coefficients α , which represent the contribution of extracted feature vectors for recognition. Afterward, the final hidden layer s is constructed, which is determined by the hidden layers of all time steps. Finally, the layer s is used in the final classification task. In the attention mechanism, the final feature vector s is obtained by the weighted summation of the hidden layers of each time, which can be expressed in the equation below. In the attention mechanism, the final feature vector s is obtained by the weighted summation of the hidden layers of each time, which can be expressed in the equation below.
where the weight coefficient α j· can be defined as: where α j· represents the vector of the jth row elements of the matrix α and e j = Va· tan h(Wa·h j ) corresponds to the hidden layer energy at time frame j. The weight coefficient α j represents the role of the hidden layer feature h j in recognition. The attention mechanism [30,41,42] is also determined by the parameter α j . By training the parameters Va and Wa, the model can assign the hidden layer h j with different weights at different moments, which makes the model more focused on the parts that play a major role in the recognition tasks.

Learning the Parameters of the Model
In the proposed model, the RTRBM plays a role of the encoder, which describes the joint probability distribution p(v (1:T) , h (1:T) ;ĥ (t−1) ). According to Equation (3) and Equation (7), the energy function can be computed and is shown below.
In order to learn the parameters, first, we need to obtain the partial derivatives of log P(v 1 , v 2 , · · · , v T ) with respect to the parameters. We use CD approximation [15,17] to compute these derivatives, which require the gradients of energy function (10) to be based on all the model parameters. Afterward, we separate the energy function into the following two terms Therefore, the gradients of E representing the parameters were separated into two parts. It is straightforward to calculate the gradients of ∂H ∂Θ , and calculating ∂Q 2 ∂Θ would be more complex. To compute ∂Q 2 ∂Θ , we first compute ∂Q 2 ∂ĥ (t) , which can be computed recursively using the back propagation-through-time (BPTT) algorithm (David Rumelhart, Geoffrey Hinton et al., 1986) and the chain rule. Therefore, the model parameters Θ can be updated via gradient ascent, which is shown in the equation below.

∂E ∂Θ
∂H ∂Θ represents the universal mean of the gradient function ∂H ∂Θ under the ) and can be expressed using the equation below.
Therefore, Equation (12) can be derived as: Specifically, ∂H ∂Θ and ∂Q 2 ∂Θ are shown in Appendix A. We extract the features from the input data with the RTRBM model, which are stored in h (j) at every time step. Then we use h (j) as the input for the attention mechanism and compute the final hidden layer s using Equation (8). To learn the parameters of the attention mechanism, we need to choose an appropriate objective function. Here we use a close variant of perplexity known as cross entropy, which represents the divergence between the entropy calculated from the predicted distribution and that of the correct prediction label (and can be interpreted as the distance between these two distributions). It can be computed using all the units of the layer s and expressed as: ln p(y n |s n ) (15) where D train = {(s n , y n )} is the set of training examples, n represents the serial number of the training sample, and s n = (s n 1 , s n 2 , · · · s n T ) is the final hidden layer while y n = (y n 1 , y n 2 , · · · y n T ) corresponds to the target labels. By taking Equations (8) and (9) into the objective function (15), the gradient ∂ ∂θ f Cross (θ, D train ) can be calculated and is derived below. where: and: with: where y and y denotes the correct label and the output label, respectively. W ys is the weight matrix that connects layer s and label vector y while the logic function σ(x) = (1 + exp (−x)) −1 is applied to each element of the argued vector. Therefore, the gradients ∂F(y n |s n ) ∂θ can be exactly computed. The brief deduction process and results are show in Appendix B.
The pseudo code of the model parameter update for the proposed model is summarized in Algorithm 1, which is shown below.  (4).

Experiments
In order to evaluate the proposed recognition model, several experiments on the MSTAR dataset have been presented. First, arranging the training and testing HRRP sequences was introduced in Section 5.1. Afterward, we completed two experiments with different purposes in Section 5.2. The first section compared the performance of our proposed model with several other comparative models and the second section tested the recognition ability of our model with different noise intensities.

The Dataset
In order to show the clear comparisons between our results with those in other papers more easily, the publicly-available MSTAR (Moving and Stationary Target Acquisition and Recognition) dataset, which has been widely used in related research was chosen in our experiments [12]. MSTAR is funded by DARPA and is the standard dataset of the SAR automatic target recognition algorithm. More detailed, the MASTAR dataset includes 10 kinds of targets data (X band) under different azimuth angles and we chose three of the most similar targets for the experiment, which are the T72 main battle tank, the BMP2 armored personal carrier, and the BTR70 armored personal carrier. In order to make the MSTAR dataset suitable for our model, we first transformed the two-dimensional SAR into a one-dimensional HRRP vector to train our proposed model. The HRRP of the three targets are shown in Figure 5.

The Dataset
In order to show the clear comparisons between our results with those in other papers more easily, the publicly-available MSTAR (Moving and Stationary Target Acquisition and Recognition) dataset, which has been widely used in related research was chosen in our experiments [12]. MSTAR is funded by DARPA and is the standard dataset of the SAR automatic target recognition algorithm. More detailed, the MASTAR dataset includes 10 kinds of targets data (X band) under different azimuth angles and we chose three of the most similar targets for the experiment, which are the T72 main battle tank, the BMP2 armored personal carrier, and the BTR70 armored personal carrier. In order to make the MSTAR dataset suitable for our model, we first transformed the two-dimensional SAR into a one-dimensional HRRP vector to train our proposed model. The HRRP of the three targets are shown in Figure 5. All three classes of targets cover 0 to 360 degrees of aspect angles and their distance and azimuth resolutions are 0.3 m [43,44]. In the dataset, each target is obtained under the depression angle of 15° and 17°. The HRRPs of 17 degree of depression angle were used as the training data while the HRRPs of 15° were used as the test data. The size of the training and testing dataset is briefly illustrated in Table 1.

Number
Training Set Size Testing Set Size All three classes of targets cover 0 to 360 degrees of aspect angles and their distance and azimuth resolutions are 0.3 m [43,44]. In the dataset, each target is obtained under the depression angle of 15 • and 17 • . The HRRPs of 17 degree of depression angle were used as the training data while the HRRPs of 15 • were used as the test data. The size of the training and testing dataset is briefly illustrated in Table 1. We divided the 360 • of aspect angles into 50 aspect frames uniformly. Each frame covers 7.2 • . In each frame, an HRRP is sampled at intervals of 0.1 degrees. Therefore, each frame contains 72 HRRPs. Additionally, the composition of the sequential HRRP datasets is shown in Figure 6. To make the process more clearly, suppose that each HRRP sequence contains L (L ≥ T) HRRPs and the steps to construct the sequential HRRP are shown as Algorithm 2 [45].

Algorithm 2.
The composition of the sequential HRRP datasets.
Step 1: Start from the aspect frame 1 to L. The first HRRPs in frame 1 to L are chosen to form the first HRRP sequence with length L. Slide one HRRP to the right and the second HRRPs in aspect frame 1 to L are chosen to form the second HRRP sequence. Repeat this algorithm until the end of each frame.
Step 2: Slide one frame to the right and repeat step 1 to construct the following sequences.
Step 3: Repeat step 2 until the end of all aspect frames. If the remaining frame is less than L, then the first L − 1 frames are cyclically used one by one to form the remaining sequences.
In many studies, the clutter is removed to get "clean" HRRPs. We directly used the raw HRRPs. The only preprocessing was normalizing the magnitude of each HRRP to its total energy. This setting could make the experiments more closed to real recognition scenarios.

Experiment 1: Investigating the Influence of Hidden Layer Size on Recognition Performance
In this experiment, we will investigate the influence of the size of the hidden layer on recognition performance. In order to explore this problem, two groups of contrastive experiments were organized for different purposes. The first group is aimed at comparing the performance of the Attention-based RTRBM model with contrast models on different hidden layer sizes while the second is to investigate whether the attention mechanism really works and how much effect it has on performance. To make the process more clearly, suppose that each HRRP sequence contains L(L ≥ T) HRRPs and the steps to construct the sequential HRRP are shown as Algorithm 2 [45]. Algorithm 2. The composition of the sequential HRRP datasets.
Step 1: Start from the aspect frame 1 to L. The first HRRPs in frame 1 to L are chosen to form the first HRRP sequence with length L. Slide one HRRP to the right and the second HRRPs in aspect frame 1 to L are chosen to form the second HRRP sequence. Repeat this algorithm until the end of each frame.
Step 2: Slide one frame to the right and repeat step 1 to construct the following sequences.
Step 3: Repeat step 2 until the end of all aspect frames. If the remaining frame is less than L, then the first L − 1 frames are cyclically used one by one to form the remaining sequences.
In many studies, the clutter is removed to get "clean" HRRPs. We directly used the raw HRRPs. The only preprocessing was normalizing the magnitude of each HRRP to its total energy. This setting could make the experiments more closed to real recognition scenarios.

Experiment 1: Investigating the Influence of Hidden Layer Size on Recognition Performance
In this experiment, we will investigate the influence of the size of the hidden layer on recognition performance. In order to explore this problem, two groups of contrastive experiments were organized for different purposes. The first group is aimed at comparing the performance of the Attention-based RTRBM model with contrast models on different hidden layer sizes while the second is to investigate whether the attention mechanism really works and how much effect it has on performance.
Before conducting the experiments, we analyzed the influence rising from the length of the RTRBM model at first. According to Table 2, it shows that when T is increased by more than 15, stable test accuracy can be achieved. In addition, we can further improve the recognition rate by adding hidden units. Therefore, to seek a balance between recognition accuracy and computational complexity, T = 15 is adopted for the recognition task.  Figure 7 where the test accuracy is computed by averaging the test results of the three targets.  Figure 7 where the test accuracy is computed by averaging the test results of the three targets. It can be seen in Figure 7 that the superior recognition performance of Attention-based RTRBM against the other two models. Additionally, our proposed model gets optimal recognition accuracy on each size of the hidden layer, which shows the strong ability to deal with high dimensional sequences. The explanation for this result is that the proposed model can extract more separable features through the RTRBM model and make better use of them using the attention mechanism. Class RBM with average HRRP performs not as good as the other two models, but gets ideal recognition accuracy when the number of hidden nodes increased to 384, which reflects that Class RBM needs more hidden units to reach high recognition accuracy.
We design another baseline using PCA to reduce the dimension of input data. There are 15 features retained after PCA and the classifier is the Support Vector Machine (SVM). We repeat the baseline five times and the average test accuracy is 91.22%. Since the contrast experiment PCA+SVM does not contain hidden units, we mark the results of the model at 512 hidden units in Figure 7. Therefore, we can compare the PCA+SVM model with the best results of other methods. Additionally, It can be seen in Figure 7 that the superior recognition performance of Attention-based RTRBM against the other two models. Additionally, our proposed model gets optimal recognition accuracy on each size of the hidden layer, which shows the strong ability to deal with high dimensional sequences. The explanation for this result is that the proposed model can extract more separable features through the RTRBM model and make better use of them using the attention mechanism. Class RBM with average HRRP performs not as good as the other two models, but gets ideal recognition accuracy when the number of hidden nodes increased to 384, which reflects that Class RBM needs more hidden units to reach high recognition accuracy.
We design another baseline using PCA to reduce the dimension of input data. There are 15 features retained after PCA and the classifier is the Support Vector Machine (SVM). We repeat the baseline five times and the average test accuracy is 91.22%. Since the contrast experiment PCA+SVM does not contain hidden units, we mark the results of the model at 512 hidden units in Figure 7. Therefore, we can compare the PCA+SVM model with the best results of other methods. Additionally, the test performance of the HMM model is lower than 80% when the sequence length is 15, which is provided by Reference [12]. Similarly, we mark the results HMM at 512 hidden units in Figure 7 to compare with the best results of other methods. Then we can conclude from Figure 7 that the correlation matrix between the adjacent hidden layers helps RTRBM to extract more discriminatory features and the weight coefficients make the attention mechanism select more separable features, which means that ARTRBM is more suitable for the radar HRRP sequence recognition task.
To gain insight into the performance of three methods on different targets, we list the confusion matrix for the three targets in Table 3. The number of hidden units for all the methods is 384. As shown in Table 3, the misclassification of BMP2 lowers the average accuracy. One possible reason is that the features learned by the three models are not discriminatory enough to recognize the true targets and another reason may be summarized as we train the models only on BMP2 (Sn_C9563). However, test models on three types of the targets BMP2 and the three types of BMP2 (shown in Figure 8) has a low similarity, which is lower than the three types of T72. However, our proposed model still achieves higher accuracy than two contrast models on the classification of BMP2, which indicates that Attention-based RTRBM is a better choice when there is a great difference between the training and testing dataset.   Table 3, the misclassification of BMP2 lowers the average accuracy. One possible reason is that the features learned by the three models are not discriminatory enough to recognize the true targets and another reason may be summarized as we train the models only on BMP2 (Sn_C9563). However, test models on three types of the targets BMP2 and the three types of BMP2 (shown in Figure 8) has a low similarity, which is lower than the three types of T72. However, our proposed model still achieves higher accuracy than two contrast models on the classification of BMP2, which indicates that Attention-based RTRBM is a better choice when there is a great difference between the training and testing dataset. In the second contrastive experiments group, we designed several ways in which, without attention mechanism, we complete the comparison. In addition, the purpose is to investigate the impact of attention, which is the mechanism in the recognition performance.
The feature information are extracted by RTRBM and contained in the hidden layer, which are expressed as ℎ ( ) , ℎ ( ) , ⋯ , ℎ ( ) . We use ℎ ( ) , ℎ ( ) , ℎ ( ) , ℎ ( ) (the feature of the first, middle, last, and the average of all time frames) as input data, respectively, and classify it with a Single Layer In the second contrastive experiments group, we designed several ways in which, without attention mechanism, we complete the comparison. In addition, the purpose is to investigate the impact of attention, which is the mechanism in the recognition performance.
As shown in Figure 9, the proposed model achieves higher recognition accuracy than the other four methods at all hidden layer sizes. This result indicates that the attention mechanism can select discriminatory features more efficiently than other methods that select averageĥ (t) or any singlê h (t) . It is worth noting in the figure that choosing averageĥ (t) performs better than the other three contrastive experiments. In addition, with the time step t increases, RTRBM+SLP models perform better. This is not surprising since the latterĥ (t) contains more temporal and spatial correlation information through the correlation matrix W hh . However, even the RTRBM+SLP model usingĥ (T) still performs worse as our proposed model. Therefore, the attention mechanism greatly contributes to the recognition performance. still performs worse as our proposed model. Therefore, the attention mechanism greatly contributes to the recognition performance.

Experiment 2: Investigating the Influence of SNR on Recognition Performance
For applications in real scenarios, white Gaussian noise of different Signal-to-Noise (SNR) increasing from -10dB to 30dB were added to the testing data to investigate the robustness of the proposed model. In addition, the test data with different SNR are shown in Figure 10. As shown in Figure 10, white Gaussian noise of different SNRs is superimposed on the test HRRP sequence. Each row in the figure represents the index of range cell while each column shows the number of testing data. We use T72 as example, which contains 5820 HRRP samples.
In this example, we trained the ARTRBM using the HRRP sequence with T = 15 and 384 hidden units. We choose the Class RBM with 384 hidden units as the contrast experiment and the data input method connected 15 HRRPs end to end, which performs better than all other contrastive experiments in Experiment 1. Another contrast experiment uses PCA to reduce the dimension to 15 of input data and the classifier is the Support Vector Machine (SVM). Figure 11 shows the recognition performance of three models with different SNR. It is obvious that our proposed model achieves better performance than the other two models at all SNR levels and it gets more than 10% advantage over the other two models at −10dB. Additionally, the testing accuracy keeps stable at a high level, which is near the average accuracy in Table 2 (0.9488) when the SNR is higher than 15dB, which inflects that our proposed model has a certain anti-noise ability. The accuracy of the proposed model decreases to about 65% with the decrease of SNR. However, this

Experiment 2: Investigating the Influence of SNR on Recognition Performance
For applications in real scenarios, white Gaussian noise of different Signal-to-Noise (SNR) increasing from −10dB to 30dB were added to the testing data to investigate the robustness of the proposed model. In addition, the test data with different SNR are shown in Figure 10. still performs worse as our proposed model. Therefore, the attention mechanism greatly contributes to the recognition performance.

Experiment 2: Investigating the Influence of SNR on Recognition Performance
For applications in real scenarios, white Gaussian noise of different Signal-to-Noise (SNR) increasing from -10dB to 30dB were added to the testing data to investigate the robustness of the proposed model. In addition, the test data with different SNR are shown in Figure 10. As shown in Figure 10, white Gaussian noise of different SNRs is superimposed on the test HRRP sequence. Each row in the figure represents the index of range cell while each column shows the number of testing data. We use T72 as example, which contains 5820 HRRP samples.
In this example, we trained the ARTRBM using the HRRP sequence with T = 15 and 384 hidden units. We choose the Class RBM with 384 hidden units as the contrast experiment and the data input method connected 15 HRRPs end to end, which performs better than all other contrastive experiments in Experiment 1. Another contrast experiment uses PCA to reduce the dimension to 15 of input data and the classifier is the Support Vector Machine (SVM). Figure 11 shows the recognition performance of three models with different SNR. It is obvious As shown in Figure 10, white Gaussian noise of different SNRs is superimposed on the test HRRP sequence. Each row in the figure represents the index of range cell while each column shows the number of testing data. We use T72 as example, which contains 5820 HRRP samples.
In this example, we trained the ARTRBM using the HRRP sequence with T = 15 and 384 hidden units. We choose the Class RBM with 384 hidden units as the contrast experiment and the data input method connected 15 HRRPs end to end, which performs better than all other contrastive experiments in Experiment 1. Another contrast experiment uses PCA to reduce the dimension to 15 of input data and the classifier is the Support Vector Machine (SVM). Figure 11 shows the recognition performance of three models with different SNR. It is obvious that our proposed model achieves better performance than the other two models at all SNR levels and it gets more than 10% advantage over the other two models at −10dB. Additionally, the testing accuracy keeps stable at a high level, which is near the average accuracy in Table 2 (0.9488) when the SNR is higher than 15dB, which inflects that our proposed model has a certain anti-noise ability. The accuracy of the proposed model decreases to about 65% with the decrease of SNR. However, this number is less than 55% for CRBM. This result shows the strong anti-noise power of ARTRBM. Considering the working environment of the radar system, the training samples are often corrupted by noise. The model we proposed is a better choice to perform the HRRP sequence recognition task. Considering the working environment of the radar system, the training samples are often corrupted by noise. The model we proposed is a better choice to perform the HRRP sequence recognition task. Figure 11. Recognition performance on models tested with different SNR.

Conclusions
In this paper, attention-based RTRBM is proposed for target recognition based on the HRRP sequence. Compared with the reported methods, the proposed method has some compelling advantages. First, it introduces the correlation matrix between the hidden layers to extract more correlation information, which makes the extracted features hold the previous and current information. Afterward, it efficiently deals with high dimensional sequential data, which performs better than Class RBM using two different data input methods. Additionally, it can be effective for choosing and utilizing the important parts of the extracted features, which outperforms the RTRBM+SLP model using different input features. Additionally, the proposed model performs well in the case of strong noise, which indicates a strong robustness for the noise. In the near future, to better solve the problem of sequential HRRP recognition, we plan to combine other deeper models with an attention mechanism as a classifier for RTRBM or other sequential feature extraction models. Furthermore, in order to make the model more applicable to the real scenario, we will operate related experiments in the cases of different waveforms and pulse recurrence intervals (PRIs) or the case of the training phase and testing phase at different angular sampling rates. Additionally, we attempt to develop a model that can set the length of the attention mechanism adaptively. In this case, the number of T will not need to be set by experience, which may achieve a better performance.
Author Contributions: X.G. and Y.Z. conceived and designed the experiments. X.P. contributed the MSTAR dataset. Y.Z. performed the experiments. Y.Z. and X.P. analyzed the data. Y.Z. and J.Y. wrote the paper. X.L. supervised this paper.  . Recognition performance on models tested with different SNR.

Conclusions
In this paper, attention-based RTRBM is proposed for target recognition based on the HRRP sequence. Compared with the reported methods, the proposed method has some compelling advantages. First, it introduces the correlation matrix between the hidden layers to extract more correlation information, which makes the extracted features hold the previous and current information. Afterward, it efficiently deals with high dimensional sequential data, which performs better than Class RBM using two different data input methods. Additionally, it can be effective for choosing and utilizing the important parts of the extracted features, which outperforms the RTRBM+SLP model using different input features. Additionally, the proposed model performs well in the case of strong noise, which indicates a strong robustness for the noise. In the near future, to better solve the problem of sequential HRRP recognition, we plan to combine other deeper models with an attention mechanism as a classifier for RTRBM or other sequential feature extraction models. Furthermore, in order to make the model more applicable to the real scenario, we will operate related experiments in the cases of different waveforms and pulse recurrence intervals (PRIs) or the case of the training phase and testing phase at different angular sampling rates. Additionally, we attempt to develop a model that can set the is calculated by Equation (4). According to Equation (A1) and (A4), we get: ∂ĥ (t+1,m) ·ĥ (t,m ) ·(1 −ĥ (t,m ) ) + ∑ m ĥ (t,m ) )·ĥ (t,m )T (A5) Similarly, the gradients ∂Q 2 ∂Θ can be represented by the equations below.