A Neural Network with Convolutional Module and Residual Structure for Radar Target Recognition Based on High-Resolution Range Profile

In the conventional neural network, deep depth is required to achieve high accuracy of recognition. Additionally, the problem of saturation may be caused, wherein the recognition accuracy is down-regulated with the increase in the number of network layers. To tackle the mentioned problem, a neural network model is proposed incorporating a micro convolutional module and residual structure. Such a model exhibits few hyper-parameters, and can extended flexibly. In the meantime, to further enhance the separability of features, a novel loss function is proposed, integrating boundary constraints and center clustering. According to the experimental results with a simulated dataset of HRRP signals obtained from thirteen 3D CAD object models, the presented model is capable of achieving higher recognition accuracy and robustness than other common network structures.


Introduction
The high-resolution range profile (HRRP) of a target refers to the projection of the target scattering center following the radar line of sight, covering numerous target characteristics (e.g., size and structure). HRRP can be acquired, processed and stored easily; it has a simplified computation and robust real-time performance. For this reason, it has constantly been a critical data source for target recognition. Researchers are able to harvest separable features from HRRP to classify and identify a range of targets. Previous HRRP-based radar target classification and recognition placed primary emphasis on feature extraction on the basis of researchers' prior knowledge and experience, as well as optimization and fusion of classification algorithms. Common features consist of time domain characteristics [1,2] (as manifested by original image, central moment, structure contour, strong scattering points, etc.), while power spectrum, polarization ratio, polarization matrix and other frequency domain [3,4] and polarization domain [5,6] characteristics are also covered.
Fueled by advances in computer technology, and in accordance with deep learning theory, deep learning has become a hotspot in research in various fields [7][8][9][10]. It has been extensively employed in radar target detection, recognition, and classification. At the same time, HRRP and CNN also have significant applications in the field of unmanned aerial vehicles and unmanned surface vehicles [11,12]. Deep learning-based object recognition refers to feature extraction using a neural network. HRRP-based radar recognition can also be achieved using a deep learning algorithm. This field has aroused a great deal of attention from researchers, and considerable new achievements have been made, which will be presented below. There are many methods for enhancing the recognition accuracy of neural networks, such as improving the structure of the neural network, optimizing the loss function, and increasing the training data. To be specific, the neural network structure for HRRP target recognition consists of an autoencoder (AE) and a convolutional neural network (CNN). optimize the loss function to enhance the recognition effect are aiming towards face recognition, and many of them can be applied to HRRP target recognition.
Designing a good neural network structure is one of the most efficient and challenging approaches to enhancing classification performance. Under the premise of sufficient datasets, the learning ability of the model can be enhanced by up-regulating the depth and width of the neural network. AlexNet [25] and VGG [26] have both demonstrated that model recognition accuracy displays a positive correlation with the network depth in a certain range. Nevertheless, with the increase in network depth, gradient explosion, disappearance, and saturation of network recognition accuracy may take place in the back propagation of CNN in the training process. By introducing a residual learning framework, Kaiming He and Xiangyu Zhang [27] addressed the degradation problem. Accordingly, the problem whereby the accuracy reaches saturation and subsequently degrades rapidly with the rise in the network depth was avoided. However, to enhance the recognition effect, the residual learning framework requires further increases in the network depth.
In this study, an efficient and extensible convolutional module is presented by optimizing the residual learning framework. The convolutional module contains left and right branches. Among them, based on the left branch structure of convolutional module, the effect of network deepening and widening can be simulated. The skip structure of the right branch is capable of transferring features and gradients more effectively. The convolutional module is capable of achieving the recognition effect of a deep network with fewer network parameters. Additionally, a novel loss function is proposed to enhance the recognition accuracy by combining central clustering and additive margin strategy. The features extracted by the novel loss function are characterized by larger inter-class variations, smaller intra-class variations, and stronger separability. In the meantime, by combining convolutional module that exhibits the same topology, the presented model can be extended to adapt to various difficulty classification tasks. According to the experimental results with a simulated dataset of HRRP, the presented model is capable of achieving higher recognition accuracy than the conventional algorithm. The rest of this paper is organized as follows. Section 2 presents the composition and structure of one-dimensional convolutional network. The design of convolutional module and loss function are elucidated in Section 3. The experimental effect of the model is demonstrated in Section 4 from different aspects. Lastly, the concluded remarks are drawn in Section 5.

One-Dimensional Convolutional Neural Network
CNN refers to a type of feedforward neural network that covers convolution calculation. For its translation invariance in the calculation process, it is capable of avoiding complex preprocessing (e.g., HRRP data alignment) and exhibits higher robustness. The model employed in this study complies with the convolutional neural network. First, the basic structure of CNN is introduced, which covers five parts, namely, input layer, convolutional layer, pooling layer, fully connected layer, and output layer. The CNN structure for HRRP is illustrated in Figure 1. The input layer acts as the start of the neural network, generally requiring simple preprocessing of data to make the data have the identical dimension and satisfy the same distribution characteristics. Preprocessing is capable of down-regulating the effect of amplitude perturbation on the extraction characteristics of different HRRP data and enhancing the robustness of the model. It is also convenient to find the minimum value more directly in the iterative process of the gradient descent method, so the model can converge faster. It can be performed in the two steps below: 1.
Normalize the amplitude of HRRP. The data after amplitude normalization of the nth HRRP is expressed as x n = x n /max(|x n |), where max(|x n |) denotes the maximum absolute value of all elements in HRRP.

2.
Subtract the mean value of the normalized HRRP data from the respective element.
The major function of the convolutional layer is to extract the features of the input data. In Figure 1, the first convolutional layer covers 16 convolution kernels, and the second convolutional layer consists of 32 convolution kernels. Each convolution kernel element is composed of weight coefficient and bias. In deep learning, the weight coefficient initialization method of the neural network plays an important role in the convergence speed and performance of the model. Common weight coefficient initialization methods include random initialization, Xavier initialization [28], and He initialization [29]. Random initialization may cause gradient disappearance when the neural network layers are deep. To solve this problem, the Xavier initialization method was proposed. When used in conjunction with the Tanh activation function, the Xavier initialization method makes the output value of the activation function of the network layer obey the Gaussian distribution. The generation of gradient disappearance is avoided. However, when used with the Relu activation function, the problem of gradient disappearance still exists. The He initialization method proposed in [29] solves the problem of gradient disappearance when the Relu activation function is used in combination with it. The convolution kernel calculates the input data by convolution, adds the bias, and then activates it by means of the activation function. The output of the convolutional layer is the extracted feature. The calculation process can be written as: where x l j denotes the output of the jth channel, belonging to the lth convolutional layer. f (·) refers to the activation function, employing the Relu function. k l ij is the convolution kernel vector of the jth channel of the convolutional layer l that corresponds to the ith input vector. b l j is the bias of the jth channel of the convolutional layer l, * represents the convolution operation. The parameters of the convolutional layer consist of convolution kernel size, step size, filling category, as well as activation function. The common activation functions cover the Sigmoid function, the Relu function, etc. For various parameters, the convolutional layer exhibits different characteristics.
The function of the pooling layer aims to select the features extracted by the convolutional layer and down-regulate the dimension by down-sampling. Max-pooling, mean-pooling and mix-pooling are the common pooling layers.
On the whole, the fully connected layer is placed on the back side of the neural network. The major function is to arrange the features extracted from the previous layer to yield the one-dimensional vector. The whole CNN outputs target-related outcomes through the output layer classifier. The common classifiers are softmax and SVM. In the task of target recognition, the output of CNN can cover the category, size and central coordinates of the target. The learning process of CNN usually updates parameters iteratively by back propagation, and stable identification results are obtained by minimizing the error calculated by the loss function.

Design of Convolutional Module
The depth of neural networks is critical. The deep convolutional neural network is capable of extracting and fusing features of different levels for end-to-end target recognition. Nevertheless, the deepening of network layers will cause saturated recognition accuracy. To address this problem, residual structure is introduced, as illustrated in Figure 2.

Design of Convolutional Module
The depth of neural networks is critical. The deep convolutional neural network is capable of extracting and fusing features of different levels for end-to-end target recognition. Nevertheless, the deepening of network layers will cause saturated recognition accuracy. To address this problem, residual structure is introduced, as illustrated in Figure 2. The residual block in the residual structure consists of convolutional layers, and the number of convolutional layers in Figure 2 is 2. The residual structure outputs the sum of the input feature, and the output of the last convolutional layer is expressed by where l x and +1 l x represent the input and output feature vector of the residual block, respectively.  [27]. In particular, if the network has extracted the optimal features required for classification, the residual structure should only carry out identity mapping of skip connections to ensure the maximal recognition accuracy. For neural networks, zero residual block is more efficient than the use of multilayer neural networks to fit identity mapping. Figure 3 presents the structure of the convolutional module promoted in this study based on the residual structure, where conv denotes the convolutional layer.  The residual block in the residual structure consists of convolutional layers, and the number of convolutional layers in Figure 2 is 2. The residual structure outputs the sum of the input feature, and the output of the last convolutional layer is expressed by

M×1×N
where x l and x l+1 represent the input and output feature vector of the residual block, respectively. F(x l ) denotes the mapping of residual blocks.
Research results reveal that the saturated recognition accuracy of deep network can be effectively addressed by replacing the required fitting mapping F(x l ) + x l with the fitting mapping F(x l ) [27]. In particular, if the network has extracted the optimal features required for classification, the residual structure should only carry out identity mapping of skip connections to ensure the maximal recognition accuracy. For neural networks, zero residual block is more efficient than the use of multilayer neural networks to fit identity mapping. Figure 3 presents the structure of the convolutional module promoted in this study based on the residual structure, where conv denotes the convolutional layer.
The convolutional module proposed in this article is set up as a highly modular network structure that exhibits high expansibility. The features extracted by the upper layer network act as the input of this layer, and the input will pass through two branches, as shown in Figure 3. In the left branch, the convolution kernel of 1 × 1 is adopted to fuse the features between layers first. Subsequently, the fused features are split into x branches according to the number of layers. Each branch contains 3 layers of features, and all branches adopt a convolution kernel of 3 × 1 to extract features; the step size is 2. Since the step size of the convolution kernel is 2, the number of layers of the output feature remains unchanged, and the dimension is halved. Next, the features of all branches are concatenated. Moreover, the size of x can be ascertained according to the complexity of the classification tasks.
The larger x is, the easier it is to extract stronger separable features, and the better the recognition effect is in a more complex classification task. Such structure is similar to Inception [30]. Nevertheless, the size and number of the convolution kernel for each branch in Inception are customized step by step. In the convolutional module proposed in this article, a small-scale convolution kernel of 3 × 1 is uniformly chosen to simplify the structure design and ensure the recognition effect in the meantime. After concatenation, the features are fused again with the convolution kernel of 1 × 1, and the number of feature layers is then up-regulated. In the left branch in Figure 3, the number of feature layers increases from N to 4N/3. Then, according to the number of layers, the features are split into two parts Sensors 2020, 20, 586 6 of 25 to prepare for the subsequent fusion of the features of the two branches, where the number of layers of features for add is N, and the number of layers of features for concatenate is N/3, as shown in Figure 3. The right branch directly uses the convolution kernel of 1 × 1 to fuse the input features and rises the number of feature layers. In the meantime, the features are also separated into two parts according to the number of layers. The number of layers of features for add is N, and the number of layers of features for concatenate is 2N/3. Lastly, the corresponding features in the left and right branches are added or concatenated, as illustrated in Figure 3.
the output of the last convolutional layer is expressed by ( ) where l x and +1 l x represent the input and output feature vector of the residual block, respectively.  [27]. In particular, if the network has extracted the optimal features required for classification, the residual structure should only carry out identity mapping of skip connections to ensure the maximal recognition accuracy. For neural networks, zero residual block is more efficient than the use of multilayer neural networks to fit identity mapping. Figure 3 presents the structure of the convolutional module promoted in this study based on the residual structure, where conv denotes the convolutional layer.  Compared with the input of the convolutional module, the dimension of the output features is halved, and the number of layers is doubled. The right branch exerts similar effects as the residual network, making the transfer of features and gradients more efficient. Because of the right branch, each layer of convolutional module is capable of acquiring information from the loss function and the original input, and the exploitation of shallow features is facilitated. Then, the problem that the recognition accuracy decreases with the rise in the number of network layers is avoided.

Design of Loss Function
The loss function is adopted to identify the difference between the predicted value and the real value. Softmax loss commonly acts as the loss function for multi-classification convolutional neural network. However, from the clustering perspective, the feature extracted from softmax loss will display larger intra-class variations than inter-class variations. In the meantime, the features extracted by softmax loss are not discriminative enough, since they still display significant intra-class variations. Under too many target types, the features will overlap, which is not conducive to object classification. To solve this problem, numerous solutions have been proposed in face recognition [31][32][33][34][35]. They primarily focus on promoting inter-class variations and lowering intra-class variations. For softmax loss, features can be brought closer by enhancing the boundary constraints between various targets. It also promotes the inter-class variations of targets. The formulation of the original softmax loss is defined as: where x represents the input of the last fully connected layer. x i ∈ R d denotes the ith deep feature, belonging to the y i th class. d indicates the feature dimension. W j ∈ R d refers to the jth column of the weights W ∈ R d×n in the last fully connected layer. W T y i x i denotes the target logit of the ith sample. m and n represent the size of mini-batch and the number of class, respectively.
The design of the loss function proposed refers to the additive margin softmax loss (AM-softmax), which is used in face recognition [35]. In the meantime, considering the constraint of intra-class variations of features, a loss function named margin center (Referred to MC), integrating additive margin and center constraint, is proposed. The loss function uses the additive margin to increase the inter-class variations of features; the center constraint is also employed to reduce the intra-class variations of features. As a result, the inter-class variations of features are larger, the intra-class variations are smaller, and the separability of features is enhanced. The formulation of the loss function proposed in this study is given by where the hyper-parameter s is adopted to scale the cosine values, and cosine values represent the similarity between the features. µ is applied for the control of the distance between the edges of the feature. c y i ∈ R d denotes the y i th class center of features, and c y i can constantly update with the variation of the features of each batch. L AMS proposes a specific ψ(θ) = cos θ − µ to introduce the additive margin property and enhances the recognition effect by promoting the inter-class variations of features. L C constructs a class center for the features of each class of target and punishes the features far away from the class center. Accordingly, the intra-class variations of features becomes more compact, the intra-class variations are lowered, and the inter-class variations are promoted. The gradients of L C with respect to x i and update equation of c y i are computed as: where δ(·) = 1 if the identification is correct; otherwise, δ(·) = 0. Under the constraint of joint loss function L AMSC , the learning details in network can be summarized in the following (Algorithm 1): Step 1: while not converge do Step 2: t ← t + 1 .
Step 3: compute the joint loss by L t AMSC = L t AMS + λL t C .
Step 4: compute the backpropagation error Step 5: update the parameters W by W t+1 Step 6: update the parameters c j by c t+1 Step 7: update the parameters θ c by θ t+1 Step 8: end while

Design of Model Structure
The block diagram of the presented model in this study is shown in Figure 4, which primarily includes an initial convolutional layer, several convolutional modules with the same topology connected sequentially, and the last two fully connected layers. The dimension of the latter fully connected layer is 2, which is conducive to visualizing the features extracted by the model and analyze the clustering effect of the features.
Step 3: compute the joint loss by AMSC AMS C .
Step 4: compute the backpropagation error ∂ Step 5: update the parameters W by Step 6: update the parameters j c by α Step 7: update the parameters θ c by θ θ θ Step 8: end while

Design of Model Structure
The block diagram of the presented model in this study is shown in Figure 4, which primarily includes an initial convolutional layer, several convolutional modules with the same topology connected sequentially, and the last two fully connected layers. The dimension of the latter fully connected layer is 2, which is conducive to visualizing the features extracted by the model and analyze the clustering effect of the features.
In Figure 4, the numbers in brackets represent the data dimension after the data passes through this layer, consistent with Figure 3. The output data dimensions of each convolutional module and the first fully connected layer are determined according to the number of convolutional modules. Lastly, the result of the output layer is one-dimensional data that represents the target types. The number of target types in this study is 13. In the presented model, a one-dimensional convolution kernel with a scale of 7 × 1 is taken for the initial convolutional layer. The selection of convolution kernel with relatively large scale in the first layer of the network is conducive to the extraction of the features (e.g., contour and texture in the HRRP). After each convolution operation in this model, batch normalization and Relu activation are performed on the extracted features. Since the Relu activation function is used, the He initialization method is chosen for all weight initialization of the model proposed.  In Figure 4, the numbers in brackets represent the data dimension after the data passes through this layer, consistent with Figure 3. The output data dimensions of each convolutional module and the first fully connected layer are determined according to the number of convolutional modules. Lastly, the result of the output layer is one-dimensional data that represents the target types. The number of target types in this study is 13. In the presented model, a one-dimensional convolution kernel with a scale of 7 × 1 is taken for the initial convolutional layer. The selection of convolution kernel with relatively large scale in the first layer of the network is conducive to the extraction of the features (e.g., contour and texture in the HRRP). After each convolution operation in this model, batch normalization and Relu activation are performed on the extracted features. Since the Relu activation function is used, the He initialization method is chosen for all weight initialization of the model proposed.

Data Set Construction
On the whole, there are two ways to obtain the target echo signal, namely the measured method and the theoretical calculation method. Since most ship targets are non-cooperative targets, it is very difficult to obtain the HRRP from field measurement. In this study, 13 ship models were built by 3D Max, and HRRP was calculated by FEKO. FEKO is 3D electromagnetic field simulation software, and is an abbreviation of "FEldberechnung für Körper mit beliebiger Oberfläche", in German. When calculating the HRRP of a ship, the ship is stationary, and the HRRP of the ship in different directions is obtained by changing the incident direction of the electromagnetic wave. Since the ship is stationary when calculating HRRP, we do not apply three-dimensional rotation around the different Cartesian axes. The set simulation parameters include the center frequency of the radar as 10 GHz, the bandwidth as 80 MHz, the number of frequency sampling points as 256, the calculated azimuth range as 0-360 • , and the interval as 1 • . The grazing angle is 10 • . The obtained HRRP has 256 range cells, with the corresponding length of each range cell as 1.875 m. The model and amplitude normalized HRRP of one of the ships are illustrated in Figure 5. Models of all of the ship targets are presented in Figure 6.

Data Set Construction
On the whole, there are two ways to obtain the target echo signal, namely the measured method and the theoretical calculation method. Since most ship targets are non-cooperative targets, it is very difficult to obtain the HRRP from field measurement. In this study, 13 ship models were built by 3D Max, and HRRP was calculated by FEKO. FEKO is 3D electromagnetic field simulation software, and is an abbreviation of "FEldberechnung für Körper mit beliebiger Oberfläche", in German. When calculating the HRRP of a ship, the ship is stationary, and the HRRP of the ship in different directions is obtained by changing the incident direction of the electromagnetic wave. Since the ship is stationary when calculating HRRP, we do not apply three-dimensional rotation around the different Cartesian axes. The set simulation parameters include the center frequency of the radar as 10 GHz, the bandwidth as 80 MHz, the number of frequency sampling points as 256, the calculated azimuth range as 0-360°, and the interval as 1°. The grazing angle is 10°. The obtained HRRP has 256 range cells, with the corresponding length of each range cell as 1.875 m. The model and amplitude normalized HRRP of one of the ships are illustrated in Figure 5. Models of all of the ship targets are presented in Figure 6.

Data Set Construction
On the whole, there are two ways to obtain the target echo signal, namely the measured method and the theoretical calculation method. Since most ship targets are non-cooperative targets, it is very difficult to obtain the HRRP from field measurement. In this study, 13 ship models were built by 3D Max, and HRRP was calculated by FEKO. FEKO is 3D electromagnetic field simulation software, and is an abbreviation of "FEldberechnung für Körper mit beliebiger Oberfläche", in German. When calculating the HRRP of a ship, the ship is stationary, and the HRRP of the ship in different directions is obtained by changing the incident direction of the electromagnetic wave. Since the ship is stationary when calculating HRRP, we do not apply three-dimensional rotation around the different Cartesian axes. The set simulation parameters include the center frequency of the radar as 10 GHz, the bandwidth as 80 MHz, the number of frequency sampling points as 256, the calculated azimuth range as 0-360°, and the interval as 1°. The grazing angle is 10°. The obtained HRRP has 256 range cells, with the corresponding length of each range cell as 1.875 m. The model and amplitude normalized HRRP of one of the ships are illustrated in Figure 5. Models of all of the ship targets are presented in Figure 6.   In Figure 5b, the horizontal axis and the vertical axis represent HRRP length and azimuth angle, respectively. Each ship acquires 360 HRRP data. To meet the requirement of the data amount of the sample during neural network training and prevent over-fitting, the dataset should be expanded. The process is as follows: 1. Translation interception of HRRP. As revealed by Figure 5, when HRRP is calculated, the coordinate axis coincides with the center of the ship, so the effective HRRP information is generally in the middle region. However, when the radar detects the target, the echo signal may be incomplete or partially missing. Accordingly, the first step of data expansion is the translation interception of HRRP. Since each HRRP is one-dimensional data, only a one-dimensional translation interception is applied. The HRRP is shifted to the left and right by 32 and 64 range cells in turn. The data removed is discarded, and the blank part is supplemented with 0, as presented in Figure 7. The number of samples is increased to 5 times by taking those HRRP samples that overlap but are not identical. It should be noted that the translation interception of HRRP is to simulate the partially missing echo signal, and there is no spatial transformation performed on the object during the HRRP acquisition and expansion process.
Sensors 2020, 20, x FOR PEER REVIEW 10 of 25 In Figure 5b, the horizontal axis and the vertical axis represent HRRP length and azimuth angle, respectively. Each ship acquires 360 HRRP data. To meet the requirement of the data amount of the sample during neural network training and prevent over-fitting, the dataset should be expanded. The process is as follows: 1. Translation interception of HRRP. As revealed by Figure 5, when HRRP is calculated, the coordinate axis coincides with the center of the ship, so the effective HRRP information is generally in the middle region. However, when the radar detects the target, the echo signal may be incomplete or partially missing. Accordingly, the first step of data expansion is the translation interception of HRRP. Since each HRRP is one-dimensional data, only a one-dimensional translation interception is applied. The HRRP is shifted to the left and right by 32 and 64 range cells in turn. The data removed is discarded, and the blank part is supplemented with 0, as presented in Figure 7. The number of samples is increased to 5 times by taking those HRRP samples that overlap but are not identical. It should be noted that the translation interception of HRRP is to simulate the partially missing echo signal, and there is no spatial transformation performed on the object during the HRRP acquisition and expansion process. 2. Random noise is added to the translated HRRP data. Gaussian white noise was added to the data 10 times, and the data after adding noise meets a certain SNR.
2/3 of the target data of each class of ship are randomly taken as the training dataset and 1/3 as the testing dataset. In the database, the training dataset samples and the testing dataset samples were 156,000 and 78,000, respectively.

Model Identification Performance Analysis
In this section, the performance of the presented model is analyzed in three aspects. The first part primarily shows the effect of different loss functions on the recognition effect. The second part primarily analyzes the advantages of the presented model compared with the comparison model. The third part primarily analyzes the enhancement of model complexity to recognition effect.
All the networks were trained from scratch. The iterations were set to 200. The learning rate began with 0.01, and it was halved every 20 training iterations. The Adam optimizer was employed to update the network weight. The batch gradient descent method was applied, and the number of training samples per batch was 512. 2. Random noise is added to the translated HRRP data. Gaussian white noise was added to the data 10 times, and the data after adding noise meets a certain SNR.

Effect of Loss Function on Recognition Effect
2/3 of the target data of each class of ship are randomly taken as the training dataset and 1/3 as the testing dataset. In the database, the training dataset samples and the testing dataset samples were 156,000 and 78,000, respectively.

Model Identification Performance Analysis
In this section, the performance of the presented model is analyzed in three aspects. The first part primarily shows the effect of different loss functions on the recognition effect. The second part primarily analyzes the advantages of the presented model compared with the comparison model. The third part primarily analyzes the enhancement of model complexity to recognition effect.
All the networks were trained from scratch. The iterations were set to 200. The learning rate began with 0.01, and it was halved every 20 training iterations. The Adam optimizer was employed to update the network weight. The batch gradient descent method was applied, and the number of training samples per batch was 512.

Effect of Loss Function on Recognition Effect
The hyper-parameters of the presented model are limited to three types: the number of convolutional modules, the number of left branches in the modules and the parameters of the joint loss function.
To verify the effectiveness of the structure and loss function proposed, model A, with low complexity, is built first. The number of convolutional modules in model A is 4, and the number of left branches inside the module is 3. The parameters of the joint loss function are fine-tuned in accordance with the identification effect. Table 1 elucidates the structure and parameters of each stage in model A. After each convolutional layer, there are batch normalization and Relu activation operations. The number of parameters in the respective stages covers convolution kernel parameters and batch normalization parameters. For instance, the number of parameters of the initial convolutional layer is 63 + 36, suggesting 63 convolutional kernel parameters and 36 batch normalized parameters, respectively. The total number of parameters of model A is 37,538. First, the effect of different loss functions on the recognition effect is compared under the structure of model A. The loss functions participating in the comparison refer to L S , L AMS and L SC .

Classification Effect Comparison of Loss Function L AMS and Loss Function L S
The hyper-parameter µ in L AMS constrains the boundaries between features and s scales the cosine values. In [35], it was reported that the s will not increase, and the network converges in a relatively slow manner if the s is set to be learned. Thus, s is fixed at 30, which is a sufficiently large value. Thus, experiments are performed to delve into the sensitivity of parameter µ.
In the dataset with SNR of 0, 5, 10 and 15 dB, respectively. s is fixed to 30 and µ varies from 0 to 1 to compare the recognition accuracy of model A using loss function L AMS and L S . The recognition accuracy is obtained by calculating the percentage of correctly classified samples in the testing dataset in the total number of samples, and the simulation results are presented in Figure 8. As suggested by Figure 8, compared with the conventional loss function L S , the use of loss function L AMS improves the model recognition accuracy under different SNR conditions to a certain extent. In addition, the lower the SNR of the dataset is, the greater the enhancement in recognition accuracy. At different SNR, with the rise in the boundary constraint strength µ, the enhancement of recognition accuracy generally presents a downward trend. It is also noted that the effective range of boundary constraint strength is small when the SNR is low. In Figure 8a, the effective range of µ is only from 0 to 0.25 at SNR of 0 dB. Furthermore, the recognition accuracy after exceeding the range is lower than that with the use of loss function L S only. When the loss function L AMS is adopted for ship target recognition in our dataset, the value of boundary constraint strength should not be too large; 0.05 is generally appropriate, applying to a larger SNR range.
Sensors 2020, 20, x FOR PEER REVIEW 12 of 25 As suggested by Figure 8, compared with the conventional loss function S L , the use of loss function AMS L improves the model recognition accuracy under different SNR conditions to a certain extent.
In addition, the lower the SNR of the dataset is, the greater the enhancement in recognition accuracy.
At different SNR, with the rise in the boundary constraint strength μ , the enhancement of recognition accuracy generally presents a downward trend. It is also noted that the effective range of boundary constraint strength is small when the SNR is low. In Figure 8a, the effective range of μ is only from 0 to 0.25 at SNR of 0 dB. Furthermore, the recognition accuracy after exceeding the range is lower than that with the use of loss function S L only. When the loss function AMS L is adopted for ship target recognition in our dataset, the value of boundary constraint strength should not be too large; 0.05 is generally appropriate, applying to a larger SNR range.  To show the effect of loss function L AMS on the separability of features extracted from model A more intuitively, when the SNR is 15 dB, the testing dataset is visualized with the 2d features of the second full-connection layer in model A, as shown in Figure 9. It can be seen that after the loss function L AMS is used, the corner space occupied by the extracted features in sample of each class becomes smaller, the inter-class variations of features become larger, and the features are more separable. It is also noted that the scale of the feature increases with the use of the loss function L AMS . That is, the features of the same class become more slender in terms of spatial distribution. L is adopted to control the learning rate of center for the features, and λ is applied for the balance of the two functions. Experimental results reveal that when the learning rate α varies, the recognition accuracy fluctuates slightly. Here, to simplify model design and optimization, the learning rate α is directly fixed at 0.6. Therefore, we conduct experiments to investigate the sensitivity of parameter λ while the dataset under different SNR conditions. The simulation results are listed in Table 2.

Classification Effect Comparison of Loss Function L SC and Loss Function L S
When the weight λ is introduced to fuse the loss function L S with the loss function L C , it yields L SC . The hyper-parameter α in L SC is adopted to control the learning rate of center for the features, and λ is applied for the balance of the two functions. Experimental results reveal that when the learning rate α varies, the recognition accuracy fluctuates slightly. Here, to simplify model design and optimization, the learning rate α is directly fixed at 0.6. Therefore, we conduct experiments to investigate the sensitivity of parameter λ while the dataset under different SNR conditions. The simulation results are listed in Table 2. Comparing the recognition accuracy in Table 2 with the results in Figure 8 using the loss function L AMS , the loss function L SC is suggested to be more robust to noise, whereas it has a limited effect on the enhancement of recognition accuracy, indicating that reducing the intra-class variations of features alone cannot significantly enhance the recognition effect of the model. To show the process of establishing the center of features extracted by the loss function L SC . When SNR of the dataset is 15 dB and the weight λ is 0.6, the 2d features of the second fully connected layer of the dataset in model A are visualized for every 50 iterations, as shown in Figure 10.
limited effect on the enhancement of recognition accuracy, indicating that reducing the intra-class variations of features alone cannot significantly enhance the recognition effect of the model. To show the process of establishing the center of features extracted by the loss function SC L . When SNR of the dataset is 15 dB and the weight λ is 0.6, the 2d features of the second fully connected layer of the dataset in model A are visualized for every 50 iterations, as shown in Figure 10. As suggested by Figure 10a, the initial features of each class are inseparable, and the initial recognition accuracy is only about 0.3274. With the increase in iteration times and the constant updating of parameters, the features of various samples are gradually separated and concentrated in their category centers. With the enhancement of feature separability, the model recognition accuracy rises. The comparison between (c) and (d) in Figure 10 suggests that though the recognition accuracy of the model is not improved between 100 and 150 iterations, the features of various samples are As suggested by Figure 10a, the initial features of each class are inseparable, and the initial recognition accuracy is only about 0.3274. With the increase in iteration times and the constant updating of parameters, the features of various samples are gradually separated and concentrated in their category centers. With the enhancement of feature separability, the model recognition accuracy rises. The comparison between (c) and (d) in Figure 10 suggests that though the recognition accuracy of the model is not improved between 100 and 150 iterations, the features of various samples are more clustered, and the intra-class variations of features are gradually decreased. As suggested by the comparison of Figure 10e,f, although the training dataset exhibits stronger feature separability and higher recognition accuracy, the testing dataset have similar feature distribution and recognition accuracy. It is, therefore, revealed that the model has no obvious overfitting and the extracted features have good generalization performance. Compared with the visualization of features extracted by model A when loss function L AMS and L S are used in Figure 9, the feature scale extracted by loss function L SC is smaller, and the distribution range is narrowed from [−400,400] to [−3,3]. The distribution of features in space varies from divergence to aggregation by class, and the intra-class difference is smaller.

Classification Effect Comparison between Loss Function L SC and Others
By analyzing the described results, it can be concluded that the boundary constraint strength µ of the loss function L AMS can significantly improve the recognition accuracy. However, when the SNR is low, the value of µ should not be overly large. The weight λ of loss function L SC has better adaptability and can improve the intra-class aggregation effect of features within a larger value range, but the enhancement of recognition accuracy is limited.
In this section, we verify the enhancement of recognition accuracy with the joint loss function L AMSC , where s, α and µ are fixed at 30, 0.6 and 0.05, respectively. When λ is taken to have different values, the recognition accuracy of model A under different SNR conditions is listed in Table 3. Table 3. Recognition accuracy of model A when the λ in loss function L AMSC is different while the dataset is under different SNR conditions. In Table 3, the recognition accuracy of model A is given when the λ in the joint loss function L AMSC is taken to have different values. In the meantime, it also shows the recognition accuracy when using loss function L AMS , L SC and L S . Among them, the loss function L AMS and L SC show the best recognition accuracy when assuming different values of parameters. As suggested by Table 3, when the joint loss function L AMSC is used, the recognition accuracy is improved stably under different SNR conditions. In addition, when the SNR of the dataset is relatively low, the enhancement is greater. When the value of λ is 0.001, 0.01, 0.1 and 1, respectively. we visualize the features extracted by model A, as shown in Figure 11. the best recognition accuracy when assuming different values of parameters. As suggested by Table  3, when the joint loss function AMSC L is used, the recognition accuracy is improved stably under different SNR conditions. In addition, when the SNR of the dataset is relatively low, the enhancement is greater. When the value of λ is 0.001, 0.01, 0.1 and 1, respectively. we visualize the features extracted by model A, as shown in Figure 11. As suggested by Figure 11, with increasing value of λ , the intra-class differences of features gradually become smaller, and the features of different types gradually converge to the center of the class. The spatial distribution range of features is narrowed from [−20, 15] to [−1.5, 1.5]. As suggested by the recognition accuracy and the intra-class aggregation of features, the recognition effect is identified the optimal at the value of λ as 0.1.

Analysis of the Recognition Effect of the Presented Model and the Comparison Model
In this section, the common target recognition algorithm based on HRRP is selected as the comparison model to verify the effectiveness of the presented model and loss function. Conventional comparison algorithms based on machine learning include: KNN [36], LSVM [37], RBF-SVM [38], RF [39] and NB [40]. The comparison algorithms based on neural network includes: CNN [18], Stack Sparse Auto Encoder and K-Nearest Neighbor (sDSAE&KNN) [24], Stack Convolutional Auto Encoder (SCAE) [41]. For the highest recognition accuracy, the hyper-parameters in the comparison algorithm are fine-tuned. Table 4, Table 5 and Table 6 elucidate the structure and parameters of each comparison model based on the neural network. The pooling layer in each model is max-pooling, and batch normalization is performed after the convolutional layer in CNN.  As suggested by Figure 11, with increasing value of λ, the intra-class differences of features gradually become smaller, and the features of different types gradually converge to the center of the class. The spatial distribution range of features is narrowed from [−20, 15] to [−1.5, 1.5]. As suggested by the recognition accuracy and the intra-class aggregation of features, the recognition effect is identified the optimal at the value of λ as 0.1.

Analysis of the Recognition Effect of the Presented Model and the Comparison Model
In this section, the common target recognition algorithm based on HRRP is selected as the comparison model to verify the effectiveness of the presented model and loss function. Conventional comparison algorithms based on machine learning include: KNN [36], LSVM [37], RBF-SVM [38], RF [39] and NB [40]. The comparison algorithms based on neural network includes: CNN [18], Stack Sparse Auto Encoder and K-Nearest Neighbor (sDSAE&KNN) [24], Stack Convolutional Auto Encoder (SCAE) [41]. For the highest recognition accuracy, the hyper-parameters in the comparison algorithm are fine-tuned. Tables 4-6 elucidate the structure and parameters of each comparison model based on the neural network. The pooling layer in each model is max-pooling, and batch normalization is performed after the convolutional layer in CNN.  Since the complexity of the model is associated with the recognition accuracy, the number of parameters of each model is similar to model A when the comparison model based on neural network is designed. The total parameters of the model are employed here to represent the complexity of the model. As suggested by the table above, the complexity of each model based on neural network is shown in descending sequence: sDSAE&KNN, SCAE, CNN, model A.
First, the recognition effect of all models was compared using the dataset under the condition SNR = 5 dB. The recognition accuracy of each model is shown in Figure 12. parameters of each model is similar to model A when the comparison model based on neural network is designed. The total parameters of the model are employed here to represent the complexity of the model. As suggested by the table above, the complexity of each model based on neural network is shown in descending sequence: sDSAE&KNN, SCAE, CNN, model A.
First, the recognition effect of all models was compared using the dataset under the condition SNR = 5 dB. The recognition accuracy of each model is shown in Figure 12.    Figure 12 shows the best recognition accuracy of the comparison model and the model A with a variety of loss functions. As suggested in Figure 12, each model based on the proposed structure (model A) achieves better recognition effect. Additionally, the recognition of model A combined with the joint loss function L AMSC exhibits the highest accuracy among all models. The effectiveness of the proposed structure and the joint loss function is verified, respectively.
In the meantime, the recognition effect based on neural network model appears to be generally better than that based on conventional machine learning model. In the neural network models, the model including the convolution kernel (model A, CNN, SCAE) can achieve a prominent recognition effect. In the meantime, the model based on the convolutional neural network (model A, CNN) outperforms the model based on the auto-encoder (SCAE, sDSAE&KNN). During the expansion of the dataset, translation interception is performed to simulate target occlusion and information loss in the echo signal to some extent. The convolutional neural network-based recognition exhibits higher accuracy, revealing that the convolution kernel helps the model extract the effective separable features of different target echo signals, achieve better recognition effect, and avoid being adversely affected by incomplete echo signal information.
Under different SNR conditions, the optimal recognition results of each model based on neural network are listed in Table 7. As suggested by the recognition results in Table 7, the recognition accuracy of each model noticeably impacts SNR. Additionally, the recognition accuracy of each model is enhanced with the rise in SNR of the dataset. Compared with the comparison model, model A exhibits the least number of parameters and the least complexity, whereas the highest recognition accuracy is achieved under different SNR datasets. By enhancing the network structure and loss function, the presented model achieves better recognition effect with less model complexity and exhibits higher generalization performance and noise robustness. It should be noted that the calculation process of the model proposed is more complicated, so it takes more time to identify the target.

Effect of Model Complexity on Recognition Effect
The mentioned experimental results verify the effectiveness of the structure and loss function proposed. Since the recognition accuracy of the model displays a positive correlation with the depth and width of the model within a certain range. In the present section, different parameters will be selected, three models with different complexity will be designed, and their recognition effects will be compared. Model A refers to the model adopted in Section 4.2.1. Model B is developed by up-regulating the number of convolutional modules in model A to 5, and Model C is obtained by up-regulating the number of branches in the left branch of model A to 6. The details of the structure and parameters of each stage in model B and C are listed in Tables 8 and 9.  Under different SNR conditions, the optimal recognition results of each model are listed in Table 10. The proposed joint loss function L AMSC is used in all models, and the values of each parameter of the loss function are s = 30, α = 0.6, µ = 0.05, λ = 0.1. As revealed by Table 10, the depth and width of the model directly impact the recognition effect. Compared with model A, models B and C both enhance the recognition accuracy noticeably. In particular, when the SNR is low, the enhancement becomes more obvious. Meanwhile, the time required for model A, B and C to calculate each HRRP is also listed in Table 10. It can be seen that compared with model A, the computational times for model B and model C increase due to the increased complexity of the model. However, compared with the increased number of parameters, the increase in calculation time is not large. To compare and delve into the convergence speed of various complexity models, at the dataset SNR of 15 dB, the recognition accuracy and loss curves in the training process are plotted in Figure 13.
accuracy gradually converge to stable values. In the meantime, the loss curve of model A in Figure  13b is always higher than that of models B and C, and a certain gap remains until the end of the training. The C L in the joint loss function AMSC L indicates the intra-class difference of the features extracted by the model, which suggests that the features extracted by model B and C undergo intra-class aggregation more effectively. The visualization of features also verifies this conclusion. The feature visualization of each model is illustrated in Figure 14 and Figure 15, respectively.  Figure 13 reveals that the recognition accuracy curve and the loss curve of model B and C fluctuate more dramatically during the training process, whereas they converge faster, as compared with those of model A. In the initial 60 iterations, the loss and the recognition accuracy curves of the testing dataset decline and increase rapidly. After 60 iterations, the model comes to exhibit a relatively high training effect. Subsequently, until the end of training, the loss and recognition accuracy gradually converge to stable values. In the meantime, the loss curve of model A in Figure 13b is always higher than that of models B and C, and a certain gap remains until the end of the training. The L C in the joint loss function L AMSC indicates the intra-class difference of the features extracted by the model, which suggests that the features extracted by model B and C undergo intra-class aggregation more effectively. The visualization of features also verifies this conclusion. The feature visualization of each model is illustrated in Figures 14 and 15, respectively.  Though the model becomes more complex with the increases in depth and width, the presented model is capable of extracting deeper and more stable separable features in HRRP data for identification, thereby making the model more adaptable to SNR.

Conclusions
In this study, a neural network model integrating micro convolutional module and residual structure is proposed to classify ship targets based on HRRP. The model is characterized by few hyper-parameters, has easy to expand properties, and high recognition accuracy. The convolutional module is set as a simple and highly modular network structure that exhibits strong scalability. Based on the left branch structure of convolutional module, the effect of network deepening and widening can be simulated. The skip structure of the right branch is capable of transferring features and gradients more effectively. The presented model can up-regulate the utilization rate of shallow features while lowering the risk of gradient disappearance and recognition rate saturation. In the meantime, a novel loss function combining boundary constraint and center clustering is developed. The features extracted by the novel loss function are characterized by larger inter-class variations, smaller intra-class variations, as well as stronger separability. The effects of loss function and model  Though the model becomes more complex with the increases in depth and width, the presented model is capable of extracting deeper and more stable separable features in HRRP data for identification, thereby making the model more adaptable to SNR.

Conclusions
In this study, a neural network model integrating micro convolutional module and residual structure is proposed to classify ship targets based on HRRP. The model is characterized by few hyper-parameters, has easy to expand properties, and high recognition accuracy. The convolutional module is set as a simple and highly modular network structure that exhibits strong scalability. Based on the left branch structure of convolutional module, the effect of network deepening and widening can be simulated. The skip structure of the right branch is capable of transferring features and gradients more effectively. The presented model can up-regulate the utilization rate of shallow features while lowering the risk of gradient disappearance and recognition rate saturation. In the meantime, a novel loss function combining boundary constraint and center clustering is developed. The features extracted by the novel loss function are characterized by larger inter-class variations, smaller intra-class variations, as well as stronger separability. The effects of loss function and model Though the model becomes more complex with the increases in depth and width, the presented model is capable of extracting deeper and more stable separable features in HRRP data for identification, thereby making the model more adaptable to SNR.

Conclusions
In this study, a neural network model integrating micro convolutional module and residual structure is proposed to classify ship targets based on HRRP. The model is characterized by few hyper-parameters, has easy to expand properties, and high recognition accuracy. The convolutional module is set as a simple and highly modular network structure that exhibits strong scalability. Based on the left branch structure of convolutional module, the effect of network deepening and widening can be simulated. The skip structure of the right branch is capable of transferring features and gradients more effectively. The presented model can up-regulate the utilization rate of shallow features while lowering the risk of gradient disappearance and recognition rate saturation. In the meantime, a novel loss function combining boundary constraint and center clustering is developed. The features extracted by the novel loss function are characterized by larger inter-class variations, smaller intra-class variations, as well as stronger separability. The effects of loss function and model complexity on recognition accuracy are analyzed by simulation experiments. Compared with other commonly used network structures, the presented model in this study exhibits higher recognition accuracy with fewer model parameters, good generalization performance and robustness.
Funding: This research received no external funding.