Underwater Acoustic Target Recognition with a Residual Network and the Optimized Feature Extraction Method

: Underwater Acoustic Target Recognition (UATR) remains one of the most challenging tasks in underwater signal processing due to the lack of labeled data acquisition, the impact of the time-space varying intrinsic characteristics, and the interference from other noise sources. Although some deep learning methods have been proven to achieve state-of-the-art accuracy, the accuracy of the recognition task can be improved by designing a Residual Network and optimizing feature extraction. To give a more comprehensive representation of the underwater acoustic signal, we ﬁrst propose the three-dimensional fusion features along with the data augment strategy of SpecAugment. Afterward, an 18-layer Residual Network (ResNet18), which contains the center loss function with the embedding layer, is designed to train the aggregated features with an adaptable learning rate. The recognition experiments are conducted on the ship-radiated noise dataset from a real environment, and the accuracy results of 94.3% indicate that the proposed method is appropriate for underwater acoustic recognition problems and sufﬁciently surpasses other classiﬁcation methods.


Introduction
As a key technology to promote the intelligence of the underwater acoustic equipment system, underwater acoustic target-radiated noise recognition is one of the most important research directions of underwater acoustic signal processing [1]. Some reasons behind this are that underwater environments are complex and changing, making it difficult to identify underwater objects [2,3]. Besides, although recent years have experienced an increasing number of water vessels, the labeled large scale of the data acquisition has been hindered a lot because of confidentiality and cost. Another drawback based on the data quality is that the background noise and the interference from other noise sources inevitably exist.
Generally, Underwater Acoustic Target Recognition (UATR) is performed by welltrained sonar men, which is inaccurate due to the long-time work and may be affected by weather conditions [4]. Hence, developing a robust recognizing system to replace humans' work of identifying ship-radiated noise is of great importance. From a technical perspective, efforts are consistently made to improve the classification accuracy in the aspects of feature extraction and classifier training [5].
Much work attempts to extract hand-crafted features from ship-radiated noise and feed them into different kinds of classifiers. On one hand, as for the traditional machine learning feature extraction process, Support Vector Machines (SVM) [6,7] and Principal Component Analysis (PCA) [8] methods are widely used. For example, Meng et al. proposes a method straightly using the wave structure with SVM [6]. Wei et al. [7] present an extraction method based on 1 1/2 D spectrum and PCA. Features derived from Mel filters of Mel Frequency Cepstral Coefficients (MFCC) and Log-Mel Spectrogram (LM) are two widely used features in Environment Sound Classification (ESC) tasks [9,10] with acceptable performance. Although such features originate from the speech or sound field, the effect of MFCC and its first-order differential MFCC or second-order MFCC features are proven for underwater acoustic target recognition [8]. Besides, a considerable number of research works indicate that the fusion feature can give a more comprehensive representation of environment sounds [11]. For the recognition of underwater acoustic targets, Meng et al. [6] exploit the fusion feature of zero-crossing wavelength, peek-to-peek amplitude, and zero-crossing-wavelength difference. On the other hand, the design of the neural networks plays an important role in achieving a competitive performance together with the optimized feature extraction. For example, a time-delay neural network (TDNN) and convolutional neural network (CNN) are introduced for UATR in [12]. Testolin et al. [13] present an innovative method that allows to accurately detect and track underwater moving targets from the reflections of an active acoustic emitter. The system is based on a computationally and energy-efficient preprocessing stage carried out using a deep convolutional denoising autoencoder (CDA), whose output is then fed to a probabilistic tracking method based on the Viterbi algorithm. Testolin et al. [14] also have proven that transfer learning can be a viable approach in these scenarios where tagged data is often lacking and evaluate the feasibility of the model based on Recurrent Neural Networks (RNN) for the scenarios requiring on-line processing of the reflection sequence. Shen et al. [15] introduce the auditory inspired convolutional neural networks trained from raw underwater acoustic signal.
For ShipsEar [16], the machine learning method using a GMM-based classifier with the standard expectation maximization (EM) algorithm for training purposes could be used as a baseline method, whose best classification rate is 75.4%. As the results obtained by typical methods based on machine learning are not very high, the methods based on deep learning models are worthy of researching. Li et al. [5] introduce a feature optimization approach with Deep Neural Networks (DNN) and optimizing loss function and achieve an accuracy of 84%. Yang et al. [17] propose a so-called competitive Deep Belief Nets (cDBN) for UATR. Luo et al. [18] present a UATR method based on Restricted Boltzmann Machine (RBM), which achieves the accuracy of 93.17% on the dataset of ShipsEar. Ke et al. [4] propose a novel recognition method of four steps including preprocessing, pretraining, finetuning, and recognition, which achieves the recognition accuracy of 93.28%.
In this paper, we propose the three-dimensional fusion features along with the data augment strategy of SpecAugment and an 18-layer Residual Network (ResNet18) containing the center loss function with the embedding layer to achieve good accuracy. The remaining parts of this paper are organized as follows. Section 2 introduces the framework of the classification method for UATR. Section 3 describes the suitable feature aggregate scheme of the three-dimensional fusion features and an 18-layer Residual Network (ResNet18). Section 4 presents the experimental results of ShipsEar with the proposed method and other methods. Section 4 gives the conclusion.

Description of the Classification Method
As depicted in Figure 1, the classification method mainly contains several steps, i.e., preprocessing, feature extraction, Residual Network training, and embedding layer design. Note that the underwater acoustic waves have different time duration, therefore, the preprocessing steps of truncation and framing are necessary. Afterward, we perform the feature extraction process, which contains feature calculation, feature fusion, and feature augment, to obtain the three-dimensional feature. Instead of using the general structure of CNN or DNN, we design an 18-layer Residual Network named "ResNet18" together with the center loss function. As a complex network, ResNet18 could learn very informative presentations in the training process. To avoid overfitting, we adopt the tactic strategy to train the aggregated features with early stopping and adaptable learning rate. Besides, aiming to increase the distance for inter-class and decrease that for Intra-Class, the embedding layers are well-tuned where the center loss function and softmax is used instead of the softmax. with the center loss function. As a complex network, ResNet18 could learn very informa-98 tive presentations in the training process. To avoid overfitting, we adopt the tactic strategy 99 to train the aggregated features with early stopping and adaptable learning rate. Besides, 100 aiming to increase the distance for inter-class and decrease that for Intra-Class, the em-101 bedding layers are well-tuned where the center loss function and softmax is used instead 102 of the softmax. Although the practically captured data from the environment may have a variable 107 duration, underwater acoustic processing methods based on CNN or DNN or ResNet net-108 works generally need the length of the input sample to be fixed. Therefore, preprocessing 109 is a prerequisite before feeding the signals of different lengths to the designed networks. 110 One feasible way is to split the underwater acoustic signal into several frames of fixed 111 length using a sliding designed window of appropriate width. Considering that an un-112 derwater acoustic signal can be seemed as stable in a very short time, the framing of the 113 recorded signal could be performed. This naturally increases the number of samples as 114 some parts of the underwater acoustic signal are reused and that can be viewed as some 115 sort of data augmentation [19]. For simplicity, our framing method may do not take over-116 lapping and no window is used in it. Note that each preprocessed signal must be labeled 117 as the raw label. Besides, for underwater acoustic, a sampling rate of 20,480 Hz may be 118 considered a good trade-off between the quality of the input sample and the computa-119 tional cost of the model. We present the demonstration of the framing of the signal of the 120 type of passengers into several framess and s + 1) with no overlapping in Figure 2.

Preprocessing
Although the practically captured data from the environment may have a variable duration, underwater acoustic processing methods based on CNN or DNN or ResNet networks generally need the length of the input sample to be fixed. Therefore, preprocessing is a prerequisite before feeding the signals of different lengths to the designed networks. One feasible way is to split the underwater acoustic signal into several frames of fixed length using a sliding designed window of appropriate width. Considering that an underwater acoustic signal can be seemed as stable in a very short time, the framing of the recorded signal could be performed. This naturally increases the number of samples as some parts of the underwater acoustic signal are reused and that can be viewed as some sort of data augmentation [19]. For simplicity, our framing method may do not take overlapping and no window is used in it. Note that each preprocessed signal must be labeled as the raw label. Besides, for underwater acoustic, a sampling rate of 20,480 Hz may be considered a good trade-off between the quality of the input sample and the computational cost of the model. We present the demonstration of the framing of the signal of the type of passengers into several framess and s + 1 with no overlapping in Figure 2.
with the center loss function. As a complex network, ResNet18 could learn very informa-98 tive presentations in the training process. To avoid overfitting, we adopt the tactic strategy 99 to train the aggregated features with early stopping and adaptable learning rate. Besides, 100 aiming to increase the distance for inter-class and decrease that for Intra-Class, the em-101 bedding layers are well-tuned where the center loss function and softmax is used instead 102 of the softmax.  Figure 1. The depiction of the classification method. 105

106
Although the practically captured data from the environment may have a variable 107 duration, underwater acoustic processing methods based on CNN or DNN or ResNet net-108 works generally need the length of the input sample to be fixed. Therefore, preprocessing 109 is a prerequisite before feeding the signals of different lengths to the designed networks. 110 One feasible way is to split the underwater acoustic signal into several frames of fixed 111 length using a sliding designed window of appropriate width. Considering that an un-112 derwater acoustic signal can be seemed as stable in a very short time, the framing of the 113 recorded signal could be performed. This naturally increases the number of samples as 114 some parts of the underwater acoustic signal are reused and that can be viewed as some 115 sort of data augmentation [19]. For simplicity, our framing method may do not take over-116 lapping and no window is used in it. Note that each preprocessed signal must be labeled 117 as the raw label. Besides, for underwater acoustic, a sampling rate of 20,480 Hz may be 118 considered a good trade-off between the quality of the input sample and the computa-119 tional cost of the model. We present the demonstration of the framing of the signal of the 120 type of passengers into several framess and s + 1) with no overlapping in Figure 2.  Figure 3 gives the graphical representation of the procedure for extracting the final features that contains several steps, i.e., feature calculation, feature fusion, and feature augment. In general, different features can capture different characteristics of the underwater acoustic signal. Therefore, multiple features can be combined to exploit the complementary information for further improvement of the feature design. At the stage of feature calculation and feature fusion, we extract the Log Mel (LM) as the first channel, the MFCC as the second channel, and the composition of Chroma, Contrast, Tonnetz, and Zero-cross ratio called CCTZ as the third channel. The first two channels both have the feature size of 60 × 41, where 60 denotes the number of the bands and 41 denotes the number of frames. As for the third channel, the feature size of Chroma, Contrast, Tonnetz, and Zero-cross ratio is 24 × 41, 6 × 41, 6 × 41, and 1 × 41, respectively. To form a new feature set called CCTZ, the features are sequentially stacked with the remaining lines paddled by zero. Afterward, at the stage of feature augment, we apply the feature augment method of SpecAugment five times on the LM feature, leaving the other channels unchanged. SpecAugment originates from the data augmentation method for speech recognition and is applied directly to the feature inputs of a neural network (i.e., filter bank coefficients) [20]. The augmentation policy consists of warping the features, masking blocks of frequency channels, and masking blocks of time steps [20]. The resulting three-dimensional feature matrix is composed of the three channels as mentioned. Figure 3 gives the graphical representation of the procedure for extracting the final 125 features that contains several steps, i.e., feature calculation, feature fusion, and feature 126 augment. In general, different features can capture different characteristics of the under-127 water acoustic signal. Therefore, multiple features can be combined to exploit the comple-128 mentary information for further improvement of the feature design. At the stage of feature 129 calculation and feature fusion, we extract the Log Mel (LM) as the first channel, the MFCC 130 as the second channel, and the composition of Chroma, Contrast, Tonnetz, and Zero-cross 131 ratio called CCTZ as the third channel. The first two channels both have the feature size 132 of 60 × 41, where 60 denotes the number of the bands and 41 denotes the number of frames. 133 As for the third channel, the feature size of Chroma, Contrast, Tonnetz, and Zero-cross 134 ratio is 24 × 41, 6 × 41, 6 × 41, and 1 × 41, respectively. To form a new feature set called 135 CCTZ, the features are sequentially stacked with the remaining lines paddled by zero. 136 Afterward, at the stage of feature augment, we apply the feature augment method of Spe-137 cAugment five times on the LM feature, leaving the other channels unchanged. SpecAug-138 ment originates from the data augmentation method for speech recognition and is applied 139 directly to the feature inputs of a neural network (i.e., filter bank coefficients) [20]. The 140 augmentation policy consists of warping the features, masking blocks of frequency chan-141 nels, and masking blocks of time steps [20]. The resulting three-dimensional feature ma-142 trix is composed of the three channels as mentioned. The number of convolutional layers plays a key role in detecting high-level concepts 147 [19]. Residual learning framework could decrease the difficulty of the training of deep 148 neural networks with many layers [21]. Motivated by this, the architecture of ResNet18, 149 which is a modified version of the general ResNet50, is shown in Figure 4. The details are 150 listed as follows: 151  Stage 0: The input layer is zero-paddled of 3 × 3. After the processing, the shape 152 changes from 60 × 41 × 3 to 66 × 47 × 3.

153
 Stage 1: The first stage consists of a convolutional layer of 64 filters of the rectangular 154 shape of 3 × 3 and stride of 2 × 2. The batch-normalization is applied, followed by a 155 Rectified Linear Unit (ReLU) as the activation function and max-pooling of 3 × 3 and 156 a stride of 2 × 2.
157 Figure 3. The depiction of the feature extraction process.

Structure of the ResNet18
The number of convolutional layers plays a key role in detecting high-level concepts [19]. Residual learning framework could decrease the difficulty of the training of deep neural networks with many layers [21]. Motivated by this, the architecture of ResNet18, which is a modified version of the general ResNet50, is shown in Figure 4. The details are listed as follows: • Stage 0: The input layer is zero-paddled of 3 × 3. After the processing, the shape changes from 60 × 41 × 3 to 66 × 47 × 3. • Stage 1: The first stage consists of a convolutional layer of 64 filters of the rectangular shape of 3 × 3 and stride of 2 × 2. The batch-normalization is applied, followed by a Rectified Linear Unit (ReLU) as the activation function and max-pooling of 3 × 3 and a stride of 2 × 2. • Stage 2: The second stage consists of a convolutional block named "block-2a," an identity block named "block-2b," and an identity block named "block-2c." The structure of the convolutional block and the identity block is shown as Conv_block and Id_block in Figure 4, respectively. • Stage 3: The third stage consists of a convolutional block named "block-3a," an identity block named "block-3b," and the average pooling layer of 2 × 2. After processing, the shape of the output of the flatten layer changes from 15 × 11 × 512 to 7 × 5 × 512.
• Stage 4: The stage contains the flatten layer and the designed module of center loss. The number of hidden units of the first fully connected layer is 17,920. As for the designed module of center loss, the second fully connected layer with 24 hidden units connects to the first layer, followed by the Parametric Rectified Linear Unit (PReLU) as the activation function. The last fully connected is with 5 hidden units.
• Stage 2: The second stage consists of a convolutional block named "block-2a," an 158 identity block named "block-2b," and an identity block named "block-2c." The struc-159 ture of the convolutional block and the identity block is shown as Conv_block and 160 Id_block in Figure 4, respectively.

161
• Stage 3: The third stage consists of a convolutional block named "block-3a," an iden-162 tity block named "block-3b," and the average pooling layer of 2 × 2. After processing, 163 the shape of the output of the flatten layer changes from 15 × 11 × 512 to 7 × 5 × 512. 164 • Stage 4: The stage contains the flatten layer and the designed module of center loss. 165 The number of hidden units of the first fully connected layer is 17,920. As for the 166 designed module of center loss, the second fully connected layer with 24 hidden units 167 connects to the first layer, followed by the Parametric Rectified Linear Unit (PReLU) 168 as the activation function. The last fully connected is with 5 hidden units.   Figure 4. The overall framework of an 18-layer Residual Network (ResNet18). 171 We should mention that for the identity block, nonlinear activation helps the de-172 signed networks to learn more complex features. The activation function is defined as where i x is the input of the nonlinear activation on the ith channel and i  is a learnable 174 coefficient controlling the slope of the negative part. We should mention that for the identity block, nonlinear activation helps the designed networks to learn more complex features. The activation function is defined as where x i is the input of the nonlinear activation on the ith channel and α i is a learnable coefficient controlling the slope of the negative part.

The Embedding Layer with the Center Loss Function and Softmax
To address the face verification problem, center loss [22] is first exploited to compensate for softmax loss in 2016. For the UATR task, we improve generalization by leveraging such a simple but useful cost function, since it learns a center for the features of each class and meanwhile tries to pull the deep features of the same class close to the corresponding center [5]. Basically, for classification task {(x i , y i )} N i=1 consisting of N samples x i and their corresponding labels y i ∈ {1, 2, . . . , Y}.
Here, x i is embedding into a new vector f (x i ) with a DNN [17]. The center loss function used here is defined as: where the first term denotes the softmax loss, the second term denotes the penetration term for clustering, and λ is a balance parameter within [0, 1]. Here,c i is the center of class i and the function D(·) stands for the distance function. During training, the center loss will encourage instances between samples of the same classes to be closer to their learnable class center. Here, j is the class number, x is the input, W j is weight, b j is the bias, and N is the total number of the classes.

Dataset Description and Preparation
The detailed description of the dataset of the ship-radiated noise called ShipsEar (available at http://atlanttic.uvigo.es/underwaternoise/), which contains a total of 91 records of 11 vessel types and one background noise class, is presented in the literature [16]. During 2012 and 2013, the researchers recorded the sounds of many different classes of ships on the Spanish Atlantic coast. The recordings were made with autonomous acoustic digitalHyd SR-1 recorders, manufactured by MarSensing Lda (Faro, Portugal). This compact recorder includes a hydrophone with a nominal sensitivity of −193.5 dB re 1 V/1 µPa and a flat response in the 1 Hz to 28 kHz frequency range.
To keep consistent with other classification methods, the 11 vessel types are merged into four experiment classes and the background noise type is as one class, as shown in Table 1. Before preprocessing, the number of the recorded sound clips with a duration of 5 s is 1956 by truncating the original records. The original signals are recorded at different sample rates. By preprocessing, each sound clip is further separated into 41 frames with an overlap of zero.

Experimental Result
The proposed method is verified by a computer with four GPUs of Nvidia GeForce RTX 2080Ti and Core i7-6900K CPU. Furthermore, the deep learning framework of the proposed model is implemented using Keras 2.2.4 with TensorFlow 1.12.0 as a backend. When training the model, the batch size and the maximum number of epochs (each epoch includes one training cycle on all training data) are set to be 128 and 200, respectively. To accelerate the training process, the early stopping strategy is that the training will be stopped if the validation loss is reduced by larger than 0.00005 in 20 successive epochs. Besides, the adaptable strategy of learning rate is adopted, where the initial value is 0.001, and the value will be 60% of the former value every 20 epochs. Using such a strategy, the practical number of epochs used is 88. As shown in Figure 5a, the training loss and validation loss are rapidly reduced within about 20 epochs, and they are gradually decreased without overfitting due to the fact of the design of the adaptable learning rate. Meanwhile, Figure 5b also shows that the accuracy is improved at a varied speed. Only small improvements are obtained after the turn and the best accuracy of the validation data is 0.946. We can observe that Figure 5a, b describes the same process.
will be stopped if the validation loss is reduced by larger than 0.00005 in 20 successive 215 epochs. Besides, the adaptable strategy of learning rate is adopted, where the initial value 216 is 0.001, and the value will be 60% of the former value every 20 epochs. Using such a 217 strategy, the practical number of epochs used is 88. As shown in Figure 5a, the training 218 loss and validation loss are rapidly reduced within about 20 epochs, and they are gradu-219 ally decreased without overfitting due to the fact of the design of the adaptable learning 220 rate. Meanwhile, Figure 5b also shows that the accuracy is improved at a varied speed. 221 Only small improvements are obtained after the turn and the best accuracy of the valida-222 tion data is 0.946. We can observe that Figure 5a, b describes the same process. 223 (a) The loss ( b) The accuracy Figure 5. The training and validation accuracy and loss when training the model. 224 We evaluate our model of ResNet18 with the three-dimensional feature. Twenty per-225 cent of the data is used as the validation set and 10% of the data is used as the test set, 226 while each network is trained with 70% of the dataset with the augmentation. Table 2 227 depicts the detailed results of the classification system at the aspect of precision, recall, 228 and f1-score. It is clear that all the accuracy of the recognition of each class is higher than 229 0.90 and the average precision or recall or f1-score is 0.943, where the support denotes the 230 number of each test class. For the convenience of comparison, classifier performance is 231 measured using the classification accuracy, defined as the average precision. The ability 232 of the described classifier to identify different vessels is indicated by the fact that there is 233 no confusion between background noise class E and four vessel classes A-D [16]. The ves-234 sel classes with the best results are A (background noise) and B (fishing boats, trawlers, 235 mussel boats, tugboats, and the dredger), with classification rates of 0.970 and 0.958, re-236 spectively. The poorest results are obtained for C (motorboats, pilot boats, and sailboats). 237 Although the acoustic dataset contains high background noise in shallow water, the over-238 all performance is still satisfactory. 239  We evaluate our model of ResNet18 with the three-dimensional feature. Twenty percent of the data is used as the validation set and 10% of the data is used as the test set, while each network is trained with 70% of the dataset with the augmentation. Table 2 depicts the detailed results of the classification system at the aspect of precision, recall, and f1-score. It is clear that all the accuracy of the recognition of each class is higher than 0.90 and the average precision or recall or f1-score is 0.943, where the support denotes the number of each test class. For the convenience of comparison, classifier performance is measured using the classification accuracy, defined as the average precision. The ability of the described classifier to identify different vessels is indicated by the fact that there is no confusion between background noise class E and four vessel classes A-D [16]. The vessel classes with the best results are A (background noise) and B (fishing boats, trawlers, mussel boats, tugboats, and the dredger), with classification rates of 0.970 and 0.958, respectively. The poorest results are obtained for C (motorboats, pilot boats, and sailboats). Although the acoustic dataset contains high background noise in shallow water, the overall performance is still satisfactory.  Figure 6 shows the confusion matrix of the proposed method on the ShipsEar dataset, where classes 0-4 denote classes A-E, respectively. Values along the diagonal indicate the number of samples classified correctly for each specific class. It shows that class C is the hardest class for the proposed classifier, while all other classes are well separated.
It is worth noting that the parameters such as the number of filters, filter size, and the number of layers and the hyperparameters for training such as the batch size, the initial learning rate, and the patience in early stopping are optimal choices according to the train and validation process in the experiment. Selecting other parameters is also feasible, however, using different parameters could encounter the performance loss and using different hyperparameters does not have significant impact but exhibit slightly inferior performance. Figure 6 shows the confusion matrix of the proposed method on the ShipsEar dataset, 242 where classes 0-4 denote classes A-E, respectively. Values along the diagonal indicate the 243 number of samples classified correctly for each specific class. It shows that class C is the 244 hardest class for the proposed classifier, while all other classes are well separated. 245 246 Figure 6. The confusion matrix of the proposed method. 247 It is worth noting that the parameters such as the number of filters, filter size, and the 248 number of layers and the hyperparameters for training such as the batch size, the initial 249 learning rate, and the patience in early stopping are optimal choices according to the train 250 and validation process in the experiment. Selecting other parameters is also feasible, how-251 ever, using different parameters could encounter the performance loss and using different 252 hyperparameters does not have significant impact but exhibit slightly inferior perfor-253 mance. 254

255
The effectiveness of three parts of the proposed approach is assessed, i.e., (1) the ad-256 vantage of the feature extraction method, (2) the advantage of the ResNet18 model, and 257 (3) the contribution of the embedding layer with the center loss function and softmax. 258 Besides, the comparison of the performance of the described method and that of references 259 is presented. This experiment is designed to verify that the three-dimensional features used in the 262 ResNet18 model could yield better results compared with other feature extraction meth-263 ods. As mentioned in [17], MFCC features have been experimentally proved to be the best 264 hand-crafted features for the recognition, we use MFCC features and its deltas(Δ) and 265 deltas-deltas(Δ 2 ) for comparison. Besides, considering that LM is a simple but effective 266 feature, we also analyze its performance. The results of different feature extraction meth-267 ods are shown in Table 3. 268 From the results of MFCC and MFCC + Δ + Δ 2 , we can see that deltas or deltas-deltas 269 do not contribute in improving the average accuracy, even if the number of parameters 270 increases due to the extra two dimensions. Using the LM feature extracted from the raw 271 signal, we can achieve an average accuracy of 0.906. Therefore, we can see that such a 272 feature is distinguishable with different classes similar to MFCC features. Compared with 273 MFCC or LM, the recognition accuracy of LM + MFCC improves at 0.027 and 0.005, re-274 Figure 6. The confusion matrix of the proposed method.

Experimental Analysis
The effectiveness of three parts of the proposed approach is assessed, i.e., (1) the advantage of the feature extraction method, (2) the advantage of the ResNet18 model, and (3) the contribution of the embedding layer with the center loss function and softmax. Besides, the comparison of the performance of the described method and that of references is presented.

Experiment A: The Advantage of the Feature Extraction Method
This experiment is designed to verify that the three-dimensional features used in the ResNet18 model could yield better results compared with other feature extraction methods. As mentioned in [17], MFCC features have been experimentally proved to be the best hand-crafted features for the recognition, we use MFCC features and its deltas(∆) and deltas-deltas(∆ 2 ) for comparison. Besides, considering that LM is a simple but effective feature, we also analyze its performance. The results of different feature extraction methods are shown in Table 3. From the results of MFCC and MFCC + ∆ + ∆ 2 , we can see that deltas or deltas-deltas do not contribute in improving the average accuracy, even if the number of parameters increases due to the extra two dimensions. Using the LM feature extracted from the raw signal, we can achieve an average accuracy of 0.906. Therefore, we can see that such a feature is distinguishable with different classes similar to MFCC features. Compared with MFCC or LM, the recognition accuracy of LM + MFCC improves at 0.027 and 0.005, respectively. An important remark is that LM + MFCC + CTZZ achieves the highest recognition accuracy of 0.943, which surpasses 0.032 of the average accuracy of LM + MFCC, although the extraction of such a three-dimension feature increases the computation time.

Experiment B: The Advantage of the ResNet18 Model
This experiment is designed to compare the performance of the described classifier with other typical models. Different networks, named "CNN-1, CNN-2, LSTM, CRNN, and ResNet18," are exploited for comparison.
As shown in Table 4, CNN-1 and CNN-2 share the same network structure, while the features are MFCC and LM + MFCC + CCTZ, respectively. The first layer exploits 32 filters of the size of 3 × 3 with the activation function of the hyperbolic tangent function named "tanh," followed by the 2 × 2 max-pooling processing. The second layer uses 64 filters of the size of 3 × 3 with the activation function of tanh, followed by max-pooling of 2 × 2. The last layer utilizes 128 filters of the size of 3 × 3 with the activation function of tanh, followed by max-pooling of 2 × 2 and batch normalization. Afterward, the fully connected dense layers consist of 1024, 128, and 5 nodes, respectively. The optimizer is Adam with a learning rate of 0.001 and a decay factor of 1 × 10 −6 . The batch size and the number of epochs are set to be 128 and 50, respectively. With CNN-2 fed by the optimized feature of LM + MFCC + CCTZ, we can achieve the average accuracy of 0.906, which surpasses that of CNN-1 of 0.845. The LSTM network with the LM feature only achieves an average accuracy of 0.852. The first two LSTM layers have 256 and 128 outputs with batch normalization, respectively. The dropout is set to be 0.2. Afterward, the fully connected part consists of 256 and 5 nodes, respectively. The optimizer is Adam.
As for the CRNN network, the first layer has 64 filters of the size of (5,5) and the regularization method is L2 with the lambda of 0.01. The activation function is ReLU. Batch Normalization is also used and the dropout rate is 0.25. The second and third layers are LSTM, which have 64 units and are regularized L2 with the lambda of 0.01. The dropout rate of recurrent is 0.5, the activation function is tanh, and the dropout rate is 0.25. The fourth and fifth layers are time distributed layers with 128 and 64 nodes. The activation functions are ReLU and the dropout is 0.25. The sixth layer is a dense layer with 5 nodes. The optimizer is Adam with a learning rate of 0.001 and a decay factor of 1 × 10 −6 . Such a method could achieve an average accuracy of 0.885.
From Table 4, we could have a better insight into the performance from the detailed accuracy. The ResNet18 with the proposed feature extraction method has an advantage over other methods.
Besides, we also compare the proposed method with other methods. The baseline design is based on the study [16], which shows that using the basic machine learning method, the accuracy is 0.754. Besides, the accuracy achieved by the ResNet18 model as well as that achieved by other state-of-the-art approaches of RBM + BP [18] and RSSD [4] described in the literature are presented in Table 5. Our method achieves an accuracy of 0.943. These results indicate that our ResNet18 has achieved significant improvement in UATR with three-dimensional features as input. Given that the split of the dataset could be different for different methods, the comparison is not rigorous. However, to our knowledge, the best results are obtained by the proposed method for ShipsEar from the accuracy perspective.

Model Accuracy
Baseline [15] 0.754 RBM + BP [14] 0.932 RSSD [4] 0.933 ResNet18 + 3D 0.943 3.3.3. Experiment C: The Contribution of the Embedding Layer with the Center Loss Function and Softmax In this section, the effect of different loss functions is investigated. Although softmax is the frequently used cost function, the simple utilization of softmax is very easy to become overfitting. For the training process, it tends to increase the value of the outputs before being fed to softmax to reduce the cost. To penalize such behavior, another selectable loss function named "uniform loss" could be a choice that also tries to fit the uniform distribution and is defined as: where the first term denotes the softmax loss and the second term denotes the uniform distribution. Note that λ is a balance parameter within [0, 1] and is set to be 0.3 in all the classification experiments. Table 6 summarizes the impacts of different loss functions on the mean accuracy. The results indicated that selecting a good loss function can give a better result for classification. Using the center loss has led to another 3.3% and 0.9% improvement in the average accuracy compared with softmax and uniform loss, respectively. It can be seen in Table 6 that although the accuracy of class A by the model with the center loss is slightly worse than the others, the average performance is better. We can conclude that the performance has been improved by utilizing the embedding layer with the center loss function and softmax.

Conclusions
In this paper, we designed a Residual Network called ResNet18 and the optimizing feature extraction method. The three-dimensional fusion features along with the data augmentation strategy of SpecAugment showed its advantages. Besides, the center loss function and softmax with the embedding layer also contributed to the best performance with the aggregated features using an adaptable learning rate. The accuracy results of 94.3% on the ShipsEar dataset indicated that the proposed method achieved state-of-the-art accuracy. The results showed that the proposed method can extract high-level abstract information that was beneficial to classification. The proposed method provided good technical support for the target classification and recognition function of the Sonar system.