A Lightweight Network Model Based on an Attention Mechanism for Ship-Radiated Noise Classiﬁcation

: Recently, deep learning has been widely used in ship-radiated noise classiﬁcation. To improve classiﬁcation efﬁciency, avoiding high computational costs is an important research direction in ship-radiated noise classiﬁcation. We propose a lightweight squeeze and excitation residual network 10 (LW-SEResNet10). In ablation experiments of LW-SEResNet10, the use of ResNet10 instead of ResNet18 reduced 56.1% of parameters, while the accuracy is equivalent to ResNet18. The improved accuracy indicates that the ReLU6 enhanced the model stability, and an attention mechanism captured the channel dependence. The ReLU6 activation function does not introduce additional parameters, and the number of parameters introduced by the attention mechanism accounts for 0.2‰ of the model parameters. The 3D dynamic MFCC feature performs better than MFCC, Mel-spectrogram, 3D dynamic Mel-spectrogram, and CQT. Moreover, the LW-SEResNet10 model is also compared with ResNet and two classic lightweight models. The experimental results show that the proposed model achieves higher classiﬁcation accuracy and is lightweight in terms of not only the model parameters, but also the time consumption. LW-SEResNet10 also outperforms the state-of-the-art model CRNN-9 by 3.1% and ResNet by 3.4% and has the same accuracy as AudioSet pretrained STM, which achieves the trade-off between accuracy and model efﬁciency.


Introduction
Ship-radiated noise includes airborne noise and underwater noise. Airborne noise mainly influences human health [1,2]. For military and security reasons, we pay more attention to underwater noise. At present, deep learning methods applied to Ship-radiated noise featuring learning and classification has become a hot research topic [3][4][5][6][7][8]. Ship-radiated noise featuring learning and classification is an important research direction of underwater acoustic target recognition. From the perspective of sound generation mechanisms, the target radiated noise is considered to be mainly composed of mechanical noise, propeller noise, and hydrodynamic noise [9]. The difference in the target radiated noise of different types can be reflected in the following aspects: (1) mechanical noise generated by different shipborne mechanical equipment; (2) propeller radiation characteristics being different due to different propeller parameters; and (3) the difference in hydrodynamic noise caused by different hull structures. Therefore, the ship-radiated noise can reflect ship attributes and be used to classify ship categories. Compared with shallow models, deep neural networks can learn more abstract and invariant features from a large dataset [10,11]. Therefore, deep learning can not only automatically learn feature representations from the raw signal, but can also perform further deep feature extraction and even feature fusion based on some artificial feature parameters such as Mel Frequency Cepstrum Coefficient (MFCC) [12], constant Q transform (CQT) [13], wavelet feature [14,15], DEMON spectrum and LOFAR spectrum [16], and high-order spectral features [17,18]. Those traditional feature analysis methods effectively reduce information redundancy and the computational cost of the back-end model.
Due to the difficulty and high cost of marine experiments, the effective samples of ship-radiated noise data are insufficient [19]. The insufficient data would lead to overfitting of the large-scale deep network model, which is difficult to converge, ultimately affecting the classification accuracy [20]. To solve this problem, Gao et al. [21] used a deep convolutional generative adversarial network (DCGAN) to expand the training set, improving the classification effect. Jiang et al. [22] proposed the modified DCGAN model to augment data for targets with a small sample size. Using GAN to generate ship-radiated noise data can effectively solve the problem of scarcity of samples, but the training of generative networks is time-consuming. Yang et al. [12] proposed an improved competitive deep belief network (DBN), which addresses the problem of insufficient training samples by pre-training the DBN with a large amount of unlabeled ship-radiated noise. Jin et al. [23] used a CNN pre-trained on the ImageNet dataset [24] and fine-tuned the network with fish image data with a small sample size to effectively solve the underwater image classification problem. These pre-training methods as transfer learning methods require considering the similarity between data, tasks, or models and need preload model parameters. A measurement of similarity needs to be defined. The negative transfer may occur when the source domain data and target domain data are not similar or when the deep model is not good enough to find a transferable feature. For our study, a lightweight network is designed to improve classification accuracy in a small sample condition.
The large-scale residual network (ResNet) [25] is redundant in the field of computer vision. Gao et al. [26] randomly removed many layers of ResNet during the training process, which did not affect the convergence of the algorithm, and the removal of the middle layers had little effect on the final results, illustrating that ResNet has redundancy. For underwater acoustic target recognition, a depth search experiment for a multiscale residual deep neural network (MSRDN) [27] was conducted. The results prove that the original MSRDN with 101 depths is redundant. Xue et al. [28] observed that the recognition rate will decrease by increasing the number of residual layers, which indicates the redundancy of ResNet. Therefore, it is feasible to reduce the model parameters while maintaining the model performance. For our work, a reduction of the model parameter is realized by shrinking the number of residual units in ResNet.
Meanwhile, large-scale deep models have the problem of high computation costs [29]. For practical applications, the trade-off between accuracy and model efficiency is necessary. The efficiency is defined with lower computation cost or time cost. To develop efficient deep models, recent works in the field of computer vision usually focus on structural design [30,31], low-rank factorization [32], and knowledge distillation [33,34]. For underwater acoustic target recognition, Lei et al. [35] proposed that avoiding high computational costs is an important future direction of underwater acoustic information processing. Jiang et al. [22] proposed the S-ResNet model to obtain good classification accuracy while significantly reducing the complexity of the model and achieving a good trade-off between classification accuracy and model complexity. Meanwhile, the parameters and floating-point operations (FLOPs) of the model are used to measure the model's complexity. However, on the actual equipment, due to a variety of optimization calculation operations, the theoretical parameters and FLOPs cannot accurately measure the actual time consumption of the model [31]. Therefore, for our study, in addition to using the theoretical parameters, we will also measure the complexity of the model according to the actual time consumption. Tian et al. [36] designed a lightweight MSRDN using lightweight network design techniques, in which 64.18% of parameters and 79.45% of FLOPs are reduced from the original MSRDN with a small loss of accuracy. Meanwhile, the time cost under the same hardware and software platforms was conducted. For our study, we will measure time costs on different platforms.
Our study utilizes the structural design technique to design our lightweight network. Lightweight networks are defined as having fewer model parameters or faster run times. The proposed model, namely lightweight squeeze and excitation residual network 10 (LW-SEResNet10), aims to inhibit overfitting and achieve high accuracy and high efficiency.
Firstly, shrinking the number of residual units in the ResNet reduces the number of parameters. Secondly, the attention mechanism called "squeeze-excitation" (SE) block [37] with low parameters is introduced into the proposed model. The attention mechanism [38] can help the network give different weights to each part of the input features, extracting more critical and important information. The attention mechanism is integrated into a residual unit structure, which helps to capture the correlation between features, and the representation generated by convolution networks can be strengthened. Thirdly, the ReLU6 activation function [39] is employed to increase the model stability. The ReLU6 activation function does not introduce additional parameters. Moreover, the 3D dynamic MFCC feature is used as the input of the proposed model. The 3D dynamic MFCC feature effectively compresses the raw time-domain information of the target radiated noise signal, while extracting the higher-order dynamic time information of the signal. To verify the lightweight nature and superiority of our proposed model, we compare the proposed model with the ResNet and the classical lightweight network models MobileNet V2 [30] and ShuffleNet V2 [31] in the field of computer vision in terms of parameters, time consumption, accuracy, and noise mismatch.
The remainder of this article is organized as follows. Section 2 provides an overview of our ship-radiated noise classification method in detail. Experiments are presented in Section 3, and Section 4 concludes this article.

System Overview
This section mainly describes the proposed ship-radiated noise classification framework. The first part introduces the proposed lightweight model. The second part introduces the extraction method of the 3D dynamic MFCC feature. ResNet [25] is proposed to deal with deep neural network degradation. In contrast with ordinary neural networks, the ResNet model implements a cross-layer connection by residual unit structure. The architecture of the residual unit is shown in Figure 1. The basic residual unit is shown in Figure 1a. It can be seen that the residual unit contains two types of connections; one is a non-linear mapping connection similar to an ordinary neural network, which generally consists of two to three convolutional layers, and the other is a short-cut connection. The input of a residual unit is denoted as x, the nonlinear mapping as F(x) (i.e., the residual mapping), and H(x) as the computed result of the residual unit, then their arithmetic relationship can be expressed as: where w 1 , w 2 , . . . , w N are the weights of convolutional layers and δ is the ReLU activation function. When the residual unit performs the backpropagation, the gradient is expressed as: Due to the existence of constant 1, the phenomenon of gradient disappearance during backpropagation is avoided. ResNet learns F(x) + x by iteration training, rather than learning H(x) directly. Learning the residual F(x) is easier to converge than learning the mapping between x, and H(x) directly, and can achieve higher classification accuracy. Figure 1b shows the downsampled residual unit, with dashed lines indicating short-cut connections. In ResNet, not all residual units have pooling layers, so a convolutional layer is needed to implement downsampling. This is implemented by setting the stride of the convolutional layer to 2 (s = 2) to change the shape of the residual mapping. Meanwhile, since the residual units are to be summed, the shape and dimension of the input x and the residual mapping F(x) must be consistent. When the residual mapping F(x) is downsampled, downsampling of x is required in the short-cut connection, which is implemented by setting a 1 × 1 convolution layer with a stride of 2 (s = 2) and then adding the downsampled x and F(x). Due to the existence of constant 1, the phenomenon of gradient disappearance during backpropagation is avoided. ResNet learns F(x) + x by iteration training, rather than learning H(x) directly. Learning the residual F(x) is easier to converge than learning the mapping between x, and H(x) directly, and can achieve higher classification accuracy. Figure  1b shows the downsampled residual unit, with dashed lines indicating short-cut connections. In ResNet, not all residual units have pooling layers, so a convolutional layer is needed to implement downsampling. This is implemented by setting the stride of the convolutional layer to 2 (s = 2) to change the shape of the residual mapping. Meanwhile, since the residual units are to be summed, the shape and dimension of the input x and the residual mapping F(x) must be consistent. When the residual mapping F(x) is downsampled, downsampling of x is required in the short-cut connection, which is implemented by setting a 1 × 1 convolution layer with a stride of 2 (s = 2) and then adding the downsampled x and F(x).

The Proposed Lightweight Squeeze and Excitation Residual Network 10 (LW-SEResNet10)
In this study, ResNet18 is shrunk in order to reduce the number of parameters. The proposed model is shown in Figure 2. The 18-layer ResNet is reduced to 10 layers (including nine convolutional layers and one fully connected layer). The input is the extracted 3D dynamic MFCC feature, and the classification layer is a fully connected layer (FC) with LogSoftmax, which outputs the probability distribution of each sample corresponding to all classes as the basis for judging the sample classes. The LogSoftmax function can be expressed as follows: where x denotes the output of the fully connected layer and the dimension is N. N corresponds to the number of classes. LogSoftmax(xi) is the probability that the predicted sample x belongs to class i. The logarithm behind Softmax changes the multiplication to addition to reduce the amount of calculation while ensuring the monotonicity of the function. In this study, ResNet18 is shrunk in order to reduce the number of parameters. The proposed model is shown in Figure 2. The 18-layer ResNet is reduced to 10 layers (including nine convolutional layers and one fully connected layer). The input is the extracted 3D dynamic MFCC feature, and the classification layer is a fully connected layer (FC) with LogSoftmax, which outputs the probability distribution of each sample corresponding to all classes as the basis for judging the sample classes. The LogSoftmax function can be expressed as follows: where x denotes the output of the fully connected layer and the dimension is N. N corresponds to the number of classes. LogSoftmax(x i ) is the probability that the predicted sample x belongs to class i. The logarithm behind Softmax changes the multiplication to addition to reduce the amount of calculation while ensuring the monotonicity of the function. In Figure 2, the nine convolutional layers are named Conv1 to Conv9, k denotes the size of the convolutional kernel, s denotes the stride, and 64, 128, 256, and 512 are the number of convolutional kernels. Max pooling (Maxpool) and averaging pooling (Avgpool) are implemented for downsampling. Batch normalization (BN) operation is applied behind the convolutional layer. By normalizing the data of each batch, the network convergence speed is accelerated while preventing the gradient from disappearing and exploding in the network. Since ReLU uses x for linear activation in the region of x > 0, which may cause values that are too large after activation and affect the stability of the proposed model. To offset the linear growth part of the ReLU activation function, this paper uses the ReLU6 [39] activation function instead of the ReLU activation function. ReLU6 limits linear activation to a range of 0 to 6, preventing the values from exploding. The ReLU6 activation function can be expressed as follows: In Figure 2, the nine convolutional layers are named Conv1 to Conv9, k denotes the size of the convolutional kernel, s denotes the stride, and 64, 128, 256, and 512 are the number of convolutional kernels. Max pooling (Maxpool) and averaging pooling (Avgpool) are implemented for downsampling. Batch normalization (BN) operation is applied behind the convolutional layer. By normalizing the data of each batch, the network convergence speed is accelerated while preventing the gradient from disappearing and exploding in the network. Since ReLU uses x for linear activation in the region of x > 0, which may cause values that are too large after activation and affect the stability of the proposed model. To offset the linear growth part of the ReLU activation function, this paper uses the ReLU6 [39] activation function instead of the ReLU activation function. ReLU6 limits linear activation to a range of 0 to 6, preventing the values from exploding. The ReLU6 activation function can be expressed as follows: In addition, the proposed model integrates the SE block [37] operation as an attention mechanism after the Conv2 layer to capture the channel dependencies, the specific framework shown in Figure 3. Firstly, a global averaging pooling operation [40] aggregates feature maps (the output of the Conv2 layer) to generate channel statistics, which is named the "squeeze" operation. The "squeeze" formula is expressed as follows: In addition, the proposed model integrates the SE block [37] operation as an attention mechanism after the Conv2 layer to capture the channel dependencies, the specific framework shown in Figure 3. Firstly, a global averaging pooling operation [40] aggregates feature maps (the output of the Conv2 layer) to generate channel statistics, which is named the "squeeze" operation. The "squeeze" formula is expressed as follows: where z c ∈ R c denotes the channel statistics and C is the number of channels.  Secondly, the "excitation" operation fully captures the channel dependence on all feature maps on ship-radiated noise signals. Two fully connected operations (FC1 layer and FC2 layer) with nonlinear activation function ReLU6 encode and decode the channel Secondly, the "excitation" operation fully captures the channel dependence on all feature maps on ship-radiated noise signals. Two fully connected operations (FC1 layer and FC2 layer) with nonlinear activation function ReLU6 encode and decode the channel statistics, respectively. The operations, as an unsupervised autoencoder, reconstruct channel statistics adaptively, which represents the channel information effectively. A sigmoid activation function is inserted to normalize the reconstructed channel statistics. The "excitation" formula is expressed as follows: where s denotes the channel statistics after the "excitation" operation, δ denotes the ReLU6 function, σ denotes a sigmoid activation function, W 1 ∈ R C r ×C and W 2 ∈ R C× C r . r denotes the reduction ratio, which is set to 8.
The "excitation" operation can be regarded as a soft threshold, which is similar to the gate mechanism of long short-term memory, applied to complete the network "forget" and "memory" functions. Finally, the dot product operation is performed on the channel statistics and U (the feature maps after the Conv2 layer) to recalibrate the channel weights U. The soft threshold mechanism enables highlighting the weight of important information in channel statistics. These two operations can be regarded as an attention mechanism for channel information.
The SE block introduces additional parameters only from the two fully connected layers of the gating mechanism and occupies only a small part of the capacity of the network model. Without considering the bias, the total number of weight parameters introduced by the two fully connected layers can be expressed by the following equation: where r is the reduction ratio and S is the number of residual units (S = 1,2, . . . , S). Cs is the dimensions of the output channels. In the proposed model, the SE block is added only after the Conv2 layer in the first residual unit. Since the Conv2 layer outputs 64 channels and the reduction ratio is set to 8, the additional parameter introduced is 1024 bytes, or 0.001 MB.

Feature Extraction
In contrast with image information, ship-radiated noise signals are nonstationary time sequences, random with time. If the time-domain signal as the input of a network model is used, the end-to-end method simplifies the procedure of the classification method but has a much higher computation cost than the back-end model. Feature extraction in advance in the network front-end can greatly reduce the computational cost of the back-end model.
In this study, the 3D dynamic MFCC is applied as the input of our proposed network models. The extraction procedure is shown in Figure 4. The first step is to extract the MFCC feature. The frame length is set to 2048. Frame overlap is 75% length of the frame length, existing between two frames. Hanning window is used before Fourier transforms for each frame of signal. The window length is equal to the frame length. A short-time Fourier transform is applied to each frame, and the power spectrum is obtained by summing the squares. The short-term power spectrum is a comprehensive characterization of shipradiated noise characteristics, including 2D spatial information in frequency and time domains. The short-term power spectrum of each frame is filtered by the 128 Mel filter banks and a logarithm is obtained to obtain the Mel-spectrogram. The logarithmic scale is commonly used for Mel-spectrograms to fit the human sense of hearing factor presenting a linear distribution below 1000 Hz and logarithmic growth above 1000 Hz [41]. The MFCC feature was obtained by discrete cosine transform of the logarithmic Mel-spectrogram [42]. The shape of the MFCC is (128 × N), where N is the number of frames. For the 5s acoustic signal with the sample rate of 22,050 Hz, the shape of the MFCC is (128 × 216).
scale is commonly used for Mel-spectrograms to fit the human sense of hearing factor presenting a linear distribution below 1000 Hz and logarithmic growth above 1000 Hz [41]. The MFCC feature was obtained by discrete cosine transform of the logarithmic Melspectrogram [42]. The shape of the MFCC is (128 × N), where N is the number of frames. For the 5s acoustic signal with the sample rate of 22,050 Hz, the shape of the MFCC is (128 × 216). The MFCC feature is static. To add dynamic information to the static MFCC feature, we add the delta feature and double-delta feature to form a multi-dimensional dynamic feature, which is performed by a local estimation of the difference operation of the input MFCC feature along the time axis. The delta feature and double-delta feature provide information on the dynamics of the feature over time. Assuming that the MFCC at frame t is Ct, the corresponding delta-spectral feature Dt is defined as follows [43]: where m denotes the number of adjacent frames. Dt denotes the delta coefficient of MFCC at frame t, which is calculated by the static coefficients Ct+m and Ct-m. Similarly, doubledelta MFCC is defined based on a subsequent delta operation on the delta MFCC. The extracted MFCC, delta MFCC, and double-delta MFCC were combined to obtain the 3D dynamic MFCC. The final input feature shape of the proposed network models is (128 × 216 × 3). Figure 5 shows the time-domain waveform and its extracted MFCC, delta MFCC, The MFCC feature is static. To add dynamic information to the static MFCC feature, we add the delta feature and double-delta feature to form a multi-dimensional dynamic feature, which is performed by a local estimation of the difference operation of the input MFCC feature along the time axis. The delta feature and double-delta feature provide information on the dynamics of the feature over time. Assuming that the MFCC at frame t is C t , the corresponding delta-spectral feature D t is defined as follows [43]: where m denotes the number of adjacent frames. D t denotes the delta coefficient of MFCC at frame t, which is calculated by the static coefficients C t+m and C t−m . Similarly, double-delta MFCC is defined based on a subsequent delta operation on the delta MFCC. The extracted MFCC, delta MFCC, and double-delta MFCC were combined to obtain the 3D dynamic MFCC. The final input feature shape of the proposed network models is (128 × 216 × 3). Figure 5 shows the time-domain waveform and its extracted MFCC, delta MFCC, and double-delta MFCC features of a Sailboat's radiation noise in the ShipsEar [19] database.

Experimental Data
We used the ShipsEar [19] database to evaluate the performance of the proposed models. ShipsEar is a database of real ship-radiated noise recordings on the Spanish Atlantic coast. All the access data are permitted by the authors. The database contains a total of 90 recordings of 11 vessel types and one background noise class. The 11 vessel types

Experimental Data
We used the ShipsEar [19] database to evaluate the performance of the proposed models. ShipsEar is a database of real ship-radiated noise recordings on the Spanish Atlantic coast. All the access data are permitted by the authors. The database contains a total of 90 recordings of 11 vessel types and one background noise class. The 11 vessel types are combined into four classes, each of which contains one or more vessels. The details are listed in Table 1. The database is preprocessed. All recordings are resampled to 22,050 Hz. We frame all signals according to a fixed frame length of 5 seconds, which results in 2223 labeled sound samples. The next step is to divide sample sets. Considering the imbalance of the ShipsEar samples, the model accuracy will be degraded, thus the classes in each sample set are evenly distributed when dividing sample sets. The total sample (2223 samples) is divided into the training set, validation set, and testing set according to the ratio of 7:2:1, and the sample size is 1556, 445, and 222, respectively.

Hyperparameter and Cost Function Setup
The optimizer of stochastic gradient descent (SGD) [44] with momentum (set to 0.9) and L2 regularization (set to 4 × 10 −5 ) is applied for training the models, which effectively suppresses sample noise interference. The total training process is set to 30 epochs (the number of iterative training). The learning rate of the training process is the initial learning rate (set to 0.001) multiplied by the Cosine Learning Rate Decay function [45], which speeds up the training progress. The minibatch size is set to 4. The cross-entropy error [46] is used as the cost function.

Evaluation Metric
The performance of all neural models used in this study is evaluated by accuracy. Accuracy is computed by the following expression: where TP is the number of positive classes predicted to be positive, FN is the number of positive classes predicted to be negative, FP is the number of negative classes predicted to be positive, and TN is the number of negative classes predicted to be negative. For the noise mismatch experiment, the performance of all neural models is evaluated by F1-score, which is computed as: where P is Precision and R is Recall. F1-score can be regarded as a weighted average of Precision and Recall.

Experimental Results
Experiment results are taken by using Ubuntu 18.04.1 x64 operating system with Intel(R) Core(TM) i9-9920X CPU@3.50 GHz, NVIDIA GeForce GTX 2080 Ti. To provide an efficient implementation, the proposed model (together with the other models) is parallelized on the graphics processing unit (GPU) using CUDA and NVIDIA CUDA ® Deep Neural Network library (cuDNN) 7.6.3 over the PyTorch 1.7 framework. The experiment results are discussed in this section.

Ablation Experiments
To demonstrate the performance of the proposed model, we conducted some ablation experiments.

•
Model ablation experiments: The essence of the attention mechanism is to recalibrate the original feature map by capturing the channel dependence of the feature map. In the proposed model, the attention mechanism is derivable, and the weight of attention can be updated by the backpropagation algorithm. Therefore, the attention mechanism is highly migratable and can be integrated after the convolutional layers (Conv1 to Conv9) in the proposed model. Table 2 compares the effect of the position of the attention mechanism on the proposed model. The validation accuracy is employed to verify model performance. Since the testing set is not involved in model training, the testing accuracy can evaluate the model performance objectively.  (3,5,7,9) 0.960 0.964 4.765M In Table 2, the (1) to (9) after the model name indicate the position of the attention mechanism after the corresponding convolutional layers (Conv1 to Conv9). We also investigated the effect of adding the attention mechanism at multiple locations on the model performance. The LW-SEResNet10 (2,4,6,8) represents adding four attention mechanisms between two convolutional layers of all residual units. The LW-SEResNet10 (3,5,7,9) represents adding four attention mechanisms after two convolutional layers of all residual units. As can be seen from Table 2, the LW-SEResNet10 (2) and LW-SEResNet10 (9) achieve optimal validation accuracy. The LW-SEResNet10 (2), LW-SEResNet10 (3), LW-SEResNet10 (6), LW-SEResNet10 (7), and LW-SEResNet10 (2,4,6,8) achieve optimal testing accuracy. The LW-SEResNet10 (1), LW-SEResNet10 (2), and LW-SEResNet10 (3) have the lowest parameters. Therefore, the LW-SEResNet10 (2) has the optimal validation accuracy and testing accuracy, while the number of model parameters is the lowest, which is the most efficient combination structure. Further, the experimental results show that the addition of the multiple attention mechanisms not only fails to yield a gain in accuracy, but also increases the number of model parameters.
As can be seen from Table 3, compared with ResNet18, ResNet10 has no significant change in the validation accuracy and testing accuracy while reducing 56.1% parameters, which indicates that ResNet 18 has redundancy in the ShipsEar dataset. The effect of the attention mechanism and the ReLU6 activation function in the proposed model on the model performance is also shown in Table 3. The attention mechanism introduced 0.001M parameters, accounting for 0.2‰ of the model parameters. The training process of the four models is shown in Figure 6. It can be seen that the addition of the ReLU6 activation function and the attention mechanism inhibits the overfitting of the proposed model, respectively. The attention mechanism adaptively recalibrates the extracted depth feature, which enhances the stability of the depth feature.  Furthermore, we make a horizontal comparison of the static Mel-spectrogram feature and MFCC feature, and their 3D dynamic features. Table 4 compares the accuracy of the proposed model under different features. The experimental results show that the four Mel-filtered time-frequency features can show the inherent attributes of the target signals, making the target separable. The proposed model combined with the 3D dynamic MFCC feature has the highest classification accuracy. MFCC feature fully simulates the auditory characteristics of human ears and has good classification performance. Considering the complexity of the marine environment, the radiated noise in the target signals is non-stationary. The delta feature and double-delta feature extract the correlation of MFCC adjacent time frames, capture the time-varying characteristics in the complex marine environment, and show good performance in our classification task. Results for classification accuracy for the proposed model under CQT [47] feature are depicted in Table 5. The 64, 84, and 120 under the dimension in Table 5 denote the number of frequency bins of CQT, while the 216 denotes the number of frames. It can be seen that

•
Feature ablation experiments: Furthermore, we make a horizontal comparison of the static Mel-spectrogram feature and MFCC feature, and their 3D dynamic features. Table 4 compares the accuracy of the proposed model under different features. The experimental results show that the four Mel-filtered time-frequency features can show the inherent attributes of the target signals, making the target separable. The proposed model combined with the 3D dynamic MFCC feature has the highest classification accuracy. MFCC feature fully simulates the auditory characteristics of human ears and has good classification performance. Considering the complexity of the marine environment, the radiated noise in the target signals is nonstationary. The delta feature and double-delta feature extract the correlation of MFCC adjacent time frames, capture the time-varying characteristics in the complex marine environment, and show good performance in our classification task. Results for classification accuracy for the proposed model under CQT [47] feature are depicted in Table 5. The 64, 84, and 120 under the dimension in Table 5 denote the number of frequency bins of CQT, while the 216 denotes the number of frames. It can be seen that the impact of different dimensions on accuracy is small. The logarithmic scale is used for CQT, which significantly improves the accuracy. The best classification effect is obtained when the sample dimension is 84 × 216 × 1. The experimental results are consistent with the literature [48], that is, CQT feature is better than the Mel-spectrogram feature. From the overall results in Tables 4 and 5, it can be observed that overall accuracy results remained better for the 3D dynamic MFCC feature as compared to other features, with an accuracy of 0.964.

Comparison Experiments
In this part, the proposed model is compared with ResNet and two classical lightweight network models in terms of the model parameters, time consumption, and classification accuracy.

•
The comparison between parameters and accuracy: The comparison between the number of model parameters and the accuracy is performed. Table 6 shows the testing accuracy of the multiple network models under different features. Table 7 shows the number of parameters of multiple models. MobileNetV2 and ShuffleNetV2 are two classic lightweight neural networks, which introduce depth-wise separable convolution to reduce the model parameters. The numbers after the MobileNetV2 and ShuffleNetV2 in Tables 6 and 7    For a certain neural network model, the accuracy corresponding to different features can reflect the degree of dependence of the network model on different features. A good classification model does not depend on a certain feature, that is, robustness [35]. Figure 7 shows the testing accuracy of the multiple network models under different features. Taking the ResNet model as an example, for ResNet50, the accuracy varies greatly under different features. For ResNet18, the accuracy varies a little under different features. Therefore, the ResNet18 network has low dependence on these features. The proposed model exhibits similar performance to ResNet18 in terms of feature dependence. In addition, it can be seen that the matching optimal features are different for different models. The MFCC-based features generally perform better than the Mel-spectrogram-based features.  For a certain neural network model, the accuracy corresponding to different features can reflect the degree of dependence of the network model on different features. A good classification model does not depend on a certain feature, that is, robustness [35]. Figure  7 shows the testing accuracy of the multiple network models under different features. Taking the ResNet model as an example, for ResNet50, the accuracy varies greatly under different features. For ResNet18, the accuracy varies a little under different features. Therefore, the ResNet18 network has low dependence on these features. The proposed model exhibits similar performance to ResNet18 in terms of feature dependence. In addition, it can be seen that the matching optimal features are different for different models. The MFCC-based features generally perform better than the Mel-spectrogram-based features.


The comparison between time consumptions: To achieve a comprehensive consideration of the time consumption, we also conducted a set of experiments using the central processing unit (CPU). Figures 8 and 9 show the training time (the average time of 30 epochs) and inferred time using the CPU and GPU, respectively, for multiple network models. Training time refers to the time it takes the model to perform one epoch on the training set and validation set. Inferred time refers to the time consumed on the testing set. In Figure 8, the proposed model has the shortest

•
The comparison between time consumptions: To achieve a comprehensive consideration of the time consumption, we also conducted a set of experiments using the central processing unit (CPU). Figures 8 and 9 show the training time (the average time of 30 epochs) and inferred time using the CPU and GPU, respectively, for multiple network models. Training time refers to the time it takes the model to perform one epoch on the training set and validation set. Inferred time refers to the time consumed on the testing set. In Figure 8, the proposed model has the shortest training time on the GPU and has a similar training time to ShuffleNetV2 (1.0) on the CPU. In Figure 9, the proposed model has the shortest inferred time on the GPU and has a similar inferred time to ShuffleNetV2 (1.5) on the CPU. training time on the GPU and has a similar training time to ShuffleNetV2 (1.0) on the CPU. In Figure 9, the proposed model has the shortest inferred time on the GPU and has a similar inferred time to ShuffleNetV2 (1.5) on the CPU.  We noticed in Figure 8 and Figure 9 that on the GPU, the time consumption of Shuf-fleNetV2 and MobileNetV2 with lower parameters is longer than that of ResNet18. One training time on the GPU and has a similar training time to ShuffleNetV2 (1.0) on the CPU. In Figure 9, the proposed model has the shortest inferred time on the GPU and has a similar inferred time to ShuffleNetV2 (1.5) on the CPU.  We noticed in Figure 8 and Figure 9 that on the GPU, the time consumption of Shuf-fleNetV2 and MobileNetV2 with lower parameters is longer than that of ResNet18. One We noticed in Figures 8 and 9 that on the GPU, the time consumption of ShuffleNetV2 and MobileNetV2 with lower parameters is longer than that of ResNet18. One important reason is that, for ShuffleNetV2 and MobileNetV2, depth-wise separable convolution is employed for reducing the model parameters. Depth-separable convolution divides a convolutional operation into a depth-wise convolution layer and multiple point-wise convolution layers, which increases the number of convolutional layers. The CPU generally uses serial computation, and the higher Cache hit rate in exchange for the increased number of layers speeds up the computation. However, on a GPU, using parallel computation with sufficiently large video memory does not improve the speed of depth-wise separable convolution. Therefore, the ShuffleNetV2 and MobileNetV2 are more suitable for implementation on the CPU. The proposed LW-SEResNet10 performs efficiently on both CPU and GPU.

•
Optimization and comparing the performance of various models: Comparing the performance of various models on the ShipsEar database, we observe that the performance of LW-SEResNet10 is relatively poorer than the newest STM + Au-dioSet [8]. To further optimize the proposed model, we use the adaptive moment estimation (Adam) [50] optimizer used in the article [8,17,51]. The optimizer uses L2 regularization (set to 4 × 10 −5 ). The LW-SEResNet10 could achieve an accuracy of 0.977, which is consistent with the newest STM + AudioSet. We can see in Table 8  In addition, we also compared the parameters of LW-SEResNet10 and STM. The STM + AudioSet pre-trains the model using an already trained network, which is the Audioset dataset model trained on AST [49]. As mentioned above, the parameters of the proposed model are 5.4% of the model parameters in AST. To sum up, the LW-SEResNet10 achieves optimal accuracy, significantly reduces the computation cost of the model, and realizes the trade-off between accuracy and model efficiency. Table 8. The accuracy and parameters of different models.

Noise Mismatch Experiment
When the sample is disturbed, whether the deep model can still maintain a high classification performance is a measure of the robustness of the model. To measure the noise robustness of all models, we conducted a noise mismatch experiment. To construct noise mismatch conditions, we added white Gaussian noise with SNRs (signal-to-noise ratios) of −20 dB, −15 dB, −10 dB, −5 dB, 0 dB, 5 dB, 10 dB, 15 dB, and 20 dB to the testing dataset. We take the 3D-dynamic MFCC feature of the testing dataset under different SNRs as the input of the trained network models to test the classification performance. Table 9 shows the F1-score of various models under different SNRs. It can be observed that the models with residual structures obtain higher F1-scores. The models without residual structures are more sensitive to white Gaussian noise. The experimental results indicate that the residual structures can suppress white Gaussian noise, and the residual-based model is noise-robust in our ship-radiated noise classification task. The proposed model performs better in noise mismatch conditions compared with the two classic lightweight models.

Conclusions
This paper proposes a lightweight ship-radiated noise classification network model, called LW-SEResNet10. Through model design, the high accuracy and efficiency of the classification model are realized. Based on ResNet, the model is lightweight by shrinking the number of residual units. The attention mechanism and ReLU6 activation function are used as techniques to suppress model overfitting to improve model classification performance. In addition, the model input uses a 3D dynamic MFCC feature to optimize the overall classification system. The experimental results on the ShipsEar database prove the effectiveness of the system.
In the experiment, the multiple models of classification accuracy and efficiency, the dependence of multiple models on features, and the influence of training and testing noise mismatch on classification performance are analyzed. A large number of experiments ensure the progressiveness of the proposed method. As a classic network, the ResNet is still superior after model compression and design, and it can meet the demand for high accuracy and efficiency in the field of ship-radiated noise classification.